TL;DR
One autoregressive video model, every temporal order
UniTemp is a single autoregressive video generator distilled to run in any temporal order: it generates videos forward, backward, and inbetween two given moments — all with the same network. One 1.3B model with 4 denoising steps matches dedicated single-direction models in each direction, and the unified temporal control unlocks results that forward-only models cannot produce: seamless looping videos, in-shot scene transitions, stable 120-second generation in either direction, and looping multi-scene stories composed non-chronologically.
forward, backward, and inbetween generation from a single network.
a compact model with only a few denoising steps — fast in every temporal order.
stable minute-scale generation, forward or backward.
seamless loops, scene transitions, and looping multi-scene stories.
Bidirectional generation
One model, forward and backward
A single UniTemp network generates in both temporal directions. Its forward samples match the dedicated forward-only Self-Forcing baseline, and its backward samples reach the same quality — no per-direction specialization needed.
5-second videos
Each page shows the same prompt generated by the Self-Forcing forward baseline, UniTemp running forward, and UniTemp running backward. Backward videos are played forward — the first frames you see were generated last.
120-second videos
Both-end guided sink latents allow 2-minute generation with less content drift than strong forward-only baselines. We apply 3 sink frames for the Self-Forcing baseline. Each panel is downsized to 416×240 for web delivery — use full screen for the best view.
Inbetween generation
Anchored at both ends
UniTemp also generates between two given moments — the capability that powers the loops, transitions, and stories below.
Inbetween comparison
Given only the first and the last frame, UniTemp generates complex motion inbetween. GI (Generative Inbetweening) struggles with complex motion, while UniTemp handles it efficiently with a compact model and only a few denoising steps.
Seamless looping videos
The first frame is also the last: UniTemp closes a video on itself, producing loops that play forever without a visible seam.
In-shot scene transitions
Given two different worlds as the two ends, UniTemp morphs one into the other within a single continuous shot.
Showcase · Sora-style scene connections
Looping long-shot stories
Because UniTemp can extend any clip forward, backward, or bridge any two moments, it can compose a multi-scene story non-chronologically: the key block of each scene is generated first, extended in both directions, and the scenes are then connected by generated transitions — including a final transition from the last scene back to the first, closing the story into a perfect 90-second loop.
For each story below, pick a pair of adjacent scenes: the center video is the connection generated by UniTemp, morphing the scene on its left into the scene on its right. The last pair closes the loop.
Analysis
Anchor latents ablation
Why does backward generation need anchor latents? Baseline backward generation (without anchors) shows periodic inter-block flickering; UniTemp's anchor latents (P=3) largely eliminate it, matching the flicker-free forward reference.
Citation
BibTeX
If you find UniTemp useful, please consider citing:
@article{zhang2026unitemp,
title = {UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation},
author = {Zhang, Lin and Mo, Sicheng and Cai, Zefan and Lin, Jinhong and Lin, Zihao and Gu, Jiuxiang and Singh, Krishna Kumar and Li, Yuheng and Li, Yin},
journal = {arXiv preprint},
year = {2026}
}