One model · Any temporal order

UniTemp

Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

Lin Zhang1*, Sicheng Mo3, Zefan Cai1, Jinhong Lin1, Zihao Lin4, Jiuxiang Gu2, Krishna Kumar Singh2, Yuheng Li2†, Yin Li1†
1University of Wisconsin–Madison 2Adobe Research 3University of California, Los Angeles 4University of California, Davis

*Work partially done during an internship at Adobe · Equal advising

Forward Backward Inbetween

One 1.3B model · 4 denoising steps · forward, backward & inbetween

TL;DR

One autoregressive video model, every temporal order

UniTemp is a single autoregressive video generator distilled to run in any temporal order: it generates videos forward, backward, and inbetween two given moments — all with the same network. One 1.3B model with 4 denoising steps matches dedicated single-direction models in each direction, and the unified temporal control unlocks results that forward-only models cannot produce: seamless looping videos, in-shot scene transitions, stable 120-second generation in either direction, and looping multi-scene stories composed non-chronologically.

One model 3 orders

forward, backward, and inbetween generation from a single network.

Efficient 1.3B · 4 steps

a compact model with only a few denoising steps — fast in every temporal order.

Long horizon 120 s

stable minute-scale generation, forward or backward.

Unlocked ∞ loops

seamless loops, scene transitions, and looping multi-scene stories.

Bidirectional generation

One model, forward and backward

A single UniTemp network generates in both temporal directions. Its forward samples match the dedicated forward-only Self-Forcing baseline, and its backward samples reach the same quality — no per-direction specialization needed.

5-second videos

Each page shows the same prompt generated by the Self-Forcing forward baseline, UniTemp running forward, and UniTemp running backward. Backward videos are played forward — the first frames you see were generated last.

120-second videos

Both-end guided sink latents allow 2-minute generation with less content drift than strong forward-only baselines. We apply 3 sink frames for the Self-Forcing baseline. Each panel is downsized to 416×240 for web delivery — use full screen for the best view.

Inbetween generation

Anchored at both ends

UniTemp also generates between two given moments — the capability that powers the loops, transitions, and stories below.

Inbetween comparison

Given only the first and the last frame, UniTemp generates complex motion inbetween. GI (Generative Inbetweening) struggles with complex motion, while UniTemp handles it efficiently with a compact model and only a few denoising steps.

Example 1 of 4

Seamless looping videos

The first frame is also the last: UniTemp closes a video on itself, producing loops that play forever without a visible seam.

In-shot scene transitions

Given two different worlds as the two ends, UniTemp morphs one into the other within a single continuous shot.

Showcase · Sora-style scene connections

Looping long-shot stories

Because UniTemp can extend any clip forward, backward, or bridge any two moments, it can compose a multi-scene story non-chronologically: the key block of each scene is generated first, extended in both directions, and the scenes are then connected by generated transitions — including a final transition from the last scene back to the first, closing the story into a perfect 90-second loop.

For each story below, pick a pair of adjacent scenes: the center video is the connection generated by UniTemp, morphing the scene on its left into the scene on its right. The last pair closes the loop.

How these are generated. We first generate the key block of each of the 4 scenes (captions x-2), extend each block forward and backward to grow the scenes, and finally generate the inbetween transitions that connect scene 1 → 2 → 3 → 4 → 1. This is a far more flexible pipeline than chronological generation, while keeping the speed and quality of a compact few-step model.

Analysis

Anchor latents ablation

Why does backward generation need anchor latents? Baseline backward generation (without anchors) shows periodic inter-block flickering; UniTemp's anchor latents (P=3) largely eliminate it, matching the flicker-free forward reference.

Citation

BibTeX

If you find UniTemp useful, please consider citing:

@article{zhang2026unitemp,
  title   = {UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation},
  author  = {Zhang, Lin and Mo, Sicheng and Cai, Zefan and Lin, Jinhong and Lin, Zihao and Gu, Jiuxiang and Singh, Krishna Kumar and Li, Yuheng and Li, Yin},
  journal = {arXiv preprint},
  year    = {2026}
}