Research Oct 15, 2025

UniVideo: Toward a "World Simulator" via Unified Video Modeling

UV
UniVideo Research Team
Based on paper arXiv:2510.08377

Video understanding and generation have traditionally been treated as separate disciplines. UniVideo bridges this gap, introducing the first framework to unify these tasks into a single, cohesive workflow.

The Architecture: The Power of Two

At its core, UniVideo leverages a massive 20 Billion parameter architecture that combines the strengths of large language models and diffusion transformers.

Why Separation Matters

Unlike previous models that attempt to use a single transformer for everything, UniVideo deconstructs the problem. The MLLM understands the "why" and "what", while the DiT handles the "how". This prevents the "hallucination" issues common in smaller, unified models.

Unified Multi-modal Embedder

The breakthrough innovation of UniVideo is its usage of the MLLM's hidden states as a direct input for the generator. Instead of using a fixed text encoder (like T5 or CLIP), UniVideo uses the dynamic, context-aware embeddings from Qwen2.5-VL.

This allows the model to process interleaved inputs—a mix of text, images, and video frames—as a single continuous stream of information. This is what enables "In-Context" generation.

Unlocking New Capabilities

1. In-Context Video Generation

By treating video generation as next-token prediction (in a semantic sense), UniVideo can take a reference video of a person walking and a reference image of a new outfit, and generating a video of that person walking in the new outfit, maintaining temporal consistency perfectly.

2. Free-Form Video Editing

Because the MLLM has been trained on vast amounts of data, it understands abstract concepts. You can issue commands like "Turn this video into a Van Gogh painting" or "Make the weather rainy". The model applies these transformations zero-shot, without needing specific training for each task.

Conclusion

UniVideo represents a significant step towards a true "World Simulator". By unifying comprehension and generation, we are moving away from rigid, single-task tools towards a fluid, conversational interface for video creation.

Read the full paper on arXiv →