Video generation has historically been fragmented. You had models for T2V (Text-to-Video), separate pipelines for editing, and entirely different systems for video understanding. UniVideo changes this paradigm.
The Dual-Stream Approach
At the heart of UniVideo lies a novel integration of two powerful components: a Multimodal Large Language Model (MLLM) and a Multimodal Diffusion Transformer (MMDiT).
Architecture Highlights
- Understanding Stream (MLLM): Built on Qwen2.5-VL, this component processes complex user instructions, essentially "seeing" the input video or understanding the nuance of a text prompt.
- Generation Stream (MMDiT): Leveraging HunyuanVideo, this stream handles the pixel-level synthesis, translating the MLLM's reasoning into high-fidelity frames.
Why "Unified"?
Previous approaches often tried to force a single model to do everything, creating a bottleneck. UniVideo's decoupled yet synchronized design allows:
- Task Composition: You can ask the model to "add a red car AND make it rain". The MLLM parses this composite request.
- Zero-Shot Transfer: It can apply editing concepts learned from images directly to video frames without specific video-editing training data.
- In-Context Learning: Provide a reference image, and the model understands the style and subject, ensuring consistency across generated frames.
Performance & Scale
Developing UniVideo required extensive training on large-scale datasets. The model demonstrates state-of-the-art results on benchmarks like VBench and MMBench, proving that a unified approach doesn't mean sacrificing specialized performance.
Ready to test the architecture?
Experience the power of MLLM + MMDiT in our cloud sandbox.
Try UniVideo Now