

Photo by author
# Lights, camera…
With the beginning of Veo And Surahthe video generation has reached a new height. Creators are experimenting at scale, and teams are integrating these tools into their marketing workflows. However, there is a drawback: most closed systems collect your data and apply visible or invisible watermarks that label the outputs as AI-infiltrated. If you value privacy, control, and on-device workflows, open source models are your best option, and several now rival VEO’s results.
In this article, we will review the first video generation models, provide technical knowledge and a demo video to help you evaluate their video generation capabilities. Every model is available Hug face And can run locally comfyui or your preferred desktop AI applications.
# 1. WAN 2.2 A14B
WAN 2.2 It upgrades the distributed backbone with a compound expert (MOE) architecture that divides timesteps into specialized experts, increasing efficiency without computational penalty. The team also developed aesthetic labels (such as lighting, composition, contrast, color tone) to create “cinema”. Compared to WAN 2.1, training significantly (+65.6% images, +83.2% videos), improving movement, terminology and aesthetics.
WAN 2.2 reports superior performance in both open and closed systems. You can find the text-to-video and image-to-video A14B collections on Hugging Face: WAN-AI/WAN2.2-T2V-A14B And WAN-AI/WAN2.2-I2V-A14B
https://www.youtube.com/watch?v=KTDOGWM7HAC
# 2. Haniwan Video
Haniwan Video is a 13B-parameter open video foundation model trained by a causal 3D variable autoencoder (VAE) in a spatio-temporal latent space. Its transformer uses a “single-stream to dual-stream” design: text and video tokens are first processed independently with full attention and then fused, while only one decoder multimodal LLM acts as a text encoder to improve following and detail capture.
The open source ecosystem includes code, weights, single and multi-GPU inference (XDIT), FP8 weights, spread and comfyui integration, a Gradio Demo, and Penguin Video Benchmark.
https://www.youtube.com/watch?v=lvvutwtj0ac
# 3. Cobbler 1
Cobbler 1 A 10b asymmetric diffusion transformer (asymmdit) is trained from scratch, released under Apache 2.0. This combines with an asymmetric VAE that compresses videos 8×8 temporally and 6x temporally to a 12-channel latent, using a single T5-XXL encoder prioritizing visual capability over text.
In a preliminary assessment, the Genmo team positions Mochi 1 as the most advanced open model with high-fidelity movement and strong instant binding, aiming to close the gap with closed systems.
https://www.youtube.com/watch?v=QMOMQZJN_FK
# 4. LTX video
LTX-video is a DIT-based (diffusion transformer) image-to-video generator built for speed: it produces real-time 30 fps videos at 1216×704, trained on a large, diverse dataset to balance motion and visual quality.
The lineup spans several variants: 13B Dev, 13B Ast, 2B Asth, and FP8 quantized builds, as well as local and global advanced and ready-to-use Comfy workflows. If you’re optimizing for fast repetition and crisp motion from a single image or short conditioning sequence, LTX is a compelling choice.
https://www.youtube.com/watch?v=7zmxtmyud_u
# 5. Cogvideox-5b
cogvideox-5b 2B is the high-fidelity sibling to the baseline, which is trained in Bfloat16 and is recommended to run in Bfloat16. It produces 6-second clips at 8 fps with a fixed 720×480 resolution and supports up to 226 tokens of English subtitles.
of the model Documents The video shows the expected random access memory (VRAM) for single and multi-GPU inference, typical runtimes (like around 90 seconds for 50 steps on an H100), and how CPU offload and VAE tiling/slicing affect memory and speed.
https://www.youtube.com/watch?v=S2B7QGV-LO
# Choosing a video generation model
Here are some high-level takeaways to help you choose the right video generation model for your needs.
- If you want a cinema-friendly look and 720p/24 on the same 4090: WAN 2.2 (A14B for basic tasks; 5B hybrid TI2V for efficient 720P/24)
- If you need a large, general-purpose T2V/I2V foundation with robust motion and a complete Open Source Software (OSS) toolchain: Haniwan Video (13B, XDIT Parallel, FP8 Weight, Diffuser/Comfui)
- If you want to give it a go, hackable state-of-the-art (SOTA) preview with modern inspiration and a clear research roadmap: Cobbler 1 (10b Asimmed + Asimmoa, Apache 2.0)
- If you care about editing with real-time I2V and advanced workflows: LTX-VIDEO (30 FPS at 1216×704, multiple 13B/2B and FP8 variants)
- If you need efficient 6S 720×480 T2V, solid diffusion support, and quantization to small VRAM: COGVIDEOX-5B
Abid Ali Owan For centuries.@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in Technology Management and a Bachelor’s degree in Telecommunication Engineering. His vision is to create an AI product using graph neural networks for students with mental illness.