Kandinsky 5.0: Diffusion model series for video and image generation

The flagship Video Pro rivals Veo 3 in visual quality and surpasses Wan 2.2-A14B, while Video Lite and Image Lite offer a fast and lightweight alternative for real-time use cases. Powered by K-VAE 1.0, a high-performance, open-source vision encoder, the kit provides powerful compression capabilities and a solid foundation for generative model training. The entire technology stack balances performance, scalability, and utility.

AI generation models have entered a stage of rapid development of “image + video integration”. Closed-source models like OpenAI Sora exhibit extremely high levels of generative capabilities, and the open-source community is catching up quickly. Kandinsky 5 is one of the core projects in this wave: a complete family of models with a mature technical architecture, a high degree of openness, and covering multimodal tasks.

1. Kandinsky 5.0 positioning: open source “multimodal generation unified framework”

Kandinsky 5 is not a single model, but a generative model Zoo, covering:

  • Text-to-Image(T2I)
  • Image-to-Image(I2I)
  • Image Editing(Inpainting / Outpainting)
  • Text-to-Video(T2V)
  • Image-to-Video(I2V)

It is essentially a diffusion architecture that is compatible with multimodal tasks and offers multiple model scales from lightweight to high-performance. 

In the open source space, this coverage is highly strategic.

2. Technology: Unified Diffusion Transformer architecture

At the heart of Kandinsky 5 is the Diffusion Transformer (DiT) class structure, which serves as the backbone of the diffusion model. This is the current mainstream trend in generative models (including Sora, Stable Diffusion 3, Pika, HunyuanVideo, etc.).

Its basic technical path includes:

2.1 Architectural Essentials: Transformer as a denoiser

  • Swap out the noise predictor for a full Transformer encoder/decoder structure
  • Enhanced multi-scale feature processing (spatial-temporal attention)
  • Compatible with cross-modal conditioning (text, images, motion tracks)

Compared with U-Net, Transformer has stronger convergence and expressiveness on large-scale data, especially for modeling temporal consistency in video.

2.2 Model scale distribution (officially disclosed)

modelParameter quantityMissionCharacteristics
Image Lite~6BT2I / I2I / EditMedium scale, inference cost friendly
Video Lite~2BT2VLightweight for quick generation
Video Pro~19BHigh quality T2V / I2VProfessional-grade consistency and detail

The video model at level 19B is close to the parameter amount of large-scale cross-modal models, and has strong ability to learn long sequences and motion semantics.

2.3 Conditioning mechanism

Kandinsky 5 uses multiple sets of cross-modal conditions:

  • Text encoding (CLIP / T5 class)
  • Image encoder as prior
  • Video tasks use additional temporal embedding
  • Camera motion as an auxiliary condition

This allows the model to “generate content” as well as “generate motion structures.”

2.4 Video Modeling: Joint Space-Time Diffusion

Video Pro adopts:

  • 3D space-time convolution + Transformer fusion
  • Time attention layer: Models frame-to-frame consistency
  • Latent video resolution compression: Reduces memory requirements
  • Multi-Stage Decoding: Gradually enhance details and textures

This technical path is very similar to the public paper route of Sora / Pika / HunyuanVideo (but on a smaller scale, open source reproducible).

3. Method: Stage-based multi-task training

Kandinsky 5 employs a “phased training strategy”:

3.1 Stage 1: Basic Diffusion Training

Objectives:

  • Learn the basic visual distribution
  • Capture textures, semantic space, light and shadow structures

Training data includes:

  • Large-scale image data
  • Diverse style distribution
  • Clear and low-quality images are mixed to enhance generalization

3.2 Stage 2: Multimodal joint training

Add tasks such as text alignment, image conditions, etc., so that the model has the ability to:

  • Graphic semantic mapping capabilities
  • Style transfer ability
  • Image repainting and editing capabilities

3.3 Stage 3: Video-specific

For Video Lite / Pro:

  • Train 3D latent on a video dataset
  • Join temporal consistency loss
  • Added camera trajectory conditioning
  • Optimized inter-frame stability and motion fluidity

4. Capabilities: The actual performance of images and videos

4.1 Image (T2I)

Features:

  • Stable composition
  • Consistent texture
  • Multiple styles are controllable
  • The 6B model has reached the mainstream level

Image editing capabilities such as inpainting are stable and can handle complex edges and stylistic transitions.

4.2 Video (T2V / I2V)

The Lite version is mainly used for:

  • Short video (5–10 seconds)
  • Lightweight content generation

The Pro version is closer to professional needs:

  • The trajectory of movement is natural
  • The interframe structure is stable
  • High detail retention (hands, faces, textures)
  • Have a certain “logical coherence”

In the field of open source, it belongs to the first echelon.

5. Engineering Deployability: Real-world advantages of open-source models

MIT license for Kandinsky 5

And the model provides:

  • ONNX / Torch reasoning script
  • Multi-GPU inference scheme
  • Some models have FP8/FP16 optimized paths

6. Comparison with Other Models (Technical Perspective)

modelVideo qualitySpeedOpen sourceArchitecture
Kandinsky 5 ProHighMediumCompletely open sourceDiT + 3D latent
Stable Diffusion VideoMediumFastOpen sourceTemporal Diffusion
PikaHighFastClosed sourceUndisclosed
SoraExtremely highFastClosed source3D Video Gen (Advanced)

From the perspective of “open source + video quality”, Kandinsky 5 is currently one of the strongest open source T2V series.

7. Can be summed up in one sentence:

Kandinsky 5 is a unified multimodal generation model framework with Diffusion Transformer as the core and oriented to image + video tasks, with the characteristics of complete structure, open source and strong engineering deployability, and is one of the key projects in the direction of open source video generation.

Its model family design, phased training strategy, and spatio-temporal diffusion structure make it have high research value and application value in the open source ecosystem.

If you’re looking to build your own AI image/video generation system, study multimodal generation, or build lightweight AI creative tools, Kandinsky 5 is a foundational framework worth diving into.

Github:https://github.com/kandinskylab/kandinsky-5
Hugging Face :https://huggingface.co/kandinskylab
Technical report :https://huggingface.co/papers/2511.14993

Tubing:

Scroll to Top