The Yuangroup team at Peking University launched an Open-Sora project

Sora model designed to replicate OpenAI

Open-Sora plans to implement the functions of the Sora model through technical components such as video VQ-VAE, Denoising Diffusion Transformer and conditional encoder.

The project now supports:

Variable aspect ratio
ˇVariable resolution
🚅Variable duration

Demonstration video: 10s video reconstruction (256×256 resolution)/18s video reconstruction (196x)

The Open-Sora project implements the following key components and features to replicate OpenAI’s video generation model:

1. Video VQ-VAE (Vector Quantized-Variable AutoEncoder): This is a component that compresses video into potential representations of temporal and spatial dimensions. It compresses high-resolution video into a low-dimensional representation for subsequent processing and generation.
2. Denoising Diffusion Transformer: This component is used to generate video from potential representations and restore the details of the video by gradually reducing noise.
3. Condition Encoder: Supports multiple conditional inputs, allowing models to generate video content based on different text descriptions or other conditions.

In addition, the project implements several technologies to enhance the flexibility and quality of video generation:

1. Variable aspect ratio: Parallel batch training is carried out through dynamic masking strategy to maintain a flexible aspect ratio. Resize the high-resolution video so that the longest side is 256 pixels, maintain the aspect ratio, and then fill it with zeroes on the right and bottom to achieve a uniform 256×256 resolution.

2. Variable resolution: Although training on a fixed 256×256 resolution, during the inference process, the use of positional interpolation allows variable resolution sampling. This allows attention-based diffusion models to process higher-resolution sequences.

3. Variable duration: Use video VQ-VAE to compress video into potential representations to achieve multi-duration video generation. By extending spatial position interpolation to spatio-temporal versions to handle video of variable duration.

Project address:https://pku-yuangroup.github.io/Open-Sora-Plan/blog_cn.html
GitHub：https://github.com/PKU-YuanGroup/Open-Sora-Plan

New video: