Peking University releases a new image generation framework VAR

For the first time, VAR allows GPT-style AR models to surpass Diffusion transformers in image generation.
At the same time, it shows similar rules to those observed in the large language model.
On the ImageNet 256×256 benchmark,VAR significantly increased FID from 18.65 to 1.80, and IS from 80.4 to 356.4, increasing inference speed by 20 times.

Detailed introduction:

Visual autoregressive model (VAR) is a new image generation paradigm that redefines autoregressive learning as “next scale prediction” or “next resolution prediction” from coarse to fine, which is different from the standard raster scan “next token prediction.”
This simple and intuitive method allows the autoregressive transformer to quickly learn visual distribution and have good generalization capabilities:
For the first time, VAR allows GPT-style AR models to surpass diffusion transformers in image generation.
On the ImageNet 256×256 benchmark,VAR significantly increased FID from 18.65 to 1.80, and IS from 80.4 to 356.4, increasing inference speed by 20 times.
The empirical results demonstrate that VAR is superior to Diffusion Transformer in multiple dimensions including image quality, reasoning speed, data efficiency and scalability.
As the VAR model expands, it exhibits a power-law scaling pattern similar to that observed in the big language model, with a linear correlation coefficient close to-0.998, which strongly proves this.
VAR further demonstrates zero-sample generalization capabilities on downstream tasks such as image inpainting, extrapolation, and editing.
These results show that VAR initially simulates two important characteristics of large language models: scaling rule and zero-sample generalization.
Researchers have made all models and code public to facilitate the exploration of AR/VAR models in visual generation and unified learning.
The VAR algorithm provides new insights into the design of autoregressive algorithms in computer vision and is expected to promote further development in this field.

Project address:https://github.com/FoundationVision/VAR
Demo Address:https://var.vision/demo

Video: