AirLLM: Runs very large models on small memory devices

AirLLM is a tool that can run very large AI models on computers with limited memory. It uses intelligent layer-by-layer loading technology instead of traditional compression. With this tool, you can run a model with 70 billion parameters with just 4GB of video memory, or even a model with 405 billion parameters on 8GB of video memory, without losing model performance.
The advantage is that you can use high-performance AI models on affordable devices without expensive hardware upgrades. The tool also offers optional compression to speed up to 3x while maintaining accuracy.

Today, as large language models become larger and larger, a practical problem is becoming more and more obvious:

Models are not expensive, graphics cards are expensive.

The parameter scale of 70B, 180B, and 405B levels often requires tens of GB of video memory. For most individual developers, a consumer GPU simply won’t be able to fully load the model. There are usually two types of traditional solutions:

Change to a larger graphics card
Quantization Compression (4bit/8bit)

But AirLLM takes the third path.

What is done is not “compression”, but “changing the way it is loaded”

The core idea of AirLLM is straightforward:

Don’t load the entire model into the GPU at once.

In a traditional inference framework, the model weights are loaded into memory as a whole and then the computation forward begins. This means that the video memory must be large enough or OOM directly.

AirLLM uses a block-wise (layer-wise) loading mechanism:

Model weights are stored on disk or in CPU memory
Only the current tier is loaded to the GPU when inference
Released immediately after the current layer calculation is complete
Load the next layer

The GPU is only responsible for the current compute unit, not holding the entire model.

From an engineering point of view, it is more like turning a Transformer into a “paginated execution structure”.

What does this mean?

This means that video memory is no longer the only bottleneck.

In theory:

4GB of video memory can run 70B-class models
Larger models can also be loaded in the same way

But here it must be emphasized:

Able to run ≠ smooth operation

Due to the frequent reading of weights from disk, inference is significantly slower than full loading. The system bottleneck has shifted from GPU memory to:

Disk IO speed (SSDs are critical)
CPU memory capacity
Data scheduling efficiency

It’s more like “IO for video memory”.

What is the difference between it and quantitative solutions?

The idea of quantification is:

Reduce parameter accuracy
Reduces video memory usage
Improves reasoning speed

The trade-off is the potential loss of accuracy.

The idea of AirLLM is:

Do not change the model parameters
No loss of accuracy
Only change the loading mechanism

So the model effect will not decrease in theory.

It is better suited for:

Want to test the structure of a very large model
I want to do research verification
There is no high-end graphics card but hope to experiment with large models

Instead of deploying highly concurrent online services.

The reality about “405B running on 8GB”

Technically, as long as tiered loading is supported, there is theoretically no upper limit to the size of the model.

But the reality is:

Model files are huge
CPU memory usage is extremely high
Inference can be very slow

Therefore, a more reasonable understanding is:

AirLLM makes “large models runable” but does not guarantee “commercial use”.

It is a breakthrough in engineering, not a miracle of performance.

Real value

AirLLM is not about “replacing high-end GPUs”, but about:

Lower the threshold for experimentation
Bring more people to the oversized model
Provide a viable path for resource-constrained environments

In the context of the gradual concentration of large model ecology on computing power monopoly, this idea of “structural optimization” is valuable in itself.

If you are interested in local large model inference, AirLLM represents a direction worth researching:

Instead of making the model smaller, it is about making the model “survive in segments”.

Github：https://github.com/lyogavin/airllm
Tubing: