DeepSeek-OCR: Optical compression that makes long documents “understandable”

In the world of large models, context length is always a limitation. When the document is too long and the diagram is too complex, traditional language models often “can’t finish reading”. Recently, an open-source project from DeepSeek AI , DeepSeek-OCR, proposed a disruptive idea:

“Instead of letting the language model read endless text, it is better to ‘see’ the text as an image and then compress it into a visual signal.”

1. What is DeepSeek-OCR?

DeepSeek-OCR, full name DeepSeek-OCR: Contexts Optical Compression, is a study published by DeepSeek AI with the core idea of:
Optical (visual) compression of long text contexts allows the model to understand more with fewer tokens.

In simple terms, it is not just an OCR (Optical Character Recognition) model, but a “visual context compression framework“.
It converts the originally lengthy document content into a small number of vision tokens, and then allows language models (such as DeepSeek 3B-MoE) to “restore” the text and structure from this visual information.

2. Why is it important?

Traditional text input methods will cause the token consumption of large models to expand rapidly:
Scanning a page of documents can often cost thousands of tokens, especially when the document contains tables, formulas, charts, and layout information, and the language model simply cannot fit it.

The contribution of DeepSeek-OCR is:

  • compress the entire page document into less than 1/10 of the original number of tokens;
  • while maintaining a recognition accuracy rate of more than 97%;
  • Even at 20x compression, it still maintains about 60% accuracy;
  • A single A100 graphics card can process more than 200,000 pages of documents per day.

This means:

Large documents that used to require hundreds of GPUs to train or process can now run on a single graphics card.

3. Architecture principles

The architecture of DeepSeek-OCR is divided into two parts:

moduleFunctionTechnical Highlights:
DeepEncoderEncode image inputs into vision tokensCapture spatial structures such as text, layouts, tables, charts, and more
DeepSeek-3B-MoE-A570MHybrid expert models for recovering or understanding text from visual tokensProvide linguistic decoding and reasoning capabilities

The overall process is as follows:

  1. Enter a page of complex documents (including charts, tables, formulas).
  2. DeepEncoder converts this into about 100 visual tokens.
  3. The decoder outputs text or semantic understanding from these visual tokens.

Compared to traditional OCR, DeepSeek-OCR not only recognizes text, but also retains layout information and spatial logic, allowing the model to “understand” the page structure rather than just “recognize” the text.

4. Performance and testing

In published papers and experiments, DeepSeek-OCR’s performance is extremely groundbreaking:

Indicatorsperformance
Compression ratio < 10×Recognition accuracy ≈ 97%
Compression ratio ≈ 20×Accuracy ≈ 60%
Single-GPU processing efficiencyMore than 200 000 pages per day
Comparison modelsBetter than GOT-OCR 2.0 (256 tokens/page) and MinerU 2.0 (6000 tokens/page)

This makes it not only an OCR model, but also a new paradigm of “saving context budget for large models”.

5. Application scenarios

The potential uses of DeepSeek-OCR are vast:

  • 📚 Scientific research and education: Batch digitization of books, academic literature, charts.
  • 💼 Enterprise file processing: Efficiently scan and structure contracts, reports, and vouchers.
  • 🔍 Large model front-end preprocessing: As a “visual compression entrance” for LLMs, it provides more context under limited tokens.
  • 🧩 Training Data Generation: Mass-produce clean corpus and visual data for LLMs/VLMs.

6. Analysis of advantages and disadvantages

Pros:Description
🔹 High compression ratioUp to 10–20 times less token consumption
🔹 Keep layout informationUnderstand tables, charts, and layout structures
🔹 Open source reproducibleGitHub + Hugging Face can be deployed directly
🔹 Lower costReduce video memory and inference time
LimitationsDescription
⚠️ Accuracy decreases when compression is too highRecognition quality is significantly reduced when compression exceeds 20×
⚠️ Support for complex handwriting/special fonts is unknownFocus on print and standard documents
⚠️ Strong GPU inference capabilities are still requiredThe encoder section is computationally intensive
⚠️ International regulatory factorsDeepSeek AI has usage restrictions in some regions

7. Deployment and use

The project is fully open source and available on GitHub and Hugging Face .
Recommended environment:

Python 3.12.9 
PyTorch 2.6.0 
Transformers 4.46.3 
CUDA 11.8 
Flash-Attn 2.7.3

Sample code:

from transformers import AutoProcessor, AutoModelForSeq2SeqLM
from PIL import Image

processor = AutoProcessor.from_pretrained("deepseek-ai/DeepSeek-OCR")
model = AutoModelForSeq2SeqLM.from_pretrained("deepseek-ai/DeepSeek-OCR", device_map="auto")

img = Image.open("sample_page.png")
inputs = processor(images=img, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
text = processor.decode(outputs[0], skip_special_tokens=True)

print(text)

8. Future prospects

DeepSeek-OCR demonstrates a significant trend:

The future of “reading” does not necessarily depend on words.

When large models gradually expand from “text understanding” to “document understanding”, visual information compression will become a new computing paradigm.
It may be applied not only to OCR, but will also be used for:

  • Document memory compression
  • Multimodal contextual fusion
  • Low-bandwidth remote inference
  • AI Education / Knowledge Graph Generation

At the intersection of vision and language, DeepSeek-OCR allows us to see:
The limit of reading is not in the number of words, but in imagination.

📎 References:

Github:https://github.com/deepseek-ai/DeepSeek-OCR
Tubing:

Scroll to Top