DeepSeek OCR - More than OCR

TL;DR

DeepSeek OCR’s central contribution is compressing text into far fewer vision tokens, then decoding it back with high accuracy—aiming to reduce long-context costs.

Briefing Cornell Notes

Briefing

DeepSeek OCR’s headline idea isn’t better document reading—it’s a new way to compress long text into far fewer “vision tokens,” then decode it back with high accuracy. The practical payoff is memory: large language models struggle with very long contexts because text costs roughly one token per word, so pushing from millions toward 10 million+ tokens quickly becomes expensive. DeepSeek’s approach aims to store millions of text tokens as an image-like representation that can fit into a much smaller context window.

The method centers on “context optimal compression,” treating vision as a compression medium for text rather than just a way to interpret pictures. Instead of converting an image to tokens and then treating those tokens as ordinary visual features, the system is trained so that a small number of vision tokens can be decoded back into a much larger amount of text. The transcript cites results such as using 100 vision tokens to recover about 1,000 text tokens with roughly 97% accuracy, and still retaining around 60% accuracy at 20× compression (50 vision tokens for 1,000 text tokens). Even though the paper is framed around OCR, the underlying claim is that this could become a general-purpose memory compression mechanism for long-context AI.

A key engineering piece is a two-stage “deep encoder” designed to avoid the token explosion that typical vision encoders can cause. Stage one uses a SAM model (about 18 million parameters) to focus attention at high resolution—effectively deciding what details matter—then a CNN compresses the image representation by 16×. Stage two feeds the compressed representation into a CLIP-style model to use global attention and produce an efficient set of vision tokens. The system also supports multiple “zoom levels,” outputting different token counts depending on mode: 64 tokens in a tiny mode, 100 in a small mode, 256 in a base mode, up to about 1,800 tokens in a larger mode.

The transcript contrasts this with a more conventional tokenization approach for long documents, which might require around 6,000 text tokens for a given document and heavy compute/memory in the language model. DeepSeek’s encoder can achieve the same task under 800 vision tokens, with reported performance that can be better despite the compression.

Benchmarks reported in the OCR setting suggest accuracy stays above 95% as long as compression remains around 10×. The work is still presented as proof-of-concept for the compression idea: it demonstrates the concept on OCR tasks, but scaling to extreme ratios (for example, replacing hundreds of thousands of vision tokens for millions of text tokens) remains an open question. Still, the implications are clear—rendering older conversation history into image-like token bundles could let systems keep 10–20 million text-token equivalents in context using dramatically fewer tokens, enabling longer in-context learning and retrieval without the usual cost explosion.

Alongside the model, the transcript notes a decoder described as “DeepSeek 3B” with about 570 million active parameters, and points to released code on GitHub and a model on Hugging Face. It also situates the work among other OCR-focused systems such as Nanets OCR2 and Paddle OCR VL, but positions DeepSeek’s contribution as a broader shift in how text can be stored and retrieved inside transformer pipelines—using vision-token compression rather than raw text tokens.

Cornell Notes

DeepSeek OCR reframes OCR as a testbed for “context optimal compression,” a technique that stores text as a small set of vision tokens and then decodes it back into text with high fidelity. The transcript highlights results like recovering ~1,000 text tokens from 100 vision tokens at about 97% accuracy, and still achieving ~60% accuracy at 20× compression. A two-stage deep encoder helps make this feasible: SAM focuses high-resolution attention, a CNN compresses by 16×, and a CLIP-style model uses global attention to produce efficient vision tokens. Multiple token “zoom levels” let the system trade off token count and detail. If this generalizes beyond OCR, it could enable long-context memory by representing millions of text tokens using far fewer vision tokens.

Why does long-context processing become expensive for large language models, and what does DeepSeek OCR try to change?

Long-context costs rise because text is tokenized at roughly one token per word, so extending context from millions toward 10 million+ tokens quickly increases compute and memory requirements. DeepSeek OCR targets that bottleneck by compressing text into vision tokens—aiming to store far more text content in a smaller token budget, then decode it back when needed.

What is “context optimal compression,” and what compression ratios are cited?

“Context optimal compression” treats vision-token representations as a compression medium for text. The transcript cites examples where 100 vision tokens decode to about 1,000 text tokens with ~97% accuracy, and where 50 vision tokens decode to about 1,000 text tokens with roughly ~60% accuracy (about 20× compression). It also notes OCR accuracy staying above 95% around 10× compression.

How does the two-stage deep encoder reduce the number of tokens compared with typical vision pipelines?

Stage one uses a SAM model (about 18 million parameters) to decide what to attend to at high resolution. Before producing final tokens, a CNN compresses the image representation by 16×. Stage two then uses a CLIP-style model to apply global attention over the compressed representation, yielding an efficient set of vision tokens instead of a large token grid.

What do the different “zoom levels” (tiny/small/base/Gundam) accomplish?

The system can output different token counts depending on how much detail is needed. The transcript lists 64 tokens for tiny mode, 100 for small mode, 256 for base mode, and up to about 1,800 tokens for a larger “Gundam mode.” This lets the system trade off context fidelity against token budget.

How does the approach compare to representing a document with standard text tokens?

The transcript contrasts an “old way” where a document might require around 6,000 text tokens to feed into a language model. DeepSeek’s encoder can represent the same kind of content under 800 vision tokens, while reporting equal or better performance in the OCR setting—suggesting the compression is not just smaller, but also effective.

Why is the work described as more than OCR, and what remains uncertain?

Although the paper is demonstrated on OCR benchmarks, the core goal is a general memory compression mechanism: storing long text history as image-like token bundles so models can retrieve and use it for in-context learning. The uncertainty is whether the same compression ratios and fidelity will hold at much larger scales and for non-OCR tasks; the transcript frames extreme scaling as still theoretical beyond the OCR proof.

Review Questions

What token-budget problem limits long-context language models, and how does vision-token compression address it?
Describe the roles of SAM, the CNN 16× compression step, and the CLIP-style global attention stage in the DeepSeek encoder.
What accuracy and compression trade-offs are reported for OCR, and what do they imply for using this method as long-term memory?

Key Points

1
DeepSeek OCR’s central contribution is compressing text into far fewer vision tokens, then decoding it back with high accuracy—aiming to reduce long-context costs.
2
“Context optimal compression” is framed as using vision as a compression algorithm for text, not merely improving OCR.
3
A two-stage deep encoder uses SAM for high-resolution attention, a CNN to compress by 16×, and a CLIP-style model for global attention over compressed features.
4
The system supports multiple token “zoom levels,” ranging from about 64 to about 1,800 vision tokens, enabling adjustable detail vs. token budget.
5
Reported results include ~97% accuracy when decoding ~1,000 text tokens from 100 vision tokens, and ~60% accuracy at 20× compression.
6
OCR benchmarks show >95% accuracy around 10× compression, but broader generalization beyond OCR and extreme scaling remain open questions.
7
The work is positioned as a potential long-context memory mechanism that could represent millions of text tokens using dramatically fewer tokens.

Highlights

DeepSeek OCR targets memory compression: it aims to store roughly 1,000 words worth of text content using only about 100 vision tokens, then decode it back with ~97% accuracy.

The encoder design avoids token blow-up by combining SAM attention, a CNN 16× compression step, and CLIP-style global attention to produce compact vision-token bundles.

Multiple “zoom levels” let the system choose between smaller token sets (e.g., 64/100/256) and higher-detail representations (up to ~1,800 tokens).

Even with OCR as the testbed, the claimed endgame is long-context AI memory—rendering conversation history into image-like token representations to fit larger contexts.

Topics

Context Optimal Compression
Vision Token Compression
Long-Context Memory
Two-Stage Deep Encoder
OCR Benchmarks

Mentioned

Sam Witteveen
OCR
SAM
CNN
CLIP