DeepSeek OCR - More than OCR
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DeepSeek OCR’s central contribution is compressing text into far fewer vision tokens, then decoding it back with high accuracy—aiming to reduce long-context costs.
Briefing
DeepSeek OCR’s headline idea isn’t better document reading—it’s a new way to compress long text into far fewer “vision tokens,” then decode it back with high accuracy. The practical payoff is memory: large language models struggle with very long contexts because text costs roughly one token per word, so pushing from millions toward 10 million+ tokens quickly becomes expensive. DeepSeek’s approach aims to store millions of text tokens as an image-like representation that can fit into a much smaller context window.
The method centers on “context optimal compression,” treating vision as a compression medium for text rather than just a way to interpret pictures. Instead of converting an image to tokens and then treating those tokens as ordinary visual features, the system is trained so that a small number of vision tokens can be decoded back into a much larger amount of text. The transcript cites results such as using 100 vision tokens to recover about 1,000 text tokens with roughly 97% accuracy, and still retaining around 60% accuracy at 20× compression (50 vision tokens for 1,000 text tokens). Even though the paper is framed around OCR, the underlying claim is that this could become a general-purpose memory compression mechanism for long-context AI.
A key engineering piece is a two-stage “deep encoder” designed to avoid the token explosion that typical vision encoders can cause. Stage one uses a SAM model (about 18 million parameters) to focus attention at high resolution—effectively deciding what details matter—then a CNN compresses the image representation by 16×. Stage two feeds the compressed representation into a CLIP-style model to use global attention and produce an efficient set of vision tokens. The system also supports multiple “zoom levels,” outputting different token counts depending on mode: 64 tokens in a tiny mode, 100 in a small mode, 256 in a base mode, up to about 1,800 tokens in a larger mode.
The transcript contrasts this with a more conventional tokenization approach for long documents, which might require around 6,000 text tokens for a given document and heavy compute/memory in the language model. DeepSeek’s encoder can achieve the same task under 800 vision tokens, with reported performance that can be better despite the compression.
Benchmarks reported in the OCR setting suggest accuracy stays above 95% as long as compression remains around 10×. The work is still presented as proof-of-concept for the compression idea: it demonstrates the concept on OCR tasks, but scaling to extreme ratios (for example, replacing hundreds of thousands of vision tokens for millions of text tokens) remains an open question. Still, the implications are clear—rendering older conversation history into image-like token bundles could let systems keep 10–20 million text-token equivalents in context using dramatically fewer tokens, enabling longer in-context learning and retrieval without the usual cost explosion.
Alongside the model, the transcript notes a decoder described as “DeepSeek 3B” with about 570 million active parameters, and points to released code on GitHub and a model on Hugging Face. It also situates the work among other OCR-focused systems such as Nanets OCR2 and Paddle OCR VL, but positions DeepSeek’s contribution as a broader shift in how text can be stored and retrieved inside transformer pipelines—using vision-token compression rather than raw text tokens.
Cornell Notes
DeepSeek OCR reframes OCR as a testbed for “context optimal compression,” a technique that stores text as a small set of vision tokens and then decodes it back into text with high fidelity. The transcript highlights results like recovering ~1,000 text tokens from 100 vision tokens at about 97% accuracy, and still achieving ~60% accuracy at 20× compression. A two-stage deep encoder helps make this feasible: SAM focuses high-resolution attention, a CNN compresses by 16×, and a CLIP-style model uses global attention to produce efficient vision tokens. Multiple token “zoom levels” let the system trade off token count and detail. If this generalizes beyond OCR, it could enable long-context memory by representing millions of text tokens using far fewer vision tokens.
Why does long-context processing become expensive for large language models, and what does DeepSeek OCR try to change?
What is “context optimal compression,” and what compression ratios are cited?
How does the two-stage deep encoder reduce the number of tokens compared with typical vision pipelines?
What do the different “zoom levels” (tiny/small/base/Gundam) accomplish?
How does the approach compare to representing a document with standard text tokens?
Why is the work described as more than OCR, and what remains uncertain?
Review Questions
- What token-budget problem limits long-context language models, and how does vision-token compression address it?
- Describe the roles of SAM, the CNN 16× compression step, and the CLIP-style global attention stage in the DeepSeek encoder.
- What accuracy and compression trade-offs are reported for OCR, and what do they imply for using this method as long-term memory?
Key Points
- 1
DeepSeek OCR’s central contribution is compressing text into far fewer vision tokens, then decoding it back with high accuracy—aiming to reduce long-context costs.
- 2
“Context optimal compression” is framed as using vision as a compression algorithm for text, not merely improving OCR.
- 3
A two-stage deep encoder uses SAM for high-resolution attention, a CNN to compress by 16×, and a CLIP-style model for global attention over compressed features.
- 4
The system supports multiple token “zoom levels,” ranging from about 64 to about 1,800 vision tokens, enabling adjustable detail vs. token budget.
- 5
Reported results include ~97% accuracy when decoding ~1,000 text tokens from 100 vision tokens, and ~60% accuracy at 20× compression.
- 6
OCR benchmarks show >95% accuracy around 10× compression, but broader generalization beyond OCR and extreme scaling remain open questions.
- 7
The work is positioned as a potential long-context memory mechanism that could represent millions of text tokens using dramatically fewer tokens.