LlamaOCR - Building your Own Private OCR System

TL;DR

LlamaOCR uses Together AI’s vision-enabled Llama 3.2 model to convert screenshots into editable Markdown, making image-based text usable for downstream systems.

Briefing Cornell Notes

Briefing

LlamaOCR turns screenshots and scanned documents into editable Markdown by using a vision-capable Llama 3.2 model hosted via Together AI. The practical payoff is straightforward: text that’s trapped inside images—receipts, UI screenshots, and other non-editable content—can be converted into structured, copyable output that’s usable downstream in search, agents, and RAG pipelines.

In live examples, the OCR quality is generally strong but not guaranteed. Because the underlying vision model is stochastic, repeated runs can produce different formatting and even occasional omissions—such as missing a brand name or misplacing elements. Still, receipts often come out with correct prices, subtotals, and tax lines, and the output typically includes barcodes/TC numbers somewhere in the extracted text. The conversion sometimes struggles with Markdown structure: headings may be inferred where they don’t truly belong, and content can appear out of order. Those inconsistencies are the tradeoff for using a general-purpose multimodal model rather than a purpose-built OCR engine.

The implementation is also presented as unusually simple. The original LlamaOCR package is essentially a single TypeScript file that imports Together AI, selects a vision model (notably the 90B variant), and sends an image plus a carefully constrained prompt. A key prompt instruction forces the model to return only Markdown—no extra commentary—because earlier managed-service attempts sometimes injected explanations alongside the extracted text.

Recreating the workflow in Python follows the same pattern: use the Together API with a vision model, filter for supported vision endpoints, and send either an image URL or a locally encoded image. For local images, the transcript emphasizes correct base64 encoding and matching MIME types (JPEG, PNG, GIF, WebP), since the API expects the encoded data to align with the declared format. It also provides a cost lens: vision token usage scales with image size (roughly 1,600 to 6,400 tokens), and Together’s pricing differs sharply between models—about $0.120 per million tokens for the 90B model versus $0.18 per million tokens for the 11B model—so smaller models can be far cheaper if latency and accuracy are acceptable.

Prompting matters beyond “OCR vs description.” One approach uses a prompt that describes the screenshot/UI and another that extracts OCR text; splitting these into two passes can improve control. The transcript also notes that even with good prompting, structure can be unreliable—especially for layouts where the model mislabels lines as headings.

For harder OCR tasks, two mitigation strategies are proposed. First, use Regions of Interest: train an object detection model to locate relevant areas (like ID card fields), then run OCR only on those cropped regions to preserve structure. Second, run OCR multiple times with a smaller model and then use a larger model as a judge to reach a consensus, leveraging the tendency for errors to be inconsistent.

Finally, the system’s value expands into agentic scraping. A practical pipeline scrapes HTML into Markdown, downloads images, runs OCR (and optionally image descriptions) on them, and then combines the results so a downstream LLM can extract the most relevant information. The same multimodal extraction can feed multimodal RAG, including charts, plots, and diagrams—not just plain text—making LlamaOCR a building block for private, end-to-end information extraction systems.

Cornell Notes

LlamaOCR converts screenshots and scanned images into editable Markdown by sending images to a vision-capable Llama 3.2 model via Together AI. Output quality is often good for receipts and UI screenshots, but results vary run-to-run because the model is stochastic, and Markdown structure (like headings and ordering) can be imperfect. The transcript shows how to recreate the workflow in Python: choose a vision model (11B for lower cost, 90B for higher capability), send an image URL or base64-encoded local image with the correct MIME type, and use a prompt that forces Markdown-only output. For tougher layouts, it recommends Regions of Interest (object detection + targeted OCR) or multi-pass OCR with a larger model acting as a judge. These techniques support agentic web scraping and multimodal RAG by extracting both text and image content.

Why does LlamaOCR sometimes produce different results for the same receipt or screenshot?

The extraction relies on a stochastic vision model, so repeated runs can yield different formatting and occasional omissions. The transcript gives examples where one run captures item names and formatting more cleanly than another, even when prices and totals are often still correct. This variability also affects Markdown structure—headings may be inferred incorrectly, and content can appear out of order.

What prompt constraint is central to getting clean OCR output?

A key instruction in the prompt forces the model to return only Markdown content, with no additional explanations or comments. The transcript contrasts this with earlier managed-service behavior that sometimes added commentary alongside the extracted text, which is undesirable when the goal is machine-readable OCR output.

How do you send images to Together AI in the recreated Python workflow?

Two main options are used: (1) provide an image URL, letting the API fetch the image, or (2) load a local image, base64-encode it, and send it with the correct MIME type. The transcript stresses that the MIME type must match the encoding (JPEG vs PNG vs GIF vs WebP), otherwise the API may not interpret the image correctly.

How does image size affect token usage and cost for vision models?

Vision inputs convert into tokens based on image size, with the transcript citing a range of roughly 1,600 to 6,400 tokens depending on dimensions. Pricing then depends on the model: the 90B model is about $0.120 per million tokens, while the 11B model is about $0.18 per million tokens, so choosing the smaller model can reduce cost if accuracy is sufficient.

What strategies help when OCR structure (headings, ordering, fields) is unreliable?

Two strategies are proposed. Regions of Interest: train an object detection model to find relevant parts (e.g., ID card fields), then run OCR on those cropped regions to preserve layout. Multi-pass + judge: run OCR multiple times (e.g., three) and use a larger model to reconcile outputs into a consensus, since errors often don’t repeat consistently.

How does this OCR approach fit into agentic scraping and RAG?

A practical pipeline scrapes HTML into Markdown, downloads images from the page, runs OCR on those images (and optionally a separate image-description prompt), then combines everything into a form a downstream LLM can use. This supports extracting information from pages where key details appear in images, including charts, plots, and diagrams—enabling multimodal RAG-style workflows.

Review Questions

When would you prefer the 11B vision model over the 90B model, and what tradeoffs might you expect?
How would you design a two-pass prompt strategy to separate UI description from OCR text extraction?
What are the main failure modes of general-purpose vision OCR on structured documents, and how do Regions of Interest or consensus judging address them?

Key Points

1
LlamaOCR uses Together AI’s vision-enabled Llama 3.2 model to convert screenshots into editable Markdown, making image-based text usable for downstream systems.
2
OCR quality is generally strong but varies across runs due to stochastic model behavior, and Markdown structure (headings/order) can be imperfect.
3
Recreating the system in Python mainly involves selecting a vision model, sending an image URL or base64-encoded local image, and using a Markdown-only prompt to avoid extra commentary.
4
Correct MIME type handling is essential for local images: JPEG/PNG/GIF/WebP encodings must match the declared content type.
5
Vision token usage scales with image size (roughly 1,600–6,400 tokens), so model choice directly affects cost (11B vs 90B pricing differs).
6
For high-precision or layout-critical OCR, Regions of Interest (object detection + cropped OCR) can enforce structure.
7
Agentic scraping improves when OCR is combined with HTML-to-Markdown conversion and image extraction, enabling multimodal RAG from text plus charts/diagrams.

Highlights

The workflow hinges on a prompt that forces Markdown-only output; without it, models may add explanatory text that breaks downstream parsing.

Even with strong OCR, Markdown structure can drift—headings may be inferred incorrectly and content can reorder—because general vision models don’t guarantee layout fidelity.

A practical cost lever is model size: vision inputs translate into thousands of tokens, and Together’s per-token pricing makes 11B materially cheaper than 90B.

For difficult documents, Regions of Interest and multi-pass consensus are presented as two concrete ways to reduce structural and transcription errors.

In scraping pipelines, combining HTML-to-Markdown with image OCR (and optional image descriptions) lets agents extract information hidden in UI elements and graphics.

Topics

LlamaOCR
Vision OCR
Together AI
Python Implementation
Agentic Scraping

Mentioned

Sam Witteveen
OCR
RAG
UI
MIME
API