LlamaOCR - Building your Own Private OCR System
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LlamaOCR uses Together AI’s vision-enabled Llama 3.2 model to convert screenshots into editable Markdown, making image-based text usable for downstream systems.
Briefing
LlamaOCR turns screenshots and scanned documents into editable Markdown by using a vision-capable Llama 3.2 model hosted via Together AI. The practical payoff is straightforward: text that’s trapped inside images—receipts, UI screenshots, and other non-editable content—can be converted into structured, copyable output that’s usable downstream in search, agents, and RAG pipelines.
In live examples, the OCR quality is generally strong but not guaranteed. Because the underlying vision model is stochastic, repeated runs can produce different formatting and even occasional omissions—such as missing a brand name or misplacing elements. Still, receipts often come out with correct prices, subtotals, and tax lines, and the output typically includes barcodes/TC numbers somewhere in the extracted text. The conversion sometimes struggles with Markdown structure: headings may be inferred where they don’t truly belong, and content can appear out of order. Those inconsistencies are the tradeoff for using a general-purpose multimodal model rather than a purpose-built OCR engine.
The implementation is also presented as unusually simple. The original LlamaOCR package is essentially a single TypeScript file that imports Together AI, selects a vision model (notably the 90B variant), and sends an image plus a carefully constrained prompt. A key prompt instruction forces the model to return only Markdown—no extra commentary—because earlier managed-service attempts sometimes injected explanations alongside the extracted text.
Recreating the workflow in Python follows the same pattern: use the Together API with a vision model, filter for supported vision endpoints, and send either an image URL or a locally encoded image. For local images, the transcript emphasizes correct base64 encoding and matching MIME types (JPEG, PNG, GIF, WebP), since the API expects the encoded data to align with the declared format. It also provides a cost lens: vision token usage scales with image size (roughly 1,600 to 6,400 tokens), and Together’s pricing differs sharply between models—about $0.120 per million tokens for the 90B model versus $0.18 per million tokens for the 11B model—so smaller models can be far cheaper if latency and accuracy are acceptable.
Prompting matters beyond “OCR vs description.” One approach uses a prompt that describes the screenshot/UI and another that extracts OCR text; splitting these into two passes can improve control. The transcript also notes that even with good prompting, structure can be unreliable—especially for layouts where the model mislabels lines as headings.
For harder OCR tasks, two mitigation strategies are proposed. First, use Regions of Interest: train an object detection model to locate relevant areas (like ID card fields), then run OCR only on those cropped regions to preserve structure. Second, run OCR multiple times with a smaller model and then use a larger model as a judge to reach a consensus, leveraging the tendency for errors to be inconsistent.
Finally, the system’s value expands into agentic scraping. A practical pipeline scrapes HTML into Markdown, downloads images, runs OCR (and optionally image descriptions) on them, and then combines the results so a downstream LLM can extract the most relevant information. The same multimodal extraction can feed multimodal RAG, including charts, plots, and diagrams—not just plain text—making LlamaOCR a building block for private, end-to-end information extraction systems.
Cornell Notes
LlamaOCR converts screenshots and scanned images into editable Markdown by sending images to a vision-capable Llama 3.2 model via Together AI. Output quality is often good for receipts and UI screenshots, but results vary run-to-run because the model is stochastic, and Markdown structure (like headings and ordering) can be imperfect. The transcript shows how to recreate the workflow in Python: choose a vision model (11B for lower cost, 90B for higher capability), send an image URL or base64-encoded local image with the correct MIME type, and use a prompt that forces Markdown-only output. For tougher layouts, it recommends Regions of Interest (object detection + targeted OCR) or multi-pass OCR with a larger model acting as a judge. These techniques support agentic web scraping and multimodal RAG by extracting both text and image content.
Why does LlamaOCR sometimes produce different results for the same receipt or screenshot?
What prompt constraint is central to getting clean OCR output?
How do you send images to Together AI in the recreated Python workflow?
How does image size affect token usage and cost for vision models?
What strategies help when OCR structure (headings, ordering, fields) is unreliable?
How does this OCR approach fit into agentic scraping and RAG?
Review Questions
- When would you prefer the 11B vision model over the 90B model, and what tradeoffs might you expect?
- How would you design a two-pass prompt strategy to separate UI description from OCR text extraction?
- What are the main failure modes of general-purpose vision OCR on structured documents, and how do Regions of Interest or consensus judging address them?
Key Points
- 1
LlamaOCR uses Together AI’s vision-enabled Llama 3.2 model to convert screenshots into editable Markdown, making image-based text usable for downstream systems.
- 2
OCR quality is generally strong but varies across runs due to stochastic model behavior, and Markdown structure (headings/order) can be imperfect.
- 3
Recreating the system in Python mainly involves selecting a vision model, sending an image URL or base64-encoded local image, and using a Markdown-only prompt to avoid extra commentary.
- 4
Correct MIME type handling is essential for local images: JPEG/PNG/GIF/WebP encodings must match the declared content type.
- 5
Vision token usage scales with image size (roughly 1,600–6,400 tokens), so model choice directly affects cost (11B vs 90B pricing differs).
- 6
For high-precision or layout-critical OCR, Regions of Interest (object detection + cropped OCR) can enforce structure.
- 7
Agentic scraping improves when OCR is combined with HTML-to-Markdown conversion and image extraction, enabling multimodal RAG from text plus charts/diagrams.