Building a Vision App with Ollama Structured Outputs
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Structured outputs enforce schema-shaped JSON responses, making extracted data easier to validate, index, and reuse in apps.
Briefing
Structured outputs in Ollama make it practical to turn both text and images into validated, schema-shaped data—locally—using Python classes (Pydantic) or JavaScript schemas (Zod). That shift matters because it replaces brittle “free-form” extraction with outputs that can be checked, indexed, and fed directly into apps and downstream automation, without needing an agent framework.
The core workflow is straightforward: define a schema for the expected result, pass that schema to the model, and validate the returned JSON against it. In Python, the approach centers on Pydantic models; in JavaScript, it uses Zod. The transcript emphasizes that this isn’t just about getting “some text back.” It’s about reliably extracting entities, nesting structured objects, and pulling specific fields out of model responses—then using those fields programmatically.
A first example demonstrates named-entity recognition (NER) via a simple class that can hold organizations, products, people, and locations. Results vary depending on the model and prompt quality. Using an off-the-shelf Llama 3.1 setup yields mixed accuracy—some organizations and products land correctly, but certain mentions (like specific model names) and some people/organizations can be missed. Switching to a different Llama variant (Llama 3.2) and adding a system prompt improves consistency, especially for organizations and people. The transcript also notes an iterative path: treat the initial extraction as a starting point, then refine prompts, choose better models, and even build datasets (for example using stronger models) to fine-tune for the exact entity types needed.
The next example shifts from entity extraction to image understanding. A bookstore photo is processed with a vision model using a schema that nests a list of “book” objects (title, author, confidence score), plus scene-level fields like a summary and number-of-books category. The results show the promise and the limits: the system can identify some book titles and authors correctly, detect that the scene is a bookshelf with rows of books, and even pick up some text and colors. But it also misses side text and struggles when prompts are generic or when the image quality is limited.
Finally, a mini app extracts track listings from the backs of album covers. Using Llama 3.2 Vision, the app reads raw text from images and returns structured album details: album title plus a list of songs, with optional durations only when they appear in the source. The implementation runs as plain Python—no agent framework—processing a list of image files and saving results as Markdown. The transcript highlights practical outcomes: one album cover yields a correct track list, another includes times that align well with the printed durations, and blurry low-resolution images lead to partial transcription rather than fabricated times (helped by the schema design).
Overall, structured outputs are presented as a way to build small, task-specific “vision-to-database” pipelines that run locally for privacy and cost control. Once the prompts and schemas are tuned—and optionally fine-tuned for a domain—the same extraction logic can be reused repeatedly as a background job, feeding databases, RAG systems, or other agent workflows without the overhead of complex orchestration.
Cornell Notes
Structured outputs in Ollama let developers extract text and image data into validated JSON-shaped results instead of unreliable free-form text. By defining Pydantic classes (Python) or Zod schemas (JavaScript), the model’s output can be checked against a schema and then indexed into fields like entities, nested objects, or album track lists. The transcript shows entity extraction from text, vision-based book detection from a bookshelf image, and a Python mini app that transcribes album-cover track listings into structured album/song data (including durations only when present). This matters because it enables local, privacy-friendly pipelines that can feed apps and downstream systems with consistent data formats.
How do structured outputs change the reliability of LLM extraction tasks?
What role do system prompts and model choice play in entity extraction?
How does schema nesting work in vision tasks like book detection?
Why does the album-cover app avoid hallucinating song durations?
What makes the mini app approach practical without an agent framework?
Review Questions
- When would you prefer schema-driven extraction over free-form prompting for an extraction pipeline?
- In the NER example, what changes improved results, and why might that matter when building a dataset for fine-tuning?
- How does the album-cover schema design influence whether durations appear in the output?
Key Points
- 1
Structured outputs enforce schema-shaped JSON responses, making extracted data easier to validate, index, and reuse in apps.
- 2
Python implementations rely on Pydantic classes; JavaScript implementations rely on Zod schemas for the same schema-driven approach.
- 3
Entity extraction quality depends heavily on both model choice and prompt quality, with system prompts improving consistency.
- 4
Vision extraction works best when prompts and schemas explicitly reflect the target domain (e.g., books, album track listings) and when image quality is sufficient.
- 5
Schema nesting (lists of objects inside a top-level object) supports realistic outputs like “books on a shelf” or “songs in an album.”
- 6
Designing optional fields in the schema can prevent hallucinations—for example, including durations only when they appear in the image.
- 7
Local execution supports privacy and cost control, enabling repeatable background jobs that feed databases or RAG systems.