Building a Vision App with Ollama Structured Outputs

TL;DR

Structured outputs enforce schema-shaped JSON responses, making extracted data easier to validate, index, and reuse in apps.

Briefing Cornell Notes

Briefing

Structured outputs in Ollama make it practical to turn both text and images into validated, schema-shaped data—locally—using Python classes (Pydantic) or JavaScript schemas (Zod). That shift matters because it replaces brittle “free-form” extraction with outputs that can be checked, indexed, and fed directly into apps and downstream automation, without needing an agent framework.

The core workflow is straightforward: define a schema for the expected result, pass that schema to the model, and validate the returned JSON against it. In Python, the approach centers on Pydantic models; in JavaScript, it uses Zod. The transcript emphasizes that this isn’t just about getting “some text back.” It’s about reliably extracting entities, nesting structured objects, and pulling specific fields out of model responses—then using those fields programmatically.

A first example demonstrates named-entity recognition (NER) via a simple class that can hold organizations, products, people, and locations. Results vary depending on the model and prompt quality. Using an off-the-shelf Llama 3.1 setup yields mixed accuracy—some organizations and products land correctly, but certain mentions (like specific model names) and some people/organizations can be missed. Switching to a different Llama variant (Llama 3.2) and adding a system prompt improves consistency, especially for organizations and people. The transcript also notes an iterative path: treat the initial extraction as a starting point, then refine prompts, choose better models, and even build datasets (for example using stronger models) to fine-tune for the exact entity types needed.

The next example shifts from entity extraction to image understanding. A bookstore photo is processed with a vision model using a schema that nests a list of “book” objects (title, author, confidence score), plus scene-level fields like a summary and number-of-books category. The results show the promise and the limits: the system can identify some book titles and authors correctly, detect that the scene is a bookshelf with rows of books, and even pick up some text and colors. But it also misses side text and struggles when prompts are generic or when the image quality is limited.

Finally, a mini app extracts track listings from the backs of album covers. Using Llama 3.2 Vision, the app reads raw text from images and returns structured album details: album title plus a list of songs, with optional durations only when they appear in the source. The implementation runs as plain Python—no agent framework—processing a list of image files and saving results as Markdown. The transcript highlights practical outcomes: one album cover yields a correct track list, another includes times that align well with the printed durations, and blurry low-resolution images lead to partial transcription rather than fabricated times (helped by the schema design).

Overall, structured outputs are presented as a way to build small, task-specific “vision-to-database” pipelines that run locally for privacy and cost control. Once the prompts and schemas are tuned—and optionally fine-tuned for a domain—the same extraction logic can be reused repeatedly as a background job, feeding databases, RAG systems, or other agent workflows without the overhead of complex orchestration.

Cornell Notes

Structured outputs in Ollama let developers extract text and image data into validated JSON-shaped results instead of unreliable free-form text. By defining Pydantic classes (Python) or Zod schemas (JavaScript), the model’s output can be checked against a schema and then indexed into fields like entities, nested objects, or album track lists. The transcript shows entity extraction from text, vision-based book detection from a bookshelf image, and a Python mini app that transcribes album-cover track listings into structured album/song data (including durations only when present). This matters because it enables local, privacy-friendly pipelines that can feed apps and downstream systems with consistent data formats.

How do structured outputs change the reliability of LLM extraction tasks?

Instead of asking for plain text, structured outputs require the model to return data that matches a defined schema. In Python, that means using Pydantic classes; in JavaScript, it uses Zod. The returned JSON can then be validated and accessed by field (e.g., organizations, products, people, locations). The transcript’s NER example shows that accuracy improves when the schema is paired with a good system prompt and a suitable model, turning extraction into something that can be programmatically consumed rather than manually read.

What role do system prompts and model choice play in entity extraction?

Entity extraction results are described as “hit and miss” with an off-the-shelf Llama 3.1 setup. Adding a system prompt and switching to a different Llama variant (Llama 3.2) improves outcomes—organizations and people are captured more consistently. The transcript also frames iteration as normal: start with a baseline, then refine prompts, try different models, and potentially build a dataset (using stronger models like Gemini 2.0) to fine-tune for the specific entity types needed.

How does schema nesting work in vision tasks like book detection?

The bookstore example uses a schema that nests objects: a top-level image description plus a list of “book” objects containing fields like title, author, and a confidence score. This nesting helps the model return a structured list of items rather than a single paragraph. The transcript notes that results depend on prompt specificity (e.g., mentioning books) and image quality; generic prompts and side text can lead to missed titles/authors.

Why does the album-cover app avoid hallucinating song durations?

The album extraction schema is designed so durations are included only when they’re available in the image. In the transcript, one album output includes a correct track list but no times, and the next includes times that match the printed durations. This behavior is attributed to the structured output constraints—schema design can prevent the model from inventing fields that aren’t supported by the source image.

What makes the mini app approach practical without an agent framework?

The transcript emphasizes that the pipeline is plain Python: define schemas, call the vision model, validate structured output, and save results (e.g., to Markdown) while iterating over a list of image files. Because the task is specific—extract track listings—there’s no need for agent orchestration. Once tuned, the same extraction function can run repeatedly as a background job to populate a database or RAG system.

Review Questions

When would you prefer schema-driven extraction over free-form prompting for an extraction pipeline?
In the NER example, what changes improved results, and why might that matter when building a dataset for fine-tuning?
How does the album-cover schema design influence whether durations appear in the output?

Key Points

1
Structured outputs enforce schema-shaped JSON responses, making extracted data easier to validate, index, and reuse in apps.
2
Python implementations rely on Pydantic classes; JavaScript implementations rely on Zod schemas for the same schema-driven approach.
3
Entity extraction quality depends heavily on both model choice and prompt quality, with system prompts improving consistency.
4
Vision extraction works best when prompts and schemas explicitly reflect the target domain (e.g., books, album track listings) and when image quality is sufficient.
5
Schema nesting (lists of objects inside a top-level object) supports realistic outputs like “books on a shelf” or “songs in an album.”
6
Designing optional fields in the schema can prevent hallucinations—for example, including durations only when they appear in the image.
7
Local execution supports privacy and cost control, enabling repeatable background jobs that feed databases or RAG systems.

Highlights

Structured outputs turn image/text extraction into validated, schema-shaped data rather than brittle free-form text.

Adding a system prompt and switching to a different Llama variant improved NER consistency for organizations and people.

A Python mini app extracted album track listings into structured album/song objects, including durations only when present in the source.

Schema design and field descriptions helped the vision model return nested, app-ready results like book lists and song lists.