Build Smarter AI Apps: Memory, Tools, Retrieval & Structured Output with Python, Pydantic & Ollama

TL;DR

Memory can be implemented by passing prior message history into the Ollama chat call, enabling context retention across turns.

Briefing Cornell Notes

Briefing

AI apps become meaningfully more useful when they’re given four upgrades beyond plain text prompting: memory, structured outputs, tool use, and retrieval-augmented knowledge. The core takeaway is that these capabilities can be built in pure Python using Pydantic plus a local LLM via Ollama, letting developers control context, enforce response formats, and ground answers in external data.

Memory addresses a practical limitation: without it, users must re-provide context on every turn. The walkthrough demonstrates a minimal “message history” approach by passing a list of prior chat messages (with custom message types) into the Ollama chat call. Using a deterministic setup (temperature set to 0) and a simulated prior answer, the model can respond as if it already knows the conversation state. The discussion then points to more scalable memory patterns—window memory that summarizes or retains only the most recent messages, and structured memory that stores specific user facts (like weight or eating habits) in a database for later reuse.

Structured output tackles another developer pain point: raw LLM text is hard to validate and parse. By defining Pydantic models (e.g., a QuizQuestion schema with fields for the question, a correct answer, and a list of incorrect answers), the system can request responses that conform to a known structure. Field descriptions are injected into prompts to improve compliance. The flow uses the model’s JSON schema generation (via Pydantic) and then validates the returned JSON back into the Pydantic type, producing a reliable, application-ready object rather than brittle string parsing.

Tool use is presented as the most powerful enhancement. Instead of relying on the model to “know” everything, the app can call external functions—such as an API-backed trivia fetcher or a calculator—to generate facts or perform operations. The example uses the Open Trivia Database API to fetch quiz questions by category (computers or vehicles) and count. A Pydantic-based tool parameter spec constrains allowed inputs (category as a literal set; count as an integer), and the model selects the tool call with inferred arguments. Crucially, the LLM doesn’t execute the tool itself; Python code runs the function, then the tool output is appended to the conversation history for a final, formatted response (e.g., markdown).

Finally, retrieval addresses the “knowledge cut off” problem—models may lack information from after their training window (the transcript cites 2023 knowledge cutoffs while discussing a 2025 context). Retrieval-augmented generation is implemented with a small local knowledge base (a list of quiz Q&A items). A search tool finds the best match by comparing the query to stored question titles, and the retrieved answer is injected into the prompt flow. The result is a grounded response that depends on the local database rather than the model’s internal memory.

Together, these techniques form a practical blueprint for building smarter AI apps: keep context with memory, enforce correctness with structured outputs, expand capability with tool calls, and ground answers with retrieval—using Python, Pydantic, and a local Ollama model stack.

Cornell Notes

The transcript lays out four upgrades that turn a basic LLM into a more reliable AI application: memory, structured output, tool use, and retrieval. Memory keeps prior conversation context by passing message history (and can scale via window or structured memory stored in a database). Structured output uses Pydantic schemas so the model returns JSON that can be validated into typed objects, avoiding fragile text parsing. Tool use lets the model request function calls (e.g., fetching trivia from Open Trivia Database) while Python executes the API and feeds results back into the chat. Retrieval-augmented generation then grounds answers in external or private knowledge to mitigate knowledge cut-off limits.

Why does memory matter for real-world chat apps, and how is it implemented in the walkthrough?

Memory matters because without it, users must restate context every turn, which is impractical. The walkthrough implements a simple version by passing a list of prior messages into the Ollama chat call, including custom message types to simulate an earlier exchange. With temperature set to 0 for reproducibility, the model can respond as if it already has the earlier context. It also sketches scalable variants: window memory (retain/summarize recent messages) and structured memory (store specific user facts like weight or eating habits in a database for later turns).

What problem does structured output solve, and how does Pydantic enforce it?

Structured output solves the difficulty of parsing and validating raw LLM text. Instead of returning free-form strings, the app defines a Pydantic model (e.g., QuizQuestion with fields for question text, correct answer, and incorrect answers). The system generates a JSON schema from the Pydantic type and uses it to request a response in the required format. After the model returns JSON, the app validates it back into the Pydantic object (e.g., via model validation), producing dependable, application-ready data.

How does tool use work in this setup, and what role does Python play?

Tool use works by letting the LLM select a function call with constrained parameters, but Python executes the function. The walkthrough defines a tool function (fetch trivia questions from Open Trivia Database) and a Pydantic-based tool parameter spec that restricts inputs (category limited to computers or vehicles; count specifies how many questions). The model returns a tool call with inferred arguments; then the code runs fetch_trivia_questions, captures the API results, appends the tool output to the message history, and asks the model for a final formatted response (like markdown).

How does retrieval address knowledge cut-off, and what does the example retrieval tool do?

Retrieval addresses knowledge cut-off by pulling relevant facts from an external knowledge base at runtime. The example uses a small local database of quiz questions and answers. A search tool takes a query string, lowercases it, and matches it against stored question titles. If there’s a match, it returns the full quiz question object; otherwise it returns an optional non-result (e.g., unknown). That retrieved content is then injected into the prompt flow so the final answer is grounded in the local data rather than the model’s internal training.

What constraints are used to keep tool calls from going off the rails?

The walkthrough uses Pydantic types to constrain tool parameters. For example, category is defined as a Literal with only two allowed values: computers and vehicles. Count is defined as an integer specifying how many items to fetch. These constraints are converted into JSON schema and provided to the model, which helps it generate valid tool-call arguments.

Review Questions

How would you decide between window memory and structured memory for a new AI feature?
What steps are required to go from a Pydantic schema to a validated structured response in the chat flow?
In the tool-calling flow, where does execution happen, and how is tool output incorporated into the next model call?

Key Points

1
Memory can be implemented by passing prior message history into the Ollama chat call, enabling context retention across turns.
2
Window memory and structured memory are two practical scaling strategies: summarize recent turns or store specific user facts in a database.
3
Pydantic schemas enable structured outputs by generating JSON schema for the model and validating returned JSON into typed objects.
4
Tool use expands model capability by letting the LLM request function calls while Python executes the functions and feeds results back into the conversation.
5
Tool parameter constraints (e.g., Literal categories) reduce invalid tool-call arguments and improve reliability.
6
Retrieval-augmented generation mitigates knowledge cut-off by searching an external/local knowledge base and grounding answers in retrieved content.
7
A complete RAG/tool pipeline typically follows: model selects tool → code executes tool → tool output is appended to history → model produces the final response.

Highlights

Memory retention is achieved by supplying a list of prior messages into the Ollama chat call, letting the model behave as if prior answers already exist.

Structured output becomes practical when Pydantic schemas generate JSON schema for the model and the response is validated back into a typed object.

Tool calling is reliable when Python—not the LLM—executes the external function and the tool output is appended to the next prompt.

Retrieval-augmented generation grounds answers by searching a local knowledge base and injecting the matched content into the prompt flow.

Topics

Mentioned

Venelin Valkov
LM
AI
GPT
RAG
JSON
API
PDF