Build Smarter AI Apps: Memory, Tools, Retrieval & Structured Output with Python, Pydantic & Ollama
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Memory can be implemented by passing prior message history into the Ollama chat call, enabling context retention across turns.
Briefing
AI apps become meaningfully more useful when they’re given four upgrades beyond plain text prompting: memory, structured outputs, tool use, and retrieval-augmented knowledge. The core takeaway is that these capabilities can be built in pure Python using Pydantic plus a local LLM via Ollama, letting developers control context, enforce response formats, and ground answers in external data.
Memory addresses a practical limitation: without it, users must re-provide context on every turn. The walkthrough demonstrates a minimal “message history” approach by passing a list of prior chat messages (with custom message types) into the Ollama chat call. Using a deterministic setup (temperature set to 0) and a simulated prior answer, the model can respond as if it already knows the conversation state. The discussion then points to more scalable memory patterns—window memory that summarizes or retains only the most recent messages, and structured memory that stores specific user facts (like weight or eating habits) in a database for later reuse.
Structured output tackles another developer pain point: raw LLM text is hard to validate and parse. By defining Pydantic models (e.g., a QuizQuestion schema with fields for the question, a correct answer, and a list of incorrect answers), the system can request responses that conform to a known structure. Field descriptions are injected into prompts to improve compliance. The flow uses the model’s JSON schema generation (via Pydantic) and then validates the returned JSON back into the Pydantic type, producing a reliable, application-ready object rather than brittle string parsing.
Tool use is presented as the most powerful enhancement. Instead of relying on the model to “know” everything, the app can call external functions—such as an API-backed trivia fetcher or a calculator—to generate facts or perform operations. The example uses the Open Trivia Database API to fetch quiz questions by category (computers or vehicles) and count. A Pydantic-based tool parameter spec constrains allowed inputs (category as a literal set; count as an integer), and the model selects the tool call with inferred arguments. Crucially, the LLM doesn’t execute the tool itself; Python code runs the function, then the tool output is appended to the conversation history for a final, formatted response (e.g., markdown).
Finally, retrieval addresses the “knowledge cut off” problem—models may lack information from after their training window (the transcript cites 2023 knowledge cutoffs while discussing a 2025 context). Retrieval-augmented generation is implemented with a small local knowledge base (a list of quiz Q&A items). A search tool finds the best match by comparing the query to stored question titles, and the retrieved answer is injected into the prompt flow. The result is a grounded response that depends on the local database rather than the model’s internal memory.
Together, these techniques form a practical blueprint for building smarter AI apps: keep context with memory, enforce correctness with structured outputs, expand capability with tool calls, and ground answers with retrieval—using Python, Pydantic, and a local Ollama model stack.
Cornell Notes
The transcript lays out four upgrades that turn a basic LLM into a more reliable AI application: memory, structured output, tool use, and retrieval. Memory keeps prior conversation context by passing message history (and can scale via window or structured memory stored in a database). Structured output uses Pydantic schemas so the model returns JSON that can be validated into typed objects, avoiding fragile text parsing. Tool use lets the model request function calls (e.g., fetching trivia from Open Trivia Database) while Python executes the API and feeds results back into the chat. Retrieval-augmented generation then grounds answers in external or private knowledge to mitigate knowledge cut-off limits.
Why does memory matter for real-world chat apps, and how is it implemented in the walkthrough?
What problem does structured output solve, and how does Pydantic enforce it?
How does tool use work in this setup, and what role does Python play?
How does retrieval address knowledge cut-off, and what does the example retrieval tool do?
What constraints are used to keep tool calls from going off the rails?
Review Questions
- How would you decide between window memory and structured memory for a new AI feature?
- What steps are required to go from a Pydantic schema to a validated structured response in the chat flow?
- In the tool-calling flow, where does execution happen, and how is tool output incorporated into the next model call?
Key Points
- 1
Memory can be implemented by passing prior message history into the Ollama chat call, enabling context retention across turns.
- 2
Window memory and structured memory are two practical scaling strategies: summarize recent turns or store specific user facts in a database.
- 3
Pydantic schemas enable structured outputs by generating JSON schema for the model and validating returned JSON into typed objects.
- 4
Tool use expands model capability by letting the LLM request function calls while Python executes the functions and feeds results back into the conversation.
- 5
Tool parameter constraints (e.g., Literal categories) reduce invalid tool-call arguments and improve reliability.
- 6
Retrieval-augmented generation mitigates knowledge cut-off by searching an external/local knowledge base and grounding answers in retrieved content.
- 7
A complete RAG/tool pipeline typically follows: model selects tool → code executes tool → tool output is appended to history → model produces the final response.