Get AI summaries of any video or article — Sign up free
Build Private AI Assistant That Actually Remembers | Chatbot Memory with Ollama, LangChain & SQLite thumbnail

Build Private AI Assistant That Actually Remembers | Chatbot Memory with Ollama, LangChain & SQLite

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Persistent memory is implemented by saving each conversation turn to SQLite and reloading that history when rebuilding the next prompt.

Briefing

A fully local chatbot can keep “memory” across restarts by writing each conversation turn into a local SQL database and re-injecting that history into the next prompt. The build pairs Ollama for on-device model inference with LangChain for prompt/context assembly, then uses SQLite as the persistent store so the assistant can answer as if it’s continuing the same thread.

The architecture is split into three layers. At the top sits a terminal UI layer built with Rich, handling user input, rendering streamed output, and managing thread selection. Beneath it is the “Neuromind” application logic layer, which orchestrates persona prompts, thread management, context building, model initialization, and streaming. The bottom layer consists of two external services: Ollama running the language model (configured via a model config provider) and SQLite storing conversation history.

Request flow runs like this: the user types into the terminal, the app builds a context list of messages (including prior turns from the selected thread), and that message list is sent to the Ollama instance. Ollama’s response is streamed back through an internal streaming processor, which renders partial output to the terminal as it arrives. Once streaming finishes, the system persists the new human/AI messages into SQLite. On the next run, the same thread’s stored messages are pulled back and included in the next context build, enabling continuity.

Memory is managed by a thread manager built around a thread data class containing an ID, a name, and a persona type. When a new conversation starts, the UI asks for a thread name, and the thread manager can create or fetch that thread, list existing threads, retrieve history, add messages, and clear messages. The implementation notes an async-friendly setting (check_same_thread = false) for SQLite, and also flags that swapping databases or client libraries may be needed for other concurrency patterns.

The “brain” of the app lives in app.py, where initialization wires together the thread manager, UI manager, persona loading, and the Ollama model instance. Personas come from markdown files and are loaded once at startup as system prompts. The run method parses CLI commands (including thread switching and other helper actions), then triggers streaming: a stream processor buffers output chunks and handles special handling for reasoning content when using a reasoning-capable model. If reasoning content or chunk content is absent, the chunk handler may not run.

Model support is configurable. The configuration lists Quaint 3 8 billion parameter model and Gemini 2.5 flush; using Gemini requires a Google Genai API key. The project also uses Pydantic for structured inputs/outputs and FastAPI to expose a streaming-capable REST API. That API can mirror CLI functionality, including listing personas and threads, showing message counts, and deleting threads.

Net result: a private assistant that runs locally, streams responses, and retains conversation history across sessions—without relying on a hosted memory service—by combining Ollama inference with SQLite-backed thread history and persona-driven prompt construction.

Cornell Notes

The assistant achieves persistent memory by storing each chat turn in SQLite and reloading that history when building the next prompt. Ollama runs the model locally, while LangChain assembles the context (system prompt + prior thread messages) and streams the model’s output back to the UI. A thread manager organizes conversations into named threads tied to a persona, supporting create/get, list, history retrieval, message insertion, and clearing. The CLI uses Rich for rendering, and the same functionality is exposed through a FastAPI streaming REST interface. This setup matters because it turns a stateless chatbot into a restart-proof assistant without external memory infrastructure.

How does the chatbot “remember” after the process restarts?

After each user/assistant exchange, the app persists the messages into SQLite under the active thread. On the next run, the thread manager retrieves that stored history and the context builder injects it into the message list sent to Ollama. Because the next prompt includes prior turns from the same thread, the assistant continues the conversation rather than starting fresh.

What role do threads and personas play in memory and prompting?

A thread is a named conversation container with an ID and a persona type. The UI asks for a thread name when creating a new conversation. Personas are loaded from markdown files as system prompts at startup, then attached to threads so the context builder can apply the right system instructions when assembling messages for Ollama.

Where does streaming fit into the architecture?

Streaming happens between Ollama and the UI. The app sends a context list to Ollama, receives a streamed response, and routes chunks through a streaming processor. That processor buffers and renders partial output to the terminal via the UI manager. After streaming completes, the final messages are written to SQLite so the next session can reuse them.

How is SQLite used safely in an async environment here?

The thread manager uses SQLite with check_same_thread = false, which supports use across threads in async-style execution. The transcript also notes that if different concurrency behavior is required, a different SQLite client library or even a different database provider may be preferable.

What model and API configuration options are built in?

The configuration lists Quaint 3 8 billion parameter model and Gemini 2.5 flush. Gemini requires a Google Genai API key. The project uses Pydantic for request/response schemas and FastAPI for a streaming REST API that can perform operations like listing personas and threads, showing message counts, and deleting threads.

Review Questions

  1. How does the context builder decide which prior messages to include in the next Ollama request?
  2. What functions does the thread manager need to support to enable switching between multiple conversations?
  3. Why does the streaming processor treat reasoning content differently from normal chunk content?

Key Points

  1. 1

    Persistent memory is implemented by saving each conversation turn to SQLite and reloading that history when rebuilding the next prompt.

  2. 2

    The app is organized into UI (Rich CLI), domain logic (Neuromind), and service layer (Ollama inference plus SQLite storage).

  3. 3

    Conversation continuity depends on thread-scoped message history, not a single global chat log.

  4. 4

    Personas are loaded once from markdown system prompts and applied to threads during context construction.

  5. 5

    Responses stream from Ollama to the terminal through a streaming processor, then are persisted after streaming ends.

  6. 6

    The same capabilities are available via a FastAPI streaming REST API, including persona and thread management operations.

  7. 7

    Model choice is configurable, with Gemini requiring a Google Genai API key and Quaint 3 8 billion parameter model configured in the model settings.

Highlights

Restart-proof memory comes from writing messages to SQLite after each streamed response and re-injecting that stored history into the next context build.
Thread manager design turns chat history into switchable, named conversations tied to a persona type.
Streaming is handled as a chunk-by-chunk pipeline from Ollama back to the terminal, with reasoning content optionally extracted from LangChain message arguments.
Personas are system prompts loaded from markdown once at startup, then reused across sessions for consistent assistant behavior.
FastAPI provides a streaming REST interface that mirrors CLI operations like listing threads and deleting them.

Topics

Mentioned

  • CLI
  • API
  • REST
  • SQL
  • AI
  • UI
  • Pydantic
  • FastAPI
  • Ollama
  • LangChain
  • SQLite