Get AI summaries of any video or article — Sign up free
Build 100% Local Chatbot with Gemma 3, Ollama and LangChain | AI Assistant with Memory and Tool Use thumbnail

Build 100% Local Chatbot with Gemma 3, Ollama and LangChain | AI Assistant with Memory and Tool Use

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The chatbot runs locally by combining Gemma 3 for responses with Qwen 2.5 for tool-calling decisions about saving memories.

Briefing

A fully local chatbot can now keep both conversation history and long-term “memories” across separate chats—without sending data to a hosted service. The system stitches together a local Gemma 3 model for responses, a local tool-calling model (Qwen 2.5) for deciding when to save new memories, and a local SQL database (SQLite + SQLVector) to store both chat context and vectorized memory entries. The result is an assistant that can remember what a user said earlier, retrieve the most relevant saved memories for the current conversation, and selectively store new details when they matter.

The workflow is built around three stages. First, each user message is combined with prior conversation history pulled from a local database (the transcript references SQLite for conversation storage). Second, the chatbot retrieves up to five relevant memory records for the current user using vector similarity search over stored memory embeddings (generated with FastEmbed). Those memories are injected into the prompt so Gemma 3 can respond with continuity.

Third, after generating a reply, the system decides whether to save new memories. A dedicated “save memory” tool is exposed to the model. The tool-calling logic uses Qwen 2.5 to evaluate the conversation and existing memories, then returns a structured memory payload only when the model judges something should be stored. Saved memories include a content field plus an importance score from 1 to 10; the transcript notes an importance threshold (importance of 3 or below should not be saved). This keeps the memory store from filling up with low-signal details.

Implementation details reinforce the “local-first” goal. The project is organized with a Streamlit UI (app.py) and a LangChain/LangGraph backend workflow (chatbot workflow and task-decorated functions). A configuration file initializes model providers (the transcript mentions Groq API support as an option, but the core models used are Gemma 3 and Qwen 2.5) and sets up the SQLite database and the vector store table for memories. The LangGraph checkpointing is wired to the SQLite connection so conversation state can persist across runs.

The memory system is formalized with a Pydantic-style Memory model containing memory_id, content, user_id, created_at, and importance. A MemoryManager handles saving memories by converting them into vector-store documents with metadata, retrieving memories via similarity search filtered by user_id, and also includes a SQL-based method to fetch memories directly from the table using JSON metadata extraction.

On the UI side, the app supports multiple conversation IDs (thread_id) and a user_id. The sidebar lists retrieved memories for that user, and the main chat view streams assistant output chunk-by-chunk. In the demo behavior described, the assistant stores the user’s name with high importance (10), then later recalls it when starting a new conversation thread. It also stores other interests (like wanting to buy a motorcycle) with lower importance, demonstrating that memory selection is selective rather than automatic.

Overall, the core insight is practical: memory persistence and retrieval can be implemented as a local, tool-driven loop—retrieve relevant memories, generate a response, then conditionally store new memories—using LangChain/LangGraph orchestration plus SQLite-backed vector search and embeddings.

Cornell Notes

The system builds a local chatbot that remembers across separate chats by combining three pieces: (1) Gemma 3 for generating answers, (2) Qwen 2.5 for tool-calling decisions about what to store, and (3) a SQLite database with SQLVector for storing and retrieving memory embeddings. For each new user message, it pulls prior conversation history and retrieves up to five relevant saved memories for the current user via vector similarity search using FastEmbed embeddings. After responding, the chatbot uses a “save memory” tool that only stores new items when they meet an importance threshold (importance 3+). Memories include content, user_id, created_at, and an importance score, enabling selective long-term personalization.

How does the chatbot decide which memories to include in a new conversation?

Before generating a response, it converts the current conversation messages into a conversation string (truncated to a max character limit) and calls a memory manager method to retrieve memories for the given user_id. Retrieval uses a vector store (SQLVector) with FastEmbed embeddings and performs similarity search filtered by user_id. The system returns at most five matching memory records, then formats them into a memory template that gets injected into the prompt sent to Gemma 3.

What mechanism prevents the memory store from growing with low-value details?

A dedicated “save memory” tool is exposed to the model, and a separate tool-calling model (Qwen 2.5) evaluates whether new information should be saved. The prompt instructs the model to assign an importance rating from 1 to 10, with guidance that importance 3 or below should not be saved. Only when the model triggers the tool call does the system persist a new memory.

Where are conversation history and memories stored locally?

Conversation history is stored in a local SQLite database (the transcript references SQLite and uses SQLVector for vector storage). Memories are stored as vector documents in a SQLVector-backed vector store tied to a specific memory table (configured in the data directory). The memory manager also includes a SQL query path that selects text and metadata from the memories table and extracts user_id from JSON metadata.

How does the system persist state across runs and support multiple chat threads?

LangGraph checkpointing is configured to use a SQLite connection, so workflow state can persist. The Streamlit UI tracks a conversation/thread identifier (thread_id) and a user_id; switching thread_id starts a new conversation while keeping the same user’s memory sidebar populated from the stored memory records.

What does a saved memory record contain, and how is it used later?

Each memory has memory_id, content, user_id, created_at, and importance (1–10). When retrieving, the system converts stored vector documents back into Memory objects, then formats them into a memory template (content plus importance) that the assistant prompt includes. This lets the assistant reuse prior user-specific details and prioritize them implicitly via importance.

How does the UI stream responses while keeping the message list consistent?

The Streamlit app uses a message placeholder and streams output chunk-by-chunk from the backend’s streaming call. As chunks arrive, they are appended to a full_response string and rendered live. After streaming completes, the assistant’s final message is appended to the session state message list so the chat history remains visible on refresh.

Review Questions

  1. What are the exact steps in order for handling a new user message (from history retrieval to memory retrieval to response generation to conditional memory saving)?
  2. Why does the system use a separate tool-calling model (Qwen 2.5) instead of relying on Gemma 3 alone for memory decisions?
  3. How does the memory retrieval process ensure memories belong to the correct user, and what limits are applied to the number of memories returned?

Key Points

  1. 1

    The chatbot runs locally by combining Gemma 3 for responses with Qwen 2.5 for tool-calling decisions about saving memories.

  2. 2

    Conversation history and long-term memories persist via SQLite, with SQLVector used to store memory embeddings for similarity search.

  3. 3

    For each message, the system retrieves up to five user-specific memories using FastEmbed-generated vectors and injects them into the prompt.

  4. 4

    After replying, a “save memory” tool is called only when the tool-calling model judges new information is worth storing based on an importance score.

  5. 5

    Saved memories include content, user_id, created_at, and an importance rating from 1 to 10, enabling selective personalization.

  6. 6

    The Streamlit UI supports multiple conversation IDs (thread_id) while keeping a sidebar of retrieved memories for the same user.

  7. 7

    LangGraph checkpointing is wired to the SQLite connection so workflow state can persist across sessions.

Highlights

Memory persistence works across separate chat threads by tying stored memories to a user_id and retrieving them on demand for each new conversation.
A tool-driven loop prevents memory bloat: the model assigns an importance score and only triggers the save-memory tool when the threshold is met.
Vector similarity search over embedded memory records (FastEmbed + SQLVector) selects the most relevant memories, capped at five per response.
The UI streams assistant output live while maintaining a consistent message list in Streamlit session state.

Topics

Mentioned

  • Venelin Valkov
  • UI
  • AMA
  • SQL
  • SQLVector
  • FastEmbed
  • LangChain
  • LangGraph
  • Pydantic
  • JSON
  • SQLVector
  • UI