Get AI summaries of any video or article — Sign up free
LangChain - Conversations with Memory (explanation & code walkthrough) thumbnail

LangChain - Conversations with Memory (explanation & code walkthrough)

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Large language models don’t retain conversation state automatically; continuity requires re-inserting prior information into each prompt.

Briefing

Memory is the difference between a chat agent that feels coherent and one that repeatedly “forgets” what a user meant earlier—especially when people use shorthand or pronouns like “he,” “she,” or “that” to refer back to earlier details. Large language models don’t retain conversation state by themselves; they generate responses from the prompt they’re given. That means conversation continuity has to be engineered, either by stuffing prior context into the prompt or by maintaining an external record that can be re-inserted later.

The transcript lays out two broad strategies for memory in LangChain. The first is prompt-based memory: earlier turns are appended to the prompt so the model can answer with full conversational context. A simple “conversation buffer” does exactly that—each user message and agent reply gets stacked into the prompt. In a customer-support style example, the agent can follow along even when the user doesn’t restate everything, because the prompt keeps growing turn by turn. The catch is practical: token limits cap how much history can fit. With modern models using roughly 4,096 tokens of context, an hour-long conversation can’t be fully included verbatim.

To address token limits, the transcript then shifts to memory that compresses history. “Conversation summary memory” replaces verbatim turns with an evolving summary. After each interaction, the system calls a language model again to summarize what happened so far, then feeds that summary back into the next prompt. This reduces how much text is carried forward while still preserving key facts like who the user is (Sam) and what the user is trying to do (get customer support). The tradeoff is extra computation: summarization requires additional model calls beyond the one used to generate the user-facing response.

A related approach, “conversation buffer window memory,” keeps only the last K interactions (or a limited slice of recent context) in the prompt. That can be enough for many real conversations, where users rarely need details from the distant past, and it’s cheaper than full-history buffering. The transcript also describes a hybrid: “summary + buffer,” where older content is summarized while the most recent turns remain verbatim. This “best of both worlds” design helps maintain continuity without letting prompts balloon.

Beyond summarization, the transcript introduces structured memory. “Knowledge graph memory” extracts entities and relationships from the conversation and inserts only relevant facts into the prompt, aiming to avoid hallucinating new information. In the TV-repair example, the system builds a mini graph capturing details like “Sam” owning a “TV,” the TV being “broken,” and being “under warranty,” including a warranty number. “Entity memory” similarly caches extracted entities—such as the warranty number and a repair person named Dave—so later turns can reference them reliably. The result is an agent that can track relationships and key attributes across turns, enabling downstream actions like routing a repair request or triggering other chains.

Overall, the transcript frames memory as a set of engineering choices: trade prompt length for coherence, compress history when context windows run out, and use structured extraction when the agent needs durable facts rather than just a longer prompt.

Cornell Notes

LangChain memory is necessary because large language models generate responses from the prompt they receive and don’t inherently retain conversation state. The transcript compares prompt-based memory (like a conversation buffer) with compressed memory (like conversation summary and buffer window approaches) to manage token limits. It also shows structured memory options—knowledge graph memory and entity memory—that extract facts and relationships (e.g., a broken TV under warranty, warranty number, and a repair person) and reuse them in later turns. These designs help agents handle shorthand references and maintain continuity without exceeding context windows. The key practical lesson is choosing the right memory strategy based on token budget, cost, and how much durable factual tracking the agent needs.

Why does an agent need “memory” at all if it can already answer questions?

Large language models don’t retain state automatically; each response depends on the prompt provided at that moment. Without memory, the model can’t reliably resolve references like “he/she” or “that” to earlier details (names, places, times, topics). LangChain memory solves this by re-inserting prior conversation information—either verbatim, summarized, or extracted as structured facts—into the next prompt.

What’s the core tradeoff between a conversation buffer and token limits?

A conversation buffer appends every user and agent turn to the prompt, which preserves full context and supports continuity. But prompts grow each turn until they hit the model’s context window (the transcript cites ~4,096 tokens). That means long conversations can’t be stored verbatim, forcing alternative strategies like summarization or windowing.

How does conversation summary memory reduce token usage, and what extra cost does it introduce?

Conversation summary memory replaces earlier verbatim turns with a running summary. After each interaction, it makes an additional call to a language model to summarize the conversation so far, then feeds that summary into the next response prompt. This lowers how much text is carried forward, but it increases total model calls because summarization happens in addition to the user-facing response.

When is buffer window memory likely to work well, and what does it risk losing?

Buffer window memory keeps only the last K interactions (the transcript sets K=2 for demonstration). It works when most user intent and references are contained in recent turns—common in many support chats. The risk is losing earlier details that might matter later; for example, the initial introduction (“hi, I’m Sam”) can drop out of the prompt if it falls outside the window.

What’s the difference between knowledge graph memory and entity memory in practice?

Knowledge graph memory extracts entities and relationships and represents them in a structured form (the transcript shows a NetworkX-style graph). It then inserts only relevant information into the prompt, aiming to prevent hallucinated additions. Entity memory focuses on caching specific extracted entities (like warranty number, device, and people such as Dave) so later turns can reuse them reliably—useful for tracking relationships and attributes across a conversation.

How do structured memories enable actions beyond “chat”?

Because structured memory stores durable facts (e.g., “TV is broken,” “under warranty,” “warranty number,” “repair person Dave”), other modules or chains can use those facts to trigger workflows. The transcript suggests that once the agent knows it’s a TV warranty issue, it can route the next step—like requesting a repair person and asking for contact information—using extracted entities rather than relying on the model to infer them.

Review Questions

  1. Compare conversation buffer, conversation summary, and buffer window memory: what does each store, and how does each manage token limits?
  2. Why do structured memories (knowledge graph and entity memory) reduce hallucination risk compared with simply appending more text to the prompt?
  3. In a multi-turn support scenario, which memory strategy would you choose if the user might reference details from 30 turns ago—and why?

Key Points

  1. 1

    Large language models don’t retain conversation state automatically; continuity requires re-inserting prior information into each prompt.

  2. 2

    Conversation buffer memory preserves full verbatim history but quickly runs into context window limits (cited around 4,096 tokens).

  3. 3

    Conversation summary memory compresses history into an evolving summary, reducing prompt size at the cost of extra summarization model calls.

  4. 4

    Buffer window memory keeps only the most recent K turns, which can be cost-effective but may drop earlier details needed later.

  5. 5

    A hybrid summary+buffer approach keeps recent turns verbatim while summarizing older context to balance coherence and token usage.

  6. 6

    Knowledge graph memory extracts entities and relationships into a structured representation and feeds only relevant facts back into prompts to limit hallucinations.

  7. 7

    Entity memory caches extracted attributes and people (like warranty number and a repair person) so later turns can reuse them reliably.

Highlights

Memory isn’t optional for coherent chat: without it, pronouns and shorthand references can’t be grounded in earlier conversation details.
Conversation summary memory works by making an extra language-model call to summarize the conversation after each turn, then using that summary in the next prompt.
Knowledge graph memory builds a mini graph of extracted facts (e.g., broken TV, warranty status, warranty number) and uses it to keep later responses grounded.
Entity memory can track relationships and attributes across turns—such as remembering Dave as the repair person and the warranty number for follow-up instructions.

Topics