LangChain - Conversations with Memory (explanation & code walkthrough)
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Large language models don’t retain conversation state automatically; continuity requires re-inserting prior information into each prompt.
Briefing
Memory is the difference between a chat agent that feels coherent and one that repeatedly “forgets” what a user meant earlier—especially when people use shorthand or pronouns like “he,” “she,” or “that” to refer back to earlier details. Large language models don’t retain conversation state by themselves; they generate responses from the prompt they’re given. That means conversation continuity has to be engineered, either by stuffing prior context into the prompt or by maintaining an external record that can be re-inserted later.
The transcript lays out two broad strategies for memory in LangChain. The first is prompt-based memory: earlier turns are appended to the prompt so the model can answer with full conversational context. A simple “conversation buffer” does exactly that—each user message and agent reply gets stacked into the prompt. In a customer-support style example, the agent can follow along even when the user doesn’t restate everything, because the prompt keeps growing turn by turn. The catch is practical: token limits cap how much history can fit. With modern models using roughly 4,096 tokens of context, an hour-long conversation can’t be fully included verbatim.
To address token limits, the transcript then shifts to memory that compresses history. “Conversation summary memory” replaces verbatim turns with an evolving summary. After each interaction, the system calls a language model again to summarize what happened so far, then feeds that summary back into the next prompt. This reduces how much text is carried forward while still preserving key facts like who the user is (Sam) and what the user is trying to do (get customer support). The tradeoff is extra computation: summarization requires additional model calls beyond the one used to generate the user-facing response.
A related approach, “conversation buffer window memory,” keeps only the last K interactions (or a limited slice of recent context) in the prompt. That can be enough for many real conversations, where users rarely need details from the distant past, and it’s cheaper than full-history buffering. The transcript also describes a hybrid: “summary + buffer,” where older content is summarized while the most recent turns remain verbatim. This “best of both worlds” design helps maintain continuity without letting prompts balloon.
Beyond summarization, the transcript introduces structured memory. “Knowledge graph memory” extracts entities and relationships from the conversation and inserts only relevant facts into the prompt, aiming to avoid hallucinating new information. In the TV-repair example, the system builds a mini graph capturing details like “Sam” owning a “TV,” the TV being “broken,” and being “under warranty,” including a warranty number. “Entity memory” similarly caches extracted entities—such as the warranty number and a repair person named Dave—so later turns can reference them reliably. The result is an agent that can track relationships and key attributes across turns, enabling downstream actions like routing a repair request or triggering other chains.
Overall, the transcript frames memory as a set of engineering choices: trade prompt length for coherence, compress history when context windows run out, and use structured extraction when the agent needs durable facts rather than just a longer prompt.
Cornell Notes
LangChain memory is necessary because large language models generate responses from the prompt they receive and don’t inherently retain conversation state. The transcript compares prompt-based memory (like a conversation buffer) with compressed memory (like conversation summary and buffer window approaches) to manage token limits. It also shows structured memory options—knowledge graph memory and entity memory—that extract facts and relationships (e.g., a broken TV under warranty, warranty number, and a repair person) and reuse them in later turns. These designs help agents handle shorthand references and maintain continuity without exceeding context windows. The key practical lesson is choosing the right memory strategy based on token budget, cost, and how much durable factual tracking the agent needs.
Why does an agent need “memory” at all if it can already answer questions?
What’s the core tradeoff between a conversation buffer and token limits?
How does conversation summary memory reduce token usage, and what extra cost does it introduce?
When is buffer window memory likely to work well, and what does it risk losing?
What’s the difference between knowledge graph memory and entity memory in practice?
How do structured memories enable actions beyond “chat”?
Review Questions
- Compare conversation buffer, conversation summary, and buffer window memory: what does each store, and how does each manage token limits?
- Why do structured memories (knowledge graph and entity memory) reduce hallucination risk compared with simply appending more text to the prompt?
- In a multi-turn support scenario, which memory strategy would you choose if the user might reference details from 30 turns ago—and why?
Key Points
- 1
Large language models don’t retain conversation state automatically; continuity requires re-inserting prior information into each prompt.
- 2
Conversation buffer memory preserves full verbatim history but quickly runs into context window limits (cited around 4,096 tokens).
- 3
Conversation summary memory compresses history into an evolving summary, reducing prompt size at the cost of extra summarization model calls.
- 4
Buffer window memory keeps only the most recent K turns, which can be cost-effective but may drop earlier details needed later.
- 5
A hybrid summary+buffer approach keeps recent turns verbatim while summarizing older context to balance coherence and token usage.
- 6
Knowledge graph memory extracts entities and relationships into a structured representation and feeds only relevant facts back into prompts to limit hallucinations.
- 7
Entity memory caches extracted attributes and people (like warranty number and a repair person) so later turns can reuse them reliably.