How To Implement Short Term Memory Using LangGraph
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Short-term memory requires explicitly storing conversation state outside the LLM and reattaching it to each new invocation.
Briefing
Short-term memory in LangGraph isn’t something LLMs can keep on their own—so the practical fix is to store conversation state outside the model and feed it back in a controlled way. The walkthrough builds that capability step by step: first by using LangGraph’s checkpointer plus a per-conversation thread ID to maintain a “conversation buffer,” then by replacing volatile RAM storage with persistent PostgreSQL so the state survives restarts. The result is a production-shaped pattern for agentic chat where “what was said before” remains available across turns and across process lifecycles.
The core starting point is the stateless nature of LLM calls: each invocation behaves like a fresh conversation unless prior messages are explicitly provided. To simulate short-term memory, the guide maintains a running message history and appends it to each new LLM call. In LangGraph terms, that state is stored at every superstep via a checkpointer. A thread ID (e.g., “thread one” vs “thread two”) determines which conversation history gets loaded and updated. With this setup, asking “What is my name?” after earlier messages correctly returns the previously provided name—because the same thread’s stored messages are reattached to the next model call.
That works until the system restarts. The checkpointer initially stores state in RAM, which disappears when the program exits. After a restart, the guide demonstrates that the prior thread’s messages are gone, and the model can no longer answer based on earlier context. This exposes the biggest gap in the naive short-term memory implementation: volatility. The fix is persistence—LangGraph’s recommended approach is to use a PostgreSQL-backed checkpointer (the walkthrough uses a Docker-based PostgreSQL setup). After wiring the checkpointer to a database URL and running the graph inside a context manager, the same thread history can be fetched after restarting the Python process, and the model continues to respond with the remembered details.
Once memory is persistent, the next bottleneck appears: context window overflow. Because the short-term approach keeps concatenating the full conversation history, long chats can exceed the LLM’s maximum token limit, leading to unreliable answers and hallucinations. To prevent that, the walkthrough introduces two mitigation techniques.
First is trimming: before calling the LLM, the system counts tokens approximately and enforces a maximum token budget (example uses 150). If the conversation history would exceed the limit, older messages are not deleted from state, but they are omitted from what gets sent to the model—only the most recent messages that fit remain in the prompt.
Second is summarization, which addresses trimming’s downside: older messages are ignored entirely, even when they contain useful information. Summarization keeps the latest messages while compressing older content into a running summary generated by the model. When the message count crosses a threshold (example triggers when more than six messages exist), the system summarizes the earlier portion, deletes those older messages from state, and retains only the summary plus the newest turns. The final flow uses a conditional graph edge: chat proceeds normally until the threshold is exceeded, then a cleanup-and-summarize node updates the summary and prunes the message list.
Together, the walkthrough delivers a complete pattern: store per-thread state with a checkpointer, persist it with PostgreSQL for restart safety, and manage prompt size with trimming and summarization to avoid context overflow while keeping important prior information available.
Cornell Notes
LLMs are stateless, so LangGraph short-term memory must be implemented by storing conversation state externally and reattaching it to each new LLM call. The walkthrough uses a checkpointer plus a thread ID to maintain separate conversation buffers per user/session. It then upgrades the setup from RAM (lost on restart) to a PostgreSQL-backed checkpointer so stored messages persist across process restarts. Finally, it tackles context window overflow: trimming enforces a token budget by omitting older messages from the prompt, while summarization preserves older knowledge by generating a running summary and deleting pruned messages from state. This combination yields a production-style memory pipeline for agentic chat.
Why doesn’t short-term memory “just work” with LLM invocations, and what mechanism in LangGraph compensates for that?
How does a thread ID change the behavior of memory in the example?
What breaks after a restart in the initial implementation, and how does PostgreSQL fix it?
What is the context overflow problem, and why do trimming and summarization address it differently?
In the summarization workflow, what triggers summarization and what gets deleted?
Review Questions
- How would you explain the role of a checkpointer and thread ID in implementing short-term memory in LangGraph?
- What are the tradeoffs between trimming and summarization for long-running conversations?
- Why is persistence (e.g., PostgreSQL) necessary even if short-term memory works during a single program run?
Key Points
- 1
Short-term memory requires explicitly storing conversation state outside the LLM and reattaching it to each new invocation.
- 2
LangGraph checkpointers save graph state at supersteps, enabling conversation buffers to persist across turns within a running process.
- 3
Thread IDs partition memory so separate conversations don’t contaminate each other’s context.
- 4
RAM-backed state disappears on restart; PostgreSQL-backed checkpointers provide durable persistence across process lifecycles.
- 5
Context overflow occurs when concatenated history exceeds the LLM context window, degrading answer quality.
- 6
Trimming enforces a token budget by omitting older messages from the prompt (without necessarily deleting them from state).
- 7
Summarization preserves older knowledge by generating a running summary and deleting pruned raw messages once the message count crosses a threshold.