How To Implement Short Term Memory Using LangGraph

TL;DR

Short-term memory requires explicitly storing conversation state outside the LLM and reattaching it to each new invocation.

Briefing Cornell Notes

Briefing

Short-term memory in LangGraph isn’t something LLMs can keep on their own—so the practical fix is to store conversation state outside the model and feed it back in a controlled way. The walkthrough builds that capability step by step: first by using LangGraph’s checkpointer plus a per-conversation thread ID to maintain a “conversation buffer,” then by replacing volatile RAM storage with persistent PostgreSQL so the state survives restarts. The result is a production-shaped pattern for agentic chat where “what was said before” remains available across turns and across process lifecycles.

The core starting point is the stateless nature of LLM calls: each invocation behaves like a fresh conversation unless prior messages are explicitly provided. To simulate short-term memory, the guide maintains a running message history and appends it to each new LLM call. In LangGraph terms, that state is stored at every superstep via a checkpointer. A thread ID (e.g., “thread one” vs “thread two”) determines which conversation history gets loaded and updated. With this setup, asking “What is my name?” after earlier messages correctly returns the previously provided name—because the same thread’s stored messages are reattached to the next model call.

That works until the system restarts. The checkpointer initially stores state in RAM, which disappears when the program exits. After a restart, the guide demonstrates that the prior thread’s messages are gone, and the model can no longer answer based on earlier context. This exposes the biggest gap in the naive short-term memory implementation: volatility. The fix is persistence—LangGraph’s recommended approach is to use a PostgreSQL-backed checkpointer (the walkthrough uses a Docker-based PostgreSQL setup). After wiring the checkpointer to a database URL and running the graph inside a context manager, the same thread history can be fetched after restarting the Python process, and the model continues to respond with the remembered details.

Once memory is persistent, the next bottleneck appears: context window overflow. Because the short-term approach keeps concatenating the full conversation history, long chats can exceed the LLM’s maximum token limit, leading to unreliable answers and hallucinations. To prevent that, the walkthrough introduces two mitigation techniques.

First is trimming: before calling the LLM, the system counts tokens approximately and enforces a maximum token budget (example uses 150). If the conversation history would exceed the limit, older messages are not deleted from state, but they are omitted from what gets sent to the model—only the most recent messages that fit remain in the prompt.

Second is summarization, which addresses trimming’s downside: older messages are ignored entirely, even when they contain useful information. Summarization keeps the latest messages while compressing older content into a running summary generated by the model. When the message count crosses a threshold (example triggers when more than six messages exist), the system summarizes the earlier portion, deletes those older messages from state, and retains only the summary plus the newest turns. The final flow uses a conditional graph edge: chat proceeds normally until the threshold is exceeded, then a cleanup-and-summarize node updates the summary and prunes the message list.

Together, the walkthrough delivers a complete pattern: store per-thread state with a checkpointer, persist it with PostgreSQL for restart safety, and manage prompt size with trimming and summarization to avoid context overflow while keeping important prior information available.

Cornell Notes

LLMs are stateless, so LangGraph short-term memory must be implemented by storing conversation state externally and reattaching it to each new LLM call. The walkthrough uses a checkpointer plus a thread ID to maintain separate conversation buffers per user/session. It then upgrades the setup from RAM (lost on restart) to a PostgreSQL-backed checkpointer so stored messages persist across process restarts. Finally, it tackles context window overflow: trimming enforces a token budget by omitting older messages from the prompt, while summarization preserves older knowledge by generating a running summary and deleting pruned messages from state. This combination yields a production-style memory pipeline for agentic chat.

Why doesn’t short-term memory “just work” with LLM invocations, and what mechanism in LangGraph compensates for that?

LLM calls don’t retain prior messages automatically; each invocation behaves like a fresh conversation unless the system explicitly supplies prior context. LangGraph compensates by storing and reloading conversation state via a checkpointer. At each superstep, the checkpointer saves the graph state, and the next invocation can load the stored message history and append it to the new prompt.

How does a thread ID change the behavior of memory in the example?

A thread ID (e.g., “thread one” vs “thread two”) selects which stored conversation history gets loaded. When the same thread ID is reused, the model can answer “What is my name?” based on earlier messages in that thread. When the thread ID changes, the model starts a new conversation buffer and can’t access the previous thread’s messages, so it responds it doesn’t know the name.

What breaks after a restart in the initial implementation, and how does PostgreSQL fix it?

Initially, the checkpointer stores state in RAM, which is volatile. After restarting the program, the stored messages disappear, so fetching state shows no prior conversation. Switching to a PostgreSQL-backed checkpointer (configured via a database URL and run through a context manager) makes state durable: after restart, the system fetches the last messages for each thread and the model continues responding with remembered context.

What is the context overflow problem, and why do trimming and summarization address it differently?

Context overflow happens when the concatenated input tokens exceed the LLM’s context window, causing degraded or incorrect outputs. Trimming prevents overflow by enforcing a token limit (example: 150) and omitting older messages from what gets sent to the model, without necessarily deleting them from state. Summarization prevents overflow while preserving older information: it keeps the newest messages, generates a summary of older content using the model, and then deletes the older raw messages so the prompt remains within budget while knowledge isn’t lost.

In the summarization workflow, what triggers summarization and what gets deleted?

Summarization triggers when the number of messages in state exceeds a threshold (example uses “more than six”). At that point, the system summarizes the older portion (keeping only the most recent two messages un-summarized) and then deletes the older messages from state using LangChain’s remove/delete message function. The retained artifacts are the running summary plus the newest turns.

Review Questions

How would you explain the role of a checkpointer and thread ID in implementing short-term memory in LangGraph?
What are the tradeoffs between trimming and summarization for long-running conversations?
Why is persistence (e.g., PostgreSQL) necessary even if short-term memory works during a single program run?

Key Points

1
Short-term memory requires explicitly storing conversation state outside the LLM and reattaching it to each new invocation.
2
LangGraph checkpointers save graph state at supersteps, enabling conversation buffers to persist across turns within a running process.
3
Thread IDs partition memory so separate conversations don’t contaminate each other’s context.
4
RAM-backed state disappears on restart; PostgreSQL-backed checkpointers provide durable persistence across process lifecycles.
5
Context overflow occurs when concatenated history exceeds the LLM context window, degrading answer quality.
6
Trimming enforces a token budget by omitting older messages from the prompt (without necessarily deleting them from state).
7
Summarization preserves older knowledge by generating a running summary and deleting pruned raw messages once the message count crosses a threshold.

Highlights

Using a checkpointer plus a thread ID turns stateless LLM calls into per-conversation short-term memory in LangGraph.

The biggest failure mode of the initial approach is volatility: restarting the program wipes RAM-stored conversation history.

PostgreSQL persistence restores memory after restarts, enabling the same thread to continue where it left off.

Trimming prevents context overflow by keeping only the most recent messages that fit within a token limit (example: 150).

Summarization avoids trimming’s information loss by compressing older turns into a running summary and deleting the older messages from state.

Topics

LangGraph Short-Term Memory
Checkpointer Threads
PostgreSQL Persistence
Context Window Overflow
Trimming and Summarization

Mentioned

Nitesh