Build Hour: Agent Memory Patterns
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Context engineering is the foundation for agent memory: what the model receives (prompt, state/history, tools, retrieved knowledge, memory) determines performance under a finite token budget.
Briefing
Agent memory patterns boil down to one practical problem: long-running AI agents have finite context windows, so every extra instruction, chat turn, tool output, and retrieved fact competes for tokens. Without deliberate context engineering, agents forget earlier details, get pulled into contradictory instructions, or drown in noisy tool payloads—leading to re-asking, degraded reliability, and “token spikes” that can derail tool-heavy workflows. With memory and context management, agents can stay stateful across turns and even across sessions, making troubleshooting and multi-step support feel consistent rather than brittle.
The session frames context engineering as both art and science. Judgment determines what matters at each step of reasoning or action, while repeatable methods and measurable impacts make context management systematic. A key takeaway is that modern LLM performance depends less on raw model capability than on what the model is actually given: prompt structure, state and history handling, memory storage/retrieval, and retrieval/citation pipelines all shape what the model “sees.” The talk positions three core strategies for agent memory patterns: (1) reshape and fit—trim, compact, and summarize to fit the context window; (2) isolate and route—send the right context and tools to the right sub-agent to reduce conflict; and (3) extract and retrieve—pull high-quality memories at the right time.
A major distinction separates short-term memory (in-session techniques) from long-term memory (cross-session continuity). Short-term memory maximizes the active context window during a conversation, while long-term memory preserves continuity across sessions by storing and later injecting summarized or extracted facts. The session illustrates the difference with a troubleshooting scenario: without memory, an agent forgets the original issues after many turns and re-asks; with memory, it keeps the unresolved thread and references earlier actions like firmware updates and background sync.
The failure modes are grouped into four categories. Context burst happens when a turn suddenly injects large tool payloads, causing token spikes. Context conflict arises when contradictory instructions or tool results collide—such as a refund policy rule that conflicts with a VIP eligibility claim. Context poisoning occurs when incorrect information enters context (including via summaries or stored memory objects) and then propagates across turns. Context noise is the overload of redundant or overly similar tool definitions and outputs that make the context harder to use.
To make these ideas concrete, a dual-agent Next.js demo (built with OpenAI Agents SDK) runs an “IT troubleshooting” assistant with tools like get orders and get policy. The demo visualizes context token growth across turns and shows context burst when large refund-policy content is injected. It then demonstrates reshape-and-fit controls: context trimming removes older turns and tool outputs once a trigger is hit; context compaction drops older tool calls/results while preserving message structure; and context summarization compresses prior dialogue into a structured “memory item” that is re-injected later. The summarization prompt is crafted to enforce factual, structured summaries with contradiction checks, temporal ordering, and hallucination control.
For cross-session memory, the session shows a “memory injection” feature that places the generated summary into the system prompt for a new session, enabling personalization—such as continuing a MacBook internet troubleshooting thread after an OS update. Guardrails for memory are emphasized: treat memory as potentially stale or incomplete, avoid over-weighting it, and don’t store secrets or accept injection-like attacks.
Finally, the Q&A extends the engineering view: evaluate memory by running standard evals with memory on/off, then build memory-specific evals for long-running tasks and long-context behavior; manage memory scope (global vs session) and prune stale memories using temporal tags or decay/consolidation; and scale memory systems by optimizing retrieval (vector DB storage, filtering, ranking, embeddings) and storage/persistence for large memory pools. The overall message is to balance what to remember, how to compress or route it, and when to retrieve it—then validate with targeted evals.
Cornell Notes
Agent memory patterns are presented as a response to a hard constraint: finite context windows. The session separates short-term memory (in-session trimming, compaction, summarization) from long-term memory (cross-session continuity via extracted or summarized memories). It identifies four common failure modes—context burst, context conflict, context poisoning, and context noise—and ties them to concrete token and tool behaviors. Live demos using OpenAI Agents SDK show how trimming removes older tool outputs, compaction drops older tool results, and summarization injects a structured “memory item” back into context. Cross-session memory then personalizes new sessions by injecting the prior summary into the system prompt, with guardrails to prevent stale or unsafe memory from dominating.
Why does agent memory matter if models are getting better at reasoning and tool use?
What are the four failure modes of unmanaged context, and how do they show up?
How do reshape-and-fit techniques differ: trimming vs compaction vs summarization?
What does “short-term” vs “long-term” memory mean in this framework?
How does cross-session memory get used without letting it become stale or unsafe?
How should memory features be evaluated and tuned in practice?
Review Questions
- Which failure mode is most likely when a single turn suddenly includes thousands of tokens from tool outputs, and what mitigation technique was demonstrated for it?
- In what situations would trimming be preferred over summarization, and what trade-off does each approach make?
- How would you design an eval plan to test whether memory improves long-running agent performance beyond what standard short-context evals measure?
Key Points
- 1
Context engineering is the foundation for agent memory: what the model receives (prompt, state/history, tools, retrieved knowledge, memory) determines performance under a finite token budget.
- 2
Separate short-term memory (in-session context management like trimming/compaction/summarization) from long-term memory (cross-session continuity via injected or retrieved memories).
- 3
Unmanaged context leads to four predictable failure modes: context burst, context conflict, context poisoning, and context noise.
- 4
Reshape-and-fit techniques reduce token pressure: trimming removes older turns, compaction drops older tool calls/results, and summarization compresses prior dialogue into structured memory objects.
- 5
Prompt and tool hygiene reduce conflict and noise: keep system prompts lean, keep tool sets small with clear boundaries, and return high-signal tool outputs.
- 6
Cross-session memory should include guardrails: treat stored summaries as potentially stale/incomplete, avoid over-weighting them, and prevent secret storage or injection-style attacks.
- 7
Memory improvements should be validated with evals that compare memory on/off and include memory-specific tests for long-running, long-context scenarios.