Build Hour: Agent Memory Patterns

TL;DR

Context engineering is the foundation for agent memory: what the model receives (prompt, state/history, tools, retrieved knowledge, memory) determines performance under a finite token budget.

Briefing Cornell Notes

Briefing

Agent memory patterns boil down to one practical problem: long-running AI agents have finite context windows, so every extra instruction, chat turn, tool output, and retrieved fact competes for tokens. Without deliberate context engineering, agents forget earlier details, get pulled into contradictory instructions, or drown in noisy tool payloads—leading to re-asking, degraded reliability, and “token spikes” that can derail tool-heavy workflows. With memory and context management, agents can stay stateful across turns and even across sessions, making troubleshooting and multi-step support feel consistent rather than brittle.

The session frames context engineering as both art and science. Judgment determines what matters at each step of reasoning or action, while repeatable methods and measurable impacts make context management systematic. A key takeaway is that modern LLM performance depends less on raw model capability than on what the model is actually given: prompt structure, state and history handling, memory storage/retrieval, and retrieval/citation pipelines all shape what the model “sees.” The talk positions three core strategies for agent memory patterns: (1) reshape and fit—trim, compact, and summarize to fit the context window; (2) isolate and route—send the right context and tools to the right sub-agent to reduce conflict; and (3) extract and retrieve—pull high-quality memories at the right time.

A major distinction separates short-term memory (in-session techniques) from long-term memory (cross-session continuity). Short-term memory maximizes the active context window during a conversation, while long-term memory preserves continuity across sessions by storing and later injecting summarized or extracted facts. The session illustrates the difference with a troubleshooting scenario: without memory, an agent forgets the original issues after many turns and re-asks; with memory, it keeps the unresolved thread and references earlier actions like firmware updates and background sync.

The failure modes are grouped into four categories. Context burst happens when a turn suddenly injects large tool payloads, causing token spikes. Context conflict arises when contradictory instructions or tool results collide—such as a refund policy rule that conflicts with a VIP eligibility claim. Context poisoning occurs when incorrect information enters context (including via summaries or stored memory objects) and then propagates across turns. Context noise is the overload of redundant or overly similar tool definitions and outputs that make the context harder to use.

To make these ideas concrete, a dual-agent Next.js demo (built with OpenAI Agents SDK) runs an “IT troubleshooting” assistant with tools like get orders and get policy. The demo visualizes context token growth across turns and shows context burst when large refund-policy content is injected. It then demonstrates reshape-and-fit controls: context trimming removes older turns and tool outputs once a trigger is hit; context compaction drops older tool calls/results while preserving message structure; and context summarization compresses prior dialogue into a structured “memory item” that is re-injected later. The summarization prompt is crafted to enforce factual, structured summaries with contradiction checks, temporal ordering, and hallucination control.

For cross-session memory, the session shows a “memory injection” feature that places the generated summary into the system prompt for a new session, enabling personalization—such as continuing a MacBook internet troubleshooting thread after an OS update. Guardrails for memory are emphasized: treat memory as potentially stale or incomplete, avoid over-weighting it, and don’t store secrets or accept injection-like attacks.

Finally, the Q&A extends the engineering view: evaluate memory by running standard evals with memory on/off, then build memory-specific evals for long-running tasks and long-context behavior; manage memory scope (global vs session) and prune stale memories using temporal tags or decay/consolidation; and scale memory systems by optimizing retrieval (vector DB storage, filtering, ranking, embeddings) and storage/persistence for large memory pools. The overall message is to balance what to remember, how to compress or route it, and when to retrieve it—then validate with targeted evals.

Cornell Notes

Agent memory patterns are presented as a response to a hard constraint: finite context windows. The session separates short-term memory (in-session trimming, compaction, summarization) from long-term memory (cross-session continuity via extracted or summarized memories). It identifies four common failure modes—context burst, context conflict, context poisoning, and context noise—and ties them to concrete token and tool behaviors. Live demos using OpenAI Agents SDK show how trimming removes older tool outputs, compaction drops older tool results, and summarization injects a structured “memory item” back into context. Cross-session memory then personalizes new sessions by injecting the prior summary into the system prompt, with guardrails to prevent stale or unsafe memory from dominating.

Why does agent memory matter if models are getting better at reasoning and tool use?

Because context is finite. Every added piece—system instructions, user turns, tool outputs, retrieved knowledge, and stored memories—competes for a fixed token budget. When context grows, agents can forget earlier details, re-ask questions, or become unreliable. The session’s troubleshooting example contrasts “no memory” (agent loses the original Wi‑Fi/battery/overheating thread) with “memory” (agent continues the unresolved thread and references prior actions like firmware updates and background sync).

What are the four failure modes of unmanaged context, and how do they show up?

The session groups failures into: (1) Context burst: sudden token spikes from dumping large tool payloads into one turn; (2) Context conflict: contradictory instructions or tool results (e.g., refund eligibility rules that disagree); (3) Context poisoning: incorrect information injected via hallucinations or summaries that then propagates across turns; (4) Context noise: too many redundant/overlapping tool definitions or outputs that make the context harder to interpret.

How do reshape-and-fit techniques differ: trimming vs compaction vs summarization?

Trimming drops older turns while keeping the most recent end turns, refreshing attention but risking loss of information. Compaction removes older tool calls/results while preserving the surrounding message structure, which helps tool-heavy workflows where tool payloads dominate tokens. Summarization compresses prior messages into structured summaries and injects them back as a memory object; it retains more continuity than trimming but adds latency/cost because it requires an extra summarization step.

What does “short-term” vs “long-term” memory mean in this framework?

Short-term memory is about maximizing the active context window during an ongoing interaction (in-session techniques like trimming/compaction/summarization). Long-term memory is about continuity across sessions (cross-session memory), where extracted or summarized facts from earlier sessions are retrieved or injected into future sessions to personalize and maintain state.

How does cross-session memory get used without letting it become stale or unsafe?

The session demonstrates injecting a generated summary into the system prompt for a new session, enabling personalization (e.g., continuing MacBook internet troubleshooting after an OS update). It also stresses memory guardrails: treat memory as potentially stale or incomplete, avoid over-weighting it, and don’t store secrets or accept injection-style attacks embedded in user content.

How should memory features be evaluated and tuned in practice?

Start with standard evals comparing memory on vs memory off to measure uplift on task metrics (e.g., completeness or other numeric measures). Then add memory-specific evals focused on long-running tasks and long-context behavior, including checks on summary quality and injection timing/prompting. Finally, tune heuristics and parameters (like when to trim/compact/summarize) using production context snapshots and token/quality signals (including timeouts/dislikes).

Review Questions

Which failure mode is most likely when a single turn suddenly includes thousands of tokens from tool outputs, and what mitigation technique was demonstrated for it?
In what situations would trimming be preferred over summarization, and what trade-off does each approach make?
How would you design an eval plan to test whether memory improves long-running agent performance beyond what standard short-context evals measure?

Key Points

1
Context engineering is the foundation for agent memory: what the model receives (prompt, state/history, tools, retrieved knowledge, memory) determines performance under a finite token budget.
2
Separate short-term memory (in-session context management like trimming/compaction/summarization) from long-term memory (cross-session continuity via injected or retrieved memories).
3
Unmanaged context leads to four predictable failure modes: context burst, context conflict, context poisoning, and context noise.
4
Reshape-and-fit techniques reduce token pressure: trimming removes older turns, compaction drops older tool calls/results, and summarization compresses prior dialogue into structured memory objects.
5
Prompt and tool hygiene reduce conflict and noise: keep system prompts lean, keep tool sets small with clear boundaries, and return high-signal tool outputs.
6
Cross-session memory should include guardrails: treat stored summaries as potentially stale/incomplete, avoid over-weighting them, and prevent secret storage or injection-style attacks.
7
Memory improvements should be validated with evals that compare memory on/off and include memory-specific tests for long-running, long-context scenarios.

Highlights

Finite context is the bottleneck: every instruction, turn, and tool payload competes for tokens, so memory patterns are about staying stateful without drowning in context.

Four failure modes—burst, conflict, poisoning, and noise—map to concrete behaviors like tool payload spikes, contradictory policy/tool results, and incorrect summaries propagating over turns.

Summarization can be engineered with a dedicated summary prompt that enforces structured factual output, temporal ordering, contradiction checks, and hallucination control.

Cross-session personalization works by injecting a prior summary into the system prompt, but it must be constrained with memory guardrails to avoid stale or unsafe information dominating.

Topics

Context Engineering
Agent Memory Patterns
Context Trimming
Context Summarization
Long-Term Memory

Mentioned

OpenAI Agents SDK
Next.js
Michaela
Emry
Brian
Andre Karpathy
RAG