Grab the Inside Scoop on How Google, Anthropic, and Manus Built Long-Running AI Agents

TL;DR

Treat context as a runtime-compiled projection from durable state, not as an ever-growing transcript.

Briefing Cornell Notes

Briefing

Agentic context engineering hinges on one core shift: memory—not a bigger prompt window—is what makes long-running AI agents work reliably. Larger context windows and smarter models have not solved the underlying problem; they often worsen it by letting irrelevant history and tool noise drown out the signals that matter. As tasks stretch into multi-hour loops, performance can drop not because the LLM fails, but because memory construction is naive—stuffing too much into the context and treating “context” as a transcript rather than a computed, task-relevant view.

The blueprint presented centers on treating every LLM call as a fresh projection computed from durable state. Instead of dragging hundreds of turns forward, the system rebuilds a minimal slice: which instructions apply now, which artifacts matter now, and which memories should be surfaced now. That “compiled context” approach is framed as essential for multi-hour autonomy, because it prevents signal dilution and keeps attention focused on what’s relevant at each step.

A second pillar is a tiered memory model that separates storage from presentation. Working context is a minimal per-call view; sessions act like structured event logs across an agent’s trajectory; memory holds durable searchable insights extracted across runs; and artifacts represent large objects referenced by handles or tags rather than pasted into prompts. With this separation, the context window can stay small while the overall memory system grows arbitrarily large—mirroring traditional computer architecture with cache, RAM, and disk.

From there, the guidance becomes operational. Default context should contain nearly nothing; retrieval should be an active decision made when the agent needs information. Long-term memory should be searchable rather than pinned permanently, because relevance-ranked retrieval helps the agent distinguish critical constraints from recent noise. Summarization must be schema-driven and ideally reversible, so multi-step reasoning doesn’t collapse into vague “glossy soup” that erases decision structure and edge-case semantics.

The practical fixes extend beyond memory into system design. Heavy state should be offloaded to tools, file systems, or sandboxes, with pointers passed back instead of raw tool outputs—reducing cognitive burden and keeping context lean. Multi-agent setups should isolate scope using sub-agents (planner, executor, verifier) that communicate through structured artifacts rather than sprawling transcripts, avoiding cross-talk and reasoning drift. Prompt layout should follow caching and prefix stability: stable identity/instructions rarely change, while only the variable suffix (user input and fresh tool outputs) updates, cutting latency and cost.

Finally, the payoff is framed as concrete capabilities unlocked by correct memory architecture: long-horizon autonomy for web browsing and repo audits; self-improving behavior via strategy and instruction updates stored in memory (without weight training); scalable cross-session personalization; multi-agent orchestration without context poisoning; deep reasoning over large corpora by treating repos and PDFs as artifacts; auditable enterprise systems with reconstructible session logs and memory updates; cost-stable operations through sublinear token growth; and domain-specific “agent OS” environments where finance, coding, and medical agents can maintain durable workspaces and risk state.

Common failure modes are also spelled out: dumping everything into prompts, blind summarization, assuming long context windows are unlimited RAM, using prompts as an observability sink, tool bloat, anthropomorphizing agent roles, static prompt configurations that never evolve, over-boxed harnesses, and ignoring caching/prefix discipline. The overall message is tradecraft: there’s no magic bullet—production-grade agents require engineering memory and context as a first-class runtime environment.

Cornell Notes

Long-running AI agents succeed when “context” is treated as a compiled, per-call view computed from durable state—not as a growing transcript. Memory must be engineered as a tiered system: working context for each call, sessions as structured event logs, searchable durable memory extracted across runs, and artifacts referenced by handles rather than pasted. Retrieval should beat pinning: default context stays nearly empty, and the agent fetches only what’s relevant, using schema-driven, ideally reversible summarization to preserve decision structure. Offloading heavy state to tools/sandboxes, isolating scope with sub-agents, and using caching/prefix stability further prevent context rot, latency spikes, and cost blowups. Done right, this unlocks multi-hour autonomy, scalable personalization, auditable enterprise compliance, and cost-stable agent operations.

Why do bigger context windows and smarter models still fail on long-horizon tasks?

Because attention becomes scarce and irrelevant history accumulates. The transcript-like approach floods the model with logs and tool noise, drowning out critical signals. The result can be worse performance as tasks get longer—often due to memory construction, not the LLM’s raw capability.

What does “context as a compiled view” mean in practice?

Each LLM call should rebuild a minimal projection against durable state: which instructions apply now, which artifacts matter now, and which memories to surface now. Instead of carrying forward the last 500 turns, the system compiles only the relevant slice needed to preserve task continuity, which is framed as necessary for multi-hour agent loops.

How should tiered memory separate storage from presentation?

Working context is a minimal per-call view. Sessions are structured event logs across the agent’s trajectory. Memory is durable, searchable insight extracted across runs. Artifacts are large objects referenced by handles/tags rather than pasted into prompts. This keeps the context window small while letting the overall memory system grow.

What’s the difference between retrieval and pinning for long-term memory?

Pinned memory stays permanently in context, which tends to fail under attention constraints. Retrieval keeps long-term memory searchable and relevance-ranked, so the agent queries on demand. That helps it distinguish a critical constraint from days ago versus noise from minutes ago, reducing recency bias and confusion.

Why must summarization be schema-driven and ideally reversible?

Naive summarization turns multi-step reasoning into vague, decision-structure-free “glossy soup,” erasing constraints, edge cases, and causal relationships. Schema-driven summarization preserves essential semantics by using templates/event types so the summarized structure can be inspected and debugged, not just read.

Which system design choices keep context lean beyond memory alone?

Offload heavy state to tools/file systems/sandboxes and pass pointers instead of raw tool outputs. Use a small, orthogonal tool set so the agent composes workflows without tool bloat. Isolate scope with sub-agents that communicate via structured artifacts rather than shared transcripts. Maintain caching/prefix stability so stable instructions are reused and only the variable suffix changes, improving latency and reliability.

Review Questions

What specific mechanism prevents irrelevant history from overwhelming the agent during multi-hour loops?
How do working context, sessions, memory, and artifacts differ, and why does that separation matter for context window size?
What failure modes arise when summarization is done without schema structure or when long-term memory is pinned instead of retrieved?

Key Points

1
Treat context as a runtime-compiled projection from durable state, not as an ever-growing transcript.
2
Build tiered memory with distinct roles: working context, sessions, searchable memory, and artifact handles.
3
Keep default context nearly empty and make retrieval an explicit, relevance-ranked decision.
4
Prefer retrieval over pinning for long-term memory to avoid attention overload and recency bias.
5
Use schema-driven, ideally reversible summarization to preserve decision structure and enable debugging.
6
Offload heavy state to tools/sandboxes and pass pointers rather than raw tool outputs to keep prompts lean.
7
Stabilize prompt prefixes for caching and isolate agent scope with sub-agents to prevent drift and latency spikes.

Highlights

Bigger context windows don’t fix agent memory; they can worsen it by letting irrelevant history and tool noise drown out critical signals.

The core architectural move is compiling a minimal, relevant context view on every LLM call from durable state.

Long-running autonomy depends on retrieval and schema-driven summarization—not on stuffing everything into the prompt.

Cost and latency improve when prompt layouts follow caching and prefix stability, with only the variable suffix changing.

Production-grade agents require auditable memory layers so session logs and memory updates can be reconstructed for compliance.

Topics

Agentic Context Engineering
Tiered Memory
Retrieval vs Pinning
Schema-Driven Summarization
Long-Running Autonomy

Mentioned

RAG
ADK
ACCE
TR