Generative Agents - Deep Dive and GPT-4 Recreation

TL;DR

Generative agents run a repeating time-step loop that produces thinking, actions, and dialogue, then writes those outputs into a timestamped memory stream.

Briefing Cornell Notes

Briefing

Generative agents for “interactive simulacra” are built around a practical loop: each character continuously turns observations into memories, converts those memories into higher-level inferences via reflection, and then uses the results to plan time-stamped actions that stay coherent over hours. The payoff is visible in a sandbox world where dozens of NPCs can converse, form plans, and update behavior as new interactions occur—most notably when a single idea (a Valentine’s Day party) spreads through dialogue and ends up drawing other agents to attend.

The core contribution is an architecture that makes large language models behave like semi-autonomous agents rather than one-off chatbots. Every time step, each agent is prompted to produce both its current “thinking” and its next action, including any active conversation. Those outputs are written into a memory stream as timestamped observations. From there, the system retrieves only the most useful memories—because dumping everything into the model would exceed context limits and would drown the agent in irrelevant detail.

Memory retrieval is governed by a weighted scheme combining recency (newer experiences matter more), importance (some events are more “poignant” than mundane chores), and relevance (what fits the agent’s current interests). Importance can be computed with an LLM-based scoring prompt—for example, ordering cleaning supplies might rate low, while applying for a loan for restaurant expansion rates higher; personal stakes like a child’s school trouble land in between. Retrieved memories then feed into two downstream processes: planning and reflection.

Reflection is the mechanism that turns raw events into abstractions. Instead of treating conversations and actions as isolated facts, reflection summarizes them into salient high-level questions and inferences—effectively chunking details into the kind of generalized understanding humans rely on. The transcript’s example frames this as moving from “surface-level” observations to higher-level takeaways that can guide future behavior.

Planning then produces structured future action sequences. Plans include location, start time, and duration (e.g., going to the park and painting for four hours). Crucially, plans are stored back into the memory stream, so later decisions can consider observations, reflections, and prior intentions together. When agents encounter new people or information, the loop updates: plans may change, and the next time step reflects those adjustments.

Beyond the AI logic, the system depends on engineering: a game-world simulation built with Phaser, scene state stored in JSON, and a rendering pipeline that flattens relevant JSON fields into prompts for the language model. Dialogue is also integrated into the memory system—messages exchanged between agents become part of each participant’s memory stream, enabling ideas to propagate socially.

The transcript also notes that this framework can be recreated with general-purpose LLMs (including experiments using GPT-4 and earlier references to ChatGPT), suggesting the approach is less about a proprietary model and more about the orchestration: time-stepped prompting, memory weighting, reflection-based summarization, and plan-driven action loops. The broader implication is clear—NPCs that behave consistently and interact naturally could improve game simulations and enable interactive role-play and scenario testing before real-world deployment.

Cornell Notes

Generative agents achieve believable, time-consistent NPC behavior by running a repeating loop at each time step: agents observe the world, store timestamped memories, retrieve the most relevant past experiences using a weighted scheme, and then generate actions. Reflection converts raw observations and conversations into higher-level inferences by summarizing salient, abstract takeaways—helping agents generalize beyond surface details. Planning then produces structured future action sequences (including location, start time, and duration) and stores those plans back into the memory stream so later decisions can incorporate intentions and updates. This architecture supports social propagation of ideas (like a party invitation spreading through dialogue) and allows agents to revise plans when new interactions occur.

What is the time-step loop that turns observations into coherent NPC actions?

At each increment of simulated time, every agent is prompted to output what it is doing and thinking, along with any active conversation and the next action it will take. Those outputs become new entries in the agent’s memory stream as timestamped observations. The system then retrieves a subset of memories for decision-making, runs reflection to form higher-level inferences, and uses the results to generate a plan. The plan is stored back into memory, and the next time step uses the updated memory state to choose actions.

How does the memory system avoid overwhelming the language model with too much context?

Memory retrieval is selective rather than exhaustive. The transcript describes a weighting approach that favors (1) recency—newer experiences decay less, (2) importance—events with higher “poignancy” are retained more, and (3) relevance—memories tied to the agent’s current interests are more likely to be retrieved. Importance can be estimated with an LLM scoring prompt (e.g., ordering cleaning supplies rates low, applying for a loan rates higher, and personal family issues fall between). This keeps the retrieved set small enough for context limits while still capturing what matters.

What does “reflection” do, and why is it necessary?

Reflection turns raw observations and dialogue into abstractions. Instead of relying on surface-level facts, it summarizes conversations and actions into salient high-level questions and inferences that the agent can use for planning. The transcript frames this as similar to how humans chunk daily experiences into generalized memories (e.g., remembering the trip and meeting someone rather than every detail). Without reflection, agents struggle to generalize and make inference-driven decisions.

How are plans represented, and how do they keep behavior consistent over time?

Plans are generated as sequences of future actions with explicit structure: each action includes a location, a starting time, and a duration (e.g., “go to the park and paint for four hours”). Plans are stored in the memory stream and included in later retrieval, so the agent can continue executing intentions across multiple time steps. If new interactions occur, the loop can revise the plan, but the structured representation helps maintain continuity.

How does social interaction propagate ideas between agents?

Dialogue is integrated into each agent’s memory stream. When one agent tells another about something (like a Valentine’s Day party and its time), that conversation becomes part of the listener’s stored memories. On subsequent time steps, retrieved memories influence planning and actions, allowing the idea to spread through the population and eventually draw other agents to attend.

What non-AI engineering pieces are needed to run the simulation?

The transcript highlights a full simulation stack: a game world built with Phaser, scene state stored in JSON, and a pipeline that flattens the JSON into prompt-ready inputs for the language model. Rendering and state updates are driven by the JSON outputs, and the system can render a day’s worth of behavior after the language-model calls generate the underlying state and actions.

Review Questions

How do recency, importance, and relevance interact to determine which memories an agent retrieves at a given time step?
Why does reflection improve an agent’s ability to plan compared with using raw observations alone?
What elements must a generated plan include to keep an agent’s behavior consistent across time?

Key Points

1
Generative agents run a repeating time-step loop that produces thinking, actions, and dialogue, then writes those outputs into a timestamped memory stream.
2
Memory retrieval is selective and weighted by recency, importance, and relevance to avoid context overload and irrelevant clutter.
3
Reflection converts raw events and conversations into higher-level inferences by summarizing salient abstract takeaways for planning.
4
Planning generates structured future action sequences with location, start time, and duration, and stores plans back into memory for continuity.
5
Dialogue between agents becomes part of each participant’s memory, enabling ideas to spread socially and influence future actions.
6
The simulation relies on engineering beyond prompting—Phaser for the world, JSON for state, and prompt-flattening pipelines to feed the right information to the language model.

Highlights

A single invitation can propagate through a multi-agent sandbox: a Valentine’s Day party idea spreads via conversation until other agents decide to attend.

The architecture’s “reflection” step is designed to chunk raw observations into abstract inferences, addressing a key weakness of using only surface-level memories.

Plans aren’t vague intentions; they’re time-stamped, location-specific sequences that persist through the memory stream and can be revised when new interactions arrive.

The system’s realism depends on orchestration and state management—Phaser + JSON + prompt-flattening—rather than only on the language model’s raw text generation.

Topics

Generative Agents
Interactive Simulacra
Memory Retrieval
Reflection and Planning
Multi-Agent Simulation

Mentioned

Percy Liang
Klaus Mueller