Generative Agents - Deep Dive and GPT-4 Recreation
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Generative agents run a repeating time-step loop that produces thinking, actions, and dialogue, then writes those outputs into a timestamped memory stream.
Briefing
Generative agents for “interactive simulacra” are built around a practical loop: each character continuously turns observations into memories, converts those memories into higher-level inferences via reflection, and then uses the results to plan time-stamped actions that stay coherent over hours. The payoff is visible in a sandbox world where dozens of NPCs can converse, form plans, and update behavior as new interactions occur—most notably when a single idea (a Valentine’s Day party) spreads through dialogue and ends up drawing other agents to attend.
The core contribution is an architecture that makes large language models behave like semi-autonomous agents rather than one-off chatbots. Every time step, each agent is prompted to produce both its current “thinking” and its next action, including any active conversation. Those outputs are written into a memory stream as timestamped observations. From there, the system retrieves only the most useful memories—because dumping everything into the model would exceed context limits and would drown the agent in irrelevant detail.
Memory retrieval is governed by a weighted scheme combining recency (newer experiences matter more), importance (some events are more “poignant” than mundane chores), and relevance (what fits the agent’s current interests). Importance can be computed with an LLM-based scoring prompt—for example, ordering cleaning supplies might rate low, while applying for a loan for restaurant expansion rates higher; personal stakes like a child’s school trouble land in between. Retrieved memories then feed into two downstream processes: planning and reflection.
Reflection is the mechanism that turns raw events into abstractions. Instead of treating conversations and actions as isolated facts, reflection summarizes them into salient high-level questions and inferences—effectively chunking details into the kind of generalized understanding humans rely on. The transcript’s example frames this as moving from “surface-level” observations to higher-level takeaways that can guide future behavior.
Planning then produces structured future action sequences. Plans include location, start time, and duration (e.g., going to the park and painting for four hours). Crucially, plans are stored back into the memory stream, so later decisions can consider observations, reflections, and prior intentions together. When agents encounter new people or information, the loop updates: plans may change, and the next time step reflects those adjustments.
Beyond the AI logic, the system depends on engineering: a game-world simulation built with Phaser, scene state stored in JSON, and a rendering pipeline that flattens relevant JSON fields into prompts for the language model. Dialogue is also integrated into the memory system—messages exchanged between agents become part of each participant’s memory stream, enabling ideas to propagate socially.
The transcript also notes that this framework can be recreated with general-purpose LLMs (including experiments using GPT-4 and earlier references to ChatGPT), suggesting the approach is less about a proprietary model and more about the orchestration: time-stepped prompting, memory weighting, reflection-based summarization, and plan-driven action loops. The broader implication is clear—NPCs that behave consistently and interact naturally could improve game simulations and enable interactive role-play and scenario testing before real-world deployment.
Cornell Notes
Generative agents achieve believable, time-consistent NPC behavior by running a repeating loop at each time step: agents observe the world, store timestamped memories, retrieve the most relevant past experiences using a weighted scheme, and then generate actions. Reflection converts raw observations and conversations into higher-level inferences by summarizing salient, abstract takeaways—helping agents generalize beyond surface details. Planning then produces structured future action sequences (including location, start time, and duration) and stores those plans back into the memory stream so later decisions can incorporate intentions and updates. This architecture supports social propagation of ideas (like a party invitation spreading through dialogue) and allows agents to revise plans when new interactions occur.
What is the time-step loop that turns observations into coherent NPC actions?
How does the memory system avoid overwhelming the language model with too much context?
What does “reflection” do, and why is it necessary?
How are plans represented, and how do they keep behavior consistent over time?
How does social interaction propagate ideas between agents?
What non-AI engineering pieces are needed to run the simulation?
Review Questions
- How do recency, importance, and relevance interact to determine which memories an agent retrieves at a given time step?
- Why does reflection improve an agent’s ability to plan compared with using raw observations alone?
- What elements must a generated plan include to keep an agent’s behavior consistent across time?
Key Points
- 1
Generative agents run a repeating time-step loop that produces thinking, actions, and dialogue, then writes those outputs into a timestamped memory stream.
- 2
Memory retrieval is selective and weighted by recency, importance, and relevance to avoid context overload and irrelevant clutter.
- 3
Reflection converts raw events and conversations into higher-level inferences by summarizing salient abstract takeaways for planning.
- 4
Planning generates structured future action sequences with location, start time, and duration, and stores plans back into memory for continuity.
- 5
Dialogue between agents becomes part of each participant’s memory, enabling ideas to spread socially and influence future actions.
- 6
The simulation relies on engineering beyond prompting—Phaser for the world, JSON for state, and prompt-flattening pipelines to feed the right information to the language model.