Get AI summaries of any video or article — Sign up free
MemGPT - Unlimited Context Window (Memory) for LLMs | Paper review, Installation & Demo thumbnail

MemGPT - Unlimited Context Window (Memory) for LLMs | Paper review, Installation & Demo

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

MemGPT replaces brute-force context expansion with virtual context management: fast working context plus slow external memory.

Briefing

MemGPT targets a core bottleneck in today’s large language models: limited context windows that force earlier parts of a conversation or large documents to be dropped or ignored. Instead of simply pushing token limits higher, it borrows an operating-systems idea—virtual memory—and splits “context” into fast memory (the model’s normal prompt window) and slow memory (an external store acting like a hard drive). The result is a system that can keep working over long chats or very large texts by selectively retrieving and compressing what matters, rather than trying to stuff everything into the transformer’s attention window.

The approach matters because transformer attention scales poorly with context length: increasing tokens drives a quadratic jump in compute and memory. MemGPT’s design also responds to a second practical issue found in research: even when long prompts fit, models often don’t attend uniformly—middle sections can get less focus than the beginning and end. MemGPT’s “virtual context management” aims to make the model’s effective working set more relevant by routing information between two layers.

At the system level, MemGPT uses an agent loop that can call functions to manage memory. A typical flow includes reading from memory, writing new facts, and sending the user-facing message back to the chat. Internally, it maintains a working context that the model can see directly, plus an external memory layer split into a recall queue (a structured history of recent events) and an archive storage (often a vector database or a regular database, plus the ability to store items like text files or JSON). When the conversation grows, MemGPT summarizes or compresses older content and injects that condensed representation back into the working context, allowing the agent to continue without losing the thread.

The transcript also highlights a subtle but important constraint: system prompts (“pre-prompts”) consume tokens too. Even if a model advertises a 4,000-token window, a large system instruction can shrink the usable space for actual conversation. MemGPT addresses this by separating tokens across system instructions, conversational context, and working context, using a FIFO-style queue so the model keeps the latest messages while older material is compressed into memory.

In evaluations described during the walkthrough, MemGPT agents reportedly recall user information more reliably than baseline GPT-4 and GPT-3.5 setups, particularly in long-running interactions where memory retention is the challenge.

The practical portion shows installation steps: create a Python virtual environment (Python 3.11.6), install the MemGPT library, set an OpenAI API key via environment variables, then run the MemGPT app with a chosen model (the demo uses GPT 4). The demo illustrates core memory writes (e.g., storing the user’s name and preferences), checkpointing and reloading memory, and a higher-level feature: personas and “humans” profiles. By defining persona files (e.g., a stoic warrior) and user profiles, the agent changes tone and behavior while still persisting relevant facts. The walkthrough ends with caveats: the setup works best with GPT 4, and function calling is less reliable with GPT 3.5, plus cost can be high. Overall, MemGPT’s promise is clear—long-context behavior without brute-force token expansion, achieved through external memory, summarization, and retrieval-driven context management.

Cornell Notes

MemGPT tackles limited context windows by treating an LLM like a system with virtual memory: a small “fast” working context plus a “slow” external memory store. Instead of relying on ever-longer prompts (which raise compute and memory costs due to attention’s scaling), it retrieves relevant history from an external database and compresses older conversation into summaries injected back into the working context. The design includes a recall queue for recent events and an archive storage (vector DB or regular DB) for longer-term facts, plus function-calling-style operations like read/write memory. Demos show core memory persistence (e.g., storing a user’s name and preferences) and persona-driven behavior using configurable “humans” and persona files. Reported experiments indicate better user-information recall than plain GPT-4 or GPT-3.5 baselines in long chats.

What problem does MemGPT try to solve, and why isn’t “just increase the context window” enough?

MemGPT targets the limited token budget of common LLMs (e.g., ChatGPT/GPT-4) and the practical failure modes that come with long prompts. Transformer attention incurs a quadratic increase in compute and memory as context grows, making large windows expensive. Research also suggests models may not use additional middle prompt content effectively, tending to focus more on the beginning and end. MemGPT instead keeps the model’s working context small while using external memory plus retrieval and compression to preserve long-range information.

How does MemGPT implement “virtual context management”?

It splits memory into two parts: fast memory (the model’s normal context window) and slow memory (an external database acting like a hard drive). The agent loop can call functions such as read memory and write memory, then decide what to bring into the working context. When the conversation grows, older content is summarized/abbreviated and re-injected so the model can continue without exceeding the token limit.

What roles do “recall storage” and “archive storage” play?

Recall storage functions like a queue of recent events that the agent can draw from when updating the working context. Archive storage is the longer-term store—often a vector database or a regular database—where the system can dump and retrieve items such as text files or JSON. In practice, the agent typically works with the archive storage and uses it to write core memory, while the recall queue helps manage what stays immediately relevant.

Why do system prompts (“pre-prompts”) matter for context limits?

System instructions consume tokens inside the model’s context window. The transcript notes that pre-prompts can be up to or even exceed around a thousand tokens, reducing the effective space available for conversation and working context. MemGPT separates token usage across system instructions, conversational context, and working context, using a FIFO-style approach so the latest messages remain available while older material is compressed into memory.

How does the demo show MemGPT’s memory working in practice?

The walkthrough starts with a basic user persona and shows the assistant initially treating the user as “chat.” After the user corrects the name to “venelin,” MemGPT updates core memory (e.g., via core memory append). Exiting and rerunning demonstrates persistence: the bot later recognizes the stored name and preferences. The demo also shows checkpointing/reloading memory to verify that stored facts survive across sessions.

What are personas and “humans” in MemGPT, and how do they change behavior?

Personas and humans are configuration files that shape the agent’s role and the user’s profile. The demo creates a persona file (e.g., “Karan”) and a human profile file (e.g., “Achilles” described as a stoic warrior with interests like motorcycles). When starting a new run with these profiles, the agent’s tone and responses shift to match the persona, while memory operations still allow it to store and update user-specific facts.

Review Questions

  1. How does MemGPT reduce the need for long prompts, and what two memory layers does it use to do so?
  2. What token-consuming components besides user messages can shrink the effective context window, and how does MemGPT handle that?
  3. In MemGPT’s architecture, what is the difference between recall storage and archive storage, and when would each be used?

Key Points

  1. 1

    MemGPT replaces brute-force context expansion with virtual context management: fast working context plus slow external memory.

  2. 2

    Transformer attention’s scaling makes long contexts expensive; MemGPT keeps the working set small and relies on retrieval and compression.

  3. 3

    MemGPT maintains a recall queue for recent events and an archive storage (vector DB or regular DB) for longer-term facts.

  4. 4

    System prompts (“pre-prompts”) consume tokens too, so MemGPT separates token budgets across system instructions, conversational context, and working context.

  5. 5

    The agent uses function-calling-style operations (e.g., read memory, write memory) to decide what to store and what to inject into the working context.

  6. 6

    Core memory persistence is demonstrated by storing user facts (like name and preferences) and reloading them across runs/checkpoints.

  7. 7

    Personas and “humans” profiles let users shape tone and role while still benefiting from MemGPT’s memory management.

Highlights

MemGPT’s core idea is operating-systems-style virtual memory for LLMs: fast prompt context plus slow external storage.
Instead of trying to fit everything into attention, MemGPT compresses older conversation into summaries and reintroduces only what’s needed.
System prompts can quietly consume a large share of the context window, so MemGPT accounts for that token budget split.
The demo shows persistent core memory updates (name and preferences) and persona-driven dialogue using configurable profiles.

Topics

Mentioned