Context Window Management in Claude Code

TL;DR

Claude Code’s context window is token-limited working memory, and it fills faster than expected because each turn resends prior conversation history plus tool outputs.

Briefing Cornell Notes

Briefing

Claude Code’s context window is small enough to become the bottleneck for real development work—and managing it well is the difference between steady output quality and fast degradation. The core idea is that Claude Code can only “remember” a limited amount of information at once, measured in tokens, and that limit is consumed not just by what the user sends but also by Claude Code’s replies, tool outputs, and even system scaffolding that loads at the start of each session. With a typical context window around 200 tokens for most models (with some models like Opus 4.6 discussed as having much larger limits), the practical usable space is far less than the headline number, because large portions are pre-filled before any conversation begins.

Context window management starts with understanding what “context” means in programming: the entire codebase, spec documents (PRDs), issue trackers like Jira or GitHub issues, team discussions in Slack, prior AI-assisted chat history, and the current state of the repository all contribute to the information needed to complete a coding task. Claude Code can’t ingest an arbitrarily large codebase in one go; instead, it relies on a sliding budget of tokens. Every new message during a session forces Claude Code to re-receive the conversation history so far (since it has no persistent memory across turns), which makes token usage grow quickly over time. A simple illustration in the transcript shows how token consumption accelerates: if each turn adds 100 tokens from the user and 100 from Claude’s reply, then by turn 10 the system has already spent far more than the initial 200 tokens because the full prior history is resent each time.

The transcript then breaks down what actually occupies Claude Code’s context window. A large system prompt is preloaded (around 6,000 tokens), tool schemas take another large chunk (around 8,000 tokens), and the project’s Claude file loads at session start. Conversation history, tool outputs, MCP tool schemas, and “skills” (markdown-like playbooks for specific tasks such as EDA) also consume space. On top of that, an auto-compaction buffer reserves about 33,000 tokens for summarization. The practical takeaway: although the headline context window is ~200 tokens, usable working space is closer to ~150 tokens because some capacity is already taken.

When the context window fills, response quality degrades. The transcript describes a threshold behavior: once token usage approaches roughly 120–130 tokens (around a 150-token working limit), Claude Code’s answers become less reliable. To prevent this, Claude Code can auto-compact conversation history into the reserved summary buffer when usage reaches about 75%–92%. That keeps older details available in compressed form, but it’s not lossless—some nuance can disappear. For control, users can manually run /compact to summarize at chosen moments, ideally when not in the middle of an important feature build.

If compaction can’t keep up, options narrow: start a new session (or use /clear to wipe the current conversation). The transcript also recommends workflow design to reduce context burn: build one feature per session rather than multiple features in a single long thread, keep prompts specific (avoid vague instructions), and use sub-agents for isolated or parallel tasks so each sub-agent has its own fresh context window. Finally, it introduces /claudeignore as a way to prevent large files from being pulled into context, and it argues that terminal-based Claude Code access unlocks advanced features that GUI flows may not support as fully.

Cornell Notes

Claude Code’s context window is a token-limited “working memory” that determines how much information it can use while generating code. Context grows fast because each new turn resends the full conversation history, plus token-heavy inputs like prompts, pasted code, images, tool calls, and tool outputs. Although many models are described as having ~200 tokens of context window, a large portion is preloaded by system prompts and tool/skill schemas, leaving closer to ~150 tokens for active work. As usage approaches the threshold (roughly 120–130 tokens), response quality can degrade. To manage this, users should monitor context usage, manually run /compact around 70–75% when not mid-task, split work into separate sessions per feature, and use sub-agents for parallel isolated tasks; if needed, start a fresh session or use /clear.

What exactly counts toward Claude Code’s context window, and why does it fill so quickly?

Token usage includes both input tokens (the user’s prompt text, pasted code, and images) and output tokens (Claude’s replies). It also counts tool-related activity: when Claude Code uses tools, the tool outputs and intermediate results consume additional tokens. Because Claude Code doesn’t retain memory across turns, each new request during a session includes the prior conversation history again—so token consumption accelerates over time rather than staying constant per turn.

Why does “one long session” cost more context than “separate sessions,” even if the same features are built?

In a single session, every new turn must resend the entire conversation history so far. That means later turns include more repeated context, multiplying token usage. The transcript’s example: building four features in one session costs about 4× the context compared with building each feature in its own session (because each feature gets its own shorter conversation thread).

What preloads into the context window before any user conversation even starts?

A large system prompt is preloaded (around 6,000 tokens), tool schemas are loaded (around 8,000 tokens), and a Claude dot file loads at session start as a project overview. Then come the conversation history, tool outputs from the session, and optional MCP tool schemas. “Skills” (markdown-like task playbooks) also take space. Finally, an auto-compaction buffer reserves about 33,000 tokens for summarization, which reduces the truly available working space.

How does auto-compaction work, and what trade-off does it introduce?

Auto-compaction triggers when context usage reaches roughly 75%–92% of the working budget. Claude Code summarizes the conversation history into the reserved compaction space (about 33,000 tokens) and frees the detailed history. The trade-off is that summaries aren’t lossless: some fine-grained details may be lost, which can matter if compaction happens mid-implementation.

When should a user run /compact manually instead of relying on auto-compaction?

Manual /compact is recommended proactively when context usage approaches about 70–75%, but only when the user is not in the middle of an important feature development task. This avoids compaction happening at an inconvenient moment and reduces the chance of losing critical implementation details.

What are the practical escape hatches when context management still fails?

If summaries can’t fit anymore (the compaction buffer fills), Claude Code may stop accepting further conversation in that session. At that point, the practical options are to start a new session or use /clear to delete the current conversation and effectively restart from a clean slate. The transcript also suggests splitting work so each feature gets its own session to prevent reaching this failure mode.

Review Questions

How does resending conversation history each turn affect token growth over a long session?
What components (system prompt, tool schemas, skills, tool outputs) consume context before the user even starts asking questions?
Why can response quality degrade near the context threshold, and what workflow changes help prevent it?

Key Points

1
Claude Code’s context window is token-limited working memory, and it fills faster than expected because each turn resends prior conversation history plus tool outputs.
2
Token usage includes user inputs (text, pasted code, images) and Claude’s outputs, including verbose tool results.
3
Even when a model is described as having ~200 tokens of context window, preloaded system prompts and tool/skill schemas reduce practical usable space to roughly ~150 tokens.
4
Response quality can degrade as context usage approaches the working threshold (around 120–130 tokens), so monitoring context usage matters.
5
Manual /compact should be run proactively around 70–75% when not mid-task to avoid losing implementation nuance from auto-compaction.
6
Workflow design reduces context burn: build one feature per session, keep prompts specific, and use sub-agents for isolated/parallel tasks with fresh context windows.
7
If compaction can’t keep up, start a new session (or use /clear) to reset context and continue work cleanly.

Highlights

Context window consumption isn’t just the user’s messages—tool outputs and Claude’s replies can dominate, making the window fill rapidly.

Because conversation history is resent every turn, token usage grows nonlinearly during long sessions.

Auto-compaction helps by summarizing history into reserved space, but it’s not lossless and can drop fine details if it triggers mid-implementation.

Splitting work into separate sessions per feature can cut context cost dramatically compared with building multiple features in one long thread.

/claudeignore is positioned as a way to keep large files out of context so they don’t crowd the token budget.

Topics

Context Window
Token Budgeting
Auto-Compaction
Sub-Agents
Workflow Optimization

Mentioned

Nitesh

Context Window Management in Claude Code | CampusX