Context Engineering vs. Prompt Engineering: Guiding LLM Agents

TL;DR

Context engineering includes more than prompt wording; it also covers system instructions, chat rules, and uploaded documents that the model consumes.

Briefing Cornell Notes

Briefing

Context engineering is being misunderstood as mostly a token-efficiency exercise, but the bigger shift is about steering the “probabilistic context” an LLM agent gathers—especially when it can browse the web or query tools like MCP servers. Token optimization still matters, yet the decisive factor for correctness and usefulness increasingly comes from shaping what information the model chooses to retrieve and how reliably it can do so over time.

The discussion starts by framing context engineering as the successor to prompt engineering: prompts are only one slice of what an LLM sees. System instructions, chat rules, uploaded documents, and other instance-level materials all form part of the input. In that sense, the operator’s job is to ensure the context is accurate and aligned with the desired outcome. But most public attention focuses on “part one”—the deterministic portion that can be controlled directly, such as shrinking what gets sent into the context window. Approaches like “chain of draft” (using shorthand symbols to approximate logical thinking) aim to reduce token burn while preserving reasoning quality, because writing down intermediate symbols can help the model think more clearly.

The core argument is that this deterministic focus becomes less relevant as agents gain web access and autonomy. Once an agent can search, the model’s effective context balloons: the prompt and uploaded materials may become a tiny fraction of the total tokens processed. A multi-agent example is given: instructing Claude Opus to research a topic and providing a perspective document can still lead to hundreds of web pages being consulted, making it impossible to treat the original prompt as a measurable driver of the final answer. The model maintains “focus” largely because it has been trained to follow the user’s ask—meaning the prompt becomes probabilistic, and the operator’s real challenge shifts to shaping the environment that determines what the agent will retrieve.

To move context engineering toward this agentic reality, a set of principles is proposed. First, design for discovery: measure the rate at which a desired response returns when probabilistic context is included, and ensure the system can consistently produce acceptable outcomes even when the context window isn’t tightly bounded. Second, monitor information-source quality and track how it changes over time; users may be satisfied with results while the underlying sources remain sketchy, as suggested by concerns about source quality in tools like ChatGPT’s Deep Research. Third, take security seriously because prompt-injection attacks across the open web and MCP servers are expected. Fourth, measure overall decision accuracy using evaluation methods that incorporate source relevance—arguing that precision/recall metrics assume determinism and may mislead when context is largely uncontrollable. Fifth, version everything: prompts and agent behaviors need careful testing and version control.

Net effect: the goal isn’t just cheaper token usage. It’s deliberate shaping of the probabilistic context so agents make better decisions, with evaluation harnesses updated to reflect source quality across large, partially uncontrollable context windows. The broader takeaway is that frontline systems are moving from chatbots to tool-using agents, and context engineering must catch up to that shift by treating web/tool retrieval as a first-class engineering problem.

Cornell Notes

Context engineering goes beyond prompt wording and token budgeting. As LLM agents gain web access and tool use (including MCP-style server queries), most of the model’s effective context becomes probabilistic—information the system retrieves rather than content the operator can deterministically control. Token optimization methods like “chain of draft” can reduce cost, but correctness and usefulness increasingly depend on shaping what the agent searches for and how it selects sources. The proposed direction emphasizes designing for reliable discovery, monitoring and auditing source quality over time, anticipating prompt-injection security risks, measuring decision accuracy with source relevance, and versioning prompts and agent behaviors. This reframes evaluation: traditional precision/recall can fall short when context is no longer tightly bounded.

What’s the key distinction between deterministic and probabilistic context in agentic LLM systems?

Deterministic context is the portion an operator can directly control and verify—system instructions, chat rules, uploaded documents, and the exact text sent into the context window. Probabilistic context is what the model acquires indirectly when it can browse or query tools: the agent’s searches across the web or MCP servers can dominate the total tokens processed. In that regime, the original prompt and documents may become a small, hard-to-measure fraction of the information driving the answer, even if the model still follows the user’s intent due to training.

Why does token optimization become less central once agents can search widely?

Token optimization targets the deterministic slice—how efficiently the prompt and provided materials fit into the context window. But with web/tool access, the agent can ingest hundreds of pages, making the operator-controlled tokens comparatively minor. The more important engineering lever becomes steering the agent’s retrieval behavior (what it searches, where it looks, and which sources it trusts), because that retrieval largely determines the final decision.

How does the “Claude Opus research” example illustrate the problem?

The example describes giving Claude Opus a directive to research a topic plus a word document containing the user’s perspective. Even with that input, the agent consults hundreds of websites before returning an answer. That makes it impractical to assume the provided document or prompt accounts for any measurable portion of the processed context. The agent’s training helps it stay aligned with the user’s ask, but the prompt’s influence becomes probabilistic—shifting the operator’s job toward shaping the search environment.

What does “design for semantic highways” mean in practice?

It means engineering for consistent outcomes despite probabilistic context. Instead of assuming a tightly bounded context window, teams should measure the rate at which a desired response comes back when the agent can retrieve information across MCP servers or the web. The goal is repeatable success: can the system reliably produce acceptable answers even when the context is not fully controllable?

Why is source monitoring and auditing treated as a core requirement?

Because an agent can produce a good-looking answer while relying on low-quality or unreliable sources. The transcript notes that this mismatch happens often—e.g., outputs may be satisfactory while the cited sources appear “sketchy,” and auditing hundreds of sources can be difficult. The proposed principle is to track whether the agent actually uses “reliable and verified” sources and to monitor how those sources change over time.

What evaluation change is suggested for probabilistic context calls?

Traditional precision/recall assumes a deterministic context window. For probabilistic context, the transcript argues that measuring decision accuracy is more informative, especially when evaluation incorporates source relevance scoring. In other words, scoring the inputs (the sources the agent used) can better predict overall response quality than answer-only metrics.

Review Questions

How would you redesign an evaluation harness if an agent’s web/tool retrieval dominates the context window rather than the prompt text?
What security controls would you prioritize to mitigate prompt-injection risks when an agent searches across MCP servers and the open web?
Which metrics would you use to assess “semantic highway” reliability, and how would you test whether the agent’s sources remain trustworthy over time?

Key Points

1
Context engineering includes more than prompt wording; it also covers system instructions, chat rules, and uploaded documents that the model consumes.
2
Most current “context engineering” advice focuses on deterministic token efficiency, but agent web/tool access makes probabilistic context dominate.
3
Token-saving techniques like “chain of draft” can help, yet correctness increasingly depends on steering what information agents retrieve.
4
Design for reliable discovery by measuring how often desired outcomes occur when the agent can gather probabilistic context across tools and the web.
5
Audit and monitor source quality continuously, since agents can deliver acceptable answers while using unreliable or shifting sources.
6
Treat security as a first-class requirement because prompt-injection attacks across web retrieval and MCP servers are expected.
7
Update evaluation to measure decision accuracy with source relevance and version prompts/agent behaviors carefully.

Highlights

As agents gain web access, the prompt and uploaded materials can become a tiny fraction of the total context processed, making prompt influence effectively probabilistic.

“Chain of draft” token-reduction works partly because writing intermediate symbols helps the model think more clearly, but it doesn’t solve the retrieval steering problem.

Source quality auditing matters even when outputs look good—cited sources can be unreliable, and auditing large numbers of sources is a practical challenge.

Prompt-injection attacks across MCP servers and open-web retrieval are expected, so security planning can’t wait.

Evaluation should shift from precision/recall assumptions about determinism toward decision accuracy and source relevance scoring.

Topics

Mentioned

Nate B Jones
MCP