Get AI summaries of any video or article — Sign up free
Context Engineering & Coding Agents with Cursor thumbnail

Context Engineering & Coding Agents with Cursor

OpenAI·
6 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Tab’s next-action model is trained using large-scale acceptance/rejection data and updated in near real time via online RL.

Briefing

Cursor’s approach to AI coding hinges on a shift from “autocomplete” to autonomous coding agents—powered less by clever prompting and more by deliberate context engineering. The core claim is that software engineering can move faster when models get the right information at the right time, and when heavy computation (like code retrieval) is pushed offline so runtime stays fast and cheap.

The talk traces that evolution through Cursor’s own product history. Tab began as next-word prediction inspired by GitHub Copilot, then progressed to predicting the next line and ultimately the next action. Tab now handles more than 400 million requests per day, and that scale feeds a feedback loop: accepted suggestions are reinforced, rejected ones are penalized, and the model is updated in near real time using online RL. A key constraint is latency—suggestions slower than about 200 milliseconds disrupt developer flow—so the latest release shows fewer suggestions but with higher confidence.

From there, Cursor moves into coding agents. Instead of only generating text, agents can create or update entire code blocks after conversational prompts. Cursor emphasizes adjustable autonomy: early steps included inline diffs that use the current line plus broader file context, followed by Composer for multi-file edits with a more conversational workflow. In 2024, Cursor added a fully autonomous agent that uses more tokens for tool calling and can self-gather context, reducing the need for users to supply everything up front.

Context engineering becomes the centerpiece. As context windows grow, models still struggle to recall reliably, so the goal is minimal, high-quality context rather than maximal context. Retrieval is treated as fundamental. For codebase search, Cursor compares traditional string search tools (like GP/RIP Grep) with semantic search built on embeddings. Semantic search helps the agent find the correct file even when names differ from what the model “expects” (e.g., mapping a request for “top navigation” to header.tsx). Cursor also moved from an off-the-shelf embedding model to a custom one and runs A/B tests; semantic search increased follow-up questions and token usage, but the biggest win is shifting compute and latency to indexing time. The result: faster, cheaper agent responses at runtime without sacrificing acceptance quality.

The talk then widens from retrieval to agent UX and extensibility. Cursor argues that CLIs are useful but not the end state; agents should be scriptable and available across surfaces—terminal, web, phone, Slack bug reports, or Linear backlog triage. It also highlights specialized agents beyond editing, including Bugbot, an internal tool that reads and reviews code to find logic bugs and reportedly caught issues missed during reviews.

Long-horizon performance depends on planning and research upfront, plus deeper product integration: storing plans, editing files accordingly, and giving agents tools like to-do lists so they don’t lose track or waste tokens. Safety and trust remain central, with a human-in-the-loop model for shell commands via one-time prompts or allow lists.

Finally, the vision points beyond today’s interfaces: manage multiple agents in parallel (locally or in cloud sandboxes), explore model “competition” (different reasoning levels or providers), and let agents verify their work by running code and using browser automation. Michael Kourser’s outlook frames Cursor’s goal as automating coding by combining model capability, autonomy, and human-computer interaction—freeing engineers from toil so they can focus on hard problems, design, and building what matters.

Cornell Notes

Cursor’s evolution of AI coding centers on context engineering and autonomy: models perform better when they receive intentional, minimal, high-quality context and when retrieval work is done ahead of time. Tab progressed from next-word prediction to next-action suggestions, using large-scale acceptance/rejection data and near-real-time online RL updates. Cursor’s coding agents then expanded from inline diffs and conversational multi-file edits to fully autonomous agents that can self-gather context via tool calling. Semantic search with embeddings (paired with string search) improves how agents retrieve the right code, while indexing shifts compute and latency offline for faster runtime. The product also emphasizes adjustable autonomy, human-in-the-loop safety for shell commands, and longer-horizon planning to raise code quality.

How did Tab evolve from basic autocomplete to a next-action system, and why does that matter for agent performance?

Tab started as next-word prediction inspired by GitHub Copilot, then moved to predicting the next line and ultimately where the cursor should go next. With more than 400 million requests per day, Tab collects data on which suggestions users accept or reject. That feedback drives a specialized next-action model trained with reinforcement: accepted behaviors are positively reinforced, rejected ones are negatively reinforced, and updates happen in near real time using online RL. This matters because agents rely on fast, reliable micro-decisions—if suggestions are slow (over ~200 ms) or low-confidence, developer flow breaks and downstream autonomy becomes harder to trust.

What is “context engineering” in this framework, and why isn’t bigger context always better?

Context engineering here means supplying models with the right information rather than simply stuffing the largest possible context window. As context size increases, models get worse at recalling details. Cursor’s approach favors minimal, high-quality tokens and treats retrieval as fundamental. Instead of pushing limits of the context window, the system retrieves relevant code snippets during generation—so the model sees targeted context that improves accuracy without bloating inference.

Why does Cursor prefer semantic search (embeddings) alongside string search tools like GP/RIP Grep?

String search (GP/RIP Grep) finds direct matches but can miss the “right” file when the user’s intent doesn’t align with exact strings. Semantic search uses embeddings to map intent to the correct code location—for example, a request about “top navigation” can correctly retrieve header.tsx. Cursor also trained a custom embedding model and A/B tested semantic search versus GP alone. Semantic search increased follow-up questions and token usage, but it delivered a major systems win: compute and latency move to indexing time, enabling faster and cheaper runtime responses.

How does Cursor control autonomy so developers stay in charge?

Cursor builds autonomy as a dial rather than a switch. Early features prompted models to add inline suggestions as diffs using current line plus file context. Composer made multi-file edits easier through a conversational UI. Later, a fully autonomous agent used more tokens for tool calling and could self-gather context, reducing upfront user burden. Even with autonomy, safety gates remain: when agents try to run shell commands, Cursor asks whether to run once or add to an allow list for future auto-execution, and these settings can be shared via code with the team.

What enables longer-horizon coding tasks beyond simple prompt changes?

Longer-horizon performance depends on more than telling the model to “plan better.” Cursor integrates planning into the product: it stores plans, supports iterative file edits aligned to those plans, and provides new tools so the agent can research and course-correct. It also allows agents to create and manage a to-do list, giving persistent task context so the model doesn’t forget goals or waste tokens. The payoff is higher-quality code because the agent starts with verified requirements and richer inputs.

What does “multiple agents” require, and what trade-offs appear in local vs cloud execution?

Running multiple agents in parallel introduces isolation and coordination problems. Locally, agents modifying overlapping files need separate code copies, such as git work trees, plus handling dev dependencies like databases and ports. In the cloud, sandbox virtual machines support long-horizon tasks but add boot time and require initial environment setup. Cursor is exploring native support for these workflows rather than leaving developers to build scripts and hacks in user space. It’s also considering interfaces where agents can run in the foreground while others run in the background.

Review Questions

  1. What specific feedback loop does Tab use to improve next-action predictions, and how quickly are updates applied?
  2. How does semantic search change the runtime cost profile of coding agents compared with GP/RIP Grep alone?
  3. What product-level mechanisms (beyond prompting) does Cursor use to support longer-horizon agent work and maintain task focus?

Key Points

  1. 1

    Tab’s next-action model is trained using large-scale acceptance/rejection data and updated in near real time via online RL.

  2. 2

    Cursor treats context engineering as minimal, high-quality context plus retrieval, because larger context windows can reduce recall accuracy.

  3. 3

    Semantic search with embeddings improves code retrieval accuracy (e.g., mapping intent to header.tsx) and shifts compute/latency to indexing time for faster runtime.

  4. 4

    Coding agents are designed with adjustable autonomy, starting from inline diffs and conversational multi-file edits and progressing to fully autonomous tool-using agents.

  5. 5

    Trust and safety are enforced through human-in-the-loop controls for shell commands, including one-time prompts and allow lists that can be shared with teams.

  6. 6

    Long-horizon tasks improve when planning, to-do management, and deeper product integration are built into the agent workflow.

  7. 7

    Multiple-agent execution requires careful isolation—locally via tools like git work trees and in the cloud via sandbox VMs—each with distinct setup and latency trade-offs.

Highlights

Tab now processes over 400 million requests per day, turning user acceptance/rejection into near-real-time online RL updates for next-action prediction.
Semantic search doesn’t just improve retrieval quality—it moves the heavy lifting to indexing time, enabling faster and cheaper agent responses at runtime.
Cursor’s autonomy is adjustable: it ranges from inline diff suggestions to fully autonomous agents that self-gather context via tool calling.
Bugbot reflects a shift toward specialized agents that read and review code for logic bugs, not only generate edits.
Multiple-agent management remains an open interface problem, with different constraints for local parallelism versus cloud sandboxes.

Mentioned

  • Michael Kourser
  • Lee
  • RL
  • GP
  • RIP
  • DOM
  • API