Build Hour: Agent RFT
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Agent RFT trains tool-using agents end-to-end by allowing tool calls during training rollouts and updating model weights from a custom reward signal.
Briefing
Agent RFT (agent reinforcement fine-tuning) is positioned as a way to make tool-using agents faster and more accurate by training the model end-to-end on how to call tools and how to interpret tool outputs—using a custom reward signal that can be computed inside a customer’s environment. Instead of relying only on prompt tweaks, Agent RFT updates model weights based on learning signals that reflect “good” versus “bad” behavior during rollouts, where the agent explores many different tool-call sequences and learns from the outcomes.
The core distinction from earlier fine-tuning approaches is that Agent RFT allows the agent to interact with external tools during training. That interaction is routed through customer-provided endpoints: as the model explores, it can call tools, and each tool call is tagged with a unique rollout identifier so graders can evaluate the full trajectory—tool calls plus final answer—together. This design is meant to reduce “domain shift” between how models were trained and how customers’ real systems behave, because the agent learns using the same tools, data formats, and operational constraints it will face in production.
OpenAI’s team frames Agent RFT as a practical upgrade path: start with prompt engineering and task design (including tool descriptions and naming), then move to fine-tuning when those levers plateau. Agent RFT is presented as the next step when performance still lags—especially for reasoning-heavy agentic tasks where tool-use efficiency matters. Benefits highlighted include improved reasoning-model performance, better tool-use leading to stronger final answers, sample efficiency when training data is scarce, and lower latency. Latency improvements are attributed to two linked effects: fewer reasoning tokens and fewer tool calls, achieved through training that lightly penalizes excessive token use and can also enforce a tool-call budget.
A detailed benchmark example targets financial QA under tight constraints. The task is made harder by removing the financial report context from the prompt and requiring the agent to locate the correct report among 2,800 documents using tools, then answer within 10 tool calls. Tools include a semantic search function (built with embeddings and cosine similarity), a directory listing tool, and a “cat”-style document retrieval tool. Reward is generated with a model grader rather than brittle string matching, allowing partial credit for near-correct numeric answers and tolerance for formatting differences.
In the training demo, the baseline agent starts around 0.59 validation reward; after roughly 10 training steps, validation reward rises to about 0.63 while tool calls drop sharply. Downstream measurements show latency reductions (about 5 seconds, roughly 10%) and fewer output tokens (from about 2,500 to 1,500). Trace-level analysis reports fewer tool calls per trace (about 6.9 down to 4.2) and a shift toward trajectories that are both faster and higher-reward, with trade-offs where some samples lose reward when the policy becomes stricter.
Customer results reinforce the theme of optimizing agent behavior for real workflows. Cognition uses Agent RFT to tune planning in Devon, restricting tools to read-file and shell, and optimizing for an F1-based reward over which files the agent selects. Ambience applies Agent RFT to ICD10 coding from clinical transcripts using a code-search tool, improving F1 and reducing response time. Other examples include a GPU kernel-writing agent trained with as few as 100 PyTorch prompts (with a strong grader), and a financial reasoning workflow where a custom endpoint grader reduces hallucinations and improves core ML performance. The throughline: Agent RFT works best when tasks and grading are well-specified, tool behavior is mirrored in training, and the reward signal is hard to game while providing graded, not binary, feedback.
Cornell Notes
Agent RFT trains tool-using agents end-to-end by letting the model call customer tools during training rollouts and then updating weights using a custom reward signal. Each rollout’s tool calls and final answer are linked via a unique identifier so graders can evaluate the full trajectory in the customer’s environment. The approach targets agentic bottlenecks—especially excessive tool calls and reasoning tokens—so it can improve accuracy while reducing latency. In a financial QA benchmark, removing report context forced the agent to search among 2,800 documents and answer within 10 tool calls; after about 10 steps, reward rose while tool calls and tokens dropped. The method is most effective when the task is constrained, the baseline sometimes succeeds (variance), and the grader provides non-brittle, partial-credit feedback aligned with real-world goals.
What makes Agent RFT different from earlier fine-tuning or prompt-only optimization for agents?
How does Agent RFT connect tool calls to grading during training?
Why does the financial QA benchmark matter for understanding Agent RFT’s latency and performance claims?
What training knobs influence exploration and learning in Agent RFT?
How do latency improvements show up in the demo results?
What conditions did the team say help Agent RFT succeed?
Review Questions
- In Agent RFT, what role does the rollout identifier play in enabling custom grading of tool-using trajectories?
- Why can a brittle string-matching grader hinder learning in numeric QA tasks, and how does a model grader address this?
- What is the relationship between compute multiplier, exploration, and the ability of the agent to learn from nonzero-reward trajectories?
Key Points
- 1
Agent RFT trains tool-using agents end-to-end by allowing tool calls during training rollouts and updating model weights from a custom reward signal.
- 2
Tool calls and final answers are linked by a unique rollout ID, enabling graders to evaluate the entire trajectory using customer-side context.
- 3
Latency improvements come from learning policies that reduce both reasoning tokens and the number of tool calls, sometimes by enforcing a tool-call budget.
- 4
A financial QA benchmark demonstrates the approach under constraints: search among 2,800 documents and answer within 10 tool calls, with reward shaped via a model grader for partial credit.
- 5
Compute multiplier controls exploration; higher values can increase learning opportunities but require more robust endpoint infrastructure.
- 6
Agent RFT tends to work best with constrained tasks, nonzero baseline variance, high-quality datasets, and graders aligned with domain goals and resistant to reward hacking.
- 7
Customer deployments (Cognition, Ambience, GenSpark, MacO, Rogo) emphasize practical gains: faster planning/editing, improved F1, reduced response time, and lower hallucination rates when grading is well-designed.