Build Hour: Agent RFT

TL;DR

Agent RFT trains tool-using agents end-to-end by allowing tool calls during training rollouts and updating model weights from a custom reward signal.

Briefing Cornell Notes

Briefing

Agent RFT (agent reinforcement fine-tuning) is positioned as a way to make tool-using agents faster and more accurate by training the model end-to-end on how to call tools and how to interpret tool outputs—using a custom reward signal that can be computed inside a customer’s environment. Instead of relying only on prompt tweaks, Agent RFT updates model weights based on learning signals that reflect “good” versus “bad” behavior during rollouts, where the agent explores many different tool-call sequences and learns from the outcomes.

The core distinction from earlier fine-tuning approaches is that Agent RFT allows the agent to interact with external tools during training. That interaction is routed through customer-provided endpoints: as the model explores, it can call tools, and each tool call is tagged with a unique rollout identifier so graders can evaluate the full trajectory—tool calls plus final answer—together. This design is meant to reduce “domain shift” between how models were trained and how customers’ real systems behave, because the agent learns using the same tools, data formats, and operational constraints it will face in production.

OpenAI’s team frames Agent RFT as a practical upgrade path: start with prompt engineering and task design (including tool descriptions and naming), then move to fine-tuning when those levers plateau. Agent RFT is presented as the next step when performance still lags—especially for reasoning-heavy agentic tasks where tool-use efficiency matters. Benefits highlighted include improved reasoning-model performance, better tool-use leading to stronger final answers, sample efficiency when training data is scarce, and lower latency. Latency improvements are attributed to two linked effects: fewer reasoning tokens and fewer tool calls, achieved through training that lightly penalizes excessive token use and can also enforce a tool-call budget.

A detailed benchmark example targets financial QA under tight constraints. The task is made harder by removing the financial report context from the prompt and requiring the agent to locate the correct report among 2,800 documents using tools, then answer within 10 tool calls. Tools include a semantic search function (built with embeddings and cosine similarity), a directory listing tool, and a “cat”-style document retrieval tool. Reward is generated with a model grader rather than brittle string matching, allowing partial credit for near-correct numeric answers and tolerance for formatting differences.

In the training demo, the baseline agent starts around 0.59 validation reward; after roughly 10 training steps, validation reward rises to about 0.63 while tool calls drop sharply. Downstream measurements show latency reductions (about 5 seconds, roughly 10%) and fewer output tokens (from about 2,500 to 1,500). Trace-level analysis reports fewer tool calls per trace (about 6.9 down to 4.2) and a shift toward trajectories that are both faster and higher-reward, with trade-offs where some samples lose reward when the policy becomes stricter.

Customer results reinforce the theme of optimizing agent behavior for real workflows. Cognition uses Agent RFT to tune planning in Devon, restricting tools to read-file and shell, and optimizing for an F1-based reward over which files the agent selects. Ambience applies Agent RFT to ICD10 coding from clinical transcripts using a code-search tool, improving F1 and reducing response time. Other examples include a GPU kernel-writing agent trained with as few as 100 PyTorch prompts (with a strong grader), and a financial reasoning workflow where a custom endpoint grader reduces hallucinations and improves core ML performance. The throughline: Agent RFT works best when tasks and grading are well-specified, tool behavior is mirrored in training, and the reward signal is hard to game while providing graded, not binary, feedback.

Cornell Notes

Agent RFT trains tool-using agents end-to-end by letting the model call customer tools during training rollouts and then updating weights using a custom reward signal. Each rollout’s tool calls and final answer are linked via a unique identifier so graders can evaluate the full trajectory in the customer’s environment. The approach targets agentic bottlenecks—especially excessive tool calls and reasoning tokens—so it can improve accuracy while reducing latency. In a financial QA benchmark, removing report context forced the agent to search among 2,800 documents and answer within 10 tool calls; after about 10 steps, reward rose while tool calls and tokens dropped. The method is most effective when the task is constrained, the baseline sometimes succeeds (variance), and the grader provides non-brittle, partial-credit feedback aligned with real-world goals.

What makes Agent RFT different from earlier fine-tuning or prompt-only optimization for agents?

Agent RFT changes model weights using a learning signal derived from rollouts where the agent can call tools during training. That means the model learns not just how to produce an answer, but how to interact with external systems—calling the right tools, in the right order, and interpreting tool outputs. Tool calls and final answers are evaluated together using a reward/grader, and the platform can enforce constraints like a tool-call budget to reduce latency.

How does Agent RFT connect tool calls to grading during training?

For each agent rollout, the system assigns a unique identifier to the rollout’s tool calls and final answer. When the agent calls customer tool endpoints, the platform attaches that rollout ID to each tool call so the customer backend can group all calls from the same rollout. The final answer is then graded with access to the full set of tool-call context, enabling holistic evaluation and flexible reward shaping.

Why does the financial QA benchmark matter for understanding Agent RFT’s latency and performance claims?

It isolates tool-use efficiency under strict constraints. The agent receives only the question (no report context), must locate the correct report among 2,800 documents using tools, and must answer within 10 tool calls. Training with a model grader (instead of brittle string matching) rewards near-correct numeric answers and tolerates formatting differences, allowing the agent to learn reasoning paths that are both correct and tool-efficient.

What training knobs influence exploration and learning in Agent RFT?

Compute multiplier controls how much exploration the model performs during rollouts; higher values increase variation but also increase the number of endpoint calls, requiring more robust infrastructure. The demo also highlights reasoning effort settings and evaluation sampling for stable reward curves. Exploration matters because the agent needs enough nonzero reward trajectories to learn what good tool-call sequences look like.

How do latency improvements show up in the demo results?

Latency drops track reductions in both reasoning tokens and tool calls. The baseline-to-step-10 comparison reports about a 5-second reduction (~10%) and a large token decrease (roughly 2,500 to 1,500 mean tokens). Tool calls per trace also fall (about 6.9 to 4.2), and trace plots show many trajectories moving toward “higher reward, fewer tool calls,” with some trade-offs where stricter policies reduce reward on certain samples.

What conditions did the team say help Agent RFT succeed?

Key requirements include a well-specified, constrained task with consistent “what good means,” a nonzero baseline performance so the model sometimes gets correct trajectories (variance enables learning), and quality-over-quantity in datasets to avoid mixed reward signals. On the infrastructure side, tools should mirror production behavior, and graders should be designed to be aligned with domain goals and resistant to reward hacking, ideally providing partial credit rather than binary checks.

Review Questions

In Agent RFT, what role does the rollout identifier play in enabling custom grading of tool-using trajectories?
Why can a brittle string-matching grader hinder learning in numeric QA tasks, and how does a model grader address this?
What is the relationship between compute multiplier, exploration, and the ability of the agent to learn from nonzero-reward trajectories?

Key Points

1
Agent RFT trains tool-using agents end-to-end by allowing tool calls during training rollouts and updating model weights from a custom reward signal.
2
Tool calls and final answers are linked by a unique rollout ID, enabling graders to evaluate the entire trajectory using customer-side context.
3
Latency improvements come from learning policies that reduce both reasoning tokens and the number of tool calls, sometimes by enforcing a tool-call budget.
4
A financial QA benchmark demonstrates the approach under constraints: search among 2,800 documents and answer within 10 tool calls, with reward shaped via a model grader for partial credit.
5
Compute multiplier controls exploration; higher values can increase learning opportunities but require more robust endpoint infrastructure.
6
Agent RFT tends to work best with constrained tasks, nonzero baseline variance, high-quality datasets, and graders aligned with domain goals and resistant to reward hacking.
7
Customer deployments (Cognition, Ambience, GenSpark, MacO, Rogo) emphasize practical gains: faster planning/editing, improved F1, reduced response time, and lower hallucination rates when grading is well-designed.

Highlights

Agent RFT is built around training-time tool interaction: the model calls customer endpoints during exploration, and grading can use every tool call from that rollout.

In the financial QA setup, removing report context forces tool-based retrieval; after ~10 steps, reward rises while tool calls and tokens fall sharply.

Latency gains are quantified as fewer tool calls per trace (about 6.9 down to 4.2) and a ~10% reduction in measured end-to-end latency in the demo.

Reward design matters: model graders provide partial credit and tolerate formatting/numeric variance, avoiding brittle failure modes of exact string checks.

Customer examples show Agent RFT can reduce planning back-and-forth (Cognition) and improve ICD10 coding F1 while cutting response time (Ambience).

Topics

Agent RFT
Tool-Using Agents
Training-Time Tool Calls
Custom Reward Graders
Latency Optimization

Mentioned

Christine
Will
Theo
Sam Pretty
Prashant
Brandon Patrick
Flame
RFT
RFT2
ML
F1
ICD10
RAG
VM
GPT5
GPT 4.1
GP5