Build Hour: Reinforcement Fine-Tuning

TL;DR

RFT is best used to improve reasoning behavior when the model already has relevant knowledge but still fails to apply it correctly.

Briefing Cornell Notes

Briefing

Reinforcement fine-tuning (RFT) is positioned as the most direct way to improve an LLM’s reasoning behavior when the model already has the needed facts but still struggles to apply them correctly. Instead of training on labeled “right answers,” RFT uses a grader—implemented as a rubric or scoring function—to evaluate multiple candidate responses produced from the same input. The system then learns from those graded trajectories, making it especially useful for policy compliance, legal reasoning, and medical workflows where correctness depends on how the model reasons, not just what it knows.

The session frames model customization around two levers: improving knowledge (via prompting and retrieval augmented generation) and improving reasoning (where fine-tuning fits). Fine-tuning is treated as an investment that should come only after teams squeeze value from prompting and RAG. Within OpenAI’s platform, three fine-tuning approaches are contrasted: supervised fine-tuning (prompt/answer pairs), preference fine-tuning (better vs. worse examples), and reinforcement fine-tuning, which relies on graders to score outputs. RFT is highlighted as data efficient—often requiring only tens to hundreds of examples—because each training example yields many sampled reasoning paths and thus a richer learning signal than a single labeled target.

A live walkthrough demonstrates how to set up an RFT task using a multilingual legal classification problem based on EuroVoc level one categories. The dataset contains 7,000 samples across 23 languages, with each text assigned to one or more of 21 broad thematic classes. The demo emphasizes that data quality and sampling strategy matter: training on imbalanced label distributions can lead to “reward hacking,” where the model exploits skewed data by over-predicting frequent categories. To counter this, the workflow includes balanced sampling for the training split.

Evaluation centers on precision, recall, and a single composite F1 score because the reinforcement system needs one grade per sample. Precision measures how many predicted labels are correct; recall measures how many expected labels were recovered. F1 weights the weaker of the two metrics more heavily, and the session notes that alternative composites like F2 can be used when recall should dominate.

Under the hood, the grader is implemented as executable Python scoring code (with edge-case handling to keep scores valid). Prompt optimization is treated as critical because the same prompt structure used during training should match inference-time behavior. The demo uses structured outputs to guarantee consistent response formatting, enabling reliable grader application. It also stresses “apples to apples” evaluation by reusing the same grader objects across the evaluation platform and the RFT training pipeline.

The training process is monitored through reward curves and variance studies. Multiple runs per sample reveal how stochastic outputs affect learning signal; the system learns to pull the mean score upward toward the best observed outcomes. Additional charts track precision/recall trade-offs across checkpoints, reasoning-token behavior, and latency/grade errors—tools meant to diagnose overfitting, checkpoint selection, and cost-performance trade-offs. A comparison against baseline models (including GPT4.1 and O4 mini) shows the fine-tuned model improving both precision and recall, yielding a higher F1 score.

A customer spotlight from Accordance extends the theme to real-world tax strategy and optimization. Their approach emphasizes task selection (reasoning-heavy, objectively gradable problems), careful data curation (small, high-quality subsets matter because individual low-quality rows can harm RFT), and grader design that is continuous and stratified to avoid rewarding superficial guessing. They report over 40% improvements on an industry evaluation set called tax bench.

The Q&A distills the requirements: RFT works best with reasoning models, formal rubrics that can produce continuous rewards, and nonzero baseline performance. Noisy data is addressed by cleaning and clarifying the task environment. Finally, cost and latency trade-offs are acknowledged: RFT can be used to extract frontier-level performance from cheaper models, but the number of reasoning tokens and the overhead of training and experiments can be difficult to control, making production volume a key factor in whether the investment pays off.

Cornell Notes

Reinforcement fine-tuning (RFT) improves an LLM’s reasoning when facts are already present but the model still applies them incorrectly. It trains using a grader (rubric) that scores multiple sampled outputs for the same input, turning each example into many learning trajectories. The workflow is built around careful task selection, prompt optimization, structured outputs for reliable grading, and evaluation using the same grader across both benchmarking and training. In the demo, a EuroVoc legal classification task uses precision/recall and an F1-based single reward, plus balanced sampling to prevent reward hacking. The result is higher F1 than baseline models, with training diagnostics (reward curves, variance, and precision/recall breakdowns) guiding checkpoint and iteration choices.

Why does RFT matter when supervised or preference fine-tuning already exist?

RFT targets reasoning quality rather than just output style or labeled correctness. Instead of learning from fixed prompt/answer pairs or “better vs. worse” examples, it repeatedly samples different reasoning paths for the same input, grades each candidate with a rubric, and updates the model to favor higher-scoring reasoning trajectories. That makes it especially suited to policy compliance, legal reasoning, and medical workflows where the correctness depends on the reasoning process, not only the final text.

What makes a task a good candidate for RFT?

A task needs (1) a reasoning model baseline that performs nontrivially (not near-zero), and (2) a formal rubric that can grade outputs in a way that is continuous and stratified—so the model can learn from gradations rather than a binary pass/fail. The session also stresses objective correctness: experts should agree on the right output or at least on a reliable scoring scheme.

How do precision, recall, and F1 connect to reinforcement fine-tuning?

For multi-label classification, precision measures the correctness of predicted labels, while recall measures whether expected labels were recovered. RFT requires a single grade per training sample, so F1 is used as a composite score that balances precision and recall (with emphasis on the lower one). The session notes that other composites like F2 can be used if recall should be weighted more heavily.

What is “reward hacking,” and how does the demo prevent it?

Reward hacking happens when the model exploits weaknesses in the reward signal—often by learning dataset biases. In the classification demo, label imbalance could let the model boost scores by over-predicting frequent categories without learning generalizable reasoning. The workflow counters this by using balanced sampling for the training split so the model must learn across categories rather than gaming skewed distributions.

Why are structured outputs and consistent prompts emphasized?

Structured outputs enforce a predictable response schema so graders can parse results reliably and score them without formatting failures. Consistent prompt design matters because the prompt used during training should match inference-time prompting; changing prompt structure later can break the learned behavior. Together, they make grader scoring stable and comparable across evaluation and training.

How do variance studies and reward curves guide training decisions?

Because model outputs are stochastic, the same sample can score differently across runs. Variance plots show the mean score, the best (max) score achieved, and the spread across multiple runs; this reveals whether there is headroom to learn and whether learning is pulling the mean toward better reasoning. Overfitting is flagged when training reward rises while validation reward stays flat. Precision/recall breakdowns across checkpoints help select the checkpoint that matches the desired trade-off.

Review Questions

What conditions must be true about a task and its grader for RFT to work well, and why does binary grading increase the risk of poor learning?
In the EuroVoc classification demo, how do balanced sampling and F1-based grading jointly reduce reward hacking and make the reward signal learnable?
How do variance and checkpoint reward curves change the way you decide whether RFT is improving generalization rather than just memorizing training batches?

Key Points

1
RFT is best used to improve reasoning behavior when the model already has relevant knowledge but still fails to apply it correctly.
2
Fine-tuning should come after maximizing prompting and RAG, since RFT is an investment that requires careful setup.
3
A grader must be objective, continuous, and stratified enough to distinguish good reasoning from merely lucky guesses; binary rewards can encourage reward hacking.
4
Balanced training sampling helps prevent the model from exploiting label skew and inflating scores without learning generalizable reasoning.
5
Structured output schemas and consistent prompts make grader scoring reliable and keep training/inference behavior aligned.
6
Evaluation and training should reuse the exact same grader logic to ensure performance comparisons are valid (apples-to-apples).
7
Training diagnostics—reward curves, validation trends, variance across runs, and precision/recall breakdowns—are essential for checkpoint selection and detecting overfitting.

Highlights

RFT turns each training example into many learning trajectories by sampling multiple reasoning paths and grading them, making it data efficient.

Balanced sampling is used to avoid reward hacking where a model can game skewed label distributions instead of learning generalizable reasoning.

F1 is used as a single reinforcement grade because RFT needs one score per sample, even when precision and recall are tracked separately.

Variance across repeated runs reveals headroom: the model can sometimes hit perfect scores for a sample, and training aims to raise the mean toward those best outcomes.

Accordance reports over 40% improvements on tax bench by using RFT with continuous, stratified graders designed to reward correct reasoning rather than superficial correctness.

Topics

Mentioned

Christine
Pashant
Theo
David
RFT
RAG
F1
GPT4.1
O4 mini
F2
Q&A
RFT