Reinforcement Fine-Tuning—12 Days of OpenAI: Day 2

TL;DR

Reinforcement fine-tuning (RFT) trains o1 models to improve domain-specific reasoning by grading generated outputs and using reinforcement learning to reinforce correct reasoning paths.

Briefing Cornell Notes

Briefing

OpenAI is previewing reinforcement fine-tuning for its o1 model family—an approach that lets developers and researchers adapt models to specialized tasks using reinforcement learning rather than simple “copy the examples” training. The key promise: with a small set of domain examples and a scoring system, o1 can learn to reason in ways tailored to a user’s data and objectives, producing better task performance and more useful ranked outputs.

Unlike supervised fine-tuning, which teaches a model to mimic patterns in input text or images (often to change tone, style, or formatting), reinforcement fine-tuning trains the model to improve its decision-making. During training, the model is given a problem and allowed time to think, then its final answer is graded. Correct reasoning paths are reinforced, while incorrect ones are discouraged. OpenAI frames this as a technique that builds on reinforcement learning methods used internally to train frontier models, including the o1 series.

The practical demonstration centers on computational genetics for rare diseases. Justin Rhee of Lawrence Berkeley National Laboratory describes a major challenge in the field: each rare genetic condition affects relatively few people, but collectively rare diseases impact roughly 300 million people worldwide and often require months or years to diagnose. His team works with curated case-report data extracted from hundreds of scientific publications, including patient symptoms, absent symptoms (what is not present), and the causative gene known from the case report. The task is to infer which gene could be responsible given an incomplete symptom profile.

In the live walkthrough, OpenAI uses an o1 mini starting point and trains a reinforcement fine-tuned variant using about 1,100 training examples. Each training record includes a case description (age and symptom list), an explicit list of absent symptoms to help rule out likely genes, and instructions asking the model to output a ranked list of candidate genes plus an explanation. The correct gene is used internally for grading, not shown to the model during generation.

Grading is handled by “graders” that compare the model’s ranked gene list to the known correct answer and return a score from 0 to 1, with partial credit when the correct gene appears lower in the ranking. A validation set is prepared in the same format but with no overlap in correct genes between training and validation, aiming to test generalization rather than memorization.

Results are tracked through validation reward and evaluation metrics such as Top-1 (correct gene is first), Top-5 (correct gene appears in the top five), and Top-Max (correct gene appears anywhere in the list). OpenAI reports that the base o1 mini improves from roughly 17% Top-1 to about 31% after reinforcement fine-tuning, while also showing gains across Top-5 and Top-Max. Example outputs highlight not just accuracy improvements but also the model’s reasoning—linking symptom combinations to likely causal genes and ranking alternatives when the top prediction is not the only plausible candidate.

OpenAI positions reinforcement fine-tuning as a general technique with early signs of value across legal, AI safety, healthcare, and bioinformatics. The company plans to launch it publicly in early next year, while expanding access through a “reinforcement fine-tuning research program” for universities, researchers, and enterprises working on complex, expert-driven tasks.

Cornell Notes

Reinforcement fine-tuning (RFT) for OpenAI’s o1 models trains the system to improve task-specific reasoning using reinforcement learning and graded outputs. Instead of teaching the model to imitate example text, RFT lets the model think, then scores its final ranked answer against a known correct target (with partial credit). In a genetics case study, OpenAI fine-tunes o1 mini on curated rare-disease case reports where inputs include present symptoms and absent symptoms, and the target is the causative gene. Using about 1,100 training examples and a validation set designed to prevent gene overlap, the fine-tuned model shows higher validation reward and better ranking metrics (Top-1, Top-5, Top-Max). The approach matters because it can turn “golden data” into specialized AI assistants that generalize beyond memorizing symptom-to-gene mappings.

How does reinforcement fine-tuning differ from supervised fine-tuning for o1 models?

Supervised fine-tuning focuses on making the model replicate patterns found in training inputs—useful for changing tone, style, or response format. Reinforcement fine-tuning instead trains the model to reason toward correct outcomes in a domain: the model is prompted with a problem, allowed to think, then its output is graded. Reinforcement learning reinforces reasoning paths that lead to correct answers and discourages those that lead to incorrect ones, enabling new reasoning behaviors over custom domains.

What does a single training example look like in the rare-disease gene prediction demo?

Each example includes (1) a case report describing the patient and symptoms (including a list of present symptoms), (2) a list of absent symptoms to help rule out genes the model might otherwise choose, (3) instructions prompting the model to output a ranked list of candidate genes plus an explanation, and (4) a correct answer gene used internally for grading rather than shown to the model during generation.

What role do graders play, and how do they score ranked gene outputs?

Graders compare the model’s ranked list to the known correct gene and return a score between 0 and 1. If the correct gene is first, the score is 1; if it appears lower in the ranking, the score decays gradually toward 0. OpenAI also mentions using a collection of graders to cover different task intents and suggests that custom graders may be possible later (e.g., via user-provided code).

Why does the validation setup matter for measuring real learning?

The validation dataset uses the same input format but is constructed so there is no overlap in correct genes between training and validation. That design reduces the chance the model can “cheat” by memorizing a symptom-to-gene mapping from the training set, forcing generalization to new genes.

Which metrics show improvement after reinforcement fine-tuning in the demo?

Evaluation uses Top-1 (correct gene is the first item), Top-5 (correct gene appears within the top five), and Top-Max (correct gene appears anywhere in the list). OpenAI reports that base o1 mini starts around 17% Top-1 on the ~200-example evaluation set, while the reinforcement fine-tuned o1 mini reaches about 31% Top-1, with additional gains reflected in Top-5 and Top-Max.

How is the approach expected to fit into real-world workflows beyond genetics?

The discussion frames RFT as a general technique for domains requiring deep expertise—legal, finance, engineering, insurance, and healthcare. The near-term expectation is often a hybrid workflow: combine existing bioinformatics or domain tools with o1 plus reinforcement fine-tuning so the model can handle incomplete inputs and produce ranked, explainable outputs that better match domain objectives.

Review Questions

What mechanism in reinforcement fine-tuning replaces the “mimic the examples” goal of supervised fine-tuning?
In the gene prediction task, how do absent symptoms and ranked outputs affect what the model must learn?
Why is preventing overlap of correct genes between training and validation important for interpreting the results?

Key Points

1
Reinforcement fine-tuning (RFT) trains o1 models to improve domain-specific reasoning by grading generated outputs and using reinforcement learning to reinforce correct reasoning paths.
2
RFT differs from supervised fine-tuning by optimizing for correct outcomes and reasoning behavior rather than copying patterns in training examples.
3
The rare-disease demo uses case reports with both present and absent symptoms, and it trains the model to output a ranked list of candidate causative genes plus an explanation.
4
Grading is implemented via “graders” that score ranked outputs from 0 to 1 with partial credit depending on where the correct gene appears.
5
A validation set with no overlap in correct genes is used to test generalization and reduce memorization risk.
6
OpenAI reports improved ranking metrics after RFT on o1 mini, including a jump in Top-1 performance (about 17% to about 31%) on the evaluated dataset.
7
Reinforcement fine-tuning is planned for public launch in early next year, with interim access via a research program for organizations working on complex expert tasks.

Highlights

RFT trains o1 to think, then improves based on graded outcomes—reinforcing reasoning that leads to correct answers and discouraging reasoning that leads to wrong ones.

In the genetics case study, absent symptoms are explicitly included so the model learns to rule out genes, not just match symptoms.

A validation design with no overlap in correct genes helps demonstrate generalization rather than memorization.

Top-1 accuracy for o1 mini rises to about 31% after reinforcement fine-tuning in the gene-ranking task.

OpenAI ties the technique to its internal reinforcement learning approach used for frontier models, including the o1 series.

Topics

Reinforcement Fine-Tuning
o1 Customization
Rare Disease Genetics
Ranked Gene Prediction
Model Evaluation

Mentioned

Mark
John Allard
Julie W
Justin Rhee
o1
RFT
API
Top-1
Top-5
Top-Max
AI
PhD
JSONL
Gen
US
US