Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021

TL;DR

Reward modeling for informal tasks can be trained from internet preference signals such as Reddit upvotes, avoiding expensive interactive labeling.

Briefing Cornell Notes

Briefing

Large-scale reward modeling can be trained from cheap, naturally occurring preference signals on the internet—without the costly, researcher-guided feedback loops used in earlier work—and it can still generalize to unseen comparisons. Jonathan Ward’s demo centers on learning what makes one response “better” than another in an informal, hard-to-define domain: creative writing. Instead of formal scoring functions, the approach treats human preference as a learnable pattern extracted from public ratings, then uses that learned reward model to steer a story-writing system.

The core setup uses Reddit’s r/writingprompts, where writers submit short story responses to a prompt and the community votes them up or down. Those vote totals become an aggregate proxy for preference. Ward trains three linked models: a generative model that produces story responses from prompts, an evaluative model that takes a prompt plus two candidate responses and predicts which one is preferred, and a gameplay/agent model that starts from the generative model and is further trained using the evaluative model’s pairwise judgments. This sequence mirrors how an agent can learn to act by optimizing against a learned reward signal.

The key empirical question is whether the reward model captures genuine preference rather than superficial shortcuts. Ward reports a hard test designed to reduce confounds such as response length and response timing. The evaluation filters to comparisons where candidate responses are roughly similar in length and created around the same time, aiming to prevent the model from learning that “longer” or “faster” automatically wins. Under this controlled test, the reward model reaches 74.2% accuracy on held-out comparisons.

Ward contextualizes that number by showing how accuracy changes with training set size and model capacity. Larger models learn faster and all model sizes trend toward a similar performance ceiling near 74.2%. Increasing the number of training examples continues to help, especially for smaller models, but gains diminish as performance saturates. The result suggests two constraints: model capacity matters for learning efficiency, while the ultimate limit may be driven by noise in human preference labels and the inherent variability of what people like.

The demo also points to a next step: transfer across tasks and communities. If reward models trained on many subreddits can be tested on new, unseen writing tasks, they would better match real-world settings where feedback is sparse and expensive. Ward frames internet feedback as a form of “pre-training” (broad, abundant signals) and interactive human feedback as “fine-tuning” (carefully curated, expertise-driven corrections).

A cautionary note follows: Reddit is not representative of the globe, so bias is unavoidable if training relies on a single platform. The proposed remedy is balancing internet signals with additional, more curated datasets—potentially including expert-written examples and broader demographic coverage—to make learned notions of “good” less parochial.

Cornell Notes

Reward modeling for informal human preferences can be learned from inexpensive internet signals. Using Reddit’s r/writingprompts, Ward trains an evaluative model that predicts which of two story responses is preferred, then uses it to further train a story-generating agent. The reward model is tested on a “hard” set designed to remove shortcuts tied to response length and timing, and it achieves 74.2% accuracy on unseen comparisons. Accuracy improves with more data and larger models, but performance converges toward a ceiling near 74.2%, suggesting label noise and preference variability. The next direction is transfer across subreddits and tasks, while addressing platform bias by combining internet feedback with more balanced, curated datasets.

How does the project turn Reddit activity into a reward signal for training?

Community voting on r/writingprompts provides pairwise preference data. Each writing prompt receives multiple story responses, and the resulting upvote/downvote totals act as an aggregate proxy for which response is preferred. From this, the evaluative model is trained to take a prompt plus two candidate responses and output which one is better, effectively learning a reward function from public ratings.

What are the three models in the system, and what role does each play?

First, a generative model maps writing prompts to story responses, analogous to a writer producing text. Second, an evaluative model predicts preference between two responses given the same prompt, analogous to a lurker who votes. Third, a gameplay/agent model starts from the generative model and is trained further using feedback from the evaluative model’s pairwise judgments, so the agent learns to produce stories that the reward model prefers.

Why was the test set designed to be “hard,” and what confounds were targeted?

The evaluation aimed to ensure the reward model learned preference rather than surface artifacts. Ward removed confounders by filtering to comparisons where candidate responses were roughly similar in length and created around the same time, reducing the chance that the model wins simply by picking longer or faster outputs. Explicit text was also filtered out during preparation, reflecting additional effort to reduce misleading signals.

What does the 74.2% accuracy result mean in context?

The reward model’s 74.2% accuracy is measured on held-out comparisons it hadn’t seen during training. Ward argues the ceiling likely reflects both preference-label noise (people disagree and preferences vary over time) and possibly model limitations. The learning curves show larger models learn faster but converge toward the same approximate performance level, reinforcing the idea that noise in labels may dominate the upper bound.

How does model size and training data affect performance, according to the learning curves?

Accuracy rises as training set size increases, with the biggest improvements appearing for smaller models. Larger models (including GPT 2 xl, described as the largest trained) learn faster and reach the saturation region sooner. However, all model sizes trend toward a similar plateau near 74.2%, indicating diminishing returns from more data once the model has extracted the available signal.

What bias concern was raised, and what mitigation strategy was proposed?

A question asked how to eliminate bias from Reddit. The response was that bias can’t be fully removed if training relies on a single skewed platform. The suggested approach is balancing internet feedback with other datasets—potentially curated and more representative—so the learned reward reflects broader preferences rather than Reddit’s particular culture.

Review Questions

What training objective does the evaluative model use, and how is it constructed from Reddit voting data?
Which specific confounds were filtered out to prevent the reward model from exploiting shortcuts like length or timing?
Why might reward-model accuracy converge to a ceiling even as model size increases?

Key Points

1
Reward modeling for informal tasks can be trained from internet preference signals such as Reddit upvotes, avoiding expensive interactive labeling.
2
A three-stage pipeline links a generative story model, a pairwise evaluative reward model, and an agent model trained to optimize against that reward.
3
Generalization is assessed on held-out comparisons using a hard test set that filters out confounds tied to response length and response timing.
4
Learning curves show larger models learn faster, but performance saturates near 74.2% accuracy, suggesting a limit from noisy preference labels.
5
Internet feedback is positioned as broad “pre-training,” while contractor-based interactive feedback functions like targeted “fine-tuning.”
6
Platform bias is unavoidable when relying on a single community; mitigation requires balancing internet data with curated, more representative datasets.
7
A next research step is transfer across subreddits and tasks to build reward models that work when feedback is sparse and expensive.

Highlights

Reddit voting on r/writingprompts can be converted into a pairwise preference dataset to train a reward model for creative writing.

The reward model’s 74.2% accuracy comes from a hard evaluation designed to block shortcuts based on response length and timing.

All model sizes trend toward a similar accuracy plateau near 74.2%, pointing to label noise and preference variability as a likely bottleneck.

The proposed roadmap treats internet feedback like pre-training and interactive human feedback like fine-tuning, while warning that Reddit is not globally representative.

Topics

Reward Modeling
Human Preferences
Pairwise Evaluation
Transfer Learning
Bias Mitigation

Mentioned

Jonathan Ward
John Shulman