OpenAI Scholars Demo Day 2019

TL;DR

Discount factor γ in DQN can affect both intertemporal preferences and the bias/confidence of bootstrapped value estimates, so “high γ” can be harmful in dense-reward settings.

Briefing Cornell Notes

Briefing

OpenAI Scholars Demo Day 2019 showcased how machine learning research ideas—from reinforcement learning and language modeling to model compression and interpretability—can be turned into working prototypes in a short, mentored sprint. Across eight presentations, the central throughline was practical: each project tackled a concrete problem (learning from sparse signals, making decisions under uncertainty, compressing large models, or diagnosing what generative models produce) and then tested whether the approach actually improved outcomes.

One of the most technical talks focused on reinforcement learning’s discount factor, γ, in deep Q-networks (DQN). In theory, Blackwell optimality suggests that for sufficiently high γ, the same optimal policy should emerge across environments. Experiments in gridworld-style tasks complicated that picture: in sparse-reward settings, higher γ (e.g., 0.99) behaved as expected, but in dense-reward environments the highest γ performed worse than mid-range values. The proposed explanation was that γ plays a dual role: it encodes intertemporal preferences and also affects how much the algorithm “trusts” bootstrapped value estimates from function approximation—effectively weighting past information. To repair the mismatch, the project introduced a time-varying “myopic” schedule: start with a lower γ for an initial fraction of training, then ramp to the target γ. This simple change improved learning in dense environments and could still reach optimal performance in sparse environments, albeit sometimes more slowly. Follow-up experiments suggested the gains were not primarily due to extra exploration; instead, the schedule helped mitigate bias and improved convergence.

Another reinforcement learning project tackled robotics without dense external rewards by using intrinsic motivation. The method trained a dynamics model and used its prediction error as an intrinsic reward, pushing the agent toward states it hadn’t mastered yet. In OpenAI’s Fetch-style robotics environments (reaching, pushing, pick-and-place, sliding), the baseline PPO agent struggled under sparse rewards for harder tasks, while intrinsic rewards enabled rapid learning. Hyperparameter sensitivity mattered—especially the learning rate of the dynamics model—and early resets after success boosted performance. The results suggested intrinsic prediction error can act as a robust exploration bonus for real control problems.

Language-focused work included fine-tuning GPT-2 small for question answering to probe common-sense reasoning. Using datasets with both answerable and unanswerable questions (including SQuAD-style splits), the model often learned a useful abstention behavior on unanswerable items, but it showed weaknesses when answers required paraphrasing or synonym/antonym shifts. Another talk used reinforcement learning for sentiment analysis by treating word selection as a sequential decision problem, with PPO training improving over supervised-only baselines on transformer and BERT-like classifiers.

Compression and interpretability appeared as well. One scholar applied knowledge distillation to shrink transformer models, but ran into data-scale issues when storing full teacher distributions; truncating the teacher’s output to the top 10 candidate tokens reduced storage and improved results. Another project evaluated GAN quality using activation Atlas, mapping where distributions diverge inside an Inception network. Layer-by-layer visualization revealed that differences between generators and ImageNet could concentrate in later layers for some comparisons, while other divergences appeared earlier—offering a more structured way to interpret scalar GAN metrics like FID.

Finally, projects extended beyond classic benchmarks: an “AI physician” used observational EHR data to learn sepsis treatment policies with offline RL and off-policy evaluation, while an education-focused system trained BERT to recommend inquiry-based projects from web sources, then used active learning to refine topic predictions. Together, the day’s demos argued—through results, not just ideas—that careful problem framing plus targeted experiments can turn advanced ML concepts into usable prototypes quickly.

Cornell Notes

OpenAI Scholars Demo Day 2019 highlighted multiple ML prototypes built in a short, mentored timeframe, with a strong emphasis on turning theory into measurable behavior. In deep RL, discount factor γ was shown to have a dual role in DQN—preferences over time and implicit confidence in bootstrapped value estimates—leading to failures of Blackwell optimality in dense-reward settings. A time-varying “myopic” γ schedule (start lower, ramp up) improved dense-environment learning and could still reach optimal performance in sparse tasks. In robotics, intrinsic motivation via dynamics-model prediction error produced dense internal rewards that enabled sparse-reward agents to solve harder Fetch tasks. Across language and generative modeling, work ranged from GPT-2 fine-tuning for QA abstention to knowledge distillation for smaller transformers and activation-Atlas-based GAN evaluation.

Why did high discount factors (e.g., γ=0.99) fail in dense-reward DQN experiments, even though Blackwell optimality would suggest otherwise?

The dense-reward experiments showed that the highest γ could underperform mid-range γ values, contradicting the expectation that sufficiently high γ yields a shared optimal policy. The proposed reason was that γ affects two mechanisms at once: (1) explicit intertemporal preference (discounting future rewards) and (2) implicit weighting of bootstrapped value estimates from the function approximator (confidence in past value predictions). In dense settings, this second effect can introduce harmful bias, so the “preference” interpretation alone doesn’t predict learning quality.

What is the “myopic” γ schedule, and how did it change learning outcomes?

Instead of using a fixed γ throughout training, the method used a time-varying schedule: choose a target γ, but for an initial fraction of training steps use a lower γ that ramps linearly to the target. This creates early training that is more “myopic,” reducing the bias introduced by bootstrapping under high γ. In dense environments, the schedule made previously poor high-γ settings competitive and often optimal. In sparse environments, fixed high γ still performed best, but the myopic schedules eventually converged to optimal performance—sometimes taking longer.

How did intrinsic motivation work for sparse-reward robotics tasks in the Fetch environments?

The approach trained a dynamics model that predicted the next state from the current state and action. Intrinsic reward was set to the prediction error: larger errors meant the agent was in less predictable (less explored) regions. This produced dense internal rewards even when the environment’s extrinsic reward was sparse (0 for success, -1 otherwise). The result was faster learning for reaching and enabling success for pushing, pick-and-place, and sliding where the baseline PPO struggled. Key tuning included the dynamics model learning rate and using early environment resets after success.

What did the GPT-2 small QA fine-tuning reveal about common-sense reasoning and model behavior?

Using SQuAD-style data with answerable and unanswerable questions, the model often learned to abstain when no answer existed in the passage and could sometimes produce plausible answers for unanswerable cases. Failures clustered around paraphrase sensitivity: when correct answers were expressed with different wording, synonyms, or reordered phrasing, the model frequently missed them. The model also showed a tendency to attend to early tokens in the question text when extracting answers from the passage.

How did knowledge distillation run into practical issues, and what fix improved results?

The initial distillation stored full teacher output distributions for masked tokens, which created enormous datasets (the scholar reported ending up with about 72 TB of saved outputs), causing GPU and sequence-length limitations and leading to very low accuracy (~5%). The fix was to truncate the teacher’s output distribution to only the top 10 candidate tokens, reducing stored data size (to about 384 GB) and improving accuracy to around 7% within the available compute budget.

What did activation Atlas add beyond scalar GAN metrics like FID?

Scalar metrics summarize quality with a single number, but they don’t show where in a network the distributions differ. Activation Atlas uses an Inception network to capture activations across layers, clusters them into grid cells, and creates feature visualizations (“icons”) per cell. It also colors cells by relative density/log-likelihood between distributions (e.g., between different GANs and ImageNet). This revealed that divergence could concentrate in specific layers (sometimes early, sometimes late), and it helped interpret what kinds of textures or shapes the generator was producing (or missing) in different regions of activation space.

Review Questions

In DQN, what two roles does γ play in the proposed explanation for dense-reward failures, and how does the myopic schedule address them?
For intrinsic motivation in robotics, why does prediction error from a learned dynamics model encourage exploration, and what hyperparameter was especially important?
When fine-tuning GPT-2 small for QA, what kinds of answer transformations (e.g., paraphrases) tended to cause failures, and how did the model behave on unanswerable questions?

Key Points

1
Discount factor γ in DQN can affect both intertemporal preferences and the bias/confidence of bootstrapped value estimates, so “high γ” can be harmful in dense-reward settings.
2
A time-varying myopic γ schedule (start lower, ramp to target) can restore learning quality without needing to search for a single best fixed γ.
3
Intrinsic motivation via dynamics-model prediction error can turn sparse external rewards into dense internal rewards, enabling harder robotics tasks to be solved with PPO.
4
In robotics experiments, the dynamics model learning rate and early reset strategy materially influenced success rates and convergence speed.
5
Knowledge distillation can fail in practice when storing full teacher distributions creates unmanageable data volumes; truncating teacher outputs to top candidates can make training feasible.
6
Activation Atlas provides interpretability for generative models by localizing distribution differences inside a classifier network rather than relying on a single scalar score.
7
Offline RL for medical decision-making requires careful state/action discretization and off-policy evaluation (e.g., weighted importance sampling) because data come from observational behavior policies.

Highlights

Dense-reward DQN experiments contradicted Blackwell optimality expectations: γ=0.99 underperformed, motivating a dual-role explanation and a myopic γ schedule.

Using prediction error from a learned dynamics model as intrinsic reward enabled sparse-reward Fetch tasks like pushing and pick-and-place to reach near-success rates.

Truncating teacher output distributions in knowledge distillation reduced data explosion and improved student performance compared with storing full distributions.

Activation Atlas turned GAN evaluation into layer-by-layer, feature-visualized comparisons, showing divergence concentrated in particular Inception layers.

In sepsis treatment policy learning, offline RL combined discretized patient states/actions with off-policy evaluation to estimate policy value from observational trajectories.

Topics

Mentioned

Ilya Sutskever
John Schulman
Azalea
Matthias
Lillian
Yura
Harry
Wojcik
Edgar
RL
DQN
PPO
EHR
MIMIC-III
MDP
KL
FID
GAN
QA
BERT
GPT-2
SQuAD
PPO loss
Dyna
IQ
IV