Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay

TL;DR

Interpretability is treated as a measurable proxy for how much a model’s decision-relevant signals align with human-recognizable objects, rather than as a purely qualitative judgment.

Briefing Cornell Notes

Briefing

Neural networks trained on more diverse experience tend to develop features humans can interpret more often—and that relationship can be measured without relying entirely on slow human review. In a Scholars Demo Day 2020 project built on CoinRun (a Mario-style platformer), Jorge Orbay frames interpretability as “mind reading” for models: instead of asking a network why it acted, the work tries to quantify how much of the model’s internal decision-relevant signals line up with the game objects a human would consider meaningful.

The starting point is the diversity hypothesis from an (at the time) unpublished OpenAI paper by Jacob Hilton and Chris Olah: interpretable features emerge at a given level of abstraction if and only if the training distribution is diverse enough at that level. “Diversity” is defined pragmatically as the amount of distinct training inputs—e.g., how many distinct levels an agent is trained on. In CoinRun, the agent’s environment varies across assets, textures, and platform layouts, so increasing the number of trained-on levels increases the variety of visual and positional patterns the model must learn.

Prior work tested the hypothesis using a human-in-the-loop process: researchers inspect attribution-based feature signals and manually decide whether each feature corresponds to something a person can recognize. The results show a clear trend: models trained on very few levels (around 100) have only about 1 out of 5 features that humans can interpret, while models trained on far more levels (around 100,000) rise to roughly 4 out of 5. Performance also improves with more training data, and interpretability appears to improve alongside it.

Orbay’s contribution is an attempt to replace the human loop with an algorithmic metric. The project uses attribution—computed via derivatives of the model output with respect to inputs—to produce saliency maps that indicate which pixels (or regions) most influence the model’s decisions. In CoinRun, attribution is applied not just to a whole network, but to parts tied to the agent’s control and to a value function (an estimate of how good a state is). The resulting saliency can be overlaid on the game frame to see whether the model focuses on meaningful objects (like enemies to avoid or platforms to use) or on irrelevant artifacts.

To quantify interpretability, Orbay defines a score based on the overlap between attribution and “objects of interest” masks: the numerator counts intersecting pixels where saliency lands on meaningful regions, and the denominator normalizes by total attribution area. Averaging this overlap across many frames and across many features yields an overall interpretability score.

The metric initially fails to match the human-in-the-loop results: it lands around 35–40% rather than the expected higher interpretability for diverse training. The shortfall is traced to practical issues—attribution regions are often much larger than the small, localized examples used to illustrate the method, sometimes covering a large fraction of the screen. That makes overlap-based scoring less discriminative. A refinement using receptive fields and weighting connected versus less-connected network parts is mentioned as a partial fix, but the current definition still doesn’t work reliably.

Orbay’s conclusion is cautious but forward-looking: interpretability can likely be computed algorithmically, but the overlap-based definition needs refinement and validation beyond CoinRun and beyond the current experimental setup. The project also clarifies that good attribution/saliency alignment is treated as a proxy for interpretability—grounded in how humans attend to salient objects—rather than a strict causal guarantee.

Cornell Notes

The project tests a link between training diversity and interpretability in CoinRun, where models trained on more distinct levels tend to produce features humans can understand. Prior human-in-the-loop work found interpretability rises from about 1/5 interpretable features (≈100 levels) to about 4/5 (≈100,000 levels). Orbay tries to replace that slow human process with an algorithmic metric using attribution (saliency via derivatives) and an overlap score between attribution maps and “objects of interest” masks. The overlap metric underperforms, landing around 35–40% interpretability, largely because attribution regions are often too large and cover much of the screen. The takeaway is that algorithmic interpretability is feasible in principle, but the current definition needs refinement and broader testing.

What does “interpretability” mean in this project, and why does it matter for neural networks?

Interpretability is framed as a form of “mind reading” for neural networks: unlike humans, models can’t be asked why they made a decision, so the work instead breaks down what parts of the input drive the model’s outputs. In CoinRun, that means identifying which visual regions the agent’s network attends to when controlling the character and estimating state value. Quantifying interpretability matters because it offers a measurable way to understand whether training choices (like data diversity) lead to internal features that are understandable to people.

How does the diversity hypothesis connect training data to interpretable features?

The diversity hypothesis claims interpretable features arise at a given abstraction level if and only if the training distribution is diverse enough at that level. In this project, “diverse” is operationalized as the number of distinct CoinRun levels an agent trains on. The prior human-in-the-loop experiment supports the hypothesis: models trained on ~100 levels yield only about 1 out of 5 features that humans can interpret, while models trained on ~100,000 levels yield about 4 out of 5.

What role does attribution (saliency) play in the attempt to quantify interpretability?

Attribution produces saliency maps by taking derivatives of a network’s output with respect to inputs, highlighting which pixels most affect the decision. In image classification, this would show attention on parts like a bird’s beak and feathers while ignoring irrelevant background. In CoinRun, attribution is applied to parts tied to player control and to a value function, producing overlays that can be visually checked for whether the model focuses on meaningful objects (e.g., enemies) versus irrelevant artifacts.

How is Orbay’s algorithmic interpretability score constructed?

The score uses overlap between attribution and a mask of “objects of interest.” For a frame, the method defines a binary mask where meaningful regions (objects) are assigned value 1 and background is 0. The attribution map is then intersected with that mask: the numerator counts intersecting pixels, and the denominator normalizes by total attribution area. A per-feature, per-frame score is averaged across many frames (e.g., 512) and across all features to produce an overall interpretability score for the model.

Why did the overlap-based interpretability metric underperform compared with human judgments?

The metric produced roughly 35–40% interpretability instead of matching the stronger human-in-the-loop trend. The main issue is that attribution regions are often much larger than the small, localized examples used for intuition—sometimes on the order of tens of pixels by tens of pixels. When saliency covers a large fraction of the screen, overlap with objects becomes less informative and can look similar across models. The project notes that receptive-field-based refinement and weighting connected versus less-connected parts may help, but the current method still isn’t robust.

What do the Q&A comments imply about generalizing this approach beyond CoinRun?

A question about scaling to other games (e.g., “bouncy ball”) highlights a key constraint: CoinRun’s meaningful assets occupy a limited portion of the screen (around 50% or less), which makes overlap-based scoring more workable. In games where allegoric or important assets fill the entire screen (the example given was checkers), the method may not transfer well because attribution overlap would saturate and lose contrast. The discussion suggests interpretability metrics may depend on game layout and object salience structure.

Review Questions

How does the project operationalize “diversity” and “level of abstraction” when testing the diversity hypothesis in CoinRun?
Describe how attribution is computed conceptually and how it is used to generate saliency maps for the agent’s decisions.
What specific failure mode caused the overlap-based interpretability score to land around 35–40%, and what refinement was proposed to address it?

Key Points

1
Interpretability is treated as a measurable proxy for how much a model’s decision-relevant signals align with human-recognizable objects, rather than as a purely qualitative judgment.
2
The diversity hypothesis predicts interpretable features emerge when training data is diverse enough at the relevant abstraction level; in CoinRun, diversity is approximated by the number of distinct levels trained on.
3
Human-in-the-loop attribution inspection previously showed interpretability rising from about 1/5 interpretable features (~100 levels) to about 4/5 (~100,000 levels).
4
Attribution-based saliency maps (via derivatives of outputs with respect to inputs) are used to identify which pixels the model attends to for both control and value estimation.
5
Orbay’s algorithmic metric computes interpretability as normalized overlap between attribution maps and masks of “objects of interest,” averaged across many frames and features.
6
The overlap metric underestimates interpretability (about 35–40%) because attribution regions are often too large, reducing the discriminative power of overlap.
7
Algorithmic interpretability appears feasible but requires refinement (e.g., receptive-field handling and weighting) and validation across domains beyond CoinRun.

Highlights

Training on more distinct CoinRun levels correlates with a sharp increase in the fraction of features humans can interpret (from ~1/5 to ~4/5).

Attribution is used to turn model attention into pixel-level saliency maps, enabling a quantitative overlap test against human-relevant objects.

The overlap-based interpretability score fails when saliency covers too much of the screen, showing that metric design must account for attribution region size and game layout.

Good attribution/saliency alignment is treated as a proxy for interpretability grounded in how humans attend to salient objects, not as a guaranteed causal implication.

Topics

Interpretability
Diversity Hypothesis
Attribution
CoinRun
Saliency Metrics

Mentioned

Jorge Orbay
Jacob Hilton
Chris Olah
Jota Jota
Alethea Power
Cobb
Greg Brockman
Pamela

Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020