Get AI summaries of any video or article — Sign up free
Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020 thumbnail

Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Social learning is framed as a testable question for independent RL agents: when does observing experts improve learning beyond what direct experience provides?

Briefing

Social learning is central to human intelligence, but it’s unclear when independent reinforcement-learning agents can benefit from one another’s behavior. Kamal N’dousse frames the core problem as a test of whether RL agents can learn from “experts” merely by observing them in shared environments—and if so, under what conditions that social signal beats learning from direct experience.

The talk starts with a human-scale motivation and a monkey parable from experimental sociology: monkeys punish any peer that tries to reach bananas after prior punishment (cold water). Even when the punishment stops, the behavior persists, and new monkeys quickly get punished for attempting the same forbidden action—suggesting a cultural pattern can emerge from observation and enforcement. N’dousse notes the story is apocryphal, but treats it as a template for studying social learning in artificial agents.

To investigate, he builds tools for independent multi-agent RL. He develops marl grid, an open-source grid-world suite compatible with the OpenAI Gym API, designed to scale to many agents and support reproducible experiments. A key environment is “goal cycle,” where agents must traverse gold tiles in a specific order to earn reward; stepping on tiles out of order triggers a configurable penalty. By tuning the penalty, the environment changes how costly exploration is: with low penalties, agents can “mess up” while still eventually finding rewards, while with high penalties, exploration becomes so aversive that agents learn to commit early to a first successful path and avoid alternatives. This knob lets him control the difficulty of learning from the environment itself—crucial for testing whether expert demonstrations provide an advantage.

On the algorithm side, he reports that DQN struggled when long-horizon behavior required memory, even after adding LSTM and trying limited prioritized experience replay. Switching to PPO produced a major improvement. Further gains came from a practical technique he calls “hidden state refreshing”: during PPO’s repeated update steps, the LSTM hidden states are periodically refreshed so the agent’s internal memory doesn’t become stale relative to the evolving policy. With this change, goal-cycle agents achieve higher rewards and more stable training.

With the infrastructure in place, N’dousse turns to observational learning. He first tries to replicate a DeepMind result, “Observational Learning by Reinforcement Learning” (where experts are hard-coded and novices learn via RL in a simple grid). In that earlier work, experts speed up novice learning, but the novices ultimately don’t outperform what they would achieve learning alone. In N’dousse’s replication on a cluttered grid with a single goal, experts again fail to provide a learning speedup.

The more interesting outcome appears in goal cycle. When the cycle structure is masked from novice agents—so direct environmental information is less immediately actionable—novices can learn to follow expert agents’ behavior. In demonstrations, novices converge toward the experts’ strategy, though they may land at slightly lower performance when an expert happens to get trapped, indicating imitation can propagate suboptimal trajectories.

The emerging conclusion is cautious: learning from experts is difficult when agents can already infer the task from direct interaction. Social cues become valuable when the environment makes the right strategy hard to discover on one’s own, and when expert behavior provides information that direct observation doesn’t readily reveal. Next steps include scaling to more goals and varying penalty settings, plus measuring whether novices truly acquire the same skill by testing them in new environments without experts. He also proposes exploring priors or mechanisms that encourage social learning, and asks whether novices could ever surpass experts—an open direction for future experiments.

Cornell Notes

N’dousse investigates when independent RL agents can learn from experts just by observing them in shared environments. Using marl grid and a goal-cycle task where agents must follow an ordered sequence of actions, he shows that direct learning often dominates: in a cluttered single-goal grid, expert presence doesn’t speed up novice learning. But when the task structure is masked so novices can’t easily infer the correct cycle from experience, novices learn to imitate expert behavior and converge toward the experts’ strategy. Algorithmic progress—especially PPO with LSTM and “hidden state refreshing”—is presented as enabling stable long-horizon learning needed for these social-learning tests.

Why does the goal-cycle environment matter for studying social learning?

Goal cycle forces agents to traverse gold tiles in a specific order to earn reward, with penalties for out-of-order steps. The penalty value is configurable, which changes how expensive exploration is. That control matters because it determines how hard it is for a novice to discover the correct strategy from direct interaction—exactly the condition under which expert observation might become useful.

What role does “hidden state refreshing” play in the RL results?

In PPO with LSTM, agents collect trajectories and then perform multiple update steps using batches of experience. Without refreshing, LSTM hidden states can become stale relative to the policy changes during updates, making the memory-driven behavior less consistent with the data being learned from. N’dousse refreshes hidden states between iterations/gradient steps, improving robustness and stability and yielding higher rewards on goal cycle.

How did DQN with memory perform compared with PPO?

DQN with added memory (LSTM) and attempts like limited prioritized experience replay did not work well for the long-horizon strategy required. PPO produced a “pretty big improvement” immediately, and additional implementation details (including hidden state refreshing) further improved performance and training stability.

What happened in the replication of DeepMind’s observational learning result?

In a cluttered grid world with a single goal, expert presence did not help novices learn faster, matching N’dousse’s “very convincingly” reported finding. The takeaway is that social cues may be hard to extract in environments where the task can be learned effectively through direct experience.

Under what condition did expert observation start to help in goal cycle?

When the goal-cycle structure was masked from novice agents, novices learned to follow experts. This suggests experts provide actionable information that novices cannot easily derive from direct interaction alone. However, novices sometimes converged to slightly lower performance than experts, especially when an expert demonstration involved a trap.

Review Questions

  1. What experimental lever (environment parameter or observation condition) most directly determines whether expert behavior becomes useful to novices in goal cycle?
  2. How does hidden state refreshing address the mismatch between stored LSTM memory and the policy being updated during PPO training?
  3. Why might experts fail to help in a single-goal cluttered grid but succeed when the cycle structure is masked?

Key Points

  1. 1

    Social learning is framed as a testable question for independent RL agents: when does observing experts improve learning beyond what direct experience provides?

  2. 2

    The marl grid framework supports scalable, configurable multi-agent grid-world experiments compatible with the OpenAI Gym API.

  3. 3

    Goal cycle uses ordered gold tiles plus configurable penalties to tune exploration difficulty, letting researchers control how learnable the task is without social information.

  4. 4

    PPO with LSTM outperformed DQN with memory for long-horizon behavior, and “hidden state refreshing” improved stability by preventing stale memory during repeated PPO updates.

  5. 5

    Replicating DeepMind’s observational learning result in a cluttered single-goal grid found no novice speedup from expert presence.

  6. 6

    Masking the goal-cycle structure enabled novices to imitate experts, indicating social cues matter most when direct inference is hard.

  7. 7

    Next-step evaluation emphasizes whether novices acquire transferable skill by testing them in new environments without experts, not just by matching expert trajectories during training.

Highlights

Goal cycle’s penalty knob changes exploration cost, effectively controlling when expert observation could beat trial-and-error learning.
Hidden state refreshing is presented as a practical fix for LSTM memory becoming inconsistent with PPO’s evolving policy during update steps.
Expert presence failed to speed learning in a single-goal cluttered grid, but imitation emerged in goal cycle when the cycle structure was masked.
Novices can follow experts yet still converge to slightly lower performance when experts demonstrate suboptimal behavior (e.g., getting trapped).

Topics

  • Social Learning
  • Multi-Agent Reinforcement Learning
  • Observational Learning
  • PPO with LSTM
  • Grid-World Environments

Mentioned

  • Kamal N’dousse
  • Natasha
  • Mariah
  • Kristina
  • Alethea Power
  • RL
  • DQN
  • PPO
  • LSTM
  • Gym
  • API