Teaching Robots to Walk w/ Reinforcement Learning
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Locomotion learning here is treated as continuous control: policies output real-valued servo targets rather than discrete action choices.
Briefing
A fast, topology-evolving NEAT setup produced the first stable “walk forward” behavior for a bipedal robot in NVIDIA Isaac Sim—while several continuous-control reinforcement learning baselines (including DDPG and TD3 variants) repeatedly peaked and then collapsed into jittery or nonsensical motion. The practical takeaway is less about one magic algorithm and more about how sensitive quadruped/biped locomotion training is to action continuity, observation design, reward shaping, and model capacity.
The work starts by reframing locomotion as a continuous-control problem. Instead of choosing between discrete actions (like CartPole’s left/right), the agent must output continuous servo targets in a range (roughly −1 to +1) so multiple joints move in coordinated fashion. To make iteration feasible, the training task is simplified using OpenAI Gym’s BipedalWalker environment as a stand-in: the objective is to move forward without falling, with continuous servo positions as actions.
A key uncertainty is what sensory inputs are truly necessary. The full physical robot setup lacks some signals used in the reference environment (such as lidar and ground-contact sensing), and the “hole angle” measurement is similar but not identical. Rather than assume a rich observation set will be available, the training experiments test whether learning can succeed with more limited inputs—initially focusing on servo positions and rotational deltas.
Algorithmically, the contrast is stark. NEAT delivers usable walking behavior extremely quickly—on the order of minutes—showing stable forward motion even if the gait isn’t graceful. In side-by-side comparisons, DDPG training runs far longer (tens of minutes) and can briefly improve, but then “loses everything,” returning to erratic behavior with no clear recovery. The same pattern of instability appears in later attempts with TD3 (an evolved DDPG), which also fails to produce reliable early gains.
Because the agent sometimes learns to exploit the simulator with jittery sliding, the reward function is engineered with guardrails. Forward progress is rewarded step-by-step, lateral movement is penalized, and a fail-safe triggers a negative reward when too many joints fail to move meaningfully (a direct punishment for the “shake and slide” failure mode). Additional reward variations briefly induce behaviors like jumping, but those gains are short-lived because the policy cannot recover once it commits to the shortcut.
The breakthrough for the DDPG-family approach comes not from switching algorithms but from scaling the neural network. After many “shotgun” trials, increasing the actor/critic model size—specifically from a smaller 3-layer 400/300/300 setup to a larger 4-layer 512/256/256/256 configuration—produces a temporary improvement baseline. Even then, training often follows a familiar curve: a bump in reward followed by sudden collapse, even after hyperparameter tweaks. The best stable results so far come from leaving TD3 paper defaults intact while only changing network sizes.
Finally, the workflow is operationalized: code and Isaac Sim scene files are provided, along with the best trained model and a training script that replaces the Gym environment with Isaac Sim. The author flags a likely area to investigate—pose.r usage and coordinate/reference-frame assumptions—as a potential source of subtle training issues. The overall message is that locomotion learning in simulation is achievable, but “import deep learning and it works” is far from reality; success depends on careful simplification, reward design, and model/observation choices.
Cornell Notes
The project treats biped locomotion as a continuous-control reinforcement learning task where the policy outputs continuous servo targets (not discrete left/right choices). NEAT quickly finds a stable forward-walking gait in Isaac Sim, while DDPG and TD3 often train for much longer and then collapse into jittery or exploitative behaviors. A custom reward function rewards forward progress and penalizes side-to-side motion, plus a fail-safe that strongly punishes “shake-and-slide” when joints don’t move enough. For DDPG-family methods, the most meaningful improvement comes from increasing actor/critic network size, which creates a temporary performance bump but still tends to degrade later. The work also emphasizes observation limitations and reference-frame details (e.g., pose.r) as likely sources of training instability.
Why is locomotion framed as a fundamentally different RL problem than CartPole-style tasks?
What sensory inputs matter, and what happens when the physical robot lacks them?
How did NEAT outperform DDPG/TD3 in practice?
What reward design choices targeted the simulator “cheats” like jittery sliding?
What change finally improved DDPG/TD3-family results, and why might it help?
Which implementation detail is flagged as a likely debugging starting point?
Review Questions
- How do continuous servo action spaces change the choice of RL algorithm compared with discrete-action environments like CartPole?
- What specific reward components and fail-safes were used to prevent the agent from exploiting jittery sliding?
- Why might increasing actor/critic network size improve early performance but still lead to later collapse during training?
Key Points
- 1
Locomotion learning here is treated as continuous control: policies output real-valued servo targets rather than discrete action choices.
- 2
NEAT found stable forward walking in minutes, while DDPG and TD3 repeatedly peaked and then collapsed into erratic or exploitative motion.
- 3
A custom reward function rewards forward progress and penalizes lateral movement, and it includes a fail-safe that imposes a strong penalty when joints don’t move meaningfully.
- 4
Reward shaping can create short-term wins (e.g., jumping behaviors) that later become traps the policy cannot recover from.
- 5
The most effective DDPG/TD3-family improvement came from increasing actor/critic network size (400/300/300 → 512/256/256/256), though training still often degrades after a reward bump.
- 6
Observation availability and reference-frame correctness (e.g., pose.r usage) are treated as major sources of uncertainty and likely instability.