Teaching Robots to Walk w/ Reinforcement Learning

TL;DR

Locomotion learning here is treated as continuous control: policies output real-valued servo targets rather than discrete action choices.

Briefing Cornell Notes

Briefing

A fast, topology-evolving NEAT setup produced the first stable “walk forward” behavior for a bipedal robot in NVIDIA Isaac Sim—while several continuous-control reinforcement learning baselines (including DDPG and TD3 variants) repeatedly peaked and then collapsed into jittery or nonsensical motion. The practical takeaway is less about one magic algorithm and more about how sensitive quadruped/biped locomotion training is to action continuity, observation design, reward shaping, and model capacity.

The work starts by reframing locomotion as a continuous-control problem. Instead of choosing between discrete actions (like CartPole’s left/right), the agent must output continuous servo targets in a range (roughly −1 to +1) so multiple joints move in coordinated fashion. To make iteration feasible, the training task is simplified using OpenAI Gym’s BipedalWalker environment as a stand-in: the objective is to move forward without falling, with continuous servo positions as actions.

A key uncertainty is what sensory inputs are truly necessary. The full physical robot setup lacks some signals used in the reference environment (such as lidar and ground-contact sensing), and the “hole angle” measurement is similar but not identical. Rather than assume a rich observation set will be available, the training experiments test whether learning can succeed with more limited inputs—initially focusing on servo positions and rotational deltas.

Algorithmically, the contrast is stark. NEAT delivers usable walking behavior extremely quickly—on the order of minutes—showing stable forward motion even if the gait isn’t graceful. In side-by-side comparisons, DDPG training runs far longer (tens of minutes) and can briefly improve, but then “loses everything,” returning to erratic behavior with no clear recovery. The same pattern of instability appears in later attempts with TD3 (an evolved DDPG), which also fails to produce reliable early gains.

Because the agent sometimes learns to exploit the simulator with jittery sliding, the reward function is engineered with guardrails. Forward progress is rewarded step-by-step, lateral movement is penalized, and a fail-safe triggers a negative reward when too many joints fail to move meaningfully (a direct punishment for the “shake and slide” failure mode). Additional reward variations briefly induce behaviors like jumping, but those gains are short-lived because the policy cannot recover once it commits to the shortcut.

The breakthrough for the DDPG-family approach comes not from switching algorithms but from scaling the neural network. After many “shotgun” trials, increasing the actor/critic model size—specifically from a smaller 3-layer 400/300/300 setup to a larger 4-layer 512/256/256/256 configuration—produces a temporary improvement baseline. Even then, training often follows a familiar curve: a bump in reward followed by sudden collapse, even after hyperparameter tweaks. The best stable results so far come from leaving TD3 paper defaults intact while only changing network sizes.

Finally, the workflow is operationalized: code and Isaac Sim scene files are provided, along with the best trained model and a training script that replaces the Gym environment with Isaac Sim. The author flags a likely area to investigate—pose.r usage and coordinate/reference-frame assumptions—as a potential source of subtle training issues. The overall message is that locomotion learning in simulation is achievable, but “import deep learning and it works” is far from reality; success depends on careful simplification, reward design, and model/observation choices.

Cornell Notes

The project treats biped locomotion as a continuous-control reinforcement learning task where the policy outputs continuous servo targets (not discrete left/right choices). NEAT quickly finds a stable forward-walking gait in Isaac Sim, while DDPG and TD3 often train for much longer and then collapse into jittery or exploitative behaviors. A custom reward function rewards forward progress and penalizes side-to-side motion, plus a fail-safe that strongly punishes “shake-and-slide” when joints don’t move enough. For DDPG-family methods, the most meaningful improvement comes from increasing actor/critic network size, which creates a temporary performance bump but still tends to degrade later. The work also emphasizes observation limitations and reference-frame details (e.g., pose.r) as likely sources of training instability.

Why is locomotion framed as a fundamentally different RL problem than CartPole-style tasks?

CartPole uses discrete actions (e.g., 0 or 1 for left/right). Here, servo commands are continuous values (roughly between −1 and +1), meaning the agent must output coordinated real-valued targets for multiple joints. That turns the policy into a continuous-output controller rather than a classifier over a small action set, changing both the learning dynamics and which algorithms fit well (e.g., actor-critic methods for continuous control).

What sensory inputs matter, and what happens when the physical robot lacks them?

The reference BipedalWalker setup uses rich observations such as lidar, ground contact sensing, motor position/velocity, and body angle/velocity. The physical robot version lacks some of those signals (notably lidar and ground contact), and even the “hole angle” measurement is similar but not identical. The experiments therefore test whether learning can still succeed with reduced observation features like servo positions and rotational deltas, which increases the risk of instability or incomplete gait learning.

How did NEAT outperform DDPG/TD3 in practice?

NEAT produced usable walking behavior extremely fast—about 90 seconds of training for a stable, forward-moving gait (even if not graceful). DDPG required far longer (around 45 minutes) and could briefly improve, but then typically peaked and “lost everything,” reverting to erratic motion without recovery. TD3 attempts also failed to show early gains, reinforcing the pattern that DDPG-family methods were brittle under these conditions.

What reward design choices targeted the simulator “cheats” like jittery sliding?

The reward system rewards forward movement while subtracting side-to-side displacement. A fail-safe adds a hard negative reward (−1) when too many joints fail to move enough—intended to stop the agent from exploiting jittery sliding where it doesn’t produce meaningful locomotion. This kind of constraint is crucial when the agent finds shortcuts that maximize reward without achieving the intended gait.

What change finally improved DDPG/TD3-family results, and why might it help?

Scaling the actor and critic networks helped at least temporarily. The baseline actor/critic architecture was increased from 3 layers of 400/300/300 units to 4 layers of 512/256/256/256 units. That larger capacity can represent more complex continuous control policies, producing a reward “bump.” However, training still often collapses later, suggesting deeper issues beyond capacity (e.g., hyperparameter sensitivity, observation/reference-frame problems, or reward landscape geometry).

Which implementation detail is flagged as a likely debugging starting point?

The training script’s use of pose.r is singled out. The author suspects pose.r might need to be relative pose.r instead, or that pose.r/pose.dart order may not match the intended reference frame (e.g., relative to the plane versus each biddle’s own plane). Such coordinate-frame mismatches can silently corrupt observations and destabilize learning.

Review Questions

How do continuous servo action spaces change the choice of RL algorithm compared with discrete-action environments like CartPole?
What specific reward components and fail-safes were used to prevent the agent from exploiting jittery sliding?
Why might increasing actor/critic network size improve early performance but still lead to later collapse during training?

Key Points

1
Locomotion learning here is treated as continuous control: policies output real-valued servo targets rather than discrete action choices.
2
NEAT found stable forward walking in minutes, while DDPG and TD3 repeatedly peaked and then collapsed into erratic or exploitative motion.
3
A custom reward function rewards forward progress and penalizes lateral movement, and it includes a fail-safe that imposes a strong penalty when joints don’t move meaningfully.
4
Reward shaping can create short-term wins (e.g., jumping behaviors) that later become traps the policy cannot recover from.
5
The most effective DDPG/TD3-family improvement came from increasing actor/critic network size (400/300/300 → 512/256/256/256), though training still often degrades after a reward bump.
6
Observation availability and reference-frame correctness (e.g., pose.r usage) are treated as major sources of uncertainty and likely instability.

Highlights

NEAT produced stable forward walking after about 90 seconds of training, while DDPG took ~45 minutes and then typically “lost everything.”

A fail-safe reward (−1) punishes a “shake-and-slide” behavior when too many joints fail to move enough, directly targeting simulator exploitation.

Scaling actor/critic networks to 4 layers (512/256/256/256) created a temporary performance baseline for DDPG/TD3-family methods.

Jumping and roll behaviors emerged under reward variations but were short-lived because the policy couldn’t recover once it learned the shortcut.

Topics

Continuous Control
NEAT Locomotion
DDPG Instability
Reward Shaping
Isaac Sim Training

Mentioned

RL
NEAT
DDPG
TD3
RTX
CPU
GPU