Training a Unitree G1 to Walk w/ Reinforcement Learning

TL;DR

Sim-to-real reliability improves when the actuator control law is consistent between simulation and hardware; switching from MJLab implicit PD to an explicit Python/Torch PD controller was a turning point.

Briefing Cornell Notes

Briefing

A workable path from simulation to a real Unitree G1 humanoid hinges less on fancy training tricks and more on making the simulator “match” the robot’s control loop—down to actuator control details and what the policy has (and hasn’t) experienced. After years of sim-to-real attempts, the project reaches a turning point by switching from MJLab’s default implicit PD control to an explicit PD controller implemented in Python/Torch, then validating the policy through a strict sim-to-sim pipeline before risking the real robot.

The core sim-to-real problem is distribution shift: neural policies trained in simulation can fail catastrophically when real-world physics—gravity, friction, contact dynamics—doesn’t line up. The practical response is to reduce the gap by ensuring the robot is “simulable” enough that sim-to-sim performance predicts sim-to-real behavior. The workflow starts with MJLab (chosen because others have already demonstrated sim-to-real on Unitree G1 using it), then trains a locomotion policy in a velocity-control setup. Motion imitation is mentioned as an alternative, but the focus is on velocity-based control to build toward a general-purpose gait that can steer and handle varied terrain.

Training configuration details emphasize scaling and hardware constraints: tens of thousands of parallel simulated environments (e.g., 50,000) to speed learning on GPUs like an RTX 4090, with training times ranging from about an hour for a shorter run to longer multi-hour or day-long sessions depending on terrain complexity and rollout settings. The policy is first tested in MJLab, where it shows balance but also reveals issues like jitter and “shuffle” behavior—symptoms the author links to downstream wear on joints.

The decisive step is sim-to-sim validation using the same MJLab/Isaac Lab pipeline and then the same control code path used for the live robot. A key failure mode appears when the robot falls: in training, episodes terminate before the torso reaches a “lying on the ground” state, meaning the neural network never sees that out-of-distribution condition. On the real robot, that can trigger unstable behavior. The mitigation is operational rather than learned: detect a fall/out-of-distribution condition and switch the robot into a “full damp” mode so it doesn’t lose control. This makes real-world testing safer while still allowing the policy to be evaluated on walking and uneven terrain.

Once sim-to-sim looks credible, the policy runs on the Unitree G1 in a mobile gantry setup—ratchet-strapped to tractor pallet forks—so the robot can be tested on slopes, leaf-covered ground, and gravel without free-fall risk. The author uses a Dell GB10 compute unit (low power, small form factor) and controls the robot via keyboard/GUI, with an emergency stop strategy that hard-shuts down a proprietary board.

The field tests succeed in the sense that the robot can walk, balance, and recover in challenging conditions, but hardware fragility becomes the limiting factor. The robot’s arm repeatedly detaches, forcing repairs and later upgrades to Inspire five-finger hands. Even after mechanical setbacks, the locomotion policy continues to function, reinforcing the idea that the lower-body gait is the foundation—while arms can be refined later via freezing/smoothing strategies once stability is achieved.

Beyond the immediate demo, the project frames the next milestone as practical manipulation: getting the G1 to squat, pick up items (like kitchen objects or toys), and stand—likely requiring a locomotion policy plus a separate arm/gripper policy, coordinated by a higher-level task flow. The broader takeaway is that sim-to-real becomes tractable when control interfaces and failure handling are made consistent, and when sim-to-sim is treated as a gatekeeper rather than a formality.

Cornell Notes

The project’s breakthrough for sim-to-real on a Unitree G1 comes from tightening the control-loop match between simulation and hardware. The author finds that MJLab’s default implicit PD control (which relies on privileged/engine data and future backfilling) blocks reliable sim-to-real, so they switch to an explicit PD controller implemented in Python/Torch so the same controller runs in both sim-to-sim and sim-to-real. They also address a major out-of-distribution failure: the policy never sees “lying on the ground” states during training because episodes terminate earlier, so real falls can destabilize behavior; a fall detector triggers “full damp” instead. After sim-to-sim validation, the policy runs on the real robot in a mobile gantry setup and shows stable walking on slopes and debris, though hardware issues (arm detachments) drive repairs and hand upgrades. The next target is manipulation tasks like picking up objects from the floor using separate locomotion and gripper/arm policies.

Why does sim-to-real fail so often for deep RL locomotion policies, and what practical lever reduces that risk here?

The failure mechanism is distribution shift: policies trained in simulation encounter real-world physics and contact dynamics that differ from the simulator (gravity, friction, and contact outcomes). Neural networks tend to break when observations/actions fall out of the training distribution. The practical lever used here is to reduce the gap by making the simulator and the robot’s control loop consistent—especially the actuator control law. After switching from MJLab’s implicit PD to an explicit PD controller implemented in Python/Torch, sim-to-sim becomes a reliable predictor for sim-to-real.

What role does sim-to-sim play beyond “sanity checking,” and why is it treated as a gate before touching the real robot?

Sim-to-sim is treated as the last safe checkpoint because loading a sketchy policy onto the real robot is risky. The author emphasizes that even if the physics engine is similar, pipeline differences can introduce inconsistencies. By running the same policy through an identical sim-to-sim pipeline (and using the same control code path and GUI/controls as the live setup), the team can catch mismatches early. Only after sim-to-sim looks credible do they proceed to sim-to-real.

How is the “falling” out-of-distribution problem handled during real-world testing?

Training terminates episodes when the torso reaches certain angles/conditions, so the neural network never experiences the robot lying on the ground. When the real robot falls into that unseen state, it can “lose its mind.” The mitigation is to detect a fall/out-of-distribution condition and switch the robot into a safety mode (“full damp”) rather than letting the policy continue. The author also notes a longer-term goal: eventually train a policy that can trip, fall, and recover, but near-term testing prioritizes safety.

What specific controller mismatch did the author identify as a likely blocker for sim-to-real?

The author contrasts MJLab’s implicit PD controller with an explicit PD controller. They believe the implicit PD implementation requires privileged data from the physics engine and uses future data/backfilling to compute PD terms. That means the controller behavior in simulation doesn’t match what can be reproduced on the real robot. Switching to an explicit PD controller makes the PD computation transparent and consistent across sim-to-sim and sim-to-real.

Why does the velocity-trained policy show “shuffle” behavior, and what observation mismatch is suspected?

The velocity strategy uses linear velocity as part of the observation state. The real Unitree G1 lacks linear velocity measurement, so the policy’s observation differs from training. The author suspects this mismatch contributes to the shuffle/jitter behavior and says the likely path forward is estimating linear velocity (or otherwise closing the sim-to-real observation gap) while keeping the same method consistent across sim and real.

What hardware and operational constraints shaped the real-world experiments?

The robot is tested while ratchet-strapped to a tractor pallet-fork gantry to prevent uncontrolled falls. The author uses a Dell GB10 compute unit (low power, small form factor) for inference and controls the robot via keyboard/GUI. Emergency stop is implemented as a hard shutdown command to a proprietary board (with a Bluetooth exploit used to gain pseudo/root access). Hardware fragility also dominates: arm detachments force repairs and later upgrades to Inspire five-finger hands, while locomotion continues to work even after an arm falls off.

Review Questions

What changes were made to the PD controller to improve sim-to-real reliability, and why does the author think implicit PD was problematic?
How does episode termination during training create an out-of-distribution condition during real falls, and what safety mechanism addresses it?
What observation mismatch (related to velocity) does the author suspect is driving shuffle/jitter behavior, and what mitigation is proposed?

Key Points

1
Sim-to-real reliability improves when the actuator control law is consistent between simulation and hardware; switching from MJLab implicit PD to an explicit Python/Torch PD controller was a turning point.
2
Sim-to-sim validation is treated as a hard gate: policies are only deployed to the real Unitree G1 after matching pipeline/control-code behavior in simulation.
3
Training termination conditions matter: because the policy never sees “lying on the ground,” real falls can destabilize behavior unless a safety fallback is used.
4
A practical safety strategy is to detect fall/out-of-distribution states and switch the robot to “full damp” rather than letting the learned policy run uncontrolled.
5
Velocity-based locomotion can develop shuffle/jitter when the real robot lacks the same observation signals (notably linear velocity) used during training.
6
Real-world testing used a mobile gantry (tractor pallet forks) plus a hard emergency stop to reduce risk while evaluating policies on slopes and debris.
7
Hardware limitations (arm detachments) can become the bottleneck; locomotion can still remain functional while manipulation and arm stability are addressed later.

Highlights

The breakthrough wasn’t just better training—it was aligning the PD controller implementation so sim-to-sim and sim-to-real use the same explicit actuator control computation.

Falls are a distribution shift the policy never trained for; switching to “full damp” on fall detection prevents catastrophic behavior during early real-world trials.

The velocity policy’s shuffle/jitter is linked to an observation mismatch: training uses linear velocity, while the real G1 lacks that measurement.

A mobile gantry setup enables risky terrain tests (leaves, slopes, gravel) without free-fall consequences, letting the policy be evaluated safely.

Repeated arm detachments forced repairs and hand upgrades, but the locomotion policy continued to work even after mechanical failures.

Topics

Sim To Real
Explicit PD Control
Sim To Sim Validation
Unitree G1 Locomotion
Robot Safety Dampening

Mentioned

Unitree
MJLab
Isaac Lab
Dell
Gemini
Lucky Robots
Inspire
Jeff
PD
RL
LLM
GPU
IP
GUI
GB10
RTX
USB
ESTOP