Get AI summaries of any video or article — Sign up free
Learning Dexterity | Alex Ray | 2018 Summer Intern Open House thumbnail

Learning Dexterity | Alex Ray | 2018 Summer Intern Open House

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The robot hand learned arbitrary object rotations using reinforcement learning trained entirely in simulation, then transferred to real hardware.

Briefing

A dexterous, underactuated five-finger robot hand learned to manipulate small objects in the real world using reinforcement learning trained entirely in simulation—despite major gaps between the simulator and the hardware. The core target was “arbitrary rotation” control: given a grasped object, the system had to execute a sequence of 50 independently sampled random rotations without dropping the item. Hitting that goal mattered because it demonstrates a practical path for transferring complex continuous-control policies from synthetic environments to robots whose real-world behavior includes backlash, tendon creep, transmission issues, and other effects that are notoriously hard to model precisely.

The project ran for about 12 months with roughly 12 people, iterating through training, testing, failure analysis, and retraining. The workflow started by training two separate neural components: a vision model to localize the object inside the hand from camera images, and a policy model to output motor actions for the hand. When deployed on the physical robot, the vision output became part of the policy’s observations; the policy then produced actions in a closed loop. Early attempts to train end-to-end together didn’t help, so the team kept vision and control training separate to reduce computational cost and complexity.

A key theme was simplifying the learning problem until it became solvable. The team initially aimed for full six-degree-of-freedom manipulation (position and orientation), then narrowed the task stepwise: first to rotation only, then to major axis-aligned rotations, then to spinning around a single axis. When that stalled, the system shifted again toward reaching fingertip positions in space, eventually climbing back toward the full rotation objective. This “ramp” approach helped unlock progress early and prevented the project from getting stuck on an overly ambitious target.

On the perception side, multiple tracking strategies were tried—retro-reflective infrared dots, depth cameras, magnetic tracking, active illumination targets, fiducials/barcodes, and camera-based vision. The final system relied on cameras, using a vision model that could estimate object pose from rendered images.

On the simulation side, the team leaned heavily on domain randomization rather than detailed physical accuracy. Instead of modeling unknown friction values, tendon and transmission imperfections, and other hardware-specific quirks exactly, the simulator injected noise and variability that the real system would plausibly experience. Friction parameters were randomized aggressively across visually distinct hand parts, and difficult-to-model effects like backlash were approximated by motor reversal perturbations. Even vision occlusion of fingertip markers was handled probabilistically by making dots disappear for a fraction of the time.

The result: robust real-world manipulation with low pose error, sparse-observation control, and policies that transferred from simulation to hardware. The project also reported that randomization choices sometimes helped and sometimes hurt, but the team ultimately kept a broad set of randomizations because the system still succeeded under the practical constraint of limited real-robot trials.

Cornell Notes

The project trained a dexterous five-finger robot hand to rotate grasped objects arbitrarily using reinforcement learning trained only in simulation. The central success was transferring a learned control policy to real hardware despite mismatches like backlash, tendon creep, and transmission problems. Progress came from simplifying the manipulation objective step-by-step, training vision and control separately, and using extensive domain randomization to cover simulator-to-reality gaps. A camera-based vision model localized the object in the hand, feeding pose estimates into an actor-critic policy that drove the hand. The system achieved the target rotation sequence on the real robot, with reported low positional and rotational error and the ability to act from sparse observations.

What was the main manipulation capability the system had to achieve, and why was it a meaningful benchmark?

The primary goal was arbitrary rotation of a small object held in the robot hand—executing a sequence of 50 independently sampled random rotations without dropping the item. That benchmark matters because it requires stable continuous control under contact-rich, underactuated mechanics, not just simple reaching or fixed trajectories.

Why did the team simplify the learning objective instead of training directly on the full task?

Early attempts targeted six degrees of freedom (including lifting/positioning), but the project repeatedly narrowed the objective: first to rotation only, then to axis-aligned rotations (e.g., getting the x-axis up), then to spinning around Z. When that still proved difficult, the approach shifted to reaching fingertip positions in space before ramping back up. This stepwise “ramp” prevented the policy from being overwhelmed by an overly complex target too early.

How did the system handle the gap between simulation physics and real hardware behavior?

Rather than modeling hardware imperfections precisely, the simulator injected variability through domain randomization. Friction was randomized across hand parts with visually distinct characteristics, even when the real friction units/values were uncertain. Backlash was approximated by randomly reversing motors in the opposite direction. For vision occlusion of fingertip markers, the system probabilistically made dots disappear for a portion of the time, approximating hard-to-model visibility states.

What role did vision tracking play, and what tracking approaches were tested before settling on the final approach?

Vision provided object localization inside the hand from camera images. Multiple tracking methods were tried—retro-reflective infrared dots, depth cameras (e.g., RealSense/Kinect-style), magnetic tracking (e.g., controller-like setups), active illumination targets (red dots on fingertips), fiducials/barcodes, and camera-based vision. The final reported solution used camera-based vision with a relatively simple multi-branch model to estimate object pose.

How were the vision model and the control policy trained and combined at deployment?

Vision and control were trained separately. Training them together end-to-end didn’t help, so the team used two pipelines: a vision model to output object pose estimates and a recurrent actor-critic policy to output actions. At rollout on the real robot, camera images fed the vision model; the resulting pose observations were appended to the policy’s observation vector, and the actor network produced motor commands in a closed loop.

What software and compute setup supported reinforcement learning at scale?

Training ran on more than 6000 CPUs and eight GPUs, with most engineering effort devoted to enabling that throughput. The training framework was related to an OpenAI system called RAPID, which also powers Dota bots, and the robotics work compared actor-critic and PPO-style methods while using a PPO-like approach for the robotics system.

Review Questions

  1. What specific manipulation objective was used as the project’s “North Star,” and how was it evaluated on the real robot?
  2. Which two neural components were trained separately, and how did their outputs interact during real-world control?
  3. How did domain randomization substitute for detailed physical modeling, and what examples of randomized effects were used?

Key Points

  1. 1

    The robot hand learned arbitrary object rotations using reinforcement learning trained entirely in simulation, then transferred to real hardware.

  2. 2

    The core benchmark required executing 50 random independent rotations without dropping the object, emphasizing continuous control under contact.

  3. 3

    Task difficulty was managed through a stepwise “ramp” from full 6-DoF toward simpler rotation objectives before expanding back to the full rotation goal.

  4. 4

    Vision and control were trained separately; end-to-end training was tried but didn’t improve results, so the system combined them only at deployment.

  5. 5

    Domain randomization was used to bridge simulator-to-reality gaps, including randomized friction, approximate backlash handling, and probabilistic modeling of fingertip marker occlusion.

  6. 6

    The final system used camera-based object localization and a recurrent actor-critic policy that acted from sparse observations.

  7. 7

    Training required large-scale compute (6000+ CPUs and eight GPUs) and relied on a reinforcement-learning infrastructure related to RAPID.

Highlights

A policy trained in simulation transferred to a real underactuated five-finger hand, enabling arbitrary rotation sequences on hardware.
Progress depended less on perfect physics modeling and more on simplifying the learning target and injecting variability through domain randomization.
The system combined a camera-based pose estimator with a recurrent actor-critic controller, trained independently and fused only during real-world rollout.
Backlash and vision occlusion—two hard-to-model real-world effects—were approximated with randomized or probabilistic mechanisms in simulation.

Topics

Mentioned

  • Alex Rey
  • PPO
  • LSTM
  • CPU
  • GPU
  • DoF