Get AI summaries of any video or article — Sign up free
Advancing Robotics with Vision Language Action (VLA) Models | Prelim Exam Talk thumbnail

Advancing Robotics with Vision Language Action (VLA) Models | Prelim Exam Talk

7 min read

Based on Code Mechanics: My PhD Life in AI & Robotics's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

VLA models aim to map language + perception directly to robot actions, reducing brittleness from separate vision, language, and control modules.

Briefing

Vision-language-action (VLA) models aim to let robots go from sensing and language instructions straight to executable motor commands—without brittle pipelines that stitch together separate vision, language, and control systems. The central promise is simple: if one model can interpret what it sees and what it’s asked to do, then it can also decide how to act, improving performance while reducing the fragility that comes from chaining specialized modules.

Early efforts grounded language in what robots could physically do. The “SayCan” approach built a predefined skill library (e.g., “find an apple”) and trained a policy for each skill, paired with a value function that predicts success likelihood from camera input. At inference time, the system matches the user’s instruction to relevant skills, scores them with the value function, and executes the highest-probability option. It worked well for relatively simple, real-world kitchen tasks, but it didn’t scale: adding more skills required retraining policies and value functions for each one.

Robotics transformer policies shifted toward a unified model. RT1 (from Google DeepMind) used a transformer that takes visual inputs plus a natural-language instruction and outputs robot actions directly. Instead of one policy per skill, a single policy handled hundreds of tasks and generalized across semantically similar instructions. However, RT1’s action space was discretized into bins, limiting fine-grained control. RT1’s data collection was also massive—teleoperating 130,000 episodes from 13 robots—and it reported 97% success across 700 skills.

RT2 extended the recipe by incorporating internet-scale vision-language supervision. It combined robot action data with large visual question-answer datasets, co-fine-tuning a vision-language model so that action tokens could be produced in a language-like token space and then mapped back to discrete action bins. While RT2 improved capability, it remained closed-source, pushing researchers toward open alternatives. OpenVLA replicated the RT2-style framework but used a Llama 2 7 billion parameter backbone and reported a 16.5% absolute task-success improvement over RT2 while using far fewer parameters (7B vs. 55B). The improvement was attributed largely to better vision encoding/tokenization, using DinoV2 and SigLIP for the image side.

A key limitation of discretized actions is jerky, low-resolution motion. Newer action paradigms replaced discrete heads with generative control. Octo introduced a diffusion-based action head to model multimodal action distributions and produce continuous trajectories. DEXVLA scaled the diffusion “action expert” itself—up to a 1 billion parameter transformer-based diffusion expert—using a multi-headed design to support multiple robot embodiments, achieving strong task success on common manipulation benchmarks like shirt folding, bin picking, and busing, and learning new embodiment skills with under 100 demonstrations.

Diffusion can be slow at inference because it requires many denoising steps. Pi Zero replaced denoising diffusion with flow matching, learning a vector field that transforms noise into actions via an ordinary differential equation—yielding faster control while maintaining smooth trajectories. Even so, action-language misalignment remained a bottleneck, with issues including semantic confusion between visually similar objects, error accumulation in open-loop evaluation, and weak long-horizon compositionality.

To improve long-horizon success and compositionality, hierarchical systems combined models. Pi05 split training and inference into stages: it used human-provided bounding boxes and language subtasks to pre-train with discretized actions, then used a flow-matching action expert in a two-level inference scheme (predict a high-level subtask, then generate low-level continuous actions). HiRobot tackled open-ended instruction following with a two-level “system 1 / system 2” design: a high-level VLM produced verbal responses and low-level language commands, while a lower-level VLA executed actions. The approach supported situated corrections (e.g., user says an item isn’t trash or reveals an allergy) and improved instruction accuracy and task progress.

Finally, chain-of-thought style reasoning was adapted for robotics. Visual chain-of-thought methods generated subgoal images and conditioned action sequences on them, improving performance on benchmarks such as Leviro. A language-reasoning variant forced intermediate planning steps using extracted motion primitives and showed a reported 48% boost in task success after a single human correction—highlighting how intermediate reasoning and human-in-the-loop interventions can help robots recover from failure.

Across these threads, the throughline is that VLA systems are moving from rigid modular stacks toward end-to-end, generative, hierarchical, and reasoning-enabled control—while still wrestling with alignment, long-horizon robustness, and practical deployment constraints for mobile robots and real-world autonomy.

Cornell Notes

VLA models target a single pipeline: take camera (or other sensor) inputs plus a natural-language instruction, then output robot actions that can be executed directly. Early systems like SayCan grounded language in a skill library but scaled poorly because each new skill required new policies and value functions. RT1 and RT2 moved toward unified transformer policies that handle many tasks, with RT2 adding internet-scale vision-language supervision; OpenVLA made this approach open and reported improved task success with fewer parameters by upgrading image encoding/tokenization. To get smoother, more precise motion, later work replaced discrete action bins with generative control—diffusion (Octo, DEXVLA) and flow matching (Pi Zero)—and hierarchical designs (Pi05, HiRobot) improved long-horizon success and user-interaction handling. Chain-of-thought style intermediate reasoning further boosted performance, especially when a single user correction was introduced.

Why did SayCan struggle to scale beyond small skill libraries?

SayCan used a predefined skill library of natural-language skills (e.g., “find an apple”). For each skill, it trained a separate policy network to execute that skill and a value function that estimated success likelihood from the robot’s image input. At inference, it matched the user prompt to relevant skills, scored them with the value function, and executed the highest-scoring skill. The catch was scaling: as the number of skills grew, adding a new skill required retraining a policy and value function for that skill, making the approach increasingly handcrafted and expensive.

What changed from RT1 to RT2, and why did that matter for generalization?

RT1 used a transformer policy that took visual inputs and a natural-language instruction and directly output robot actions, enabling one unified policy for hundreds of tasks. Its action outputs were discretized into bins, and it required large teleoperation data collection (130,000 episodes from 13 robots). RT2 aimed to improve generalization by adding internet-scale vision-language supervision: it combined robot action data with large visual question-answer datasets, co-fine-tuning a vision-language model so action tokens could be produced and mapped back to discrete action bins. This transfer of “web knowledge” was intended to help the robot handle more skills and adapt better.

How did OpenVLA improve on RT2 while using fewer parameters?

OpenVLA followed an RT2-like framework: image + instruction inputs were tokenized through modules, processed by a Llama 2 7 billion parameter model, and then detokenized into discrete robot action bins. The reported improvement was a 16.5% absolute task-success gain over RT2 while using fewer parameters (7B vs. 55B). The key technical difference highlighted was improved vision encoding/tokenization—using DinoV2 and SigLIP—suggesting that better image tokenization contributed substantially to performance.

Why switch from discrete action bins to diffusion or flow matching?

Discrete bins limit fine-grained control, producing jerky, low-resolution motion that can fail on tasks needing smooth trajectories or continuous control. Diffusion-based heads (Octo) model multimodal action distributions and generate continuous actions by denoising from noise through multiple steps. DEXVLA scaled the diffusion action expert to a 1B-parameter transformer-based model and used multi-headed designs for different embodiments, improving task success and enabling new embodiment skills with under 100 demonstrations. Pi Zero addressed diffusion’s inference-time cost by using flow matching instead of denoising diffusion, learning a vector field that transforms noise to actions via an ODE, enabling faster control while preserving smooth trajectories.

What does “hierarchical” mean in these VLA systems, and how did HiRobot use it for corrections?

Hierarchical systems split decision-making into levels. In HiRobot, a high-level VLM handled deliberation: it processed image data and the user prompt, then produced a verbal response and a low-level language command. A lower-level VLA (system 1) converted that command plus robot state (joint states and images) into motor actions. This structure let the robot incorporate situated corrections mid-task—such as when the robot picks up something that isn’t trash or when a user reveals an allergy—by updating low-level commands rather than continuing blindly.

How does chain-of-thought reasoning show up in robotics here?

Chain-of-thought adaptations insert intermediate reasoning before acting. In visual chain-of-thought, the model generates a subgoal image and conditions action sequences on it in a closed loop until the task completes. In a language-based variant, training uses extracted bounding boxes and motion primitives to force intermediate planning steps (e.g., identifying which object to approach first, then moving closer). A reported result was that a single interactive correction from a user boosted task success by 48%, illustrating how intermediate reasoning plus human feedback can help recover from failure modes like picking the wrong object.

Review Questions

  1. Which scaling bottleneck in SayCan makes it hard to add new skills, and how do RT1/RT2 avoid that specific issue?
  2. Compare diffusion-based action generation (Octo/DEXVLA) with flow matching (Pi Zero): what problem does flow matching target, and what misalignment issues still remain?
  3. In HiRobot’s two-level “system 1 / system 2” design, what information flows from the high-level module to the low-level module, and why does that help with mid-task user corrections?

Key Points

  1. 1

    VLA models aim to map language + perception directly to robot actions, reducing brittleness from separate vision, language, and control modules.

  2. 2

    SayCan grounded language in a skill library using per-skill policies and value functions, but scaling became costly because each new skill required retraining.

  3. 3

    RT1 and RT2 moved toward unified transformer policies that handle many tasks; RT2 added internet-scale vision-language data to improve generalization.

  4. 4

    OpenVLA made VLA research more reproducible and reported higher task success than RT2 with fewer parameters, attributing gains largely to improved image encoding/tokenization (DinoV2 + SigLIP).

  5. 5

    Discrete action bins can produce jerky motion; diffusion (Octo/DEXVLA) and flow matching (Pi Zero) generate continuous control and better handle multimodal actions.

  6. 6

    Hierarchical VLA systems (Pi05, HiRobot) improve long-horizon success and user interaction by separating high-level subtask reasoning from low-level action execution.

  7. 7

    Chain-of-thought style intermediate reasoning (visual subgoals or language planning) can improve performance, and a single user correction produced a large reported success boost in one approach.

Highlights

SayCan’s skill-library design worked for simple tasks but didn’t scale because each added skill required new trained components (policy + value function).
OpenVLA reported a 16.5% absolute task-success improvement over RT2 while cutting parameter count from 55B to 7B, with DinoV2 + SigLIP image encoding cited as a major contributor.
DEXVLA scaled the diffusion action expert to a 1B-parameter model and claimed strong results across embodiments, including learning new embodiment skills with fewer than 100 demonstrations.
Pi Zero replaced diffusion’s denoising steps with flow matching to speed up inference while keeping smooth trajectories, though action-language misalignment still limited long-horizon robustness.
HiRobot’s two-level cognition design enabled situated corrections mid-task (e.g., allergy constraints), updating low-level commands rather than restarting the task.

Topics

  • Vision Language Action
  • Robot Control Policies
  • Continuous Action Generation
  • Hierarchical VLA
  • Chain-of-Thought Robotics

Mentioned

  • Lauren Ay
  • VLA
  • RT1
  • RT2
  • VLM
  • LAR
  • RL
  • ODE
  • DEXVLA
  • Pi Zero
  • Pi05
  • HiRobot
  • RL
  • PPO
  • DinoV2
  • SigLIP
  • DINO
  • SigLIP