Advancing Robotics with Vision Language Action (VLA) Models | Prelim Exam Talk
Based on Code Mechanics: My PhD Life in AI & Robotics's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
VLA models aim to map language + perception directly to robot actions, reducing brittleness from separate vision, language, and control modules.
Briefing
Vision-language-action (VLA) models aim to let robots go from sensing and language instructions straight to executable motor commands—without brittle pipelines that stitch together separate vision, language, and control systems. The central promise is simple: if one model can interpret what it sees and what it’s asked to do, then it can also decide how to act, improving performance while reducing the fragility that comes from chaining specialized modules.
Early efforts grounded language in what robots could physically do. The “SayCan” approach built a predefined skill library (e.g., “find an apple”) and trained a policy for each skill, paired with a value function that predicts success likelihood from camera input. At inference time, the system matches the user’s instruction to relevant skills, scores them with the value function, and executes the highest-probability option. It worked well for relatively simple, real-world kitchen tasks, but it didn’t scale: adding more skills required retraining policies and value functions for each one.
Robotics transformer policies shifted toward a unified model. RT1 (from Google DeepMind) used a transformer that takes visual inputs plus a natural-language instruction and outputs robot actions directly. Instead of one policy per skill, a single policy handled hundreds of tasks and generalized across semantically similar instructions. However, RT1’s action space was discretized into bins, limiting fine-grained control. RT1’s data collection was also massive—teleoperating 130,000 episodes from 13 robots—and it reported 97% success across 700 skills.
RT2 extended the recipe by incorporating internet-scale vision-language supervision. It combined robot action data with large visual question-answer datasets, co-fine-tuning a vision-language model so that action tokens could be produced in a language-like token space and then mapped back to discrete action bins. While RT2 improved capability, it remained closed-source, pushing researchers toward open alternatives. OpenVLA replicated the RT2-style framework but used a Llama 2 7 billion parameter backbone and reported a 16.5% absolute task-success improvement over RT2 while using far fewer parameters (7B vs. 55B). The improvement was attributed largely to better vision encoding/tokenization, using DinoV2 and SigLIP for the image side.
A key limitation of discretized actions is jerky, low-resolution motion. Newer action paradigms replaced discrete heads with generative control. Octo introduced a diffusion-based action head to model multimodal action distributions and produce continuous trajectories. DEXVLA scaled the diffusion “action expert” itself—up to a 1 billion parameter transformer-based diffusion expert—using a multi-headed design to support multiple robot embodiments, achieving strong task success on common manipulation benchmarks like shirt folding, bin picking, and busing, and learning new embodiment skills with under 100 demonstrations.
Diffusion can be slow at inference because it requires many denoising steps. Pi Zero replaced denoising diffusion with flow matching, learning a vector field that transforms noise into actions via an ordinary differential equation—yielding faster control while maintaining smooth trajectories. Even so, action-language misalignment remained a bottleneck, with issues including semantic confusion between visually similar objects, error accumulation in open-loop evaluation, and weak long-horizon compositionality.
To improve long-horizon success and compositionality, hierarchical systems combined models. Pi05 split training and inference into stages: it used human-provided bounding boxes and language subtasks to pre-train with discretized actions, then used a flow-matching action expert in a two-level inference scheme (predict a high-level subtask, then generate low-level continuous actions). HiRobot tackled open-ended instruction following with a two-level “system 1 / system 2” design: a high-level VLM produced verbal responses and low-level language commands, while a lower-level VLA executed actions. The approach supported situated corrections (e.g., user says an item isn’t trash or reveals an allergy) and improved instruction accuracy and task progress.
Finally, chain-of-thought style reasoning was adapted for robotics. Visual chain-of-thought methods generated subgoal images and conditioned action sequences on them, improving performance on benchmarks such as Leviro. A language-reasoning variant forced intermediate planning steps using extracted motion primitives and showed a reported 48% boost in task success after a single human correction—highlighting how intermediate reasoning and human-in-the-loop interventions can help robots recover from failure.
Across these threads, the throughline is that VLA systems are moving from rigid modular stacks toward end-to-end, generative, hierarchical, and reasoning-enabled control—while still wrestling with alignment, long-horizon robustness, and practical deployment constraints for mobile robots and real-world autonomy.
Cornell Notes
VLA models target a single pipeline: take camera (or other sensor) inputs plus a natural-language instruction, then output robot actions that can be executed directly. Early systems like SayCan grounded language in a skill library but scaled poorly because each new skill required new policies and value functions. RT1 and RT2 moved toward unified transformer policies that handle many tasks, with RT2 adding internet-scale vision-language supervision; OpenVLA made this approach open and reported improved task success with fewer parameters by upgrading image encoding/tokenization. To get smoother, more precise motion, later work replaced discrete action bins with generative control—diffusion (Octo, DEXVLA) and flow matching (Pi Zero)—and hierarchical designs (Pi05, HiRobot) improved long-horizon success and user-interaction handling. Chain-of-thought style intermediate reasoning further boosted performance, especially when a single user correction was introduced.
Why did SayCan struggle to scale beyond small skill libraries?
What changed from RT1 to RT2, and why did that matter for generalization?
How did OpenVLA improve on RT2 while using fewer parameters?
Why switch from discrete action bins to diffusion or flow matching?
What does “hierarchical” mean in these VLA systems, and how did HiRobot use it for corrections?
How does chain-of-thought reasoning show up in robotics here?
Review Questions
- Which scaling bottleneck in SayCan makes it hard to add new skills, and how do RT1/RT2 avoid that specific issue?
- Compare diffusion-based action generation (Octo/DEXVLA) with flow matching (Pi Zero): what problem does flow matching target, and what misalignment issues still remain?
- In HiRobot’s two-level “system 1 / system 2” design, what information flows from the high-level module to the low-level module, and why does that help with mid-task user corrections?
Key Points
- 1
VLA models aim to map language + perception directly to robot actions, reducing brittleness from separate vision, language, and control modules.
- 2
SayCan grounded language in a skill library using per-skill policies and value functions, but scaling became costly because each new skill required retraining.
- 3
RT1 and RT2 moved toward unified transformer policies that handle many tasks; RT2 added internet-scale vision-language data to improve generalization.
- 4
OpenVLA made VLA research more reproducible and reported higher task success than RT2 with fewer parameters, attributing gains largely to improved image encoding/tokenization (DinoV2 + SigLIP).
- 5
Discrete action bins can produce jerky motion; diffusion (Octo/DEXVLA) and flow matching (Pi Zero) generate continuous control and better handle multimodal actions.
- 6
Hierarchical VLA systems (Pi05, HiRobot) improve long-horizon success and user interaction by separating high-level subtask reasoning from low-level action execution.
- 7
Chain-of-thought style intermediate reasoning (visual subgoals or language planning) can improve performance, and a single user correction produced a large reported success boost in one approach.