I Watched an AI Drive a Real Car Through San Francisco Using Arrow Keys

TL;DR

Standard Intelligence’s FDM-1 is trained on 11 million hours of computer action and built for long-context video, using a 1 million token context window and a video encoder that can ingest two hours of 30fps high-resolution footage.

Briefing Cornell Notes

Briefing

A new wave of AI systems is moving beyond text and static images into long-horizon “computer use” and real-time reasoning—highlighted by Standard Intelligence’s FDM-1, which can operate desktop software and even drive a real car in San Francisco using only arrow keys. The breakthrough matters because it targets the hardest part of agentic AI: reliably turning instructions into sequences of actions across complex, changing interfaces for extended periods, not just producing the next word.

FDM-1 is trained on 11 million hours of computer action and built to handle long context video, with a 1 million token context window. Its video encoder can pack two hours of 30fps high-resolution footage directly into that context, enabling the model to learn from “tutorial-like” recordings of what a user does on a screen. Instead of predicting language tokens, it predicts frame-by-frame computer actions via an inverse dynamics approach—effectively treating the next “token” as the next control step. That design shows up in demos: the system can navigate interfaces well enough to use CAD software and Blender, including constructing a gear from scratch. The car demo is framed as the clearest evidence of generality: the model steers a real vehicle in a real city environment using arrow keys, with less than one hour of fine-tuning data and high accuracy.

The transcript also places FDM-1 inside a broader ecosystem of fast-moving model types. Inception Labs’ Mercury 2 is described as a diffusion large language model that renders text by diffusing it into place rather than generating token-by-token. Mercury 2 is marketed as a “fastest reasoning” LLM, running at over a thousand tokens per second on Nvidia Blackwell GPUs, with low listed pricing (25 cents per million input tokens and 75 cents per million output tokens). It supports a 128k context window, native tool use, and schema-aligned JSON output, and includes a tunable reasoning mode that can delay responses while producing more deliberate outputs.

Benchmarks and comparisons are mixed: Mercury 2 is said to trail some Gemini variants on certain tests but beat others, including Claude 4.5 Haiku and GPT5 Nano in the cited set. Practical impressions from the demo emphasize that diffusion can feel “more serious” and responsive when reasoning is enabled, and that the model’s speed could matter for tasks requiring lots of output quickly—especially coding. The transcript claims Mercury 2 can be tried for free without an account.

Finally, the robotics thread ties these advances to scalable control. Nvidia’s Sonic is presented as a transformer trained on robot motion itself, enabling whole-body teleoperation from human video and webcam motion tracking, plus text prompts for actions like walking sideways, dancing, or kicking. Nvidia attributes success to motion tracking as the scalable task for whole-body control, using dense frame-by-frame supervision from human mocap data and accelerating training via Isaac Lab simulation—reportedly transferring to a real G1 robot with zero-shot performance and high success across diverse motion sequences. Across all these examples, the common theme is action: models that can perceive longer sequences, reason in real time, and translate inputs into physical or interface-level behavior.

Cornell Notes

Standard Intelligence’s FDM-1 is positioned as a major step toward true “computer use” agents: it’s trained on 11 million hours of computer action and built for long-context video (1 million tokens; up to two hours of 30fps high-res video). Instead of predicting words, it predicts frame-by-frame computer actions via an inverse dynamics approach, enabling it to operate tools like Blender/CAD and even steer a real car in San Francisco using arrow keys with less than one hour of fine-tuning. The transcript connects this to other fast-evolving model directions, including Inception Labs’ Mercury 2 diffusion LLM, which aims for high-speed reasoning and tunable deliberation. Nvidia’s Sonic further pushes the action frontier by learning whole-body robot control from motion tracking and transferring zero-shot performance to a real robot using simulation acceleration.

What makes FDM-1’s training and architecture different from typical agent setups?

FDM-1 is trained on 11 million hours of computer action and designed from the ground up for long context video. It uses a 1 million token context window and a video encoder that can fit two hours of 30fps high-resolution video directly into that context. The model’s core learning target is inverse dynamics: it predicts computer actions frame-by-frame (described as “predicting the next token,” where the token corresponds to a computer action), rather than generating text tokens. That combination is meant to let it carry out tasks over extended periods—an area where long-horizon agents have struggled.

How does FDM-1 turn video input into actions inside software like Blender or CAD?

The transcript frames it as generalization from data: the model learns to read and use interface elements (icons and controls) in a way that resembles how a person operates the UI. It can navigate interfaces well enough to run CAD-style workflows and Blender tasks, including constructing a gear from scratch. The key claim is that it isn’t just using a Blender-specific extension; it can apply its learned computer-use skills across applications because it has parameters trained to handle those interface patterns.

Why is the real-car demo treated as a milestone for “general” computer use?

The car demo is presented as evidence of full generality: the system uses arrow keys to steer a real vehicle in San Francisco. The transcript emphasizes that this is achieved with less than one hour of fine-tuning data for driving, yet produces high accuracy. The implication is that the model can translate long-horizon perception and interface control into safe-enough physical control behavior, not merely follow a scripted UI routine.

What is Mercury 2’s diffusion approach, and how does it differ from auto-regressive LLMs?

Mercury 2 is described as a diffusion large language model that renders an entire block of text at once by diffusing it into existence. In contrast, auto-regressive models generate output token-by-token. The transcript also highlights that Mercury 2 runs at over a thousand tokens per second on Nvidia Blackwell GPUs, and that it includes a tunable reasoning mode that can delay responses while producing more considered outputs.

What practical capabilities does the transcript attribute to Nvidia’s Sonic for robotics?

Sonic is described as a transformer trained on the robot’s own motion and actions (arms, legs, and more), acting as a bridge between humans and the robot. It supports whole-body teleoperation from human video and webcam motion tracking, plus text prompts for behaviors like walking sideways, dancing, or kicking. The transcript also claims Sonic can adapt to musical audio tempo and rhythm, and that it achieved high success on mobile tasks after integrating a VLA foundation model component (Groot N1.5), with reported 95% success.

How does Nvidia claim it achieves fast training and zero-shot transfer for whole-body control?

Nvidia’s approach centers on motion tracking as the scalable task for whole-body control, using dense frame-by-frame supervision from human mocap data. The mocap-derived data is said to encode the reward function (maintaining balance while configuring limbs). Training is accelerated using Nvidia Isaac Lab simulation at 10,000x faster than real time, providing many years of virtual experience in hours of wall-clock time. The transcript reports that after three days of training, Neuralet transfers zero-shot to a real G1 robot with no fine-tuning and 100% success across 50 diverse real-world motion sequences.

Review Questions

How do FDM-1’s long-context video design and inverse dynamics training target combine to enable long-horizon computer actions?
What tradeoffs does the transcript suggest between diffusion LLMs (Mercury 2) and auto-regressive LLMs, especially regarding speed and benchmark performance?
What does Nvidia’s Sonic training pipeline claim to replace with motion tracking, and why does that matter for scaling to new robot skills?

Key Points

1
Standard Intelligence’s FDM-1 is trained on 11 million hours of computer action and built for long-context video, using a 1 million token context window and a video encoder that can ingest two hours of 30fps high-resolution footage.
2
FDM-1 predicts frame-by-frame computer actions via inverse dynamics, treating the “next token” as the next control step rather than the next word.
3
FDM-1 demonstrates tool-level generalization by operating CAD software and Blender, including constructing a gear from scratch.
4
The real-car demo is framed as a milestone for general computer use: the model steers a real car in San Francisco using arrow keys with less than one hour of fine-tuning data.
5
Inception Labs’ Mercury 2 is a diffusion LLM that renders text by diffusing it into place rather than generating token-by-token, aiming for high-speed reasoning and tunable deliberation.
6
Mercury 2 is positioned as competitively priced and fast on Nvidia Blackwell GPUs, with listed rates of 25 cents per million input tokens and 75 cents per million output tokens.
7
Nvidia’s Sonic targets whole-body robot control by scaling motion tracking with dense mocap supervision and Isaac Lab simulation acceleration, aiming for zero-shot transfer to real robots.

Highlights

FDM-1 can operate complex desktop interfaces and steer a real car in San Francisco using only arrow keys, with less than one hour of driving fine-tuning.

FDM-1’s long-context setup is unusually large: a 1 million token context window that can hold two hours of 30fps high-resolution video.

Mercury 2 swaps token-by-token generation for diffusion-based text rendering and is described as running at over a thousand tokens per second on Nvidia Blackwell GPUs.

Nvidia’s Sonic claims zero-shot transfer to a real G1 robot after simulation-accelerated training, with 100% success across 50 diverse motion sequences.

Topics

Computer Use Agents
Diffusion LLMs
Long-Context Video
Robotics Teleoperation
Real-Time Reasoning

Mentioned

FDM-1
CAD
LLM
API
VR
VLA
GPU
GPT
FSD
JSON
mocap
VR whole body teleoperation
Isaac Lab
VLA foundation models
Neuralet