I Watched an AI Drive a Real Car Through San Francisco Using Arrow Keys
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Standard Intelligence’s FDM-1 is trained on 11 million hours of computer action and built for long-context video, using a 1 million token context window and a video encoder that can ingest two hours of 30fps high-resolution footage.
Briefing
A new wave of AI systems is moving beyond text and static images into long-horizon “computer use” and real-time reasoning—highlighted by Standard Intelligence’s FDM-1, which can operate desktop software and even drive a real car in San Francisco using only arrow keys. The breakthrough matters because it targets the hardest part of agentic AI: reliably turning instructions into sequences of actions across complex, changing interfaces for extended periods, not just producing the next word.
FDM-1 is trained on 11 million hours of computer action and built to handle long context video, with a 1 million token context window. Its video encoder can pack two hours of 30fps high-resolution footage directly into that context, enabling the model to learn from “tutorial-like” recordings of what a user does on a screen. Instead of predicting language tokens, it predicts frame-by-frame computer actions via an inverse dynamics approach—effectively treating the next “token” as the next control step. That design shows up in demos: the system can navigate interfaces well enough to use CAD software and Blender, including constructing a gear from scratch. The car demo is framed as the clearest evidence of generality: the model steers a real vehicle in a real city environment using arrow keys, with less than one hour of fine-tuning data and high accuracy.
The transcript also places FDM-1 inside a broader ecosystem of fast-moving model types. Inception Labs’ Mercury 2 is described as a diffusion large language model that renders text by diffusing it into place rather than generating token-by-token. Mercury 2 is marketed as a “fastest reasoning” LLM, running at over a thousand tokens per second on Nvidia Blackwell GPUs, with low listed pricing (25 cents per million input tokens and 75 cents per million output tokens). It supports a 128k context window, native tool use, and schema-aligned JSON output, and includes a tunable reasoning mode that can delay responses while producing more deliberate outputs.
Benchmarks and comparisons are mixed: Mercury 2 is said to trail some Gemini variants on certain tests but beat others, including Claude 4.5 Haiku and GPT5 Nano in the cited set. Practical impressions from the demo emphasize that diffusion can feel “more serious” and responsive when reasoning is enabled, and that the model’s speed could matter for tasks requiring lots of output quickly—especially coding. The transcript claims Mercury 2 can be tried for free without an account.
Finally, the robotics thread ties these advances to scalable control. Nvidia’s Sonic is presented as a transformer trained on robot motion itself, enabling whole-body teleoperation from human video and webcam motion tracking, plus text prompts for actions like walking sideways, dancing, or kicking. Nvidia attributes success to motion tracking as the scalable task for whole-body control, using dense frame-by-frame supervision from human mocap data and accelerating training via Isaac Lab simulation—reportedly transferring to a real G1 robot with zero-shot performance and high success across diverse motion sequences. Across all these examples, the common theme is action: models that can perceive longer sequences, reason in real time, and translate inputs into physical or interface-level behavior.
Cornell Notes
Standard Intelligence’s FDM-1 is positioned as a major step toward true “computer use” agents: it’s trained on 11 million hours of computer action and built for long-context video (1 million tokens; up to two hours of 30fps high-res video). Instead of predicting words, it predicts frame-by-frame computer actions via an inverse dynamics approach, enabling it to operate tools like Blender/CAD and even steer a real car in San Francisco using arrow keys with less than one hour of fine-tuning. The transcript connects this to other fast-evolving model directions, including Inception Labs’ Mercury 2 diffusion LLM, which aims for high-speed reasoning and tunable deliberation. Nvidia’s Sonic further pushes the action frontier by learning whole-body robot control from motion tracking and transferring zero-shot performance to a real robot using simulation acceleration.
What makes FDM-1’s training and architecture different from typical agent setups?
How does FDM-1 turn video input into actions inside software like Blender or CAD?
Why is the real-car demo treated as a milestone for “general” computer use?
What is Mercury 2’s diffusion approach, and how does it differ from auto-regressive LLMs?
What practical capabilities does the transcript attribute to Nvidia’s Sonic for robotics?
How does Nvidia claim it achieves fast training and zero-shot transfer for whole-body control?
Review Questions
- How do FDM-1’s long-context video design and inverse dynamics training target combine to enable long-horizon computer actions?
- What tradeoffs does the transcript suggest between diffusion LLMs (Mercury 2) and auto-regressive LLMs, especially regarding speed and benchmark performance?
- What does Nvidia’s Sonic training pipeline claim to replace with motion tracking, and why does that matter for scaling to new robot skills?
Key Points
- 1
Standard Intelligence’s FDM-1 is trained on 11 million hours of computer action and built for long-context video, using a 1 million token context window and a video encoder that can ingest two hours of 30fps high-resolution footage.
- 2
FDM-1 predicts frame-by-frame computer actions via inverse dynamics, treating the “next token” as the next control step rather than the next word.
- 3
FDM-1 demonstrates tool-level generalization by operating CAD software and Blender, including constructing a gear from scratch.
- 4
The real-car demo is framed as a milestone for general computer use: the model steers a real car in San Francisco using arrow keys with less than one hour of fine-tuning data.
- 5
Inception Labs’ Mercury 2 is a diffusion LLM that renders text by diffusing it into place rather than generating token-by-token, aiming for high-speed reasoning and tunable deliberation.
- 6
Mercury 2 is positioned as competitively priced and fast on Nvidia Blackwell GPUs, with listed rates of 25 cents per million input tokens and 75 cents per million output tokens.
- 7
Nvidia’s Sonic targets whole-body robot control by scaling motion tracking with dense mocap supervision and Isaac Lab simulation acceleration, aiming for zero-shot transfer to real robots.