AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs

TL;DR

Devin, SIMA, and Figure 01 are presented as agent “containers” built around vision-language models, so upgrading the underlying model can rapidly improve performance.

Briefing Cornell Notes

Briefing

Three new agent-style AI systems—Cognition AI’s Devin, Google DeepMind’s SIMA, and Figure 01—signal a shift from chatbots that describe work to systems that can carry it out in constrained environments. The common thread is that each system acts like a “container” around a vision-language model: swap in a stronger underlying model (GPT-5 or Gemini 2, for example) and performance can jump quickly, often overnight. That matters because it reframes the pace of progress—from slow, incremental model training to rapid capability upgrades delivered through the same agent shell.

Devin is positioned as an AI software engineer built around a code editor plus browser access. It doesn’t just interpret instructions; it plans steps, reads documentation, and executes changes autonomously. In a short demonstration, it reads a blog post, runs the referenced code, and then iterates—finding edge cases and bugs the original instructions didn’t mention—before returning final results and even bonus images. The headline metric comes from SWE-bench, a benchmark built from 2,294 real-world software engineering problems drawn from historical GitHub issues and their accepted solutions. Devin scores nearly 14% on the subset it was tested on, far above Claude 2 and GPT 4 at about 1.7%. But the benchmark design also shapes what “success” looks like: it emphasizes issues with solutions that introduce new tests, which may underrepresent harder bugs where writing a clear test is difficult. Even so, the transcript argues that Devin’s score could rise sharply with a stronger multimodal, long-context model—especially because SWE-bench tasks often require locating relevant code inside large repositories and coordinating edits across multiple files.

Google DeepMind’s SIMA pushes the same agent idea into games. SIMA uses mouse-and-keyboard control with pixel input, trained on data gathered from humans playing a wide range of games, including commercial titles and Google-made environments. The key finding is positive transfer: training across many games helps SIMA perform better on new games than agents specialized for a single environment. In some cases, the generalist approach even beats the specialist trained on that exact game, and performance is described as approaching human capability in the tested settings. The transcript links this to broader robotics lessons—co-training with skills from other platforms can enable novel task performance.

Figure 01 brings the concept into the physical world. It’s described as a humanoid robot that takes multiple images per second and performs tasks like dish-related manipulation without teleoperation, with intelligence attributed to GPT-4 Vision-style perception. The estimated cost—roughly $30,000 to $150,000 per robot—still limits widespread deployment, but the company’s roadmap aims at automating manual labor broadly, potentially reducing labor costs toward the price of renting robots and eventually letting humans “leave the loop.”

Across all three systems, the transcript repeatedly returns to a job-market tension: capabilities are improving fast, yet timelines remain uncertain and control over deployment is contested. Still, the overall direction is clear—agents are becoming operational, and their performance is increasingly tied to the next generation of underlying multimodal models and tool-augmented reasoning.

Cornell Notes

Agent systems like Devin, SIMA, and Figure 01 are built as “shells” around vision-language models, so upgrading the underlying model can produce large capability jumps quickly. Devin’s results on SWE-bench (nearly 14% on its tested subset) come from autonomous planning and codebase navigation, but the benchmark’s construction—favoring issues with new tests—may underrepresent the hardest software bugs. SIMA demonstrates positive transfer in games: training across many environments improves performance on new games compared with agents specialized to one title. Figure 01 applies similar vision-based intelligence to real-world manipulation, aiming to automate manual labor, though cost and deployment control remain major constraints. Together, the systems suggest rapid progress toward practical automation, with job impacts that are likely uneven and hard to predict.

What makes Devin different from earlier “AI coding assistants,” and what does SWE-bench measure?

Devin is described as an AI software engineer system with a code editor and browser, capable of reading documentation, planning steps, and executing multi-step changes autonomously. SWE-bench measures performance on real professional software engineering problems: 2,294 issues with corresponding accepted solutions. Success requires understanding and coordinating edits across multiple functions and files, often using long contexts to locate relevant code and apply precise patches rather than selecting from multiple-choice answers.

Why does the SWE-bench score need context before treating it as a universal measure of software ability?

The transcript highlights dataset construction choices that can bias results. SWE-bench draws from GitHub issues and focuses on pull requests that introduce new tests, which may make some bugs easier to detect and verify. It also narrows the set of tasks to a subset of GitHub issues and a subset of software engineering skills, meaning complex issues—especially those where writing a clear test is difficult—could be underrepresented.

What is the core technical takeaway from SIMA’s game-playing results?

SIMA’s central claim is positive transfer. Training on many games improves performance on new games compared with an environment-specialized agent trained only for one game. The transcript notes that in some titles (e.g., Goat Simulator 3), the generalist approach can outperform the specialist, and it describes performance as approaching the human performance ballpark in the tested scenarios.

How does the transcript connect SIMA’s approach to robotics?

The argument is that co-training with skills from other platforms enables novel task performance. The transcript draws an analogy to robotics work where a robot (e.g., RT-2) benefits from additional skills developed by other robots, transferring those capabilities to new tasks—similar to how SIMA improves across game environments.

Why is Figure 01 framed as a “container” rather than a fully independent intelligence?

Figure 01’s real-time dexterity and control are treated as impressive, but the intelligence for recognizing and acting on what’s in the scene is attributed to the underlying vision-language model (described as GPT 4 Vision in the transcript). The implication is that swapping in a stronger model (like GPT 5) would deepen environmental understanding and improve performance without changing the robot’s overall agent structure.

What job-market uncertainty runs through the three systems’ discussion?

The transcript emphasizes that capability gains are real but unpredictable in their economic impact. It notes public distress about job implications, argues that near-term replacement is not guaranteed, and points out that deployment and control are contested. It also cites predictions that software engineering may shift toward higher-level supervision while other roles could be automated, but it stresses that timelines and outcomes remain uncertain.

Review Questions

How do SWE-bench’s dataset choices (e.g., reliance on pull requests that introduce new tests) potentially affect what “progress” looks like?
What does positive transfer mean in SIMA’s results, and how does it compare to training a specialist agent for a single game?
In the transcript’s framing, what role does the underlying vision-language model play in Figure 01’s real-world performance, and why does that matter for future upgrades?

Key Points

1
Devin, SIMA, and Figure 01 are presented as agent “containers” built around vision-language models, so upgrading the underlying model can rapidly improve performance.
2
Devin’s SWE-bench performance is reported at nearly 14% on its tested subset, but benchmark design may bias results toward issues with testable solutions.
3
SWE-bench problems require coordinated, multi-file code edits and long-context localization, not just single-line fixes or multiple-choice selection.
4
SIMA’s standout result is positive transfer: training across many games improves performance on new games versus agents specialized to one environment.
5
Figure 01’s intelligence is tied to vision-based model perception, and its roadmap targets broad automation of manual labor, constrained today by cost and deployment control.
6
Across all three, job impacts are treated as uncertain and uneven, with near-term replacement not guaranteed even as capabilities improve quickly.

Highlights

Devin’s SWE-bench score is reported at nearly 14%, far ahead of Claude 2 and GPT 4, but the benchmark’s construction may underrepresent the hardest bugs.

SIMA’s generalist training across many games produces positive transfer, sometimes beating agents specialized for a single title.

Figure 01 is framed as end-to-end vision-driven manipulation without teleoperation, with performance linked to the underlying vision-language model.

A recurring theme is that agent shells can deliver big capability jumps when the underlying multimodal model is upgraded.

Topics

AI Agents
SWE-bench
Positive Transfer
Humanoid Robotics
Job Impacts

Mentioned

Francois Chollet
Andre Carpathi
Bindu Ready
Jeff Clune
Edward Harris
Jensen Huang
Sam Altman
GPT
GPT-4
GPT-5
SIMA
SWE-bench
RT-2
AAA
AGI