Get AI summaries of any video or article — Sign up free
Gemini 3 Pro testing is unbelievable, and World Models are BACK! [AI NEWS] thumbnail

Gemini 3 Pro testing is unbelievable, and World Models are BACK! [AI NEWS]

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

World Labs’ RTFM renders interactive, navigable 3D-like scenes by generating frames in real time, functioning as a learned renderer rather than a true 3D simulation.

Briefing

World Labs’ “RTFM” (real-time frame model) is pushing the idea of controllable “world models” into interactive, browser-demo territory—rendering a shifting 3D-like environment frame by frame without an actual 3D scene behind it. Users can steer a viewpoint with game-style control dials or a mouse, while the system generates new images in real time as the perspective changes. The catch is that the illusion is constrained: movement is limited, reflections and fine geometry degrade (mirrors blur into smeary, artifact-prone output), and higher-quality settings can feel choppy. Still, the core breakthrough is the learned-renderer approach—an end-to-end autoregressive diffusion transformer trained on large-scale video data that learns 3D geometry, reflections, and shadows purely from observation. Even with a single NVIDIA H100 GPU required for operation, the demo signals where “world model” interfaces are headed: interactive hallucinated spaces that feel navigable, even if they’re not truly physically simulated.

Google’s Gemini thread runs in parallel, with three distinct directions getting attention: a web-capable “Gemini Agent” prototype, a flood of unofficial Gemini 3 Pro-style coding demos, and broader multimodal and agentic upgrades across the AI ecosystem. The Gemini Agent concept centers on task execution inside a browser—navigating to requested sites in Google AI Studio without full file-system or computer-like access. That matters because it points toward practical productivity workflows for everyday users already living in Chrome, where an agent could potentially act on behalf of someone across routine web tasks.

The biggest buzz, though, comes from leaked or semi-public Gemini 3 Pro demonstrations shared by community users, often tied to an “ECPT” checkpoint and accessed via A/B testing in Google AI Studio. The examples lean heavily into code generation that produces real, interactive artifacts: an Eiffel Tower built from voxels via HTML/CodePen code; Space Invaders clones with CRT-like effects, particles, screen shake, and more polished gameplay; and animation-oriented outputs like a polar bear biking under a starry sky where limbs, motion, and even scarf flapping are handled with surprising consistency. Other demos go further—simulated iPhone 3G and Game Boy experiences with working apps and games, OS-like recreations with functional utilities, and a “seahorse emoji” test where the model avoids a common trap by not forcing a nonexistent emoji. The most striking educational example is a custom DNA-unzipping learning animation with labeled elements and real-time strand construction, framed as a tool that could be customized for different learners.

Beyond Gemini, the transcript highlights a broader acceleration: OpenAI’s Sora 2 is being integrated into production pipelines via Nvidia AI’s partner tooling; ChatGPT’s iOS app reportedly supports native video input by dragging clips into prompts; and Anthropic’s Claude adds “customizable skills” that load brand guidelines or slide-deck instructions from uploaded files to constrain outputs and generate structured deliverables. The overall throughline is clear: AI is moving from text-only answers toward interactive agents, code-generated software, and controllable media—while raising urgent questions about verification, misuse, and responsible labeling as video realism improves.

Cornell Notes

World Labs’ RTFM (“real-time frame model”) turns “world models” into an interactive illusion: users navigate a 3D-like space while the system generates new frames in real time, without an underlying real 3D world. The approach is described as a learned renderer—an end-to-end autoregressive diffusion transformer trained on large-scale video data that learns geometry, reflections, and shadows from observation. Google’s Gemini developments split into a browser-task “Gemini Agent” prototype and a wave of Gemini 3 Pro-style coding demos tied to an ECPT checkpoint, showing code that generates voxels, games, OS-like interfaces, and even animated learning tools. The practical significance is productivity and software creation on demand, but the transcript also flags risks around realism, tracking, and the need for responsible use and clear labeling.

How does World Labs’ RTFM create the illusion of a navigable 3D environment, and what are its limits?

RTFM generates frames in real time as the user changes viewpoint, using an AI “learned renderer” rather than a stored 3D scene. It’s described as an autoregressive diffusion transformer trained end to end on large-scale video data, learning 3D geometry, reflections, and shadows from training examples. In the demo, users steer like a video game, but movement is limited and visual fidelity degrades in tricky cases—mirrors blur and reflections can look artifacted or “Gaussian-splatsque.” Quality mode can be choppy, while a speed setting smooths motion at the cost of quality.

Why does the transcript treat the Gemini Agent prototype as potentially valuable for everyday users?

The prototype focuses on performing tasks on the web inside a browser workflow. It can navigate to a requested website, but it doesn’t open its own full browser session or gain file-system/computer-level access like a more general “computer-use” setup. The practical upside is that many users already do their work in Chrome with logged-in sites; an agent that can act within that environment could automate routine tasks such as browsing, checking pages, or other web-based work.

What evidence is cited for Gemini 3 Pro-style demos being generated by a newer model rather than hand-made content?

The transcript points to community-shared demos (notably from a trusted X user) that include AI-like artifacts and, crucially, direct code outputs. Examples include HTML/CodePen code that builds a 3D voxel Eiffel Tower, and checkpoint naming references like “ECPT.” The demos are also compared against Gemini 2.5 Pro using a Space Invaders prompt, where Gemini 3 (ECPT) is described as producing a more advanced, polished game with effects like glowing CRT styling, particles, and screen shake.

How do the coding demos illustrate a shift from “generating images” to “generating working software”?

Several examples go beyond static visuals: Space Invaders clones behave like playable games; iPhone 3G and Game Boy-style outputs include working apps and games; and OS-like recreations are described as having functional utilities such as Notepad, Snake, and a Safari browser that can load Wikipedia content. The transcript frames this as LLMs translating abstract prompts into code that produces interactive systems rather than just rendering a picture.

What is the significance of Claude’s “customizable skills” feature mentioned in the transcript?

Claude can load packaged instructions as skills from uploaded files. The demo uses brand guidelines stored in a zip file: Claude reads the document and then applies those constraints when generating outputs. Another skill uses PowerPoint documentation to execute steps and produce a pitch deck that follows the specified structure. The transcript also notes poster-design skills for concept posters, and suggests a future where users can download and reuse skills.

What risks and responsibilities are emphasized as AI media realism and capability increase?

The transcript raises concerns about verification and misuse as video generation improves—especially the difficulty of distinguishing AI-generated depictions from real evidence in contexts like crimes. It also stresses privacy and tracking risks (illustrated by a geolocation tool demo) and argues for responsible behavior: labeling AI-generated content, avoiding deception, and using AI for beneficial applications while acknowledging the potential for harm.

Review Questions

  1. What specific technical description is given for RTFM, and how does that description connect to why the environment can update in real time?
  2. In the Gemini 3 Pro demos, what kinds of outputs are used to argue the model is producing code that creates interactive artifacts (not just visuals)?
  3. How do “customizable skills” in Claude differ from simply writing a prompt, according to the transcript’s examples?

Key Points

  1. 1

    World Labs’ RTFM renders interactive, navigable 3D-like scenes by generating frames in real time, functioning as a learned renderer rather than a true 3D simulation.

  2. 2

    RTFM’s realism has clear failure modes—reflections and fine details degrade, and movement range is limited—even though the experience can feel like controlling a virtual space.

  3. 3

    Google’s Gemini Agent prototype focuses on browser-based task execution, aiming at practical productivity for users already working in Chrome-like web environments.

  4. 4

    Unofficial Gemini 3 Pro-style demos emphasize code generation that produces voxels, playable games, and OS-like interfaces, with checkpoint references such as ECPT cited as supporting evidence.

  5. 5

    Claude’s “customizable skills” let users upload instruction files (e.g., brand guidelines or slide-deck rules) so outputs follow structured constraints automatically.

  6. 6

    The transcript repeatedly ties capability gains to responsibility: labeling AI-generated media, considering privacy and tracking risks, and anticipating misuse as realism improves.

Highlights

RTFM can make a user-controlled “world” feel real by hallucinating a changing 3D-like environment frame by frame—no actual 3D space required.
Gemini 3 Pro-style demos are presented as code-generating systems: from voxel Eiffel Towers to Space Invaders with CRT effects, particles, and screen shake.
Claude’s skills feature turns uploaded documents (brand guidelines, PowerPoint rules) into reusable constraints that guide generation end to end.
ChatGPT’s iOS app is described as supporting native video input by dragging clips into prompts, enabling strong video understanding—while access is reportedly restricted.

Topics

  • World Models
  • RTFM
  • Gemini Agent
  • Gemini 3 Pro Demos
  • Agentic Skills

Mentioned