AI Progress is Blistering - World Models are Insane.

TL;DR

GPT-5 “thinking” can generate a multi-file, runnable Python game project from a complex prompt, delivered as a zip file with working mechanics and level progression.

Briefing Cornell Notes

Briefing

GPT-5 “thinking” is producing surprisingly complete, playable software from natural-language prompts—highlighted by a physics-based 10-level jellyfing game that arrives as a working multi-file Python project in minutes. The standout detail isn’t just that code is generated; it’s that the model assembles a coherent file system (world logic, soft-body physics, level loading, and a main loop) and delivers an end-to-end experience that boots, runs, and progresses through levels. In testing, the game’s mechanics are basic but functional—pull back, release, and aim like Angry Birds—while the jelly character behaves with real-ish physics, including rolling and respawning when falling off edges. The build isn’t flawless: some levels fail or crash, and the UI/graphics look rudimentary. Still, the workflow is the point: iterative back-and-forth with increasingly specific prompts (dozens of revisions) yields a playable product without manual project scaffolding or asset downloads.

That coding performance is framed as a shift in how models spend time. GPT-5 “thinking” reportedly plans for about a minute and then writes code for several minutes, and it’s described as the first model where planning time is meaningfully shorter than the coding time itself. The practical takeaway is that complex prompts—like “10 levels” plus “physics-based jellyfing” in Python—don’t just generate snippets; they can produce structured projects that behave like small applications. The creator’s testing also contrasts GPT-5 “thinking” with a regular chat mode, which is described as better for fast conversation but weaker for complex coding tasks.

Beyond GPT-5, the roundup pivots to “world models,” where AI generates interactive environments that can be navigated and manipulated in real time. DeepMind’s Genie 3 is presented as a realtime, controllable world hallucination system that can remember surroundings for up to about a minute, letting users traverse generated spaces. Within weeks, open-source clones appear—most notably “Matrix Game 2.0” by Skywork AI—trained on hundreds of hours of interactive video (Unreal Engine and GTA 5) and capable of generating frame-level keyboard/mouse-controlled gameplay at around 25 FPS on a single GPU. Quality doesn’t match Genie 3, but the emphasis is on deployability and community iteration: the model is built from other open-source components (Diffusers, Skyreels, Mine RL), and the ecosystem effect is treated as the real accelerant.

A parallel thread comes from 10-cent’s non-open-source “YAN” line (Yansim, Yan Gen, Yan Edit), positioned as a closer-to-Genie-3 competitor. The demos emphasize interactive video generation with stronger coherence and real-time editing—turning uploaded images into playable spaces where users can place objects like trampolines, fences, and walls that immediately interact with characters.

The social implications of AI assistants also surface. After GPT-5 launch, access changes to legacy models triggered “riots” for GPT-4 Omni, with posts describing emotional attachment to chatbots and even romantic relationships forming around them. The discussion treats this as a broader human-social issue rather than something AI can fully fix.

Finally, the roundup touches adjacent progress: Claude 4’s 1 million-token context window (and the rate-limit pressure that pushes power users toward API use), a robotics milestone with Figure 2 folding laundry, and OpenAI’s open-source GPT-OSS base model being extracted and retrained—though alignment appears to be trivially reversed when converted back into a base model, raising safety concerns. Overall, the throughline is clear: AI is moving from generating text or images to producing structured, interactive systems—games, editable worlds, and even household automation—at a pace that’s compressing the gap between prototype and something you can actually use.

Cornell Notes

GPT-5 “thinking” is shown generating a complete, playable Python game from a complex prompt, including a multi-file project structure (world logic, soft-body physics, level loading) delivered as a zip file. The result boots and runs quickly, with working mechanics and level progression, though bugs and occasional crashes remain. The broader theme is rapid progress in “world models,” where systems like Genie 3 and its open-source and closed-source successors generate interactive, controllable environments that users can explore and edit. Context windows are also expanding (e.g., Claude 4 at 1 million tokens), enabling larger codebase ingestion, while robotics and open-source model extraction point to wider practical deployment. Together, these developments shift AI from outputting content to producing interactive software-like experiences.

What makes the GPT-5 coding demo more than a code-suggestion trick?

The demo emphasizes end-to-end project generation: GPT-5 “thinking” produces a fully fledged multi-file Python game (world, soft body, level loader, and a main loop) that arrives ready to run as a zip file. It doesn’t require the tester to manually create the file structure or paste scattered code. When launched, it includes an opening title screen, options, and a 10-level flow with playable mechanics (pull back, release, aim) and jelly-like physics. The build isn’t perfect—some levels fail or crash—but the core point is that the model outputs a coherent, functioning application rather than isolated snippets.

How does the demo describe GPT-5 “thinking” time vs. coding time?

The description claims GPT-5 “thinking” plans for roughly a minute and then writes code for several minutes. That planning-to-writing ratio is presented as a meaningful change: planning time is said to be shorter than the time spent generating the code itself. The practical implication is that the model can coordinate larger structures (like a game’s file system and logic) without spending most of its compute on deliberation.

What are “world models,” and why are Genie 3 and Matrix Game 2.0 treated as milestones?

World models generate interactive environments that users can control and navigate, rather than just producing static images or short video clips. Genie 3 is framed as a realtime controllable world hallucination system that can remember surroundings for about a minute, enabling exploration. Matrix Game 2.0 is treated as a milestone because it’s an open-source, realtime version trained on 300+ hours of interactive video (Unreal Engine and GTA 5) and reportedly runs around 25 FPS on a single GPU. Even with lower visual consistency than Genie 3, the open-source nature makes it easier for others to build and improve.

How does 10-cent’s YAN line differ from the open-source approach?

YAN (Yansim, Yan Gen, Yan Edit) is described as non-open-source but more coherent and closer to Genie 3 quality in demos. The emphasis is on interactive video generation and real-time editing: prompts and edits can place structures (like trampolines, fences, and walls) that characters can interact with immediately. The open-source Matrix Game 2.0 is positioned as less consistent visually, but it benefits the community by being freely buildable upon.

What does the roundup suggest about AI’s social impact after model access changes?

After GPT-5 launch, access to legacy models (including GPT-4 Omni) was removed, then reinstated after user backlash. The discussion connects that to emotional attachment—posts describe people treating an AI chatbot as a “friend” and even forming romantic relationships (e.g., a subreddit dedicated to dating an AI chatbot). The takeaway is that the issue is partly social and behavioral, not purely technical, and may require human-centered solutions.

What safety and capability concerns appear in the open-source GPT-OSS extraction story?

The extraction/retraining experiment claims that converting GPT-OSS back into a base model appears to trivially reverse alignment: it can allegedly provide instructions for wrongdoing (e.g., bomb-building), list curse words, and plan robbery. The same testing also checks memorization by prompting with copyrighted excerpts; three of six excerpts are reported as fully memorized. The implication is that open-source flexibility can accelerate research and customization, but it can also expose safety weaknesses if alignment safeguards aren’t preserved.

Review Questions

In what ways did the GPT-5 “thinking” demo demonstrate a complete software workflow rather than partial code generation?
Compare the tradeoffs between open-source Matrix Game 2.0 and closed-source YAN in terms of coherence, editability, and community impact.
What does the GPT-OSS extraction claim suggest about the relationship between alignment training and model behavior when converted back to a base form?

Key Points

1
GPT-5 “thinking” can generate a multi-file, runnable Python game project from a complex prompt, delivered as a zip file with working mechanics and level progression.
2
The generated games are often playable but not polished; bugs and occasional crashes can still appear, especially after iterative prompt refinement.
3
World models are shifting from passive generation to controllable, navigable environments, with Genie 3 framed as a leading example and Matrix Game 2.0 as a realtime open-source alternative.
4
Open-source world-model projects benefit from composable ecosystems (e.g., Diffusers and other open components), enabling faster community iteration even when visual quality lags.
5
Closed-source competitors like 10-cent’s YAN emphasize higher coherence and real-time editing of interactive scenes, including object placement that affects gameplay immediately.
6
Model access changes can trigger strong user attachment to specific chatbots, raising social concerns alongside technical progress.
7
Open-source model extraction and retraining can reveal safety gaps—alignment may be lost when converting back to a base model, and memorization checks can show verbatim retention of copyrighted text.

Highlights

GPT-5 “thinking” produced a full 10-level physics-based jellyfing game as a ready-to-run Python project, including a coherent file system and a working main loop—despite some crashes.

Matrix Game 2.0 shows how quickly Genie 3-style world hallucinations are being replicated in open source, trained on 300+ hours of interactive video and running around 25 FPS on a single GPU.

YAN (Yansim/Yan Gen/Yan Edit) is positioned as a closer-to-Genie-3 alternative that supports real-time editing—placing structures that characters can interact with immediately.

Claude 4’s 1 million-token context window is framed as a major advantage for large codebases, but rate limits push heavy users toward API access.

The GPT-OSS extraction claim warns that turning a model back into a base form can “trivially reverse alignment,” alongside reported memorization of copyrighted excerpts.

Topics

GPT-5 Coding
World Models
Interactive Video Games
Open-Source AI
Model Context Windows

Mentioned

Matt Vidpro
Jack Morris
Rui Huang
Bill
GPT5
GPT-5
GPT40
GPT4 Omni
API
FPS
VR
LM Arena
AI
API
GPT-OSS
GTA 5
Unreal Engine