SORA 2 Storyboard mode, Google VEO 3.1 & other updates!
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini 3 demos reported via limited access suggest “one-shot” generation of software-like outputs, including OS-style interfaces with interactive elements.
Briefing
A new wave of Gemini 3 demos is pushing AI beyond “generate a clip” into “recreate software,” with users reporting models that can output working-looking operating systems and full UI flows from simple prompts. Early, non-public access examples show Gemini 3 generating an Xbox 360 controller from an SVG prompt, then escalating to recreations of Mac OS X and Windows-like interfaces—complete with elements such as a functioning Finder, Safari loading a Wikipedia page, resizable/movable windows, and even interactive tools like a sketchpad and a terminal. The standout theme is one-shot UI construction: instead of piecemeal coding, the model appears to generate the code and the interface assets in a single pass, including icons and app behaviors that normally require significant engineering time.
That capability matters because it targets one of the hardest gaps in current generative AI: reliability under complex, multi-component constraints. Recreating an operating system isn’t just about visual similarity; it requires coherent layout, correct interactions, and internal consistency across many UI elements. In the demos, Gemini 3 also seems to handle “steep use cases” such as building multiple apps and wiring them to pre-built behaviors, which is a different challenge than producing a single image or short animation.
While Gemini 3’s timeline remains uncertain, Google has already moved forward on its video model line. DeepMind’s VO3.1 upgrade to VO3.1 adds quality claims—better textures, realism, and audio—plus new tooling aimed at control. “Ingredients to video” is expanded on Flow so creators can supply up to three references for one generation. An extension feature lets users extend existing generations to lengthen outputs while trying to maintain coherence. Most notably, “first and last frames” introduces a controllability mechanism where the model must pass through a specified start and end image; a barn-to-cowboy reveal example illustrates how the system can transform a scene across time.
Even with those additions, Sora 2 remains the benchmark in side-by-side comparisons. The transcript’s comparisons emphasize Sora 2’s stronger character consistency, deeper motion understanding, and more convincing mouth movement during dialogue, alongside better shot-to-shot continuity. VO3.1’s audio can sound more realistic in some cases, but movement and timing mismatches show up more often, especially when prompts stretch the model.
On the OpenAI side, Sora 2’s update is centered on longer generations and a new “storyboard mode.” Pro users can generate up to 25 seconds in storyboard mode, while free users get up to 15 seconds. Storyboards let creators sketch video second-by-second, with a prompt bar that can auto-generate storyboard scenes using ChatGPT. Early tests show the feature works best with simpler prompts; overly dense, multi-object scenarios can overwhelm the model and produce “AI madness.” The storyboard UI also supports importing existing videos into a storyboard, deleting scenes, fitting scenes to duration, and uploading reference photos.
The overall picture: Gemini 3 hints at a future where AI can produce functional software-like outputs in one shot, while video models are racing toward more controllable, longer-form generation. For now, Sora 2’s control and coherence lead in many practical comparisons, and storyboard mode is the clearest step toward turning creative intent into structured, frame-level direction.
Cornell Notes
Gemini 3 demos (reported via limited access) suggest a major leap in “one-shot” software-like generation: from producing code for an Xbox 360 controller image to recreating OS-style interfaces such as Mac OS X and Windows-like layouts with interactive elements. The key significance is constraint-handling—building coherent multi-part UI behavior and assets in a single generation. Meanwhile, Google’s VO3.1 upgrade adds tools for control and production workflow, including “ingredients to video” (up to three references), generation extension, and “first and last frames.” In comparisons, Sora 2 is still favored for character consistency and motion understanding, even when VO3.1’s audio can sound more lifelike. OpenAI’s Sora 2 storyboard mode adds structured, second-by-second planning with longer outputs (15 seconds for all users; 25 seconds for pro in storyboard).
What makes the Gemini 3 demos feel different from typical image/video generation?
How do VO3.1’s new tools change the way creators can direct a generation?
Why does Sora 2 still come out ahead in the comparisons described?
What is storyboard mode in Sora 2, and what does it enable?
What limits show up when using Sora 2 storyboard mode?
Review Questions
- Which Gemini 3 behavior in the demos suggests it can handle multi-component constraints beyond visual generation?
- What three VO3.1 control features are named, and how does each one affect the generation workflow?
- In Sora 2 storyboard mode, what kinds of prompts tend to work best, and what failure mode appears with overly complex instructions?
Key Points
- 1
Gemini 3 demos reported via limited access suggest “one-shot” generation of software-like outputs, including OS-style interfaces with interactive elements.
- 2
A simple SVG prompt for an Xbox 360 controller reportedly produced code-based image results, while later examples escalated to Mac OS X and Windows-like recreations.
- 3
Google’s VO3.1 upgrade adds control and workflow tools: ingredients to video (up to three references), generation extension, and first/last frames.
- 4
Side-by-side comparisons in the transcript favor Sora 2 for character consistency, mouth movement during dialogue, and deeper motion understanding, even when VO3.1 audio can sound more realistic.
- 5
Sora 2’s update introduces storyboard mode plus longer generations: 15 seconds for all users and up to 25 seconds for pro users in storyboard mode.
- 6
Storyboard mode works best with simpler prompts; dense, multi-object instructions can overwhelm the system and produce incoherent results.
- 7
Storyboard mode is still beta-like: scene timing and exact precision aren’t guaranteed, and auto-generated storyboards can be finicky (e.g., missing at-symbol mentions).