Get AI summaries of any video or article — Sign up free
SORA 2 Storyboard mode, Google VEO 3.1 & other updates! thumbnail

SORA 2 Storyboard mode, Google VEO 3.1 & other updates!

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemini 3 demos reported via limited access suggest “one-shot” generation of software-like outputs, including OS-style interfaces with interactive elements.

Briefing

A new wave of Gemini 3 demos is pushing AI beyond “generate a clip” into “recreate software,” with users reporting models that can output working-looking operating systems and full UI flows from simple prompts. Early, non-public access examples show Gemini 3 generating an Xbox 360 controller from an SVG prompt, then escalating to recreations of Mac OS X and Windows-like interfaces—complete with elements such as a functioning Finder, Safari loading a Wikipedia page, resizable/movable windows, and even interactive tools like a sketchpad and a terminal. The standout theme is one-shot UI construction: instead of piecemeal coding, the model appears to generate the code and the interface assets in a single pass, including icons and app behaviors that normally require significant engineering time.

That capability matters because it targets one of the hardest gaps in current generative AI: reliability under complex, multi-component constraints. Recreating an operating system isn’t just about visual similarity; it requires coherent layout, correct interactions, and internal consistency across many UI elements. In the demos, Gemini 3 also seems to handle “steep use cases” such as building multiple apps and wiring them to pre-built behaviors, which is a different challenge than producing a single image or short animation.

While Gemini 3’s timeline remains uncertain, Google has already moved forward on its video model line. DeepMind’s VO3.1 upgrade to VO3.1 adds quality claims—better textures, realism, and audio—plus new tooling aimed at control. “Ingredients to video” is expanded on Flow so creators can supply up to three references for one generation. An extension feature lets users extend existing generations to lengthen outputs while trying to maintain coherence. Most notably, “first and last frames” introduces a controllability mechanism where the model must pass through a specified start and end image; a barn-to-cowboy reveal example illustrates how the system can transform a scene across time.

Even with those additions, Sora 2 remains the benchmark in side-by-side comparisons. The transcript’s comparisons emphasize Sora 2’s stronger character consistency, deeper motion understanding, and more convincing mouth movement during dialogue, alongside better shot-to-shot continuity. VO3.1’s audio can sound more realistic in some cases, but movement and timing mismatches show up more often, especially when prompts stretch the model.

On the OpenAI side, Sora 2’s update is centered on longer generations and a new “storyboard mode.” Pro users can generate up to 25 seconds in storyboard mode, while free users get up to 15 seconds. Storyboards let creators sketch video second-by-second, with a prompt bar that can auto-generate storyboard scenes using ChatGPT. Early tests show the feature works best with simpler prompts; overly dense, multi-object scenarios can overwhelm the model and produce “AI madness.” The storyboard UI also supports importing existing videos into a storyboard, deleting scenes, fitting scenes to duration, and uploading reference photos.

The overall picture: Gemini 3 hints at a future where AI can produce functional software-like outputs in one shot, while video models are racing toward more controllable, longer-form generation. For now, Sora 2’s control and coherence lead in many practical comparisons, and storyboard mode is the clearest step toward turning creative intent into structured, frame-level direction.

Cornell Notes

Gemini 3 demos (reported via limited access) suggest a major leap in “one-shot” software-like generation: from producing code for an Xbox 360 controller image to recreating OS-style interfaces such as Mac OS X and Windows-like layouts with interactive elements. The key significance is constraint-handling—building coherent multi-part UI behavior and assets in a single generation. Meanwhile, Google’s VO3.1 upgrade adds tools for control and production workflow, including “ingredients to video” (up to three references), generation extension, and “first and last frames.” In comparisons, Sora 2 is still favored for character consistency and motion understanding, even when VO3.1’s audio can sound more lifelike. OpenAI’s Sora 2 storyboard mode adds structured, second-by-second planning with longer outputs (15 seconds for all users; 25 seconds for pro in storyboard).

What makes the Gemini 3 demos feel different from typical image/video generation?

The reported examples emphasize one-shot recreation of complex, software-like systems. Instead of generating a single image or short clip, Gemini 3 appears to generate the code and assets needed for UI behavior in one pass—such as Mac OS X-like desktops with a working Finder, a Safari browser that loads a Wikipedia page, resizable/movable windows, and a sketchpad. Another trend is recreating operating systems and interfaces from simple prompts, which requires internal consistency across many components.

How do VO3.1’s new tools change the way creators can direct a generation?

VO3.1.1 adds multiple control mechanisms: “ingredients to video” lets users provide up to three references for one video; an extension feature allows extending an existing generation to increase length while aiming to keep coherence; and “first and last frames” lets creators specify start and end frames, forcing the model to transform between them. The transcript highlights the barn-to-cowboy reveal as an example of how first/last framing can drive a time-based transformation.

Why does Sora 2 still come out ahead in the comparisons described?

Across several side-by-side examples, Sora 2 is credited with stronger character consistency and deeper motion understanding. Dialogue scenes are described as having more convincing mouth movement, and action sequences are said to maintain coherence better when shots change. VO3.1 sometimes delivers more realistic audio, but movement/timing mismatches and less convincing continuity show up more often in the cited comparisons.

What is storyboard mode in Sora 2, and what does it enable?

Storyboard mode introduces a second-by-second, frame-by-frame planning workflow. Pro users can generate up to 25 seconds in storyboard mode, while all users can generate up to 15 seconds. The interface includes a prompt bar that can auto-generate storyboard scenes via ChatGPT, plus controls like fitting scenes to duration, deleting scenes, and uploading reference photos. There’s also an option to convert an existing video back into a storyboard.

What limits show up when using Sora 2 storyboard mode?

The transcript warns that storyboard mode can be overwhelmed by hypercomplex prompts. In longer 25-second storyboard tests, dense multi-object instructions (e.g., a Rube Goldberg theme park) can “melt into AI madness,” and the model may not match scene timing precisely (e.g., scene durations like 1.67 seconds aren’t reliably exact). Storyboard is described as beta and “wishy-washiness” remains—helpful for structure, but not a precision tool.

Review Questions

  1. Which Gemini 3 behavior in the demos suggests it can handle multi-component constraints beyond visual generation?
  2. What three VO3.1 control features are named, and how does each one affect the generation workflow?
  3. In Sora 2 storyboard mode, what kinds of prompts tend to work best, and what failure mode appears with overly complex instructions?

Key Points

  1. 1

    Gemini 3 demos reported via limited access suggest “one-shot” generation of software-like outputs, including OS-style interfaces with interactive elements.

  2. 2

    A simple SVG prompt for an Xbox 360 controller reportedly produced code-based image results, while later examples escalated to Mac OS X and Windows-like recreations.

  3. 3

    Google’s VO3.1 upgrade adds control and workflow tools: ingredients to video (up to three references), generation extension, and first/last frames.

  4. 4

    Side-by-side comparisons in the transcript favor Sora 2 for character consistency, mouth movement during dialogue, and deeper motion understanding, even when VO3.1 audio can sound more realistic.

  5. 5

    Sora 2’s update introduces storyboard mode plus longer generations: 15 seconds for all users and up to 25 seconds for pro users in storyboard mode.

  6. 6

    Storyboard mode works best with simpler prompts; dense, multi-object instructions can overwhelm the system and produce incoherent results.

  7. 7

    Storyboard mode is still beta-like: scene timing and exact precision aren’t guaranteed, and auto-generated storyboards can be finicky (e.g., missing at-symbol mentions).

Highlights

Gemini 3 examples reportedly go beyond visuals into generating working-looking operating system interfaces—suggesting a shift toward functional, code-and-UI-level generation.
VO3.1’s most controllable addition is “first and last frames,” which can drive a transformation across time (illustrated by a barn-to-cowboy reveal).
Sora 2’s storyboard mode turns video creation into a second-by-second planning workflow, but complex prompts can easily overwhelm it.
In the transcript’s comparisons, Sora 2 is repeatedly favored for coherence and character consistency, while VO3.1’s audio realism sometimes stands out.

Topics