Get AI summaries of any video or article — Sign up free
Even More MASSIVE Video AI Upgrades & New Models!!! it just does not stop! thumbnail

Even More MASSIVE Video AI Upgrades & New Models!!! it just does not stop!

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

STG is positioned as a spatio-temporal guidance method that improves realism and detail while maintaining consistency across video frames.

Briefing

Spatio-temporal skip guidance (STG) is emerging as a practical upgrade that makes video diffusion models produce sharper details and more consistent motion across frames—without abandoning the familiar classifier-free guidance workflow. In side-by-side demos, smoke looks more physically grounded, facial features sharpen instead of turning into blobs, and motion stays coherent rather than “morphing” between frames. The most striking claim is temporal scaling: guidance that works for single images is one thing, but STG is presented as a way to keep improvements stable across an entire clip, including hair, lighting highlights, and background structures like trees, clouds, and castles.

STG is positioned as a “no-compromises” add-on that can run on its own or alongside standard classifier-free guidance. Examples highlight the difference: a butterfly that’s blurry and mushy in the baseline becomes crisp enough to show individual legs and antennae; a woman’s lips and teeth form more clearly instead of collapsing into random skin-like shapes; and a 3D cat’s viewpoint stays consistent during a jump rather than flipping positions abruptly. The workflow is also framed as accessible for open-source video generators, with the demos attributed to the Mochi generator using STG enabled. The same technique is said to be applicable to other systems such as Stable Video Diffusion and Open Sora, with the promise of improvements “across the board.”

The broader theme is that video quality is accelerating through both algorithmic guidance and new model releases. A major new fully open-source video generation model from 10cent is described as having strong physics and lighting understanding, with open weights and code. The tradeoff is compute: early users reportedly need around 60 GB of VRAM, far beyond typical consumer GPUs. Still, the open-source framing is treated as a community win—lowering the barrier for experimentation and potentially pushing hardware requirements down over time.

The 10cent demos emphasize complex, multi-object scenes: consistent chair patterns while people move in the background, slow-motion fire lighting a child’s face realistically, camels walking with relatively stable feet, and sand behaving plausibly as it falls from a person’s hands. Even a cat eating a cheeseburger is cited as a difficult test for leg coordination and multi-part object continuity, with the model described as getting “pretty darn well” despite minor glitches.

Alongside model upgrades, new tools target specific styles. Minimax (Hailu AI) introduces 12 vo1 live, aimed at transforming 2D illustration art into more dynamic motion—especially anime-like sequences—while maintaining character consistency. The tool is explicitly not presented as a system that takes an existing video of someone speaking and transfers that motion; instead, it’s trained for illustration-driven animation, with demos showing hair moving independently and frame-hold behavior that resembles traditional animation timing.

Finally, Google’s VO AI is mentioned as available via Vertex AI in private preview, with early impressions described as less impressive than newer top-tier models—suggesting it may have lost momentum since its announcement. The segment closes with speculation that OpenAI could be preparing another release soon, with community chatter centered on Sora and other OpenAI developments, though no concrete confirmation is provided.

Cornell Notes

Spatio-temporal skip guidance (STG) is presented as an add-on for video diffusion models that improves detail and realism while keeping motion consistent across frames. Demos claim STG sharpens subjects (smoke, faces, animals) and reduces frame-to-frame “morphing,” working alongside classifier-free guidance. The technique is attributed to Mochi-based examples and is said to transfer to other generators like Stable Video Diffusion and Open Sora. In parallel, 10cent’s newly released fully open-source video model is highlighted for physics and lighting, though it reportedly needs about 60 GB of VRAM. Together, these developments point toward higher-quality AI video generation becoming more accessible through both better guidance methods and open model releases.

What is spatio-temporal skip guidance (STG), and why does it matter for video diffusion?

STG is described as a guidance mechanism attached to video diffusion models that steers generation toward more accurate, higher-quality outputs. The key difference from image-only guidance is temporal behavior: STG is claimed to enhance results consistently across all frames, not just in single snapshots. In demos, that shows up as sharper smoke physics, clearer facial features, and fewer abrupt changes between frames (like a cat flipping viewpoint).

How does STG relate to classifier-free guidance (CFG)?

STG is presented as able to work independently or alongside the standard classifier-free guidance used in many image/video diffusion systems. The comparison shown pairs normal CFG (top) against CFG plus STG (bottom), with the STG version producing more detailed smoke, more realistic physics, and more human-like facial structure rather than uncanny or blob-like artifacts.

What kinds of improvements are shown in the STG demos?

The transcript highlights several: smoke becomes more detailed and realistic; a woman’s face becomes more like a living person (less doll-like) and her features sharpen; a butterfly’s legs and antennae become visible; background elements like trees, clouds, and castles gain detail; and a jumping 3D cat maintains consistent positioning instead of switching front/back immediately on the top baseline.

What is the compute tradeoff for 10cent’s open-source video model?

The model is described as fully open source with weights, but it currently requires substantial GPU memory—reportedly around 60 GB of VRAM. The open-source community is expected to work on reducing that requirement to consumer levels, but the transcript treats the current hardware need as a major barrier.

How does 12 vo1 live differ from general video generation or motion transfer tools?

12 vo1 live (Minimax / Hailu AI) is described as a model specifically trained to animate 2D illustration styles, enhancing smoothness and vivid motion. It is explicitly not framed as a system where someone uploads a video of a person speaking and the model transcribes that motion onto new footage. Instead, it learns how different art styles should animate, with demos emphasizing character consistency and independent hair motion.

Why is Google’s VO AI described as potentially behind newer models?

VO AI is said to be available via Google’s Vertex AI platform in private preview, but early impressions are that it feels underwhelming compared with more recent top-end systems. The transcript attributes this to timing: VO AI was announced earlier, and it hasn’t improved enough relative to newer releases, making it less competitive on complex prompts that require temporal consistency and physics across multiple interacting entities.

Review Questions

  1. Which specific visual artifacts does STG aim to reduce, and what evidence is given that it works across time rather than only within a single frame?
  2. What does “fully open source” change for 10cent’s model in terms of adoption and community development, and what bottleneck remains?
  3. How does 12 vo1 live’s training focus (2D illustration animation) constrain what it can and cannot do compared with general video generation or motion transfer?

Key Points

  1. 1

    STG is positioned as a spatio-temporal guidance method that improves realism and detail while maintaining consistency across video frames.

  2. 2

    STG can be used alongside classifier-free guidance, and demos claim it sharpens subjects that otherwise blur or collapse into artifacts.

  3. 3

    The STG examples emphasize temporal coherence—reducing abrupt frame-to-frame changes such as viewpoint flipping in motion.

  4. 4

    10cent’s newly released video model is described as fully open source with weights, but it reportedly needs about 60 GB of VRAM to run today.

  5. 5

    Open-source release is treated as a catalyst for lowering hardware requirements and expanding community workflows beyond paid websites.

  6. 6

    Minimax’s 12 vo1 live targets 2D illustration animation (including anime-like styles) and is not presented as a general motion-transcription tool for arbitrary videos.

  7. 7

    Google’s VO AI is available in private preview on Vertex AI, but early impressions suggest it lags behind newer models on complex, physics-heavy prompts.

Highlights

STG is credited with making guidance improvements persist across time—smoke, faces, and animals stay more consistent from frame to frame.
A baseline cat jump is contrasted with an STG-enabled version that avoids sudden front/back position switching during motion.
10cent’s open-source model is framed as a major community asset despite a steep current requirement of roughly 60 GB of VRAM.
12 vo1 live focuses on animating 2D illustration styles, emphasizing character consistency and independent hair motion rather than real-world realism.

Topics

  • Spatio-Temporal Skip Guidance
  • Open-Source Video Models
  • VRAM Requirements
  • 2D Illustration Animation
  • Vertex AI Video Preview

Mentioned