Get AI summaries of any video or article — Sign up free
AI News WAVE Continues! AI Video, LLMs, & World Models! thumbnail

AI News WAVE Continues! AI Video, LLMs, & World Models!

MattVidPro·
6 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Llama 3.3 70B is presented as near–state-of-the-art in quality while being far cheaper than GPT-4o, with quoted input pricing of $0.10 per 1M tokens versus $2.50 and output pricing of $0.40 per 1M tokens versus $10.

Briefing

Open-source Llama 3.3 70B is being positioned as a near–top-tier alternative to GPT-4o, with pricing that undercuts closed models by an order of magnitude—especially for input tokens. Meta’s new 70B release is reported to deliver performance comparable to much larger systems in its class, including results described as roughly on par with GPT-4o and Gemini Pro 1.5 across common benchmarks and human evaluation. The practical takeaway is cost: Llama 3.3 70B is quoted at $0.10 per 1 million input tokens, versus $2.50 per 1 million input tokens for GPT-4o. Output pricing is also far lower—$0.40 per 1 million output tokens for Llama 3.3 70B compared with $10 for GPT-4o—making it especially attractive for applications that generate lots of text.

The roundup also highlights a broader shift toward cheaper, more controllable AI systems—both for text and for media. Microsoft Copilot’s “Live Vision” adds real-time screen understanding, letting users show Copilot what’s happening as they scroll, shop, or even play location-based games like GeoGuessr. In the demo, Copilot identifies clues from on-screen text and symbols (including language cues) and helps guide decisions in real time, effectively turning “show, don’t tell” into an interactive workflow. The feature is framed as a natural extension of Microsoft’s tight relationship with OpenAI, and it raises the competitive question of why similar advanced vision-and-voice experiences aren’t yet standard in the ChatGPT app.

On the AI video front, multiple projects push toward more direct control. A GitHub project described as “motion prompting” enables interactive, physics-like motion generation from a static image: dragging a cursor can fling smoke, shaking branches can make them sway, and moving objects can cause realistic sand and character reactions. The same theme appears in Runway’s Act One update, which moves beyond transposing acting onto a still image to transposing performance onto a video—so facial movement, hand motion, and voice can be layered onto footage with the background action already in motion. The results are presented as close enough to be useful for more professional production workflows, though still imperfect.

Runway’s update sits alongside other agent and search tooling. 11 Labs released a platform aimed at building conversational AI agents quickly for business use cases, emphasizing “build, test, and deploy” with options like voice creation, knowledge base uploads, and integrations for websites and apps. Separately, Mind Search is introduced as an open-source “AI search engine framework” that can connect to either open or closed LLMs to search the web—positioned as a cheaper alternative to relying on a single proprietary search API.

Finally, Google’s Genie2 is presented as a real-time, command-driven AI video game generator—an early step toward “diffusion world models” that can maintain a consistent world state for short periods (about a minute) and respond to actions like movement, jumping, and camera changes. The model is described as trained on video game data and using latent-frame transformers with guidance to improve control. While not yet a fully playable, long-session experience, the direction is clear: AI systems are moving from generating clips to generating interactive worlds, and from static outputs to live, screen-aware assistance.

Cornell Notes

Llama 3.3 70B is framed as a near-parity option to GPT-4o while being dramatically cheaper, with quoted input pricing of $0.10 per 1M tokens versus $2.50 for GPT-4o and output pricing of $0.40 per 1M tokens versus $10. Microsoft Copilot’s Live Vision adds real-time screen understanding, enabling “show, don’t tell” help during tasks like GeoGuessr. Video generation is shifting toward control: a motion-prompting project lets users drag/shake objects in interactive image-to-video demos, while Runway Act One now transposes acting onto video rather than only images. 11 Labs pushes business-focused conversational agents with fast build/deploy tooling, and Mind Search offers an open-source framework for LLM-powered web search. Google’s Genie2 aims at real-time, command-driven AI “world” generation, responding to keyboard actions and maintaining short-term consistency.

What makes Llama 3.3 70B stand out versus GPT-4o in this roundup?

The key differentiator is cost paired with reported benchmark parity. Llama 3.3 70B is described as delivering performance comparable to GPT-4o and other leading models, while pricing is quoted far lower: $0.10 per 1M input tokens for Llama 3.3 70B versus $2.50 for GPT-4o input. Output is also cheaper: $0.40 per 1M output tokens for Llama 3.3 70B versus $10 for GPT-4o. That combination matters most for applications that generate lots of text or run at scale.

How does Copilot Live Vision change the way users interact with AI assistance?

Live Vision lets users show Copilot their screen in real time, so the assistant can interpret what’s visible as the user scrolls or navigates. In the GeoGuessr demo, Copilot helps by reading on-screen clues—like Chinese characters and smaller English text—to infer a likely region (the Philippines) and then guide a next decision (north vs. south). The emphasis is on interactive, moment-to-moment guidance rather than static prompts.

What does “motion prompting” add to AI video generation control?

It enables interactive control over motion generated from a static image. Examples include dragging the mouse to fling smoke with realistic physics-like behavior, shaking branches to make them sway, and moving objects to affect how sand falls. The demos are described as more like interactive, physics-driven scenes than pre-rendered video, because the output responds to user input during generation.

How is Runway Act One’s update different from its earlier approach?

Earlier, Act One transposed acting onto an image. The update shifts to transposing acting onto a video, meaning the background footage can already contain motion (e.g., a car moving in the distance) while the person’s facial and hand performance is layered on top. The roundup notes the mouth movement can be convincingly added even when the original clip had no mouth motion, though background motion artifacts (like slow car movement) can still give away the source.

What is the core promise behind Google Genie2 as described here?

Genie2 is presented as a real-time, command-driven AI video game generator. It generates a playable world on the fly from an initial image and then responds to keyboard actions (W/A/S/D for movement, space to jump, camera adjustments). It’s also claimed to remember parts of the world that go out of view and render them accurately when they return, with consistency described as holding for up to about a minute.

Why are 11 Labs and Mind Search grouped together in the roundup?

Both target practical deployment of AI beyond raw model generation. 11 Labs focuses on building conversational agents quickly for business workflows—voice creation, knowledge base uploads, transcript analysis, and fast deployment to websites/apps. Mind Search focuses on web search integration as an open-source framework that can connect to different LLMs, aiming to reduce reliance on a single proprietary search API and potentially lower costs.

Review Questions

  1. Which pricing numbers in the roundup most directly support the claim that Llama 3.3 70B is cheaper than GPT-4o, and for what token types?
  2. In the GeoGuessr demo, what kinds of visual evidence does Copilot Live Vision use to guide decisions?
  3. What distinguishes Runway Act One’s new video-based acting transposition from its earlier image-based version?

Key Points

  1. 1

    Llama 3.3 70B is presented as near–state-of-the-art in quality while being far cheaper than GPT-4o, with quoted input pricing of $0.10 per 1M tokens versus $2.50 and output pricing of $0.40 per 1M tokens versus $10.

  2. 2

    Meta’s Llama 3.3 70B is positioned as fully open source and available via Meta and Hugging Face links, with both local and API-based usage options.

  3. 3

    Microsoft Copilot’s Live Vision enables real-time screen understanding, letting users show Copilot what they’re doing and receive guidance as they interact with apps and games.

  4. 4

    Interactive motion control is emerging in AI video workflows, with “motion prompting” demos that respond to mouse dragging and shaking to drive physics-like changes.

  5. 5

    Runway Act One’s update shifts from acting-on-images to acting-on-video, layering facial and hand performance onto moving background footage.

  6. 6

    11 Labs offers a business-oriented platform for building conversational AI agents quickly, emphasizing build/test/deploy and integrations for websites and apps.

  7. 7

    Google’s Genie2 is described as a diffusion world model that generates interactive, real-time “gameplay” from commands and can maintain short-term world consistency (about a minute).

Highlights

Llama 3.3 70B is framed as “basically on par” with top closed models while undercutting costs sharply—$0.10 per 1M input tokens and $0.40 per 1M output tokens versus GPT-4o’s $2.50 and $10.
Copilot Live Vision turns assistance into a live, screen-aware co-pilot, demonstrated through GeoGuessr-style clue reading and step-by-step guidance.
Runway Act One now transposes acting onto video, not just images—making it closer to usable performance layering for production workflows.
Motion prompting demos suggest AI video control is moving toward interactive, physics-like manipulation rather than fixed prompts.
Genie2 is pitched as real-time, command-driven AI world generation, responding to keyboard movement and camera changes while keeping parts of the world consistent for up to a minute.

Topics

  • Llama 3.3 70B
  • Copilot Live Vision
  • AI Video Motion Control
  • Runway Act One
  • Conversational AI Agents
  • Open-Source AI Search
  • Genie2 World Models