AI News WAVE Continues! AI Video, LLMs, & World Models!
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Llama 3.3 70B is presented as near–state-of-the-art in quality while being far cheaper than GPT-4o, with quoted input pricing of $0.10 per 1M tokens versus $2.50 and output pricing of $0.40 per 1M tokens versus $10.
Briefing
Open-source Llama 3.3 70B is being positioned as a near–top-tier alternative to GPT-4o, with pricing that undercuts closed models by an order of magnitude—especially for input tokens. Meta’s new 70B release is reported to deliver performance comparable to much larger systems in its class, including results described as roughly on par with GPT-4o and Gemini Pro 1.5 across common benchmarks and human evaluation. The practical takeaway is cost: Llama 3.3 70B is quoted at $0.10 per 1 million input tokens, versus $2.50 per 1 million input tokens for GPT-4o. Output pricing is also far lower—$0.40 per 1 million output tokens for Llama 3.3 70B compared with $10 for GPT-4o—making it especially attractive for applications that generate lots of text.
The roundup also highlights a broader shift toward cheaper, more controllable AI systems—both for text and for media. Microsoft Copilot’s “Live Vision” adds real-time screen understanding, letting users show Copilot what’s happening as they scroll, shop, or even play location-based games like GeoGuessr. In the demo, Copilot identifies clues from on-screen text and symbols (including language cues) and helps guide decisions in real time, effectively turning “show, don’t tell” into an interactive workflow. The feature is framed as a natural extension of Microsoft’s tight relationship with OpenAI, and it raises the competitive question of why similar advanced vision-and-voice experiences aren’t yet standard in the ChatGPT app.
On the AI video front, multiple projects push toward more direct control. A GitHub project described as “motion prompting” enables interactive, physics-like motion generation from a static image: dragging a cursor can fling smoke, shaking branches can make them sway, and moving objects can cause realistic sand and character reactions. The same theme appears in Runway’s Act One update, which moves beyond transposing acting onto a still image to transposing performance onto a video—so facial movement, hand motion, and voice can be layered onto footage with the background action already in motion. The results are presented as close enough to be useful for more professional production workflows, though still imperfect.
Runway’s update sits alongside other agent and search tooling. 11 Labs released a platform aimed at building conversational AI agents quickly for business use cases, emphasizing “build, test, and deploy” with options like voice creation, knowledge base uploads, and integrations for websites and apps. Separately, Mind Search is introduced as an open-source “AI search engine framework” that can connect to either open or closed LLMs to search the web—positioned as a cheaper alternative to relying on a single proprietary search API.
Finally, Google’s Genie2 is presented as a real-time, command-driven AI video game generator—an early step toward “diffusion world models” that can maintain a consistent world state for short periods (about a minute) and respond to actions like movement, jumping, and camera changes. The model is described as trained on video game data and using latent-frame transformers with guidance to improve control. While not yet a fully playable, long-session experience, the direction is clear: AI systems are moving from generating clips to generating interactive worlds, and from static outputs to live, screen-aware assistance.
Cornell Notes
Llama 3.3 70B is framed as a near-parity option to GPT-4o while being dramatically cheaper, with quoted input pricing of $0.10 per 1M tokens versus $2.50 for GPT-4o and output pricing of $0.40 per 1M tokens versus $10. Microsoft Copilot’s Live Vision adds real-time screen understanding, enabling “show, don’t tell” help during tasks like GeoGuessr. Video generation is shifting toward control: a motion-prompting project lets users drag/shake objects in interactive image-to-video demos, while Runway Act One now transposes acting onto video rather than only images. 11 Labs pushes business-focused conversational agents with fast build/deploy tooling, and Mind Search offers an open-source framework for LLM-powered web search. Google’s Genie2 aims at real-time, command-driven AI “world” generation, responding to keyboard actions and maintaining short-term consistency.
What makes Llama 3.3 70B stand out versus GPT-4o in this roundup?
How does Copilot Live Vision change the way users interact with AI assistance?
What does “motion prompting” add to AI video generation control?
How is Runway Act One’s update different from its earlier approach?
What is the core promise behind Google Genie2 as described here?
Why are 11 Labs and Mind Search grouped together in the roundup?
Review Questions
- Which pricing numbers in the roundup most directly support the claim that Llama 3.3 70B is cheaper than GPT-4o, and for what token types?
- In the GeoGuessr demo, what kinds of visual evidence does Copilot Live Vision use to guide decisions?
- What distinguishes Runway Act One’s new video-based acting transposition from its earlier image-based version?
Key Points
- 1
Llama 3.3 70B is presented as near–state-of-the-art in quality while being far cheaper than GPT-4o, with quoted input pricing of $0.10 per 1M tokens versus $2.50 and output pricing of $0.40 per 1M tokens versus $10.
- 2
Meta’s Llama 3.3 70B is positioned as fully open source and available via Meta and Hugging Face links, with both local and API-based usage options.
- 3
Microsoft Copilot’s Live Vision enables real-time screen understanding, letting users show Copilot what they’re doing and receive guidance as they interact with apps and games.
- 4
Interactive motion control is emerging in AI video workflows, with “motion prompting” demos that respond to mouse dragging and shaking to drive physics-like changes.
- 5
Runway Act One’s update shifts from acting-on-images to acting-on-video, layering facial and hand performance onto moving background footage.
- 6
11 Labs offers a business-oriented platform for building conversational AI agents quickly, emphasizing build/test/deploy and integrations for websites and apps.
- 7
Google’s Genie2 is described as a diffusion world model that generates interactive, real-time “gameplay” from commands and can maintain short-term world consistency (about a minute).