‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Meta’s Llama 3 70B is presented as competitive with Gemini Pro 1.5 and Claude Sonnet based on human-evaluated comparisons, even before the largest model and full research paper arrive.
Briefing
Meta’s newly released Llama 3 70B is arriving in a competitive state—without the full “biggest and best” model or its research paper yet—while Microsoft’s Vasa-1 and Boston Dynamics’ Atlas keep pushing AI toward more lifelike, real-time social interaction. The immediate takeaway: smaller Llama 3 models are already matching top-tier peers on several benchmarks, and the next wave of AI interfaces may be driven less by raw intelligence and more by personalization, avatar realism, and integration into everyday tasks.
Meta’s late-breaking update centers on two smaller Llama 3 releases, with Llama 3 70B positioned as competitive with Gemini Pro 1.5 and Claude Sonnet. Human-evaluated comparisons are presented against models including Mistral Medium, Claude Sonnet, and GPT-3.5, with results described as broadly similar across a “mystery model” still training and newer GPT-4 Turbo and Claude 3 Opus baselines. On graduate-stem style assessment, performance is described as nearly identical, while coding benchmarks show close competition—though the transcript flags that some benchmarks are “deeply flawed,” especially for math where GPT-4 still appears ahead. The bigger research claim tied to Llama 3 is that performance keeps improving even after training on far more data than the “Chinchilla optimal” amount, suggesting saturation from high-quality data rather than sheer volume; coding data receives special emphasis.
Meta also signals a roadmap: multiple future Llama 3 variants with multimodality, multilingual conversation, longer context windows, and stronger overall capabilities. The transcript repeatedly stresses that the most consequential missing piece is not just model size, but the eventual release of the larger model and the paper that will clarify training details like context window length.
In parallel, Microsoft’s Vasa-1 paper spotlights a different kind of progress: generating expressive, controllable facial animation from minimal inputs. The system takes a single image plus an audio clip and produces video with detailed facial dynamics—blinking, lip motion, eyebrow and gaze behavior—at about 40 frames per second with “negligible” starting latency. The method maps facial dynamics into a latent space and uses a diffusion Transformer to connect audio to head movement and facial expression codes before rendering frames using identity features from the input image. Vasa-1 is described as trained on public VoxCeleb2 plus a smaller supplemental dataset, with the transcript emphasizing the surprisingly limited scale (2,000 hours in VoxCeleb2) compared with the massive web-scale data used by frontier systems.
The practical implication is social: Vasa-1 is framed as enabling real-time, lifelike avatar interactions, including in healthcare contexts. The transcript then links this to AI nurse deployments that reportedly outperform human nurses on bedside manner and technical patient education tasks, including identifying medication impacts on lab values and detecting toxic dosages—though the claims are presented as performance metrics and human ratings.
Finally, the transcript ties these advances to a broader debate about personalization versus general intelligence. Sam Altman is cited suggesting long-term differentiation will come from AI models that plug into a user’s life context and integrate across everything they want to do. Meanwhile, AI safety timelines remain contested, with references to Anthropic’s Dario Amodei discussing rapid paths to higher-risk autonomy (ASL 3 and ASL 4). Taken together, the central message is that “Her”-like capability may arrive through tighter personalization and more convincing real-time avatars—potentially sooner than traditional AGI timelines imply.
Cornell Notes
Llama 3’s latest smaller releases are described as highly competitive—especially Llama 3 70B—despite Meta not yet releasing its largest model or the full research paper. The training approach is framed around saturating performance with high-quality data (with emphasis on coding), leading to continued gains even beyond “Chinchilla optimal” amounts. In a separate leap, Microsoft’s Vasa-1 generates lifelike avatar video from a single image and an audio clip, producing expressive facial motion (blinking, gaze, lip movement) at ~40 fps with low latency. The method uses a latent-space mapping of facial dynamics and a diffusion Transformer to connect audio to head and facial expression codes before rendering frames. Together, the developments point toward near-term “Her”-style interaction driven by personalization and avatar realism rather than only bigger base models.
Why does Llama 3 70B’s “competitive” positioning matter if Meta hasn’t released its biggest model yet?
What training insight is credited for Llama 3’s performance gains?
How does Vasa-1 turn a single photo and audio into expressive video?
What makes Vasa-1’s outputs feel more “real” than earlier deepfake approaches?
Why is the training-data scale for Vasa-1 treated as a notable detail?
What does the personalization debate suggest about the next interface for AI?
Review Questions
- Which benchmark categories in the transcript are described as nearly tied between Llama 3 70B and top competitors, and which ones still show GPT-4 as ahead?
- What are the two-stage components of Vasa-1’s pipeline (audio-to-motion vs motion-to-video), and what inputs feed each stage?
- How do the transcript’s AI personalization claims connect to both avatar realism (Vasa-1) and business incentives (engagement-driven differentiation)?
Key Points
- 1
Meta’s Llama 3 70B is presented as competitive with Gemini Pro 1.5 and Claude Sonnet based on human-evaluated comparisons, even before the largest model and full research paper arrive.
- 2
Llama 3’s training narrative emphasizes continued performance gains after training beyond Chinchilla “optimal,” attributed to saturation with high-quality data and special emphasis on coding data.
- 3
Microsoft’s Vasa-1 generates expressive avatar video from a single image plus an audio clip, targeting real-time interaction with ~40 fps output and negligible starting latency.
- 4
Vasa-1’s approach uses a latent-space representation of facial dynamics and a diffusion Transformer to map audio to head movement and facial expression codes before rendering frames with identity features.
- 5
The transcript connects avatar realism to high-stakes domains like healthcare, citing reported AI nurse performance on bedside manner, medication impact detection, and toxic dosage identification.
- 6
Sam Altman’s personalization framing suggests long-term differentiation may come from AI integrated into a user’s life context, potentially amplified by video avatars.
- 7
AI safety timelines remain disputed, with Dario Amodei describing relatively near-term possibilities for higher-risk autonomy levels (ASL 3 and ASL 4).