Get AI summaries of any video or article — Sign up free
‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’ thumbnail

‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’

AI Explained·
5 min read

Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Meta’s Llama 3 70B is presented as competitive with Gemini Pro 1.5 and Claude Sonnet based on human-evaluated comparisons, even before the largest model and full research paper arrive.

Briefing

Meta’s newly released Llama 3 70B is arriving in a competitive state—without the full “biggest and best” model or its research paper yet—while Microsoft’s Vasa-1 and Boston Dynamics’ Atlas keep pushing AI toward more lifelike, real-time social interaction. The immediate takeaway: smaller Llama 3 models are already matching top-tier peers on several benchmarks, and the next wave of AI interfaces may be driven less by raw intelligence and more by personalization, avatar realism, and integration into everyday tasks.

Meta’s late-breaking update centers on two smaller Llama 3 releases, with Llama 3 70B positioned as competitive with Gemini Pro 1.5 and Claude Sonnet. Human-evaluated comparisons are presented against models including Mistral Medium, Claude Sonnet, and GPT-3.5, with results described as broadly similar across a “mystery model” still training and newer GPT-4 Turbo and Claude 3 Opus baselines. On graduate-stem style assessment, performance is described as nearly identical, while coding benchmarks show close competition—though the transcript flags that some benchmarks are “deeply flawed,” especially for math where GPT-4 still appears ahead. The bigger research claim tied to Llama 3 is that performance keeps improving even after training on far more data than the “Chinchilla optimal” amount, suggesting saturation from high-quality data rather than sheer volume; coding data receives special emphasis.

Meta also signals a roadmap: multiple future Llama 3 variants with multimodality, multilingual conversation, longer context windows, and stronger overall capabilities. The transcript repeatedly stresses that the most consequential missing piece is not just model size, but the eventual release of the larger model and the paper that will clarify training details like context window length.

In parallel, Microsoft’s Vasa-1 paper spotlights a different kind of progress: generating expressive, controllable facial animation from minimal inputs. The system takes a single image plus an audio clip and produces video with detailed facial dynamics—blinking, lip motion, eyebrow and gaze behavior—at about 40 frames per second with “negligible” starting latency. The method maps facial dynamics into a latent space and uses a diffusion Transformer to connect audio to head movement and facial expression codes before rendering frames using identity features from the input image. Vasa-1 is described as trained on public VoxCeleb2 plus a smaller supplemental dataset, with the transcript emphasizing the surprisingly limited scale (2,000 hours in VoxCeleb2) compared with the massive web-scale data used by frontier systems.

The practical implication is social: Vasa-1 is framed as enabling real-time, lifelike avatar interactions, including in healthcare contexts. The transcript then links this to AI nurse deployments that reportedly outperform human nurses on bedside manner and technical patient education tasks, including identifying medication impacts on lab values and detecting toxic dosages—though the claims are presented as performance metrics and human ratings.

Finally, the transcript ties these advances to a broader debate about personalization versus general intelligence. Sam Altman is cited suggesting long-term differentiation will come from AI models that plug into a user’s life context and integrate across everything they want to do. Meanwhile, AI safety timelines remain contested, with references to Anthropic’s Dario Amodei discussing rapid paths to higher-risk autonomy (ASL 3 and ASL 4). Taken together, the central message is that “Her”-like capability may arrive through tighter personalization and more convincing real-time avatars—potentially sooner than traditional AGI timelines imply.

Cornell Notes

Llama 3’s latest smaller releases are described as highly competitive—especially Llama 3 70B—despite Meta not yet releasing its largest model or the full research paper. The training approach is framed around saturating performance with high-quality data (with emphasis on coding), leading to continued gains even beyond “Chinchilla optimal” amounts. In a separate leap, Microsoft’s Vasa-1 generates lifelike avatar video from a single image and an audio clip, producing expressive facial motion (blinking, gaze, lip movement) at ~40 fps with low latency. The method uses a latent-space mapping of facial dynamics and a diffusion Transformer to connect audio to head and facial expression codes before rendering frames. Together, the developments point toward near-term “Her”-style interaction driven by personalization and avatar realism rather than only bigger base models.

Why does Llama 3 70B’s “competitive” positioning matter if Meta hasn’t released its biggest model yet?

The transcript highlights human-evaluated comparisons placing Llama 3 70B close to models like Gemini Pro 1.5 and Claude Sonnet, and it describes similar performance across a “mystery model” still training versus GPT-4 Turbo and Claude 3 Opus on graduate-stem style tests. That matters because it suggests Meta’s smaller releases already reach the same practical tier as leading systems, even before the largest model and paper clarify details like context window size.

What training insight is credited for Llama 3’s performance gains?

The key claim is that performance improves even after training on data far beyond the Chinchilla “optimal” amount, implying saturation from quality rather than just quantity. Coding data is singled out as a special emphasis, and the transcript frames this as a reason the model can keep getting better post-saturation.

How does Vasa-1 turn a single photo and audio into expressive video?

Vasa-1 maps facial dynamics—lip motion, non-lip expressions, eye gaze, and blinking—into a latent space, then uses a diffusion Transformer to connect audio to head movements and facial expression codes. After generating those motion codes, the system renders video frames using appearance and identity features extracted from the input image.

What makes Vasa-1’s outputs feel more “real” than earlier deepfake approaches?

The transcript emphasizes facial expressiveness: blinking, eyebrow and lip behavior, and gaze direction, plus controllability over emotion (happiness to anger), distance from the camera, and where the avatar looks. It also notes the system runs around 40 frames per second with negligible starting latency, which supports real-time engagement.

Why is the training-data scale for Vasa-1 treated as a notable detail?

The transcript points out that Vasa-1 uses VoxCeleb2 (about 2,000 hours) and supplements with a smaller internal dataset. Compared with frontier systems trained on vastly larger web-scale audio/video, the relatively modest hours are presented as a reason the results are striking—even if the dataset is curated.

What does the personalization debate suggest about the next interface for AI?

Sam Altman is cited arguing that long-term differentiation will come from models personalized to a user’s whole life context and integrated into everything they want to do. The transcript links this to the idea that adding lifelike video avatars could make AI feel more engaging and potentially more “addictive,” shifting the focus from raw intelligence alone to how well the system plugs into daily routines.

Review Questions

  1. Which benchmark categories in the transcript are described as nearly tied between Llama 3 70B and top competitors, and which ones still show GPT-4 as ahead?
  2. What are the two-stage components of Vasa-1’s pipeline (audio-to-motion vs motion-to-video), and what inputs feed each stage?
  3. How do the transcript’s AI personalization claims connect to both avatar realism (Vasa-1) and business incentives (engagement-driven differentiation)?

Key Points

  1. 1

    Meta’s Llama 3 70B is presented as competitive with Gemini Pro 1.5 and Claude Sonnet based on human-evaluated comparisons, even before the largest model and full research paper arrive.

  2. 2

    Llama 3’s training narrative emphasizes continued performance gains after training beyond Chinchilla “optimal,” attributed to saturation with high-quality data and special emphasis on coding data.

  3. 3

    Microsoft’s Vasa-1 generates expressive avatar video from a single image plus an audio clip, targeting real-time interaction with ~40 fps output and negligible starting latency.

  4. 4

    Vasa-1’s approach uses a latent-space representation of facial dynamics and a diffusion Transformer to map audio to head movement and facial expression codes before rendering frames with identity features.

  5. 5

    The transcript connects avatar realism to high-stakes domains like healthcare, citing reported AI nurse performance on bedside manner, medication impact detection, and toxic dosage identification.

  6. 6

    Sam Altman’s personalization framing suggests long-term differentiation may come from AI integrated into a user’s life context, potentially amplified by video avatars.

  7. 7

    AI safety timelines remain disputed, with Dario Amodei describing relatively near-term possibilities for higher-risk autonomy levels (ASL 3 and ASL 4).

Highlights

Llama 3 70B is positioned as competitive with Gemini Pro 1.5 and Claude Sonnet on human evaluations, despite Meta holding back its biggest model and the research paper.
Vasa-1 can produce lifelike facial animation from just one photo and an audio clip, with blinking, gaze, and lip/eyebrow expressiveness at about 40 fps.
Vasa-1’s pipeline is described as audio-to-motion-code generation (via a diffusion Transformer) followed by frame rendering using identity features from the input image.
The transcript’s through-line is that “Her”-like interaction may arrive through personalization and real-time avatars, not only through larger base models.
AI safety expectations vary sharply, with Dario Amodei suggesting ASL 4 could fall between 2025 and 2028 in his estimate.

Topics