Zuck's new Llama is a beast
Based on Fireship's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Llama 3.1 is offered in 8B, 70B, and 405B sizes, with the largest variant using a 128,000-token context window.
Briefing
Meta’s latest large language model, Llama 3.1, is positioned as a major leap in open-weight AI—especially with its biggest 405B parameter variant—while also being offered for free in a way that could pressure closed competitors like OpenAI and Anthropic. Meta trained the model over months on 16,000 Nvidia H100 GPUs, a scale that implies hundreds of millions of dollars in compute and enough electricity to power a small country. The payoff is a 405B model with a 128,000-token context window, and benchmark results that claim it is mostly superior to OpenAI’s GPT-4 and can beat Claude 3.5 Sonnet on select tests.
Still, raw benchmarks don’t settle whether a model is genuinely useful. Early hands-on reactions described the largest Llama as somewhat disappointing while smaller versions look more impressive. In practical testing, Llama 3.1 Heavy struggled with a specific “single-shot” coding task: generating a spelling-correct 5 web application using “runes,” a feature described as not yet released. The model performed decently for coding in general, but it showed limited awareness of that particular capability, and it fell behind Claude 3.5 Sonnet, which was reportedly the only model seen to handle the task correctly in one go.
Where Llama’s strategy becomes more consequential than its initial performance is in how developers can adapt it. Llama 3.1 is “open” in the sense that the model weights are available, enabling self-hosting and fine-tuning with custom data. That matters because it shifts cost and control away from paying per-request APIs (like GPT-4) and toward running models on rented or owned hardware. The transcript notes that self-hosting the biggest model is not cheap: the weights are about 230 GB, and even an RTX 490 setup wasn’t enough to run it smoothly with Ollama. But for teams that can afford infrastructure, the open-weight approach can reduce long-term dependency on closed model providers.
Meta’s openness has limits: the training code is described as relatively small—around 300 lines of Python using PyTorch and FairScale for distributed training—yet the training data itself is not open. That data could include personal and proprietary sources such as blog posts, GitHub repositories, old Facebook content, and potentially WhatsApp messages, raising familiar privacy and consent questions even as the model weights remain accessible.
The transcript also frames a broader industry plateau. Multiple companies have poured massive compute into ever-larger models, but capability gains appear to be leveling off, with improvements since GPT-4 described as more incremental than revolutionary. Despite high-stakes rhetoric about AI’s future, programmers still aren’t being replaced, and “Skynet” scenarios remain absent. In that context, Llama 3.1 is portrayed as Meta’s most credible attempt to keep the ecosystem moving—less about sudden artificial superintelligence and more about practical developer access, customization, and competition in the model market.
Cornell Notes
Meta’s Llama 3.1 is a large language model released with open-weight availability and a very long 128,000-token context window, including a 405B “Heavy” variant. Training reportedly used 16,000 Nvidia H100 GPUs over months, producing a 405B model that claims strong benchmark performance versus GPT-4 and Claude 3.5 Sonnet on some tests. Hands-on results suggest the biggest model can underperform on certain niche “single-shot” coding tasks, even while smaller variants look more capable. The biggest practical advantage is that developers can self-host and fine-tune using custom data, avoiding API costs—though the largest weights (~230 GB) are expensive to run locally. Training code is described as relatively compact, but the training data remains closed.
What makes Llama 3.1 strategically different from many closed competitors?
How big is Llama 3.1, and what does “405B” and “128,000 token context” mean in practice?
Why do benchmarks not settle the question of whether the model is “good”?
What does the transcript say about self-hosting costs and feasibility?
What limits the “open” aspect of Llama 3.1?
What broader trend does the transcript claim about the AI industry’s progress?
Review Questions
- Which part of Llama 3.1 is open (weights, code, or training data), and how does that affect developers’ ability to customize models?
- Why might the 405B model underperform on a specific coding task even if it scores well on benchmarks?
- What practical constraints does the transcript mention for running Llama 3.1 Heavy locally, and what alternatives are suggested?
Key Points
- 1
Llama 3.1 is offered in 8B, 70B, and 405B sizes, with the largest variant using a 128,000-token context window.
- 2
Meta claims Llama 3.1 Heavy performs strongly on benchmarks, including comparisons against GPT-4 and Claude 3.5 Sonnet on some tests.
- 3
Hands-on testing suggests the 405B model can struggle with niche single-shot tasks involving “runes,” even if it codes reasonably well overall.
- 4
Open-weight availability enables self-hosting and fine-tuning with custom data, potentially reducing dependence on paid API access.
- 5
Self-hosting the 405B model is expensive and storage-heavy (about 230 GB of weights), and local hardware may not be sufficient for smooth use.
- 6
The training code is described as relatively compact (around 300 lines of Python with PyTorch and FairScale), but the training data remains closed.
- 7
The transcript frames industry progress as plateauing despite large compute investments and questions the gap between AI hype and real-world capability gains.