Llama 3 - 8B & 70B Deep Dive

TL;DR

Llama 3 is released in two open-weight sizes—8B and 70B—each available as both base (pre-trained) and instruction-tuned variants.

Briefing Cornell Notes

Briefing

Meta’s Llama 3 release centers on two new open-weight language models—8B and 70B—that aim to outperform last generation’s Llama 2 while matching or challenging leading proprietary systems on key benchmarks. The most consequential detail is scale and training depth: both models were trained on more than 15 trillion tokens, with a reported context length of 8K, and both use grouped-query attention. That combination—very large token exposure plus architectural efficiency—helps explain why the smaller 8B model is positioned as a leap forward, even beating the largest Llama 2 variant in several comparisons.

The 8B and 70B models arrive in two forms: a base (pre-trained) version intended for fine-tuning, and an instruction-tuned version meant for everyday chat and task execution. Meta’s model cards describe text-only inputs for now, but the transcript notes strong signals that multimodal capability is likely next—especially hints from team members about a future vision-style model where images and other modalities could be added. For developers, the practical near-term takeaway is that Llama 3 is usable immediately in instruction form, while fine-tuning workflows are expected to expand as more scripts and community variants appear.

Benchmark comparisons in the transcript highlight where Llama 3 is strongest. On GSM-style math and other reasoning-oriented tasks, the 8B model is claimed to land around double the scores of Mistral Instruct and Gemma Instruct, suggesting a meaningful jump in reasoning quality for a relatively small model. The 70B model is described as more competitive than dramatically dominant: it performs strongly against Gemini Pro 1.5 and Claude 3 Sonnet on a range of evaluations, while also beating prior Llama 2 results. Meta’s own evaluation set is described as 800 prompts spanning 12 use cases—advice, brainstorming, classification, coding, creative writing, extraction, roleplay, reasoning, rewriting, summarization, and more—where the 70B model comes out ahead of several baselines including GPT 3.5, Mistral Medium, and Claude 3 Sonnet.

Training and compute details add context to the performance claims. The transcript cites a training cutoff of March 23 for the 8B model and December 2023 for the 70B model, implying the models were built from large datasets well before release. It also notes a reported training run using 24,000 GPUs—substantially less than the much larger GPU counts sometimes associated with other frontier efforts—raising questions about how the upcoming 405B model will be trained.

Access to Llama 3 weights comes with a license gate on Hugging Face, and the transcript flags restrictions that may disappoint people expecting “open source” behavior. Two standouts: a clause preventing use of Llama outputs to improve other large language models (outside Llama 3 and Llama 3 fine-tunes), and a requirement that any fine-tuned or merged model name begin with “Llama 3.” Commercial use is allowed if other terms are followed, but the output-based restriction limits certain downstream training strategies.

Finally, the transcript walks through practical ways to try Llama 3—Ollama, LM Studio, Hugging Chat, and hosted endpoints on major cloud providers—then demonstrates a Hugging Face setup using text-generation pipelines. Early hands-on tests suggest the model is generally capable at roleplay, concise answering, and some reasoning patterns, with mixed results on certain math variants that appear sensitive to prompting and system instructions. Looking ahead, the transcript points to an imminent 405B model and hints that future releases may expand context length, multimodality, and code-focused variants.

Cornell Notes

Llama 3 arrives as two open-weight models—8B and 70B—in both base and instruction-tuned formats. Both were trained on over 15 trillion tokens and use grouped-query attention, with a reported 8K context window and text-only inputs for now. Benchmark results in the transcript emphasize strong reasoning performance, including claims that the 8B model can outperform larger prior Llama 2 variants and that the 70B model is competitive with Gemini Pro 1.5 and Claude 3 Sonnet across a 12-category evaluation set. Access requires accepting a Hugging Face license with notable restrictions, including limits on using Llama outputs to train other large models. The practical message: Llama 3 is easy to try via Ollama, Hugging Chat, and hosted endpoints, while fine-tuning and future multimodal/code variants are expected to expand quickly.

What are the two Llama 3 models released so far, and how do their formats differ?

The release includes an 8 billion parameter model and a 70 billion parameter model. Each comes in two formats: a base (pre-trained) version aimed at fine-tuning, and an instruction-tuned version intended for chat and task execution without additional training.

Why does the training scale matter for the expected quality of Llama 3?

Both models were trained on over 15 trillion tokens, which the transcript frames as unusually large for publicly disclosed training runs. It also notes that this is about seven times what Llama 2 was trained on, and that the models were trained with grouped-query attention—factors that can support stronger reasoning and general performance even after many tokens.

How do the benchmark comparisons describe Llama 3’s strengths and limits?

For the 8B model, the transcript highlights especially strong results on GSM-style math/reasoning, claiming roughly double the scores of Mistral Instruct and Gemma Instruct. For the 70B model, results are described as competitive rather than uniformly dominant: it’s benchmarked against Gemini Pro 1.5 and Claude 3 Sonnet and is said to beat multiple baselines on Meta’s evaluation set spanning 12 use cases.

What license restrictions could affect developers trying to build new models or datasets?

The transcript flags two key constraints from the Hugging Face license: (1) a prohibition on using Llama materials or outputs to improve other large language models (except Llama 3 or Llama 3 fine-tunes), which blocks certain “train-on-outputs” workflows; and (2) a naming rule requiring fine-tuned or merged models to start with “Llama 3” in the model name.

What practical options exist to run Llama 3 without hosting it yourself?

The transcript lists several: Ollama (with Llama 3 already added), LM Studio (via quantized versions), Hugging Chat (incorporated Llama 3 for direct prompting), and hosted endpoints through Hugging Face deployment options on cloud providers like Azure, Google Cloud, and Amazon, plus third-party model hosts such as Together AI and Replicate.

What did hands-on prompting tests suggest about Llama 3 behavior?

The transcript reports generally good performance on roleplay and concise instruction following, and it notes that adding a system prompt can change output structure—sometimes producing step-by-step “chain-of-thought”-style formatting. It also reports mixed results on certain math variants (including a “babysitter” style question and some pure math steps), suggesting sensitivity to prompting and system instructions.

Review Questions

Which Llama 3 format would you choose if you want to fine-tune the model yourself, and why?
How do the transcript’s benchmark claims differ between the 8B and 70B models?
What two specific license clauses are highlighted as most likely to limit downstream model training or reuse?

Key Points

1
Llama 3 is released in two open-weight sizes—8B and 70B—each available as both base (pre-trained) and instruction-tuned variants.
2
Both models were trained on over 15 trillion tokens and use grouped-query attention, supporting strong performance despite an 8K context window.
3
Meta’s reported evaluation set spans 12 task categories (800 prompts), where the 70B model is described as beating several baselines including GPT 3.5 and Claude 3 Sonnet.
4
The Hugging Face license includes restrictions that limit using Llama outputs to improve other large language models and requires “Llama 3” to appear at the start of fine-tuned/merged model names.
5
Llama 3 can be tried quickly via Ollama, LM Studio, Hugging Chat, and hosted endpoints on major cloud providers.
6
Early prompting tests suggest Llama 3 is strong at roleplay and instruction following, but math performance can be inconsistent depending on system prompts and question framing.

Highlights

The 8B Llama 3 model is positioned as a major jump: it’s claimed to beat the largest Llama 2 models on some benchmarks, including reasoning-focused evaluations.

Both 8B and 70B were trained on more than 15 trillion tokens—an unusually large publicly disclosed training scale—while still reporting only an 8K context window.

The license gate on Hugging Face blocks using Llama outputs to improve other large language models, limiting certain dataset-building workflows.

Hands-on tests suggest system prompts can strongly affect output structure, including whether step-by-step reasoning appears.

The transcript frames the upcoming 405B model as potentially close to GPT-4-level results based on early checkpoint testing, though details remain limited.

Topics

Llama 3 Release
Model Benchmarks
Hugging Face License
Open-Weight Deployment
Prompting & Fine-Tuning

Mentioned

GPU
GSM
GPT
MMLU
GCP
AWS
API
K
B
8K