Llama 3.3 70B Test - Coding, Data Extraction, Summarization, Data Labelling, RAG

TL;DR

Meta positions Llama 3.3 70B as an evolution over Llama 3.2, with reported benchmark performance near Llama 3.1 and independent scoring near GPT-4o (November).

Briefing Cornell Notes

Briefing

Meta’s Llama 3.3 70B is landing as a strong all-around text model, with independent evaluations and hands-on tests pointing to performance that tracks closely with earlier top-tier releases—especially on structured tasks like coding, data extraction, and table-based question answering. The practical takeaway: even without local hardware, the model can be run quickly via Groq’s API, and it repeatedly handles “real work” prompts that many comparable models stumble on.

The transcript frames Llama 3.3 as an evolution over Llama 3.2, with Meta reporting evaluation results near Llama 3.1 and independent scoring from Artificial Analysis (California) suggesting an intelligence level comparable to Llama 3.1 and close to November’s GPT-4o. A key reported jump is on MMLU-style benchmarks (the creator highlights “MMLU” as the biggest increase). The model is multilingual, text-only, and supports a 128K context window with roughly 15T tokens for pretraining; its knowledge cutoff is December 23. Availability is described as primarily through Hugging Face, with the 70B variant offered in 4-bit quantization at about 43GB storage—still heavy enough to require serious GPU resources.

Because the creator lacks hardware for full local testing, the evaluation is done through Groq’s API. Two deployment modes are mentioned: a default “versatile” style and a speculative decoding variant that can be faster. Speed becomes a recurring theme: responses arrive quickly enough that the workflow feels practical for iterative testing.

In coding, Llama 3.3 is tested with a pandas task: generating a dataset of the wealthiest people in each continent, then sorting by continent and ordering wealth from poorest to richest. The model produces working, well-documented code and gets the ordering right—an outcome the creator says even some other large models failed to match.

For data labeling, the model is given five tweets related to categories such as malicious VS Code extensions, migrating infrastructure, AI replacing doors ASAP, “builder” motivation, and a more philosophical focus on attention and priorities. The labeling is described as extremely fast, with most outputs landing correctly; the creator notes a few areas of uncertainty (tone or topic assignment) but still calls the results close to perfect.

Summarization and generation are also tested using Meta’s latest earnings report. The model produces concise summaries that capture the gist of revenue and other key figures, and it generates LinkedIn-style posts without adding typical social-media embellishments like emojis or rockets—unless prompted otherwise.

The strongest “accuracy” moments come from extraction and retrieval-style prompts. When given receipt markdown, it extracts quantities and key fields correctly, with only minor formatting assumptions needing adjustment (e.g., converting seconds into hours/minutes). For earnings-report questions, it answers quote-grounded facts correctly (like expected 2024 tax rate) and extracts table values with correct numbers, including cases where the transcript says other models either miscomputed or failed to produce the requested table structure. Overall, the creator concludes Llama 3.3 70B “passes with flying colors,” while suggesting harder, more adversarial tests could further validate its limits—especially for agentic frameworks and less-available information scenarios.

Cornell Notes

Llama 3.3 70B is presented as a text-only multilingual model that performs close to Llama 3.1 on reported benchmarks, with independent scoring from Artificial Analysis placing it near top recent systems. In hands-on tests run via Groq’s API, it delivers fast responses and performs well on practical tasks: generating pandas code with correct sorting, labeling tweet categories with high accuracy, summarizing Meta earnings concisely, and extracting structured data from receipts and earnings tables. The model also answers quote-grounded questions correctly when the needed information is present in the provided text/markdown. The main reason it matters is that it combines speed (via Groq) with strong structured reasoning, making it usable for RAG-style workflows and data pipelines without requiring local 70B hardware.

What benchmarks and third-party evaluations are cited to position Llama 3.3 70B relative to other frontier models?

Meta’s reported evaluation metrics are described as close to Llama 3.1, and the transcript notes independent evaluations from Artificial Analysis (California) claiming Llama 3.3 reaches an intelligence level similar to Llama 3.1 and comparable to the November release of GPT-4o. The creator also highlights a major improvement on an MMLU-style benchmark, calling it the biggest jump versus prior versions.

How is the model run for testing, and what speed-related options are mentioned?

Local testing isn’t feasible for the creator, so testing uses Groq’s API. Two modes are mentioned: a default “versatile” version and a speculative decoding version that can be faster. Output latency is repeatedly described as very fast, making iterative coding and extraction prompts practical.

What coding task demonstrates Llama 3.3’s ability to follow structured data requirements?

The coding prompt asks for a pandas DataFrame of the wealthiest people in each continent, sorted first by continent and then from poorest to richest by wealth (in millions). The transcript claims Llama 3.3 is among the first open models tried that can complete this exact ordering correctly, producing well-documented code with a single function rather than splitting logic across multiple functions.

How does Llama 3.3 perform on tweet classification and what kinds of errors are noted?

Five tweets are mapped to classes related to malicious VS Code extensions, migrating to one’s own infrastructure, AI replacing doors ASAP, surrounding oneself with builders, and a philosophical focus on attention and priorities. The model’s labeling is described as extremely fast and “perfect” or near-perfect for topic assignment in at least one example (technology/health). Uncertainty is noted for tone and one tweet’s complexity, but overall results are still described as highly accurate.

What extraction and table QA tests show the strongest accuracy, and why?

Receipt extraction from markdown is described as mostly correct: quantities and key fields match the expected answer, with only a minor issue about interpreting time units (seconds vs hours/minutes). For Meta earnings, the model correctly answers quote-grounded questions like the expected 2024 tax rate (mid) and extracts table values with correct numbers. The transcript emphasizes that these successes occur when the needed information is present in the provided markdown/text used for the prompt.

What limitation is implied about harder RAG scenarios?

The creator suggests a better test would involve questions where the requested information isn’t available in the provided text/markdown. In the current tests, the needed facts appear to be present in the prompt context, which helps extraction and table QA succeed. That sets up a future direction for more adversarial evaluation.

Review Questions

Which specific structured tasks (coding, labeling, extraction, table QA) does Llama 3.3 70B handle best in these tests, and what evidence is given for correctness?
How do the Groq API modes (versatile vs speculative decoding) relate to the transcript’s claims about response speed?
Why does the transcript repeatedly stress that the answerable facts were present in the provided markdown/text, and how would removing that information change the evaluation?

Key Points

1
Meta positions Llama 3.3 70B as an evolution over Llama 3.2, with reported benchmark performance near Llama 3.1 and independent scoring near GPT-4o (November).
2
Artificial Analysis is cited as an independent evaluator placing Llama 3.3’s intelligence level around Llama 3.1, with a notable jump on MMLU-style results.
3
The 70B model is described as text-only, multilingual, with a 128K context window, ~15T pretraining tokens, and a December 23 knowledge cutoff.
4
Groq’s API enables practical testing without local 70B hardware, with two modes mentioned: versatile and speculative decoding for speed.
5
In coding tests, Llama 3.3 produces pandas code that correctly sorts by continent and orders wealth from poorest to richest—an outcome the transcript says many other models fail.
6
For data extraction and table QA, the model delivers correct numbers when the relevant facts are included in the provided markdown/text, including receipt fields and earnings-report tables.
7
The transcript flags a future need for harder RAG tests where the requested information is not present in the prompt context.

Highlights

Llama 3.3 70B is repeatedly described as fast to respond via Groq’s API, making iterative coding and extraction workflows practical.

The pandas dataset task (continent grouping + poorest-to-richest ordering) is presented as a standout success that other tested models couldn’t match.

Receipt and earnings-table extraction are described as numerically correct, including quote-grounded answers like the expected 2024 tax rate.

The model generates LinkedIn-style posts without default emoji/rocket flair, suggesting controllable formatting behavior.

Topics

Mentioned

Meta
Llama
Groq
Hugging Face
pandas
Artificial Analysis
GitHub
VS Code
LinkedIn
MMLU
RAG
API
GPU
CFO