Llama 3.3 70B Test - Coding, Data Extraction, Summarization, Data Labelling, RAG
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Meta positions Llama 3.3 70B as an evolution over Llama 3.2, with reported benchmark performance near Llama 3.1 and independent scoring near GPT-4o (November).
Briefing
Meta’s Llama 3.3 70B is landing as a strong all-around text model, with independent evaluations and hands-on tests pointing to performance that tracks closely with earlier top-tier releases—especially on structured tasks like coding, data extraction, and table-based question answering. The practical takeaway: even without local hardware, the model can be run quickly via Groq’s API, and it repeatedly handles “real work” prompts that many comparable models stumble on.
The transcript frames Llama 3.3 as an evolution over Llama 3.2, with Meta reporting evaluation results near Llama 3.1 and independent scoring from Artificial Analysis (California) suggesting an intelligence level comparable to Llama 3.1 and close to November’s GPT-4o. A key reported jump is on MMLU-style benchmarks (the creator highlights “MMLU” as the biggest increase). The model is multilingual, text-only, and supports a 128K context window with roughly 15T tokens for pretraining; its knowledge cutoff is December 23. Availability is described as primarily through Hugging Face, with the 70B variant offered in 4-bit quantization at about 43GB storage—still heavy enough to require serious GPU resources.
Because the creator lacks hardware for full local testing, the evaluation is done through Groq’s API. Two deployment modes are mentioned: a default “versatile” style and a speculative decoding variant that can be faster. Speed becomes a recurring theme: responses arrive quickly enough that the workflow feels practical for iterative testing.
In coding, Llama 3.3 is tested with a pandas task: generating a dataset of the wealthiest people in each continent, then sorting by continent and ordering wealth from poorest to richest. The model produces working, well-documented code and gets the ordering right—an outcome the creator says even some other large models failed to match.
For data labeling, the model is given five tweets related to categories such as malicious VS Code extensions, migrating infrastructure, AI replacing doors ASAP, “builder” motivation, and a more philosophical focus on attention and priorities. The labeling is described as extremely fast, with most outputs landing correctly; the creator notes a few areas of uncertainty (tone or topic assignment) but still calls the results close to perfect.
Summarization and generation are also tested using Meta’s latest earnings report. The model produces concise summaries that capture the gist of revenue and other key figures, and it generates LinkedIn-style posts without adding typical social-media embellishments like emojis or rockets—unless prompted otherwise.
The strongest “accuracy” moments come from extraction and retrieval-style prompts. When given receipt markdown, it extracts quantities and key fields correctly, with only minor formatting assumptions needing adjustment (e.g., converting seconds into hours/minutes). For earnings-report questions, it answers quote-grounded facts correctly (like expected 2024 tax rate) and extracts table values with correct numbers, including cases where the transcript says other models either miscomputed or failed to produce the requested table structure. Overall, the creator concludes Llama 3.3 70B “passes with flying colors,” while suggesting harder, more adversarial tests could further validate its limits—especially for agentic frameworks and less-available information scenarios.
Cornell Notes
Llama 3.3 70B is presented as a text-only multilingual model that performs close to Llama 3.1 on reported benchmarks, with independent scoring from Artificial Analysis placing it near top recent systems. In hands-on tests run via Groq’s API, it delivers fast responses and performs well on practical tasks: generating pandas code with correct sorting, labeling tweet categories with high accuracy, summarizing Meta earnings concisely, and extracting structured data from receipts and earnings tables. The model also answers quote-grounded questions correctly when the needed information is present in the provided text/markdown. The main reason it matters is that it combines speed (via Groq) with strong structured reasoning, making it usable for RAG-style workflows and data pipelines without requiring local 70B hardware.
What benchmarks and third-party evaluations are cited to position Llama 3.3 70B relative to other frontier models?
How is the model run for testing, and what speed-related options are mentioned?
What coding task demonstrates Llama 3.3’s ability to follow structured data requirements?
How does Llama 3.3 perform on tweet classification and what kinds of errors are noted?
What extraction and table QA tests show the strongest accuracy, and why?
What limitation is implied about harder RAG scenarios?
Review Questions
- Which specific structured tasks (coding, labeling, extraction, table QA) does Llama 3.3 70B handle best in these tests, and what evidence is given for correctness?
- How do the Groq API modes (versatile vs speculative decoding) relate to the transcript’s claims about response speed?
- Why does the transcript repeatedly stress that the answerable facts were present in the provided markdown/text, and how would removing that information change the evaluation?
Key Points
- 1
Meta positions Llama 3.3 70B as an evolution over Llama 3.2, with reported benchmark performance near Llama 3.1 and independent scoring near GPT-4o (November).
- 2
Artificial Analysis is cited as an independent evaluator placing Llama 3.3’s intelligence level around Llama 3.1, with a notable jump on MMLU-style results.
- 3
The 70B model is described as text-only, multilingual, with a 128K context window, ~15T pretraining tokens, and a December 23 knowledge cutoff.
- 4
Groq’s API enables practical testing without local 70B hardware, with two modes mentioned: versatile and speculative decoding for speed.
- 5
In coding tests, Llama 3.3 produces pandas code that correctly sorts by continent and orders wealth from poorest to richest—an outcome the transcript says many other models fail.
- 6
For data extraction and table QA, the model delivers correct numbers when the relevant facts are included in the provided markdown/text, including receipt fields and earnings-report tables.
- 7
The transcript flags a future need for harder RAG tests where the requested information is not present in the prompt context.