Llama 4 Test with Groq: Coding, Data Extraction, Data Labelling, Summarization, RAG

TL;DR

Meta’s Llama 4 lineup includes Scout (109B), Maverick (400B), and Behemoth (2T, still training), with long-context claims of 10M tokens (Scout) and 1M tokens (Maverick).

Briefing Cornell Notes

Briefing

Meta’s Llama 4 lineup—Scout (109B), Maverick (400B), and Behemoth (2T, still training)—arrives with headline claims built around huge context windows and strong benchmark performance, but real-world testing paints a more mixed picture. Meta says Llama 4 Scout supports a 10 million token context window, while Llama 4 Maverick supports a 1 million token context window and “beats” models like GPT-4.5 and Gemini 2.0 Flash on STEM benchmarks. The catch: these models are mixture-of-experts (MoE), meaning only a subset of parameters activates per token, yet the full model still must be loaded on the GPU—so the hardware requirements remain steep. For quantization, Meta’s guidance points to needing an H100-class GPU, which effectively keeps “local” experimentation out of reach for most users.

A key technical differentiator is Meta’s revamped post-training pipeline. After large-scale filtered pretraining, the models are tuned into chat behavior via instruction tuning and reinforcement-style optimization. Historically, human feedback drove instruction tuning; now the pipeline leans more on supervised fine-tuning and reinforcement learning using reward functions and methods such as GRPO, along with DPO (direct preference optimization). Meta’s stated motivation is that supervised fine-tuning and DPO can over-constrain the model, limiting exploration during later online RL and hurting accuracy—especially in reasoning, coding, and other demanding domains. Distillation also plays a role: Llama 4 Behemoth is positioned as a “teacher” model used to distill capability into smaller MoE variants like Scout and Maverick.

Independent evaluation and practical tests using Groq’s API access for Llama 4 Scout and Maverick show both strengths and failure modes. In coding-style prompts, Llama 4 Scout generated a synthetic dataset of wealthy people by continent (including Antarctica) and produced a Pandas workflow to compute the top five wealthiest per continent, completing in under two seconds. It also performed structured data extraction from a receipt image and from receipt text, returning correctly formatted fields such as line items and totals in markdown/JSON-like structures. Multimodal behavior looked solid: given an image of a receipt on a countertop, it described the scene and extracted key numbers quickly (around a second).

Where things became less reliable was in strict output formatting and summarization control. A tweet-labeling task that required a single JSON response triggered a “failed to generate JSON” error; the model produced partial fields but then diverged when a predicted class didn’t match the allowed label set. Summarization also showed a prompt-following gap: when asked for a concise summary, the output started with extra framing text rather than delivering only the requested summary. A follow-up attempt to generate a LinkedIn post likewise began with an unwanted preface.

Finally, retrieval-augmented generation (RAG) tests against a Meta earnings document produced grounded answers when the quote existed and returned “not enough information” when it didn’t. Table extraction worked well when values came from a single table, and remained accurate even when the task required pulling specific fields across multiple tables.

Overall, Llama 4’s MoE architecture, long-context claims, and post-training changes look promising on targeted tasks—especially coding, extraction, and multimodal OCR—but strict schema adherence and “just give me the summary” instruction following still need work. Meta’s marketing benchmark numbers may not fully match real-world performance, and the gap is likely to narrow only as more independent tests and broader availability (including smaller local variants) arrive.

Cornell Notes

Meta’s Llama 4 family (Scout 109B, Maverick 400B, Behemoth 2T) targets long-context performance and strong STEM results, using mixture-of-experts models that still require high-end GPUs because the full model must be loaded at inference. Meta attributes improvements partly to a revised post-training pipeline: instruction tuning plus reinforcement-style optimization using methods like GRPO and DPO, with an emphasis on avoiding over-constraint that can reduce exploration during RL. In hands-on tests via Groq’s API, Llama 4 Scout performed quickly and well on coding (Pandas dataset generation), receipt extraction (text and image), and RAG Q&A grounded in a Meta earnings document. Weak spots appeared in strict JSON/schema output and in prompt-following for “summary only,” where extra framing text sometimes leaked into results.

Why do mixture-of-experts (MoE) models still demand powerful hardware at inference?

MoE activates only a subset of experts per token, but the system still needs to load the full MoE model weights into the GPU memory. That means the “active parameters” count doesn’t translate into easy local deployment. In the transcript, Meta’s quantization guidance for Llama 4 Scout points to H100-class hardware, and the tester notes that even Scout (109B) is too large for typical “walkable” local setups.

What changed in Llama 4’s post-training pipeline, and why does it matter for reasoning/coding?

After large filtered pretraining, the models are tuned for chat via instruction tuning and reinforcement-style optimization. The transcript highlights a shift toward supervised fine-tuning plus reinforcement learning using reward functions (including GRPO) and preference methods like DPO. Meta’s concern is that supervised fine-tuning and DPO can over-constrain the model, reducing exploration during later online RL and leading to suboptimal accuracy—particularly in reasoning and coding domains.

How did Llama 4 Scout perform on a coding task that required generating data and computing top-wealth entries by continent?

The test asked for at least a thousand synthetic examples with fields like name, gender, wealth, and continent, then required a Pandas DataFrame to select the top five wealthiest per continent and sort by continent and poorest-to-richest. The model produced the dataset-generation and transformation code in under two seconds, with results that the tester described as “perfect,” including plausible continent coverage (with Antarctica appearing, though many entries were excluded).

What went wrong in the tweet-labeling task that demanded strict JSON output?

When asked to classify tweets into single labels for target audience, tone/sentiment, complexity level, and main themes, the model hit a “failed to generate JSON” error. It still produced some fields (e.g., tone and complexity), but the output didn’t fully conform to the required JSON schema. Additionally, one predicted theme didn’t match the allowed list, forcing the tester to revise to the closest match.

How did the model handle summarization and LinkedIn-post generation from an earnings document?

For summarization, the tester requested a 3–4 sentence summary, but the output included extra framing text before the summary—described as a major fail for a strict “summary only” request. For a LinkedIn post, the model similarly began with an unwanted preface (“Here is a LinkedIn post based on the provided text”), though the remainder (key highlights, a closing question, and hashtags) was considered reasonably usable.

What did RAG tests reveal about grounding and uncertainty?

Using the Meta earnings text as context, the model answered a question about what the founder is most proud of by saying the document didn’t provide enough information—then it correctly quoted and used a relevant line for a different question. It also returned a tax-rate estimate consistent with the CFO commentary (“mid-teen” range for 2024), showing it can both ground answers when evidence exists and abstain when it doesn’t.

Review Questions

In MoE models, what distinction between “active parameters” and “loaded parameters” explains why local deployment can still be difficult?
Which post-training methods mentioned (e.g., GRPO, DPO) are tied to the transcript’s discussion of over-constraint and exploration during RL?
What specific failure modes showed up in structured tasks: JSON/schema adherence, prompt-following for summaries, or multimodal extraction—and how did each manifest?

Key Points

1
Meta’s Llama 4 lineup includes Scout (109B), Maverick (400B), and Behemoth (2T, still training), with long-context claims of 10M tokens (Scout) and 1M tokens (Maverick).
2
MoE architecture reduces active computation per token, but inference still requires loading the full model weights, keeping GPU requirements high (H100-class guidance for quantization is cited).
3
Meta’s post-training approach emphasizes supervised fine-tuning plus reinforcement-style optimization using reward functions (including GRPO) and preference methods like DPO, aiming to avoid over-constraining exploration during RL.
4
Hands-on tests via Groq API found strong performance for coding workflows (Pandas transformations) and for receipt extraction from both text and images with fast latency.
5
Strict output formatting remains a weak point: a tweet-labeling task triggered a “failed to generate JSON” error and produced labels that didn’t always match allowed options.
6
Summarization and social-post generation sometimes included unwanted prefaces, indicating imperfect adherence to “summary only” or formatting constraints.
7
RAG behavior showed both grounding (correctly quoting and extracting from the earnings document) and uncertainty handling (declining to answer when the document lacked evidence).

Highlights

Llama 4 Scout generated a complete synthetic-data + Pandas workflow for “top five wealthiest per continent” in under two seconds, with results the tester described as highly accurate.

Receipt extraction worked well in both directions: text-to-structured fields and image-to-number extraction, with markdown-formatted outputs and correct totals/line items in checked cases.

A tweet classification prompt requiring strict JSON failed—producing partial structured fields and then diverging when a predicted theme didn’t match the allowed label set.

RAG tests demonstrated grounded answers when the evidence existed and a refusal when it didn’t, including a correct “mid-teen” 2024 tax-rate response from CFO commentary.

Topics

Mentioned

Groq
Meta
Llama 4
LM Arena
Piggly Wiggly
Visual Studio Code
Faker
Pandas
Gemini
Mark Zuckerberg
MoE
RAG
DPO
GRPO
RL
GPU
OCR
JSON