Local Llama 3.2 (3B) Test using Ollama - Summarization, Structured Text Extraction, Data Labelling

TL;DR

Run the Q4-quantized Llama 3.2 3B model locally via Ollama for fast responses, especially when using keep-alive to avoid reload latency.

Briefing Cornell Notes

Briefing

A 3B quantized Llama 3.2 model running locally through Ollama delivers fast, usable results for structured data extraction—especially when information is clearly present in receipts or tables—but it struggles with strict instruction-following, reliable summarization, and some question-answering tasks that require precise grounding.

On an M3 MacBook Pro, the workflow starts with downloading and running the Q4-quantized “Llama 3.2” 3B model via Ollama, then calling it from Python 3.12 using the Ollama client library. The setup emphasizes speed: a “keep alive” option keeps the model resident in memory, and the author reports strong token generation rates compared with larger models. Early conversational prompts respond quickly, setting expectations that the small model can be practical for everyday use.

The first major test—prompted code generation—shows mixed reliability. The model produced code intended to generate a dataset of “top five” famous people per continent, sorted by continent and then by birth year. The generated function looked fine at a glance, but the output contained clear failures: repeated names, missing Australia, and no consistent sorting. The code ran, yet it didn’t follow the task constraints closely enough to be trustworthy.

For labeling, the model returned valid JSON for five technology-related tweets, including fields like target audience, tone, complexity, and main themes. However, several classifications looked off—complexity skewed toward “intermediate,” and the philosophical tweet was categorized under “education” rather than “philosophy.” The author’s takeaway is that even when the output format is correct, category accuracy may lag behind what’s needed for real training data.

Summarization and Q&A against a long Meta earnings report for Q1 2024 were also inconsistent. The model produced a LinkedIn-style post that seemed more coherent than the requested key-point summary, but a key numeric detail in the summary (total assets rising to an implausible value) undermined confidence. In direct question answering, it sometimes refused or failed to answer when the question required exact information—returning “not supported by the text” or rejecting the request with a “can’t provide financial advice” style response.

Where the model shined was structured extraction. Given a receipt converted to markdown, it extracted store name, purchase date/time, total amount, tax amount, payment method, and item details with correct formatting; one item quantity was wrong, but pricing and descriptions matched expectations. Against the earnings report tables, it correctly built requested tables for “net cash provided by operating activities” and “purchases of property and equipment” for 2023 and 2024, even when values were not adjacent. A second table request (“operating margin,” “effective tax rate,” and “cost of revenue”) showed a sharp drop in accuracy, suggesting that table layout complexity and cross-row mixing can make performance swing.

Overall, the local 3B Q4 model appears best suited to structured extraction tasks with clear source fields, while higher-stakes summarization, labeling, and strict table reconstruction may require larger models or tighter validation pipelines.

Cornell Notes

Running a Q4-quantized Llama 3.2 3B model locally via Ollama on an M3 MacBook Pro produces fast outputs and can reliably extract structured fields from receipts and some financial tables. In tests, the model’s code generation and tweet labeling were format-correct but contained substantive errors (missing Australia; miscategorized philosophy; questionable complexity ratings). Summarization of a Meta earnings report was inconsistent—one summary produced implausible numbers—while a LinkedIn-style post looked more usable. Question answering sometimes failed due to “not supported by the text” or refusal-style responses. Table extraction worked well for one set of cash-flow metrics, but failed for a second table request, indicating sensitivity to table layout and mixed-row values.

Why does the model feel “practical” despite being only 3B parameters?

The local setup uses Ollama to run the Q4-quantized Llama 3.2 3B model, then calls it from Python with the Ollama client library. A “keep alive” setting keeps the model loaded in memory, reducing latency. On the author’s M3 MacBook Pro, response times and token generation were reported as strong compared with much larger models.

What went wrong in the code-generation test, and why does it matter?

The prompt asked for production-ready code to generate a dataset of famous people per continent, then sort by continent and birth year. The generated function looked correct, but the output repeated names, omitted Australia entirely, and didn’t sort properly. That’s a key risk: code can run while still violating constraints, so outputs need validation.

How did the model perform on tweet labeling, and what were the failure modes?

For five tweets, the model returned valid JSON with fields like target audience, tone, complexity, and main themes. Still, the classifications looked inaccurate: complexity was consistently “intermediate,” and a philosophical tweet was labeled under “education” rather than “philosophy.” The takeaway is that correct formatting doesn’t guarantee label quality for training data.

What differed between summarization and LinkedIn-style writing?

When asked for key points with 3–4 sentences, the summary included an implausible numeric jump for total assets, which reduced trust. When asked to write a LinkedIn post from the same long Meta earnings report, the numbers appeared more consistent and the format worked better, suggesting the prompt style strongly affects reliability.

Where did structured extraction succeed most clearly?

Receipt extraction from markdown worked well: store name, purchase date/time, total amount, tax amount, payment method, and item lists were extracted with correct formatting. The author noted one incorrect quantity (“TF” instead of the expected value), but totals and prices matched. For the earnings report, the model correctly produced tables for net cash from operating activities and purchases of property/equipment for 2023 and 2024, even when values were mixed across rows.

Why did table extraction sometimes fail even when the data existed?

A second table request (operating margin, effective tax rate, and cost of revenue) produced a “total failure” despite the metrics being explicitly stated in the document. The author’s observation points to sensitivity to table layout complexity—values may be distributed across non-obvious rows/sections, and the model can mis-associate numbers.

Review Questions

In the code-generation test, what specific output errors indicated instruction-following failure, and how would you validate such outputs automatically?
Which tasks showed the strongest correlation between “correct format” and “correct content,” and which tasks exposed gaps between the two?
What table-layout characteristics (e.g., mixed rows, non-adjacent values) seem most likely to trigger extraction errors in small quantized models?

Key Points

1
Run the Q4-quantized Llama 3.2 3B model locally via Ollama for fast responses, especially when using keep-alive to avoid reload latency.
2
Treat code-generation outputs as untrusted until you verify constraints like sorting, completeness (e.g., missing Australia), and uniqueness of entities.
3
Valid JSON labeling is not the same as accurate labeling; check category quality with spot audits before using results for training.
4
Prompt style can swing performance: LinkedIn-style writing from a long report looked more reliable than a strict key-point summary with numeric fidelity requirements.
5
Structured extraction from receipts and some financial tables can be strong when fields are clearly present and mapped to explicit output schemas.
6
Table extraction accuracy can collapse when requested metrics are distributed across complex or mixed table regions, even if the numbers exist in the source text.
7
For question answering on financial documents, expect occasional refusals or “not supported by the text” responses that require fallback strategies.

Highlights

The model produced fast, usable outputs locally on an M3 MacBook Pro using Ollama with Q4 quantization and keep-alive.

Code generation ran but violated constraints—Australia was missing and sorting was wrong—showing that “working code” can still be incorrect.

Receipt extraction was largely accurate with correct totals and item pricing, but at least one quantity was wrong.

Table extraction succeeded for cash-flow metrics across 2023–2024, yet failed for a second table request involving operating margin and effective tax rate.

Summarization and Q&A were inconsistent: one summary had implausible numbers, and some answers were blocked or deemed unsupported.

Topics

Local Llama 3.2
Ollama Inference
Structured Data Extraction
Data Labeling
Financial Document QA

Mentioned

Ollama
Llama 3.2
Meta
LinkedIn
AWS
Venelin Valkov
Q4
JSON
AI
RAG