Local Llama 3.2 (3B) Test using Ollama - Summarization, Structured Text Extraction, Data Labelling
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Run the Q4-quantized Llama 3.2 3B model locally via Ollama for fast responses, especially when using keep-alive to avoid reload latency.
Briefing
A 3B quantized Llama 3.2 model running locally through Ollama delivers fast, usable results for structured data extraction—especially when information is clearly present in receipts or tables—but it struggles with strict instruction-following, reliable summarization, and some question-answering tasks that require precise grounding.
On an M3 MacBook Pro, the workflow starts with downloading and running the Q4-quantized “Llama 3.2” 3B model via Ollama, then calling it from Python 3.12 using the Ollama client library. The setup emphasizes speed: a “keep alive” option keeps the model resident in memory, and the author reports strong token generation rates compared with larger models. Early conversational prompts respond quickly, setting expectations that the small model can be practical for everyday use.
The first major test—prompted code generation—shows mixed reliability. The model produced code intended to generate a dataset of “top five” famous people per continent, sorted by continent and then by birth year. The generated function looked fine at a glance, but the output contained clear failures: repeated names, missing Australia, and no consistent sorting. The code ran, yet it didn’t follow the task constraints closely enough to be trustworthy.
For labeling, the model returned valid JSON for five technology-related tweets, including fields like target audience, tone, complexity, and main themes. However, several classifications looked off—complexity skewed toward “intermediate,” and the philosophical tweet was categorized under “education” rather than “philosophy.” The author’s takeaway is that even when the output format is correct, category accuracy may lag behind what’s needed for real training data.
Summarization and Q&A against a long Meta earnings report for Q1 2024 were also inconsistent. The model produced a LinkedIn-style post that seemed more coherent than the requested key-point summary, but a key numeric detail in the summary (total assets rising to an implausible value) undermined confidence. In direct question answering, it sometimes refused or failed to answer when the question required exact information—returning “not supported by the text” or rejecting the request with a “can’t provide financial advice” style response.
Where the model shined was structured extraction. Given a receipt converted to markdown, it extracted store name, purchase date/time, total amount, tax amount, payment method, and item details with correct formatting; one item quantity was wrong, but pricing and descriptions matched expectations. Against the earnings report tables, it correctly built requested tables for “net cash provided by operating activities” and “purchases of property and equipment” for 2023 and 2024, even when values were not adjacent. A second table request (“operating margin,” “effective tax rate,” and “cost of revenue”) showed a sharp drop in accuracy, suggesting that table layout complexity and cross-row mixing can make performance swing.
Overall, the local 3B Q4 model appears best suited to structured extraction tasks with clear source fields, while higher-stakes summarization, labeling, and strict table reconstruction may require larger models or tighter validation pipelines.
Cornell Notes
Running a Q4-quantized Llama 3.2 3B model locally via Ollama on an M3 MacBook Pro produces fast outputs and can reliably extract structured fields from receipts and some financial tables. In tests, the model’s code generation and tweet labeling were format-correct but contained substantive errors (missing Australia; miscategorized philosophy; questionable complexity ratings). Summarization of a Meta earnings report was inconsistent—one summary produced implausible numbers—while a LinkedIn-style post looked more usable. Question answering sometimes failed due to “not supported by the text” or refusal-style responses. Table extraction worked well for one set of cash-flow metrics, but failed for a second table request, indicating sensitivity to table layout and mixed-row values.
Why does the model feel “practical” despite being only 3B parameters?
What went wrong in the code-generation test, and why does it matter?
How did the model perform on tweet labeling, and what were the failure modes?
What differed between summarization and LinkedIn-style writing?
Where did structured extraction succeed most clearly?
Why did table extraction sometimes fail even when the data existed?
Review Questions
- In the code-generation test, what specific output errors indicated instruction-following failure, and how would you validate such outputs automatically?
- Which tasks showed the strongest correlation between “correct format” and “correct content,” and which tasks exposed gaps between the two?
- What table-layout characteristics (e.g., mixed rows, non-adjacent values) seem most likely to trigger extraction errors in small quantized models?
Key Points
- 1
Run the Q4-quantized Llama 3.2 3B model locally via Ollama for fast responses, especially when using keep-alive to avoid reload latency.
- 2
Treat code-generation outputs as untrusted until you verify constraints like sorting, completeness (e.g., missing Australia), and uniqueness of entities.
- 3
Valid JSON labeling is not the same as accurate labeling; check category quality with spot audits before using results for training.
- 4
Prompt style can swing performance: LinkedIn-style writing from a long report looked more reliable than a strict key-point summary with numeric fidelity requirements.
- 5
Structured extraction from receipts and some financial tables can be strong when fields are clearly present and mapped to explicit output schemas.
- 6
Table extraction accuracy can collapse when requested metrics are distributed across complex or mixed table regions, even if the numbers exist in the source text.
- 7
For question answering on financial documents, expect occasional refusals or “not supported by the text” responses that require fallback strategies.