Local Qwen 2.5 (14B) Test using Ollama - Summarization, Structured Text Extraction, Data Labelling

TL;DR

Run Qwen 2.5 14B locally by using Ollama with an Ollama server and a client (e.g., a Jupyter notebook) that sends prompts requesting either JSON or text output.

Briefing Cornell Notes

Briefing

Qwen 2.5 14B running locally through Ollama (via an Ollama server) delivers a noticeable jump in text-heavy tasks—especially sentiment/topic labeling and nuanced summarization—while still struggling with precise information extraction from long, structured documents. In side-by-side tests against Llama 3.2 3B, Qwen 2.5 produced more detailed, better-formatted outputs and generally handled “understand and classify” prompts more reliably, but both models faltered when asked to extract specific table values from Meta earnings materials.

On the setup side, the workflow centers on running Ollama locally, then using a Jupyter notebook to send prompts to the local model. The evaluation uses a prompt format that can request either JSON or plain text, letting the tester compare both coding-style outputs and structured labeling. For a simple coding benchmark—generating a synthetic dataset of “wealthy people” by continent and world—Qwen 2.5 14B produced code that largely matched the requested structure (including a function with a parameterized sample count and a docstring). It also generated a dataset with the expected continents and genders, though some numeric realism issues appeared (values didn’t align perfectly with the intended “million USD” framing) and ordering logic showed minor confusion.

The clearest win came from tweet labeling. Qwen 2.5 labeled audience type, tone, sentiment, complexity level, and main themes across multiple motivational/tech/AI-health tweets, with per-tweet latency around ~5 seconds. The outputs were judged more consistent and, in several cases, more plausible than Llama 3.2’s labels—particularly on topic selection and the overall fit between the label and the tweet’s intent. Qwen 2.5 also handled a summarization task on Meta’s Q1 2024 reporting text far more effectively: it extracted and presented multiple financial figures with better formatting and added details that weren’t explicitly highlighted in the provided text, while Llama 3.2’s summary was comparatively sparse.

When the task shifted from narrative understanding to strict extraction—pulling exact answers from tables in the Meta earnings PDF—results deteriorated for both models. For questions like “founder most proud of” and “expected tax rate,” Qwen 2.5 either returned unsupported content or failed to ground answers in the document. Table-generation prompts also produced incorrect or malformed tables, with Qwen 2.5 sometimes returning values that were directionally right but not reliably accurate, and both models producing clearly wrong computed table structures for more complex fields.

By the end, the takeaway is pragmatic: Qwen 2.5 14B is strong for local, text-centric workflows (summarization, structured labeling, and general understanding), but long-document table extraction still appears to require substantially larger models—suggested as 72B-class or higher—for dependable accuracy. The tradeoff is clear: better nuance and classification at local scale, with extraction precision remaining the bottleneck for earnings-style documents.

Cornell Notes

Qwen 2.5 14B running locally via Ollama performs best on tasks that require reading, interpreting, and producing structured text—like sentiment/topic labeling and narrative summarization. In comparisons with Llama 3.2 3B, Qwen 2.5 produced more nuanced summaries and more plausible labels, with tweet labeling taking about ~5 seconds per tweet. However, both models struggled when asked to answer grounded questions from long Meta earnings materials and to generate accurate tables from specific PDF table fields. The results suggest that local 14B models are currently better suited to understanding and classification than to precise table extraction from lengthy financial documents.

How does Qwen 2.5 14B perform on “generate code / structured output” tasks compared with Llama 3.2 3B?

For a synthetic dataset task (generate a dataset of “wealthy people” by continent and world), Qwen 2.5 produced code that matched the requested structure: a function with a parameter for sample count, a docstring, and continent/gender fields. The generated dataset included the expected continents (Africa, Asia, Europe, North America, South America, Australia) and genders. Minor issues appeared in realism and ordering logic (e.g., values didn’t perfectly match the intended “million USD” framing, and sorting direction/secondary sorting behavior showed inconsistencies). Llama 3.2’s dataset output was judged “pretty good” in some respects but showed signs of not sorting correctly in the tested scenario.

What was the strongest area for Qwen 2.5 14B in the evaluation?

Tweet labeling. Qwen 2.5 labeled each tweet with audience (e.g., general public vs professionals vs teenagers), tone/sentiment (often informal/negative where appropriate), complexity level (elementary vs intermediate), and main themes/topics (e.g., technology, health, AI for humanity, business/career, personal development). The labeling took roughly ~5 seconds per tweet, and the outputs were considered more consistent and better aligned with the tweet content than Llama 3.2’s labels.

How did Qwen 2.5 14B handle summarization of Meta’s Q1 2024 reporting text?

Qwen 2.5 took about ~43 seconds to generate a summary and produced a more detailed, better-formatted response than Llama 3.2. It surfaced multiple financial figures (e.g., net income and cash figures) and presented them in a readable structure. It also included additional operational details (like operating activities) and comparisons (such as net income vs Q1 2023) that the tester didn’t see explicitly stated in the provided excerpt, suggesting stronger nuance in extracting and contextualizing information.

Where did both models fail most clearly?

Grounded Q&A and table extraction from long Meta earnings documents. For questions like “what is the founder most proud of” and “expected tax rate for 2024,” Qwen 2.5 either produced content not supported by the text or returned an answer that didn’t correctly ground to the document. For table-generation prompts (e.g., net cash provided by operating activities, purchases of property and equipment for 2023/2024; operating margin/effective tax rate/cost of revenue), both models produced incorrect or poorly structured tables, with Qwen 2.5 sometimes returning values that were directionally plausible but not reliably correct.

What practical conclusion did the tester draw about local model size?

Qwen 2.5 14B is strong for local text understanding and structured labeling, but extraction accuracy from long, table-heavy PDFs remains unreliable at this scale. The tester suggested that 72B-class models or larger are likely needed for dependable table extraction from lengthy earnings documents.

Review Questions

In the tweet-labeling task, which label categories were requested, and what kinds of topics did Qwen 2.5 assign more convincingly than Llama 3.2?
What evidence from the Meta earnings tests indicates that both models struggled with grounding answers in document text?
Why might Qwen 2.5 excel at summarization and labeling while still failing at precise table extraction?

Key Points

1
Run Qwen 2.5 14B locally by using Ollama with an Ollama server and a client (e.g., a Jupyter notebook) that sends prompts requesting either JSON or text output.
2
Qwen 2.5 14B produced code that generally followed the requested dataset structure, but numeric realism and sorting logic still showed occasional mismatches.
3
Tweet sentiment/topic labeling was Qwen 2.5’s standout performance area, with outputs judged more aligned to tweet intent than Llama 3.2 3B.
4
Qwen 2.5 generated more nuanced, better-formatted summaries of Meta Q1 2024 reporting text than Llama 3.2, including multiple financial figures.
5
Both Qwen 2.5 14B and Llama 3.2 struggled with grounded Q&A from long PDFs, sometimes returning answers not supported by the provided text.
6
Both models also produced incorrect or malformed tables when asked to extract specific table fields from the Meta earnings document.
7
For reliable table extraction from long earnings-style documents, the results point toward using substantially larger models (around 72B-class or higher).

Highlights

Qwen 2.5 14B delivered more plausible sentiment/topic labels than Llama 3.2 3B, with about ~5 seconds per tweet in the test.

Summaries of Meta’s Q1 2024 materials were more detailed and better formatted with Qwen 2.5, including additional operational context.

Grounded questions and table extraction from Meta’s earnings PDF were unreliable for both models, with incorrect or unsupported outputs.

Local 14B models appear strong for understanding and classification, but not yet dependable for precise table-heavy extraction.

Topics

Local Ollama
Qwen 2.5
Structured Text Extraction
Data Labelling
Summarization

Mentioned

Ollama
Qwen 2.5
Llama 3.2
Meta
Piggly Wiggly
Venelin Valkov
JSON
PDF