Local Qwen 2.5 (14B) Test using Ollama - Summarization, Structured Text Extraction, Data Labelling
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Run Qwen 2.5 14B locally by using Ollama with an Ollama server and a client (e.g., a Jupyter notebook) that sends prompts requesting either JSON or text output.
Briefing
Qwen 2.5 14B running locally through Ollama (via an Ollama server) delivers a noticeable jump in text-heavy tasks—especially sentiment/topic labeling and nuanced summarization—while still struggling with precise information extraction from long, structured documents. In side-by-side tests against Llama 3.2 3B, Qwen 2.5 produced more detailed, better-formatted outputs and generally handled “understand and classify” prompts more reliably, but both models faltered when asked to extract specific table values from Meta earnings materials.
On the setup side, the workflow centers on running Ollama locally, then using a Jupyter notebook to send prompts to the local model. The evaluation uses a prompt format that can request either JSON or plain text, letting the tester compare both coding-style outputs and structured labeling. For a simple coding benchmark—generating a synthetic dataset of “wealthy people” by continent and world—Qwen 2.5 14B produced code that largely matched the requested structure (including a function with a parameterized sample count and a docstring). It also generated a dataset with the expected continents and genders, though some numeric realism issues appeared (values didn’t align perfectly with the intended “million USD” framing) and ordering logic showed minor confusion.
The clearest win came from tweet labeling. Qwen 2.5 labeled audience type, tone, sentiment, complexity level, and main themes across multiple motivational/tech/AI-health tweets, with per-tweet latency around ~5 seconds. The outputs were judged more consistent and, in several cases, more plausible than Llama 3.2’s labels—particularly on topic selection and the overall fit between the label and the tweet’s intent. Qwen 2.5 also handled a summarization task on Meta’s Q1 2024 reporting text far more effectively: it extracted and presented multiple financial figures with better formatting and added details that weren’t explicitly highlighted in the provided text, while Llama 3.2’s summary was comparatively sparse.
When the task shifted from narrative understanding to strict extraction—pulling exact answers from tables in the Meta earnings PDF—results deteriorated for both models. For questions like “founder most proud of” and “expected tax rate,” Qwen 2.5 either returned unsupported content or failed to ground answers in the document. Table-generation prompts also produced incorrect or malformed tables, with Qwen 2.5 sometimes returning values that were directionally right but not reliably accurate, and both models producing clearly wrong computed table structures for more complex fields.
By the end, the takeaway is pragmatic: Qwen 2.5 14B is strong for local, text-centric workflows (summarization, structured labeling, and general understanding), but long-document table extraction still appears to require substantially larger models—suggested as 72B-class or higher—for dependable accuracy. The tradeoff is clear: better nuance and classification at local scale, with extraction precision remaining the bottleneck for earnings-style documents.
Cornell Notes
Qwen 2.5 14B running locally via Ollama performs best on tasks that require reading, interpreting, and producing structured text—like sentiment/topic labeling and narrative summarization. In comparisons with Llama 3.2 3B, Qwen 2.5 produced more nuanced summaries and more plausible labels, with tweet labeling taking about ~5 seconds per tweet. However, both models struggled when asked to answer grounded questions from long Meta earnings materials and to generate accurate tables from specific PDF table fields. The results suggest that local 14B models are currently better suited to understanding and classification than to precise table extraction from lengthy financial documents.
How does Qwen 2.5 14B perform on “generate code / structured output” tasks compared with Llama 3.2 3B?
What was the strongest area for Qwen 2.5 14B in the evaluation?
How did Qwen 2.5 14B handle summarization of Meta’s Q1 2024 reporting text?
Where did both models fail most clearly?
What practical conclusion did the tester draw about local model size?
Review Questions
- In the tweet-labeling task, which label categories were requested, and what kinds of topics did Qwen 2.5 assign more convincingly than Llama 3.2?
- What evidence from the Meta earnings tests indicates that both models struggled with grounding answers in document text?
- Why might Qwen 2.5 excel at summarization and labeling while still failing at precise table extraction?
Key Points
- 1
Run Qwen 2.5 14B locally by using Ollama with an Ollama server and a client (e.g., a Jupyter notebook) that sends prompts requesting either JSON or text output.
- 2
Qwen 2.5 14B produced code that generally followed the requested dataset structure, but numeric realism and sorting logic still showed occasional mismatches.
- 3
Tweet sentiment/topic labeling was Qwen 2.5’s standout performance area, with outputs judged more aligned to tweet intent than Llama 3.2 3B.
- 4
Qwen 2.5 generated more nuanced, better-formatted summaries of Meta Q1 2024 reporting text than Llama 3.2, including multiple financial figures.
- 5
Both Qwen 2.5 14B and Llama 3.2 struggled with grounded Q&A from long PDFs, sometimes returning answers not supported by the provided text.
- 6
Both models also produced incorrect or malformed tables when asked to extract specific table fields from the Meta earnings document.
- 7
For reliable table extraction from long earnings-style documents, the results point toward using substantially larger models (around 72B-class or higher).