Gemini 2.0 Flash Thinking Test - Coding, Data Extraction, Summarization, Data Labelling, RAG
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini 2.0 Flash Thinking is experimental and trained to generate internal reasoning steps as part of its responses, using a compute-time budget for “thinking.”
Briefing
Gemini 2.0 Flash Thinking is positioned as a fast “thinking-mode” variant that exposes its internal reasoning steps, and hands-on tests suggest that access to those steps can translate into stronger performance on tasks that need structured outputs and faithful extraction from documents. The model is described as experimental and trained to generate the thinking process as part of its responses, with an explicit “test compute time” budget and a chain-of-prompting style that asks it to reason before answering. While it’s marketed as a Flash (mini-class) model, the practical results show mixed latency: creative and multi-step generations can still take tens of seconds, but document-grounded Q&A and table extraction often return in a few seconds.
In early demonstrations, Gemini 2.0 Flash Thinking handles an “astrophysicist on Mars” scenario by producing a detailed, physics-grounded refusal—arguing that changing oil on Mars is impossible due to missing infrastructure and environmental constraints. The more striking creativity test asks for 90s hip-hop style lyrics that blend slang, rhythm, and cultural references with future-AI themes. The model’s exposed reasoning includes explicit planning around rhyme, internal rhymes, and beat-like phrasing, and the resulting lyrics are judged by the tester as among the best produced across models tried.
The evaluation then shifts to practical engineering workflows. For coding-style data generation and sorting, the model is tasked with creating a dataset (at least 1,000 examples) of people with fields like name, gender, world wealth, and continent, then sorting by continent and poorest-to-richest. Compared with other models that struggled with the sorting logic, Gemini 2.0 Flash Thinking produces the most convincing results, returning a pandas-style table with the expected ordering.
For data labeling—framed as knowledge distillation—the model classifies five tweets into audience, tone/sentiment, complexity level, and main themes. The outputs are formatted as structured JSON-like data (with the tester noting that strict JSON sometimes fails unless a schema is provided), and the labels are largely consistent with the tweets’ content, including negative tone for some philosophical and economics-related posts and an optimistic, general-audience stance for an AI-vs-doctors claim.
The strongest evidence of “faithful extraction” comes from document tasks using Meta earnings materials converted to markdown. The model produces short, direct answers grounded in the text (including quoting a founder-related statement and pulling an expected tax-rate figure from the relevant section). It also extracts table data—net cash, purchases of property and equipment, operating margin, and effective tax rate—while providing references to the source sections. In one table comparison, the extracted numbers match the manually read values, and the tester reports fewer hallucinations than with other models.
Finally, receipt processing combines markdown and image input. With a Piggly Wiggly receipt, Gemini 2.0 Flash Thinking returns a structured JSON object containing store name, debit card payment method, purchase date/time, item names, quantities, and prices. When the image is provided, item names exclude unrelated categories that appeared after markdown conversion, suggesting the image input can improve precision. The model also supports multi-image input, and the tester notes interest in future experiments like object detection via bounding boxes and more retrieval-augmented generation tests.
Cornell Notes
Gemini 2.0 Flash Thinking is an experimental “thinking-mode” model that generates its internal reasoning steps as part of responses. In hands-on tests, it shows strong performance on structured tasks—especially document-grounded question answering and table extraction—where answers stay close to the provided markdown and include source references. It also performs well on creative generation (90s hip-hop lyrics) and on coding-style data sorting, producing more reliable ordering than several other models tried. Receipt extraction from both markdown and images yields structured JSON with fields like store name, payment method, purchase date/time, and line items, with image input improving item-name accuracy. Latency is inconsistent: creative and multi-step calls can take ~20 seconds, while extraction/Q&A often returns in a few seconds.
What does “Flash Thinking” mean in practice, and how does it affect reasoning quality?
Why did the model perform better on some coding and sorting tasks than other models?
How did the model handle structured data labeling for tweets?
What evidence suggests Gemini 2.0 Flash Thinking is good at faithful extraction from earnings reports?
How did image input change receipt extraction compared with markdown alone?
What latency pattern emerged across different task types?
Review Questions
- Which task types in the tests showed the most consistent “faithful extraction” behavior, and what signals in the outputs supported that conclusion?
- How did the exposed thinking process help with either creative generation or structured extraction in the examples provided?
- What practical differences emerged between using markdown-only inputs versus adding an image for receipt parsing?
Key Points
- 1
Gemini 2.0 Flash Thinking is experimental and trained to generate internal reasoning steps as part of its responses, using a compute-time budget for “thinking.”
- 2
Despite being labeled “Flash,” real-world latency can still be tens of seconds for creative generation and multi-item labeling, while document-grounded extraction often returns in a few seconds.
- 3
Exposed reasoning appears to improve planning for tasks like 90s hip-hop lyric generation, where rhyme and rhythm constraints are explicitly considered.
- 4
On a coding-style dataset task requiring continent grouping and poorest-to-richest sorting, Gemini 2.0 Flash Thinking produced the most reliable ordering compared with other models tested.
- 5
For tweet labeling, the model can output structured fields (audience, tone, complexity, topics), though strict JSON may require schema guidance.
- 6
In Meta earnings report experiments, answers and extracted table values matched the provided markdown closely, with source references and fewer hallucinations than other models.
- 7
Receipt extraction works with both markdown and images; image input can correct item-name noise introduced by markdown conversion.