Gemini 2.0 Flash Thinking Test - Coding, Data Extraction, Summarization, Data Labelling, RAG

TL;DR

Gemini 2.0 Flash Thinking is experimental and trained to generate internal reasoning steps as part of its responses, using a compute-time budget for “thinking.”

Briefing Cornell Notes

Briefing

Gemini 2.0 Flash Thinking is positioned as a fast “thinking-mode” variant that exposes its internal reasoning steps, and hands-on tests suggest that access to those steps can translate into stronger performance on tasks that need structured outputs and faithful extraction from documents. The model is described as experimental and trained to generate the thinking process as part of its responses, with an explicit “test compute time” budget and a chain-of-prompting style that asks it to reason before answering. While it’s marketed as a Flash (mini-class) model, the practical results show mixed latency: creative and multi-step generations can still take tens of seconds, but document-grounded Q&A and table extraction often return in a few seconds.

In early demonstrations, Gemini 2.0 Flash Thinking handles an “astrophysicist on Mars” scenario by producing a detailed, physics-grounded refusal—arguing that changing oil on Mars is impossible due to missing infrastructure and environmental constraints. The more striking creativity test asks for 90s hip-hop style lyrics that blend slang, rhythm, and cultural references with future-AI themes. The model’s exposed reasoning includes explicit planning around rhyme, internal rhymes, and beat-like phrasing, and the resulting lyrics are judged by the tester as among the best produced across models tried.

The evaluation then shifts to practical engineering workflows. For coding-style data generation and sorting, the model is tasked with creating a dataset (at least 1,000 examples) of people with fields like name, gender, world wealth, and continent, then sorting by continent and poorest-to-richest. Compared with other models that struggled with the sorting logic, Gemini 2.0 Flash Thinking produces the most convincing results, returning a pandas-style table with the expected ordering.

For data labeling—framed as knowledge distillation—the model classifies five tweets into audience, tone/sentiment, complexity level, and main themes. The outputs are formatted as structured JSON-like data (with the tester noting that strict JSON sometimes fails unless a schema is provided), and the labels are largely consistent with the tweets’ content, including negative tone for some philosophical and economics-related posts and an optimistic, general-audience stance for an AI-vs-doctors claim.

The strongest evidence of “faithful extraction” comes from document tasks using Meta earnings materials converted to markdown. The model produces short, direct answers grounded in the text (including quoting a founder-related statement and pulling an expected tax-rate figure from the relevant section). It also extracts table data—net cash, purchases of property and equipment, operating margin, and effective tax rate—while providing references to the source sections. In one table comparison, the extracted numbers match the manually read values, and the tester reports fewer hallucinations than with other models.

Finally, receipt processing combines markdown and image input. With a Piggly Wiggly receipt, Gemini 2.0 Flash Thinking returns a structured JSON object containing store name, debit card payment method, purchase date/time, item names, quantities, and prices. When the image is provided, item names exclude unrelated categories that appeared after markdown conversion, suggesting the image input can improve precision. The model also supports multi-image input, and the tester notes interest in future experiments like object detection via bounding boxes and more retrieval-augmented generation tests.

Cornell Notes

Gemini 2.0 Flash Thinking is an experimental “thinking-mode” model that generates its internal reasoning steps as part of responses. In hands-on tests, it shows strong performance on structured tasks—especially document-grounded question answering and table extraction—where answers stay close to the provided markdown and include source references. It also performs well on creative generation (90s hip-hop lyrics) and on coding-style data sorting, producing more reliable ordering than several other models tried. Receipt extraction from both markdown and images yields structured JSON with fields like store name, payment method, purchase date/time, and line items, with image input improving item-name accuracy. Latency is inconsistent: creative and multi-step calls can take ~20 seconds, while extraction/Q&A often returns in a few seconds.

What does “Flash Thinking” mean in practice, and how does it affect reasoning quality?

The model is described as trained to output its thinking process as part of responses, using a chain-of-prompting style and a compute-time budget for reasoning. In tests, the exposed reasoning helps with tasks that benefit from planning—like generating 90s hip-hop lyrics where the model explicitly plans rhyme/rhythm patterns, and extracting receipt fields where it lays out a step-by-step search strategy (store name, date/time formats, totals, payment method, line items, and quantities).

Why did the model perform better on some coding and sorting tasks than other models?

For a coding-style dataset task (creating a pandas DataFrame of at least 1,000 examples with fields including continent and wealth), the tester found that several other models struggled with the sorting portion. Gemini 2.0 Flash Thinking produced the most convincing results, returning a table sorted first by continent (Africa, Asia, Australia, etc.) and then from poorest to richest, with the expected structure (names, gender, wealth in millions, continent).

How did the model handle structured data labeling for tweets?

It classified five tweets into audience, tone/sentiment, complexity level, and main themes/topics. The tester noted that strict JSON sometimes wasn’t returned automatically, but structured outputs were still produced; a schema approach via a data class/Pydantic was possible through the API. The resulting labels were largely consistent with the tweets’ content—for example, intermediate complexity for most tweets, philosophy as a topic for one, and negative tone for multiple posts.

What evidence suggests Gemini 2.0 Flash Thinking is good at faithful extraction from earnings reports?

Using Meta earnings markdown, the model answered questions with short, direct responses that matched the relevant sections (e.g., an expected tax rate described as “mid teens” for full 2024, and a founder/CEO quote about AI progress). For table extraction, it pulled values like operating margin and effective tax rate and provided references to the source sections; the tester reported that extracted numbers matched manually read table values and saw fewer hallucinations than with other models.

How did image input change receipt extraction compared with markdown alone?

With a Piggly Wiggly receipt, markdown conversion produced item descriptions that included unrelated categories (like meat and produce). When the receipt image was provided, the structured output’s item names excluded those unrelated categories, aligning better with the actual line items. The model also produced a detailed plan for extracting fields (store name, purchase date/time, totals, payment method, and line items with quantities).

What latency pattern emerged across different task types?

Latency varied by task. Creative generation (hip-hop lyrics) took roughly ~19 seconds, and tweet labeling took about ~22 seconds for five tweets. Coding-style dataset sorting took around ~8 seconds. Document Q&A and table extraction from markdown were faster—often around ~3–4 seconds—while receipt markdown/image extraction took about ~9 seconds for the markdown-based prompt plus additional image processing.

Review Questions

Which task types in the tests showed the most consistent “faithful extraction” behavior, and what signals in the outputs supported that conclusion?
How did the exposed thinking process help with either creative generation or structured extraction in the examples provided?
What practical differences emerged between using markdown-only inputs versus adding an image for receipt parsing?

Key Points

1
Gemini 2.0 Flash Thinking is experimental and trained to generate internal reasoning steps as part of its responses, using a compute-time budget for “thinking.”
2
Despite being labeled “Flash,” real-world latency can still be tens of seconds for creative generation and multi-item labeling, while document-grounded extraction often returns in a few seconds.
3
Exposed reasoning appears to improve planning for tasks like 90s hip-hop lyric generation, where rhyme and rhythm constraints are explicitly considered.
4
On a coding-style dataset task requiring continent grouping and poorest-to-richest sorting, Gemini 2.0 Flash Thinking produced the most reliable ordering compared with other models tested.
5
For tweet labeling, the model can output structured fields (audience, tone, complexity, topics), though strict JSON may require schema guidance.
6
In Meta earnings report experiments, answers and extracted table values matched the provided markdown closely, with source references and fewer hallucinations than other models.
7
Receipt extraction works with both markdown and images; image input can correct item-name noise introduced by markdown conversion.

Highlights

Gemini 2.0 Flash Thinking’s exposed “thinking” steps included explicit planning for rhyme, internal rhymes, and beat-like phrasing in 90s hip-hop lyrics.

Short, text-grounded answers from Meta earnings markdown pulled figures like the expected 2024 tax rate (“mid teens”) directly from the relevant sections.

Table extraction from earnings markdown returned values that matched manual reads, with references to where each number came from.

Receipt parsing produced accurate structured JSON, and adding the receipt image removed unrelated categories that appeared after markdown conversion.

Topics

Gemini 2.0 Flash Thinking
Chain-of-Prompting
Structured Output
RAG
Data Labeling