Get AI summaries of any video or article — Sign up free
Gemini 2.0 - Video Analyzer with Code thumbnail

Gemini 2.0 - Video Analyzer with Code

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemini’s Video Analyzer produces time-coded, structured outputs (JSON-like objects) that include both spoken text and visual scene descriptions.

Briefing

Gemini’s “Video Analyzer” turns uploaded videos into structured, time-coded outputs—captions, spoken transcripts, visual scene descriptions, key moments, tables, and even counts—by combining prompting with function-style JSON results. The practical payoff is metadata you can search, index, and feed into retrieval systems, not just a single transcript.

In AI Studio’s starter apps, the workflow starts with uploading a video through the Files API. Once the upload completes, the system tokenizes the content for downstream prompting. From there, users can generate A/V captions that return scene-by-scene objects tied to timecodes. Each scene’s output can include both what appears on screen and any spoken text (with spoken segments placed in quotation marks). The interface lets viewers jump to specific timestamps and see descriptions that track visual changes—such as identifying a 3D animation of a window with plants on a dark blue background—while also capturing the audio content.

Beyond captions, the same underlying function-based approach can be steered to produce different formats. Users can request paragraph-style narrative responses for each scene, ask for “key moments” summarized with timestamps, or generate a table that organizes detected objects and visual elements (including emojis/objects) by time. Counting is another supported mode: prompts can instruct the model to count people visible per scene, returning numeric values aligned to timecodes. The demo also shows prompt-driven customization—switching from counting people to counting phones, and then to counting trees—highlighting that the counting logic can be repurposed for different visual targets.

A key detail is that these outputs aren’t purely free-form text. The system relies on function calls such as setTimeCodes, setTimeCodesWithObjects, and setTimeCodesWithNumericValues, each designed to return structured data (timecodes plus text, objects, or numbers). The transcript emphasizes that the function call behavior depends on including the function name in the prompt; otherwise, the model may return text without triggering the structured function output.

To reproduce the demo in code, the walkthrough uses Colab with a “new unified SDK.” The process mirrors the UI: upload a longer video (noting the UI seemed sensitive to length), poll the upload state until processing finishes, then call Gemini with a system prompt that instructs it to call the relevant function only once with appropriate timecodes and text. The Python implementation defines “tools” (function declarations) and helper functions to print results. Initially, removing the function mention from the user prompt yields mostly transcript-like text; adding the function name back enables proper structured outputs.

Finally, the code extends the approach by adding a custom function, setTimeCodesWithDescriptions, to separate spoken text from visual descriptions. The result is a richer indexable representation: time-coded transcripts plus distinct visual metadata (e.g., describing a cartoon llama and on-screen UI elements). The practical conclusion is that these structured captions and visual descriptions can be converted into text chunks for RAG systems, enabling question answering over both what was said and what appeared on screen at specific moments.

Cornell Notes

Gemini’s Video Analyzer converts uploaded videos into structured, time-coded outputs rather than only producing a transcript. In AI Studio, it can generate A/V captions with spoken text, key moments summaries, tables of visual elements, and even scene-by-scene counts (people, phones, trees) by using function-style outputs aligned to timestamps. Reproducing the behavior in Python requires defining tool/function declarations (e.g., setTimeCodes, setTimeCodesWithObjects, setTimeCodesWithNumericValues) and—crucially—mentioning the function name in the prompt so the model triggers the structured call. A custom function (setTimeCodesWithDescriptions) further separates spoken text from visual descriptions, creating metadata that can be indexed for RAG systems. This makes video content searchable by both audio and on-screen visuals at specific times.

How does the Video Analyzer turn a raw video into something usable for search or RAG?

It uploads the video via the Files API, then produces structured outputs tied to timecodes. The core result is scene-by-scene JSON-like objects containing timestamps plus content—spoken text (often quoted) and visual descriptions. Because the output is time-aligned and structured, it can be reformatted into text chunks and indexed, letting a system retrieve relevant moments based on both what was said and what appeared on screen.

Why do function calls matter more than plain prompting in this workflow?

The structured modes (captions, objects, numeric counts) depend on function-style outputs such as setTimeCodes, setTimeCodesWithObjects, and setTimeCodesWithNumericValues. If the prompt doesn’t explicitly mention the function name, the model may return mostly free-form text (e.g., transcript-like output) instead of triggering the structured function call. Including the function name in the user prompt is what makes the tool output appear in the expected JSON structure.

What’s the difference between the three built-in function patterns shown?

setTimeCodes returns timecodes with text (including spoken transcript plus visual description). setTimeCodesWithObjects returns timecodes with object-like elements (including emojis/objects) suitable for table-style visualization. setTimeCodesWithNumericValues returns timecodes with numeric values for tasks like counting people visible per scene.

How does the demo demonstrate prompt-driven customization for counting?

The counting prompt is adapted from “count the number of people visible” to “count the number of phones” and then to “count trees.” The same counting mechanism—returning numeric values aligned to timecodes—works as long as the prompt specifies the new target. The output then shows per-scene counts (e.g., scenes with three people vs. one person) and can be used to build charts or structured metadata.

What does the Python reproduction add beyond the UI?

The Colab version defines tool/function declarations in code and uses helper functions to print results. It also adds a custom function, setTimeCodesWithDescriptions, designed to return timecodes with both spoken text and a separate visual description field. That separation makes it easier to index and retrieve video information for downstream applications like RAG.

What practical limitation is mentioned when comparing UI and code?

The walkthrough notes that the AI Studio interface seemed limited with longer videos, prompting a switch to a longer (but still under ~10 minutes) video in the Python approach. In code, the upload state is polled until processing completes, then analysis runs once the file is ready.

Review Questions

  1. What changes in the prompt are necessary to ensure the model triggers a structured function call rather than returning only free-form text?
  2. How would you design a RAG indexing pipeline using setTimeCodesWithDescriptions outputs—what fields would you store and how would you chunk them?
  3. Which function pattern would you choose for (a) counting objects, (b) extracting emojis/objects for a table, and (c) producing time-coded spoken transcripts with visual descriptions?

Key Points

  1. 1

    Gemini’s Video Analyzer produces time-coded, structured outputs (JSON-like objects) that include both spoken text and visual scene descriptions.

  2. 2

    Function-style outputs (setTimeCodes, setTimeCodesWithObjects, setTimeCodesWithNumericValues) enable captions, object tables, and numeric counting aligned to timestamps.

  3. 3

    Structured function calling depends on explicitly mentioning the function name in the prompt; otherwise results may degrade into plain transcript text.

  4. 4

    Counting tasks can be repurposed by changing the target in the prompt (people → phones → trees) while keeping the numeric timecode structure.

  5. 5

    The Python/Colab reproduction uses a unified SDK: upload via Files API, poll processing state, then call Gemini with system/user prompts plus tool declarations.

  6. 6

    A custom function like setTimeCodesWithDescriptions can separate spoken text from visual descriptions, improving downstream indexing for RAG.

  7. 7

    Time-coded visual metadata is positioned as a practical way to retrieve “what happened when” in videos, not just “what was said.”

Highlights

The most valuable output isn’t just a transcript—it’s scene-by-scene JSON tied to timecodes, combining spoken text with what appears on screen.
Counting is implemented as structured numeric values per scene, making it easy to chart or filter video moments.
Function calling works only when the prompt explicitly names the function; otherwise the model returns mostly plain text.
Separating spoken text from visual descriptions (via a custom function) creates cleaner metadata for RAG indexing.
The workflow is reproducible in Python: upload, wait for processing, define tools, then run Gemini with prompts that trigger the right structured call.

Topics

Mentioned