Gemini 2.0 - Video Analyzer with Code
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini’s Video Analyzer produces time-coded, structured outputs (JSON-like objects) that include both spoken text and visual scene descriptions.
Briefing
Gemini’s “Video Analyzer” turns uploaded videos into structured, time-coded outputs—captions, spoken transcripts, visual scene descriptions, key moments, tables, and even counts—by combining prompting with function-style JSON results. The practical payoff is metadata you can search, index, and feed into retrieval systems, not just a single transcript.
In AI Studio’s starter apps, the workflow starts with uploading a video through the Files API. Once the upload completes, the system tokenizes the content for downstream prompting. From there, users can generate A/V captions that return scene-by-scene objects tied to timecodes. Each scene’s output can include both what appears on screen and any spoken text (with spoken segments placed in quotation marks). The interface lets viewers jump to specific timestamps and see descriptions that track visual changes—such as identifying a 3D animation of a window with plants on a dark blue background—while also capturing the audio content.
Beyond captions, the same underlying function-based approach can be steered to produce different formats. Users can request paragraph-style narrative responses for each scene, ask for “key moments” summarized with timestamps, or generate a table that organizes detected objects and visual elements (including emojis/objects) by time. Counting is another supported mode: prompts can instruct the model to count people visible per scene, returning numeric values aligned to timecodes. The demo also shows prompt-driven customization—switching from counting people to counting phones, and then to counting trees—highlighting that the counting logic can be repurposed for different visual targets.
A key detail is that these outputs aren’t purely free-form text. The system relies on function calls such as setTimeCodes, setTimeCodesWithObjects, and setTimeCodesWithNumericValues, each designed to return structured data (timecodes plus text, objects, or numbers). The transcript emphasizes that the function call behavior depends on including the function name in the prompt; otherwise, the model may return text without triggering the structured function output.
To reproduce the demo in code, the walkthrough uses Colab with a “new unified SDK.” The process mirrors the UI: upload a longer video (noting the UI seemed sensitive to length), poll the upload state until processing finishes, then call Gemini with a system prompt that instructs it to call the relevant function only once with appropriate timecodes and text. The Python implementation defines “tools” (function declarations) and helper functions to print results. Initially, removing the function mention from the user prompt yields mostly transcript-like text; adding the function name back enables proper structured outputs.
Finally, the code extends the approach by adding a custom function, setTimeCodesWithDescriptions, to separate spoken text from visual descriptions. The result is a richer indexable representation: time-coded transcripts plus distinct visual metadata (e.g., describing a cartoon llama and on-screen UI elements). The practical conclusion is that these structured captions and visual descriptions can be converted into text chunks for RAG systems, enabling question answering over both what was said and what appeared on screen at specific moments.
Cornell Notes
Gemini’s Video Analyzer converts uploaded videos into structured, time-coded outputs rather than only producing a transcript. In AI Studio, it can generate A/V captions with spoken text, key moments summaries, tables of visual elements, and even scene-by-scene counts (people, phones, trees) by using function-style outputs aligned to timestamps. Reproducing the behavior in Python requires defining tool/function declarations (e.g., setTimeCodes, setTimeCodesWithObjects, setTimeCodesWithNumericValues) and—crucially—mentioning the function name in the prompt so the model triggers the structured call. A custom function (setTimeCodesWithDescriptions) further separates spoken text from visual descriptions, creating metadata that can be indexed for RAG systems. This makes video content searchable by both audio and on-screen visuals at specific times.
How does the Video Analyzer turn a raw video into something usable for search or RAG?
Why do function calls matter more than plain prompting in this workflow?
What’s the difference between the three built-in function patterns shown?
How does the demo demonstrate prompt-driven customization for counting?
What does the Python reproduction add beyond the UI?
What practical limitation is mentioned when comparing UI and code?
Review Questions
- What changes in the prompt are necessary to ensure the model triggers a structured function call rather than returning only free-form text?
- How would you design a RAG indexing pipeline using setTimeCodesWithDescriptions outputs—what fields would you store and how would you chunk them?
- Which function pattern would you choose for (a) counting objects, (b) extracting emojis/objects for a table, and (c) producing time-coded spoken transcripts with visual descriptions?
Key Points
- 1
Gemini’s Video Analyzer produces time-coded, structured outputs (JSON-like objects) that include both spoken text and visual scene descriptions.
- 2
Function-style outputs (setTimeCodes, setTimeCodesWithObjects, setTimeCodesWithNumericValues) enable captions, object tables, and numeric counting aligned to timestamps.
- 3
Structured function calling depends on explicitly mentioning the function name in the prompt; otherwise results may degrade into plain transcript text.
- 4
Counting tasks can be repurposed by changing the target in the prompt (people → phones → trees) while keeping the numeric timecode structure.
- 5
The Python/Colab reproduction uses a unified SDK: upload via Files API, poll processing state, then call Gemini with system/user prompts plus tool declarations.
- 6
A custom function like setTimeCodesWithDescriptions can separate spoken text from visual descriptions, improving downstream indexing for RAG.
- 7
Time-coded visual metadata is positioned as a practical way to retrieve “what happened when” in videos, not just “what was said.”