Gemini 1.5 Pro for Video Analysis

TL;DR

Gemini 1.5 Pro can extract slide-level facts and approximate timestamps from a ~50-minute video even when audio input is stripped.

Briefing Cornell Notes

Briefing

Gemini 1.5 Pro can extract highly specific information from a long video—down to approximate timestamps for when key topics appear—making video-based Q&A and slide-by-slide summaries practical even with a near-1 million token context window. The most consequential takeaway is that the model’s strength comes from pairing visual slide content with text (especially Whisper-style transcripts), enabling targeted “needle-in-the-haystack” queries like identifying whether Noam Shazeer is cited, and pulling details such as Gemini training data and evaluation themes.

The workflow starts with preparing a ~50-minute cut of Jeff Dean’s recent talk on machine learning trends, then uploading it to Google AI Studio. Tokenization alone can take time, and inference latency grows as the context approaches the 1 million token limit: answers often take roughly 1.5–3 minutes, with the slowdowns becoming noticeable near the maximum context. Audio is stripped because Gemini 1.5 Pro (in this setup) doesn’t support audio input, so early questions rely purely on the slides.

Even without transcript or audio, Gemini can answer basic identity and logistics questions. It correctly identifies Jeff Dean as the talk’s author and returns a general summary focused on large language models, multimodal models, data quality, and evaluation methods. More importantly for retrieval-style use cases, it can locate when Gemini first appears: Gemini is mentioned at about the 25-minute mark (around 25:25), and the system sometimes narrows this to within seconds to half a minute.

The slide-only approach also supports precise, high-signal queries. When asked whether Dean cites work from Noam Shazeer, Gemini returns “yes,” pointing to Shazeer’s Transformer-related contributions—consistent with Shazeer’s name appearing on the relevant “Attention is All You Need” slide. That said, some returned claims may blend slide evidence with internal knowledge; for example, Gemini also mentions Shazeer’s work on PaLM, even though the transcript suggests the name may not appear directly on a slide.

Adding a Whisper transcription changes the quality and depth of extraction. With the transcript included, Gemini can generate a breakdown of each slide with the time it appears and what was discussed, and it can pull details from Dean’s spoken wording that aren’t visible in the visuals. The transcript-plus-slides setup increases total token load (to roughly 817k–801k tokens for video plus ~15.5k for transcript) and further increases latency, but it yields more complete answers—such as Dean’s description of Gemini training data: large collections of web documents, books, and code, plus quality filtering using heuristics and model-based classifiers.

Finally, the same extracted material can be repurposed into other outputs, including a generated blog post summarizing the talk’s key themes. The practical message is clear: long-context video understanding isn’t a replacement for retrieval-augmented generation (RAG), because processing cost and latency still favor RAG for most production tasks—but for deep video interrogation and slide-level information mining, Gemini 1.5 Pro is demonstrably effective.

Cornell Notes

Gemini 1.5 Pro can answer questions and extract structured information from a long video by using its large context window and the visual content of slides. Without audio support, early results rely on slide text and can still identify when Gemini is first mentioned (around 25 minutes) and answer factual questions like where the talk is held. Adding a Whisper transcription improves accuracy and coverage, enabling slide-by-slide breakdowns with timestamps and extracting details that aren’t visible in the visuals. The tradeoff is latency: near the upper end of the context window, responses can take roughly 2–3 minutes, and long-context is mainly useful for input rather than generating extremely large outputs.

How does Gemini 1.5 Pro handle a long video when audio isn’t supported in this setup?

Audio is stripped during upload, so the system relies on the video’s visual content—especially slides. That still allows it to identify the talk’s author and produce a general summary, and it can locate topic timing by detecting when specific terms (like “Gemini”) appear on slides.

What evidence shows Gemini can pinpoint when Gemini is discussed in the talk?

A query for “when does he start talking about Gemini?” returns an approximate timestamp. The slide containing “Gemini” appears at about 25 minutes (around 25:25), and the model sometimes narrows the timing to within seconds to half a minute depending on the prompt and slide clarity.

How does the “needle in the haystack” test work, and what was the result for Noam Shazeer?

A targeted question asks whether Dean cites work from Noam Shazeer. Gemini answers “yes,” linking Shazeer to the Transformer work and pointing to the “Attention is All You Need” slide where Shazeer’s name appears. The transcript may not mention his name directly, so the model is effectively using slide evidence; however, some additional claims (like PaLM-related attribution) may reflect internal knowledge beyond what’s explicitly on slides.

What does Gemini report about Gemini’s training data, and how is that tied to timestamps?

Using slide-only input, Gemini identifies a Gemini training-data discussion around 31 minutes (about 31:13). The extracted content says Gemini is trained on web documents, books, and code, with quality filtering using heuristics plus model-based classifiers. The timestamp alignment is checked by matching the returned claim to the corresponding slide.

Why does adding a Whisper transcript improve results, and what cost comes with it?

Whisper transcription adds spoken details that may not appear in slides, enabling more accurate slide-by-slide breakdowns and extraction of comments that are hard to find in YouTube transcripts. The cost is increased latency: total token load rises (video plus ~15.5k transcript tokens), and responses can take about 2–3 minutes when operating near the million-token range.

What limitation appears in output length when using the long context window?

The long context window is mainly useful for input. Outputting extremely large generations (tens of thousands of tokens at once) isn’t generally supported in practice; the system tends to produce outputs in the thousands of tokens rather than very large batches.

Review Questions

When audio is unavailable, what parts of the video become the primary source of truth for Gemini’s answers?
What are the practical latency tradeoffs as the context approaches 1 million tokens, and how does that affect interactive Q&A?
In the Noam Shazeer test, how can you tell whether an answer is grounded in visible slides versus likely internal knowledge?

Key Points

1
Gemini 1.5 Pro can extract slide-level facts and approximate timestamps from a ~50-minute video even when audio input is stripped.
2
Inference latency rises sharply as token usage approaches the 1 million context window, with many answers taking roughly 1.5–3 minutes.
3
Long-context video understanding is best treated as input capacity; it doesn’t automatically enable extremely long generations.
4
Slide-only queries can still locate when a topic first appears (Gemini mentioned at about 25 minutes) and answer factual questions like the venue.
5
Adding a Whisper transcript improves coverage by capturing spoken details not present in slides, enabling slide-by-slide summaries with timestamps.
6
Targeted “needle in the haystack” questions can work when names or key terms appear on slides, but some claims may blend slide evidence with internal knowledge.
7
RAG remains relevant for most production retrieval tasks because processing cost and latency make full-context use impractical as a default.

Highlights

Gemini 1.5 Pro identified the first “Gemini” mention at roughly 25:25 by detecting the relevant slide in a long video.

A slide-only “needle in the haystack” question about Noam Shazeer returned “yes,” consistent with Shazeer’s name appearing on the “Attention is All You Need” slide.

With Whisper transcripts added, Gemini produced a structured breakdown of slides with timestamps and extracted training-data details: web documents, books, code, plus heuristics and model-based classifiers for quality filtering.

Near the million-token range, response times commonly stretched to about 2–3 minutes, reinforcing that long context is mainly for input rather than fast interaction.

Topics

Video Analysis
Gemini 1.5 Pro
Long Context Window
Whisper Transcription
Slide-Level Retrieval

Mentioned

Jeff Dean
Noam Shazeer
RAG
TPU
TPUv4s
TPU 5e
TPU 5
TPUs
NLP
LLMs
T5
PaLM
MedPaLM
NeurIPS
RNNs
TPU pods