Get AI summaries of any video or article — Sign up free
Gemini 2.5 Pro for YouTube Analysis thumbnail

Gemini 2.5 Pro for YouTube Analysis

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Use Gemini’s Files API to upload a video file (often MP4) and include the uploaded file reference in the prompt for multimodal analysis.

Briefing

Gemini 2.5 Pro can analyze YouTube videos directly—either by uploading a video file or, more conveniently, by passing a public YouTube URL into Gemini’s file input—making it possible to generate summaries, full transcripts, timestamped quotes, and even extract code from tutorial-style screen recordings. The practical payoff is speed: instead of downloading and processing media, the workflow uses Gemini’s File API (or the YouTube-URL shortcut) to fetch the content and then run multimodal prompts that return structured text like Markdown, transcripts, and time-linked segments.

For file-based workflows, the approach is straightforward: download or obtain an MP4 (or other supported video formats), upload it via the Files API, then include the resulting file reference in the prompt. Inline video inputs are possible but constrained—video must be under 20 MB, which often rules out typical HD, multi-minute recordings. When size becomes an issue, the transcript suggests older workarounds such as splitting raw video into audio plus sampled images (for example, sampling frames every second and only uploading frames that differ meaningfully), though the newer video support often makes that extra step unnecessary.

The more notable update is Google’s ability to upload a YouTube video by URL. This requires the video to be publicly listed (not private or unlisted) and is still in preview, with pricing and rate limits not finalized. The current operational limits are practical: up to 8 hours of YouTube video per day, and only one video per request. That “one per request” constraint contrasts with normal file uploads, where multiple videos can be uploaded and analyzed together—useful for product testing footage where Gemini can identify recurring friction points, slowdowns, and problem areas across many user sessions.

Under the hood, the transcript notes a tokenization cost that scales with video content: roughly 258 tokens per frame, plus 32 tokens per second for audio, along with metadata. In practical terms, that means an hour of video is likely to fit within the model’s context budget.

Once the YouTube URL is provided as file data, Gemini can produce more than descriptions. It can generate timestamped transcripts (including speaker attribution, such as “Sam”), and it can also perform visual Q&A and visual descriptions. A key capability highlighted is temporal understanding: Gemini can extract code from a tutorial video, reconstructing the sequence of notebook cells and returning the code wrapped in triple backticks rather than doing simple frame-by-frame OCR. The transcript emphasizes that this works even when the code appears gradually as the cursor scrolls, suggesting the model is assembling content over time.

The final takeaway is about use cases rather than mechanics. The workflow is positioned as a default way to turn YouTube URLs into written artifacts—tutorial code, recipes from cooking videos, or step-by-step instructions for team members using Loom-style recordings—while keeping the implementation relatively simple. The creator encourages experimentation and creative repurposing of the same URL-to-structured-output pipeline across different video types.

Cornell Notes

Gemini 2.5 Pro can turn YouTube videos into usable text by analyzing either uploaded video files or a public YouTube URL passed as file data. The URL method avoids downloading and supports tasks like video summaries, detailed transcripts with timestamps, and visual Q&A. A standout capability is temporal code extraction: tutorial screen recordings can be converted into reconstructed code blocks (triple backticks) rather than raw OCR. Practical limits include public-only access, up to 8 hours of YouTube video per day, and one video per request. Token costs scale with frames and audio, so long videos may require prompt and context planning.

What are the two main ways to feed video into Gemini 2.5 Pro, and when does each method matter?

One method downloads a video (commonly MP4) and uploads it through Gemini’s Files API, then passes the uploaded file reference into the prompt. The other method uses a YouTube URL directly as file data, letting Gemini fetch the publicly listed video for analysis. The file-upload route is flexible but constrained by inline size limits (under 20 MB for inline inputs), while the YouTube-URL route is convenient but limited by preview constraints like public-only availability and daily upload caps.

Why does inline video input often fail for real YouTube content?

Inline input requires the video to be under 20 MB. Many HD videos—especially multi-minute podcasts and tutorials—exceed that size, so the workflow typically shifts to the Files API upload or the YouTube-URL method instead of inline passing.

What limitations apply to the YouTube-URL upload approach?

The video must be publicly listed (not private or unlisted). It’s still in preview with pricing/rate limits not finalized. The main operational limits mentioned are a cap of up to 8 hours of YouTube video per day and only one video per request. By contrast, normal video uploads can support multiple videos in one shot for cross-video pattern extraction.

How does Gemini’s tokenization affect how much video fits in context?

The transcript gives an approximate scaling: about 258 tokens per frame, plus 32 tokens per second for audio, along with metadata. Because of that, an hour of video is described as likely to fit within the model’s context budget, though exact results depend on frame rate, audio presence, and prompt length.

What’s the difference between transcript generation and “visual Q&A” in this workflow?

Transcript generation focuses on producing spoken content as text, including timestamps and speaker names when prompted (e.g., “Sam”). Visual Q&A and visual descriptions focus on what appears on screen—Gemini can answer questions about the video’s visual content and provide descriptions, and it can also extract structured artifacts like code from tutorial videos by reconstructing the sequence of notebook cells.

Why is code extraction from tutorial videos described as more than OCR?

The transcript emphasizes temporal assembly: the model reconstructs code in the correct order even when the code is revealed gradually through scrolling rather than shown all at once. The output is presented as code blocks (triple backticks) rather than a jumble of per-frame text, indicating it’s integrating information over time.

Review Questions

  1. What constraints make the YouTube-URL method preferable to downloading and uploading files, and what constraints make it less flexible?
  2. How do the token costs (tokens per frame and tokens per second for audio) influence planning for long videos?
  3. What prompt and output differences distinguish transcript generation from code extraction in this workflow?

Key Points

  1. 1

    Use Gemini’s Files API to upload a video file (often MP4) and include the uploaded file reference in the prompt for multimodal analysis.

  2. 2

    Inline video inputs are limited to under 20 MB, so most HD YouTube content needs the Files API upload or the YouTube-URL method.

  3. 3

    Passing a public YouTube URL as file data lets Gemini fetch and analyze the video directly, but it’s limited to preview rules like up to 8 hours per day and one video per request.

  4. 4

    Token usage scales with video content (roughly 258 tokens per frame plus 32 tokens per second for audio), so long videos require context budgeting.

  5. 5

    Gemini can generate timestamped transcripts with speaker attribution when prompted, and it can also do visual Q&A and visual descriptions.

  6. 6

    A standout workflow is temporal code extraction from tutorial videos, producing reconstructed code blocks instead of frame-by-frame OCR.

  7. 7

    The most valuable next step is applying the same URL-to-structured-output approach to creative tasks like recipes, product feedback synthesis, and Loom-style team instructions.

Highlights

Passing a public YouTube URL directly into Gemini’s file input enables analysis without downloading and re-uploading the media.
The workflow supports more than transcription: it can generate timestamped transcripts, visual descriptions, and visual Q&A.
Temporal code extraction can reconstruct notebook code from a scrolling tutorial, outputting clean code blocks rather than OCR noise.
Cross-video analysis is easier with multiple file uploads than with the one-video-per-request YouTube-URL limit.

Topics

  • YouTube URL Analysis
  • Gemini 2.5 Pro
  • Files API
  • Timestamped Transcripts
  • Code Extraction