Gemini 2.5 Pro for YouTube Analysis
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use Gemini’s Files API to upload a video file (often MP4) and include the uploaded file reference in the prompt for multimodal analysis.
Briefing
Gemini 2.5 Pro can analyze YouTube videos directly—either by uploading a video file or, more conveniently, by passing a public YouTube URL into Gemini’s file input—making it possible to generate summaries, full transcripts, timestamped quotes, and even extract code from tutorial-style screen recordings. The practical payoff is speed: instead of downloading and processing media, the workflow uses Gemini’s File API (or the YouTube-URL shortcut) to fetch the content and then run multimodal prompts that return structured text like Markdown, transcripts, and time-linked segments.
For file-based workflows, the approach is straightforward: download or obtain an MP4 (or other supported video formats), upload it via the Files API, then include the resulting file reference in the prompt. Inline video inputs are possible but constrained—video must be under 20 MB, which often rules out typical HD, multi-minute recordings. When size becomes an issue, the transcript suggests older workarounds such as splitting raw video into audio plus sampled images (for example, sampling frames every second and only uploading frames that differ meaningfully), though the newer video support often makes that extra step unnecessary.
The more notable update is Google’s ability to upload a YouTube video by URL. This requires the video to be publicly listed (not private or unlisted) and is still in preview, with pricing and rate limits not finalized. The current operational limits are practical: up to 8 hours of YouTube video per day, and only one video per request. That “one per request” constraint contrasts with normal file uploads, where multiple videos can be uploaded and analyzed together—useful for product testing footage where Gemini can identify recurring friction points, slowdowns, and problem areas across many user sessions.
Under the hood, the transcript notes a tokenization cost that scales with video content: roughly 258 tokens per frame, plus 32 tokens per second for audio, along with metadata. In practical terms, that means an hour of video is likely to fit within the model’s context budget.
Once the YouTube URL is provided as file data, Gemini can produce more than descriptions. It can generate timestamped transcripts (including speaker attribution, such as “Sam”), and it can also perform visual Q&A and visual descriptions. A key capability highlighted is temporal understanding: Gemini can extract code from a tutorial video, reconstructing the sequence of notebook cells and returning the code wrapped in triple backticks rather than doing simple frame-by-frame OCR. The transcript emphasizes that this works even when the code appears gradually as the cursor scrolls, suggesting the model is assembling content over time.
The final takeaway is about use cases rather than mechanics. The workflow is positioned as a default way to turn YouTube URLs into written artifacts—tutorial code, recipes from cooking videos, or step-by-step instructions for team members using Loom-style recordings—while keeping the implementation relatively simple. The creator encourages experimentation and creative repurposing of the same URL-to-structured-output pipeline across different video types.
Cornell Notes
Gemini 2.5 Pro can turn YouTube videos into usable text by analyzing either uploaded video files or a public YouTube URL passed as file data. The URL method avoids downloading and supports tasks like video summaries, detailed transcripts with timestamps, and visual Q&A. A standout capability is temporal code extraction: tutorial screen recordings can be converted into reconstructed code blocks (triple backticks) rather than raw OCR. Practical limits include public-only access, up to 8 hours of YouTube video per day, and one video per request. Token costs scale with frames and audio, so long videos may require prompt and context planning.
What are the two main ways to feed video into Gemini 2.5 Pro, and when does each method matter?
Why does inline video input often fail for real YouTube content?
What limitations apply to the YouTube-URL upload approach?
How does Gemini’s tokenization affect how much video fits in context?
What’s the difference between transcript generation and “visual Q&A” in this workflow?
Why is code extraction from tutorial videos described as more than OCR?
Review Questions
- What constraints make the YouTube-URL method preferable to downloading and uploading files, and what constraints make it less flexible?
- How do the token costs (tokens per frame and tokens per second for audio) influence planning for long videos?
- What prompt and output differences distinguish transcript generation from code extraction in this workflow?
Key Points
- 1
Use Gemini’s Files API to upload a video file (often MP4) and include the uploaded file reference in the prompt for multimodal analysis.
- 2
Inline video inputs are limited to under 20 MB, so most HD YouTube content needs the Files API upload or the YouTube-URL method.
- 3
Passing a public YouTube URL as file data lets Gemini fetch and analyze the video directly, but it’s limited to preview rules like up to 8 hours per day and one video per request.
- 4
Token usage scales with video content (roughly 258 tokens per frame plus 32 tokens per second for audio), so long videos require context budgeting.
- 5
Gemini can generate timestamped transcripts with speaker attribution when prompted, and it can also do visual Q&A and visual descriptions.
- 6
A standout workflow is temporal code extraction from tutorial videos, producing reconstructed code blocks instead of frame-by-frame OCR.
- 7
The most valuable next step is applying the same URL-to-structured-output approach to creative tasks like recipes, product feedback synthesis, and Loom-style team instructions.