Gemini 2.5 Pro for Audio Transcription
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini 2.5 Pro’s 64,000-token generation limit makes full podcast-length transcription more practical, enabling roughly two hours of transcript output in many workflows.
Briefing
Gemini 2.5 Pro’s jump to a 64,000-token generation limit is the practical unlock for high-quality podcast transcription at scale—long enough to turn roughly two hours of audio into a usable, timestamped transcript in a single pass. Earlier Gemini models could handle audio, but the bottleneck was output length: even if the transcription quality was strong, generating a full podcast transcript often ran into token constraints. With 64,000 tokens, a 15-minute segment maps to about 8,000 tokens, and the same math implies around two hours of transcript output (about 230,000 tokens in, 64,000 out). That shift matters because it changes the workflow from “partial transcripts and patchwork” to “complete transcripts you can search, summarize, and quote.”
The transcript also lays out how audio is metered and why pipeline design still matters. Gemini’s audio processing converts roughly 32 tokens per second—about 1,920 tokens per minute and around 115,000 tokens per hour of audio. That means token limits can force splitting for long inputs: the guidance is to keep total token usage under a ~200,000 token ceiling by feeding the model chunks that produce 30–40k tokens of output. The system down-samples audio to 16k and collapses stereo to a single channel, so stereo-specific tasks (like analyzing stereo positioning) won’t work as expected.
On the mechanics, the workflow starts with uploading audio rather than embedding it directly in the prompt. Inline audio is constrained by a per-call size limit (20 MB), while the upload API supports a single file up to 2 GB, with multiple files usable in one call. Once uploaded, the audio file reference is passed into Gemini’s content generation call, and the model can return transcripts with timestamps.
A key quality feature is diarization—identifying who is speaking when. The transcript argues that Gemini 2.5 Pro effectively performs diarization “out of the box,” without the traditional embedding-and-clustering pipeline. In podcasts, speakers often address each other by name (“Sam, what do you think?”), and the model can use those cues across repeated exchanges to infer turn-taking. If names aren’t explicit, the prompt can include a speaker list so the model can map voices to provided identities, even when spellings vary.
The practical output is refined with a timestamping strategy: instead of a timestamp for every sentence, the prompt can request a new timestamp only when the speaker changes and then every 30 seconds within the same speaker. That preserves diarization detail while producing a transcript that’s easier to navigate.
Finally, the workflow chains tasks: after generating a timestamped transcript, the transcript is summarized into bullet-point notes with timestamps at the end of each idea (hours:minutes:seconds). Those timestamps then become navigation anchors for jumping back into the recording. For podcasts longer than the model can fully transcribe, the approach is to request overlapping segments (e.g., start at 1:30:00 for the next chunk) and then stitch them using fuzzy matching on the overlap window.
The transcript closes by contrasting this approach with raw transcription tools like Whisper, which may not deliver diarization. It also points to AI Studio for quick experimentation and sketches downstream possibilities: turning transcripts into podcast-style summaries (via NotebookLM-style prompting) and using TTS systems to generate new audio—while flagging legal and voice-rights considerations.
Cornell Notes
Gemini 2.5 Pro makes podcast transcription more usable by raising the generation ceiling to 64,000 tokens, which is enough to output transcripts for about two hours of audio in many cases. Audio is processed at a rate of ~32 tokens per second (about 115,000 tokens per hour), and long recordings may require splitting to stay under token limits. The model can produce diarized transcripts—mapping who speaks when—often using conversational cues like speakers addressing each other by name, and it can also use a provided speaker list when names aren’t explicit. A practical pipeline uploads audio via the upload API, generates a timestamped transcript with diarization, then summarizes that transcript into bullet notes with timestamps for fast navigation and quoting. For recordings beyond the limit, overlapping segments and fuzzy matching help stitch transcripts together.
Why does the 64,000-token limit change audio transcription workflows?
How do token rates and token limits affect how long an audio file can be transcribed in one go?
What practical steps make audio transcription work reliably (upload vs inline)?
How does Gemini 2.5 Pro handle diarization, and what cues does it rely on?
How can timestamps be made more useful than “one timestamp per sentence”?
What’s the recommended method for transcribing audio longer than the model can fully output?
Review Questions
- If a podcast is longer than two hours, what overlap strategy would you use to stitch transcripts together, and why does overlap matter?
- How would you modify a prompt to reduce timestamp noise while still preserving diarization accuracy?
- What limitations arise from down-sampling to 16k and converting stereo to a single channel, and how might that affect certain audio-analysis tasks?
Key Points
- 1
Gemini 2.5 Pro’s 64,000-token generation limit makes full podcast-length transcription more practical, enabling roughly two hours of transcript output in many workflows.
- 2
Audio tokenization runs at about 32 tokens per second (~115,000 tokens per hour), so long recordings may require splitting to stay under token ceilings.
- 3
Use the upload API for audio (up to 2 GB per file) instead of embedding audio inline, which is limited to 20 MB per call.
- 4
Gemini 2.5 Pro can produce diarized transcripts without the traditional embedding-and-clustering pipeline, often leveraging conversational cues like speakers addressing each other by name.
- 5
Prompt timestamping can be tuned to reduce clutter—e.g., timestamp on speaker changes and every 30 seconds within the same speaker.
- 6
After transcription, summarizing the transcript into bullet notes with timestamps enables fast navigation and quote-finding without re-listening.
- 7
For recordings beyond output limits, transcribe overlapping segments and stitch them using fuzzy matching on the overlap window.