Gemini 2.5 Pro for Audio Transcription

TL;DR

Gemini 2.5 Pro’s 64,000-token generation limit makes full podcast-length transcription more practical, enabling roughly two hours of transcript output in many workflows.

Briefing Cornell Notes

Briefing

Gemini 2.5 Pro’s jump to a 64,000-token generation limit is the practical unlock for high-quality podcast transcription at scale—long enough to turn roughly two hours of audio into a usable, timestamped transcript in a single pass. Earlier Gemini models could handle audio, but the bottleneck was output length: even if the transcription quality was strong, generating a full podcast transcript often ran into token constraints. With 64,000 tokens, a 15-minute segment maps to about 8,000 tokens, and the same math implies around two hours of transcript output (about 230,000 tokens in, 64,000 out). That shift matters because it changes the workflow from “partial transcripts and patchwork” to “complete transcripts you can search, summarize, and quote.”

The transcript also lays out how audio is metered and why pipeline design still matters. Gemini’s audio processing converts roughly 32 tokens per second—about 1,920 tokens per minute and around 115,000 tokens per hour of audio. That means token limits can force splitting for long inputs: the guidance is to keep total token usage under a ~200,000 token ceiling by feeding the model chunks that produce 30–40k tokens of output. The system down-samples audio to 16k and collapses stereo to a single channel, so stereo-specific tasks (like analyzing stereo positioning) won’t work as expected.

On the mechanics, the workflow starts with uploading audio rather than embedding it directly in the prompt. Inline audio is constrained by a per-call size limit (20 MB), while the upload API supports a single file up to 2 GB, with multiple files usable in one call. Once uploaded, the audio file reference is passed into Gemini’s content generation call, and the model can return transcripts with timestamps.

A key quality feature is diarization—identifying who is speaking when. The transcript argues that Gemini 2.5 Pro effectively performs diarization “out of the box,” without the traditional embedding-and-clustering pipeline. In podcasts, speakers often address each other by name (“Sam, what do you think?”), and the model can use those cues across repeated exchanges to infer turn-taking. If names aren’t explicit, the prompt can include a speaker list so the model can map voices to provided identities, even when spellings vary.

The practical output is refined with a timestamping strategy: instead of a timestamp for every sentence, the prompt can request a new timestamp only when the speaker changes and then every 30 seconds within the same speaker. That preserves diarization detail while producing a transcript that’s easier to navigate.

Finally, the workflow chains tasks: after generating a timestamped transcript, the transcript is summarized into bullet-point notes with timestamps at the end of each idea (hours:minutes:seconds). Those timestamps then become navigation anchors for jumping back into the recording. For podcasts longer than the model can fully transcribe, the approach is to request overlapping segments (e.g., start at 1:30:00 for the next chunk) and then stitch them using fuzzy matching on the overlap window.

The transcript closes by contrasting this approach with raw transcription tools like Whisper, which may not deliver diarization. It also points to AI Studio for quick experimentation and sketches downstream possibilities: turning transcripts into podcast-style summaries (via NotebookLM-style prompting) and using TTS systems to generate new audio—while flagging legal and voice-rights considerations.

Cornell Notes

Gemini 2.5 Pro makes podcast transcription more usable by raising the generation ceiling to 64,000 tokens, which is enough to output transcripts for about two hours of audio in many cases. Audio is processed at a rate of ~32 tokens per second (about 115,000 tokens per hour), and long recordings may require splitting to stay under token limits. The model can produce diarized transcripts—mapping who speaks when—often using conversational cues like speakers addressing each other by name, and it can also use a provided speaker list when names aren’t explicit. A practical pipeline uploads audio via the upload API, generates a timestamped transcript with diarization, then summarizes that transcript into bullet notes with timestamps for fast navigation and quoting. For recordings beyond the limit, overlapping segments and fuzzy matching help stitch transcripts together.

Why does the 64,000-token limit change audio transcription workflows?

Earlier Gemini models could process audio, but generating a full podcast transcript often hit output-length limits. Gemini 2.5 Pro’s 64,000-token generation cap makes it feasible to output transcripts for roughly two hours of audio (the transcript cites ~15 minutes ≈ 8,000 tokens, implying ~2 hours ≈ 64,000 tokens out). That turns transcription from “partial output” into “complete, navigable transcripts” that can then be summarized and searched.

How do token rates and token limits affect how long an audio file can be transcribed in one go?

Gemini’s audio-to-tokens conversion is about 32 tokens per second—~1,920 tokens per minute and ~115,000 tokens per hour. The transcript warns that pricing/token constraints can become an issue around a ~200,000 token limit, so a pipeline may need to split inputs so the model never exceeds that cap. It also notes that a single prompt can include up to about 9.5 hours of audio for analysis, but transcript generation is typically limited to around two hours (cited as ~230,000 tokens in and ~64,000 out).

What practical steps make audio transcription work reliably (upload vs inline)?

Inline audio in the prompt is constrained by a per-call maximum size (20 MB). Using the upload API avoids that: a single file can be uploaded up to 2 GB, and multiple files can be uploaded and referenced in one call. After upload, the returned file reference is passed into Gemini’s generate-content call alongside the prompt.

How does Gemini 2.5 Pro handle diarization, and what cues does it rely on?

Diarization identifies which speaker is talking when. The transcript claims Gemini 2.5 Pro performs diarization “out of the box,” avoiding the classic approach of extracting speaker embeddings and clustering/segmenting. In podcasts, speakers often address each other by name (“Sam, what do you think?”). Repeated name-addressing patterns help the model infer turn-taking and map voices to speakers. If names aren’t present, the prompt can include a list of speaker names so the model assigns turns to those identities.

How can timestamps be made more useful than “one timestamp per sentence”?

The transcript describes a prompt-based timestamping strategy: keep diarization, but request a timestamp only when the speaker changes and then every 30 seconds within the same speaker. That reduces clutter while preserving the ability to jump to relevant moments. The interval can be adjusted (e.g., 60 seconds) depending on how granular the navigation needs to be.

What’s the recommended method for transcribing audio longer than the model can fully output?

Use segmented transcription with overlap. For example, request a transcript starting at 1 hour 30 minutes, then another segment that overlaps earlier/later by a few minutes. The overlap provides matching text so fuzzy matching can determine where to cut and stitch segments into a longer transcript. The transcript emphasizes that Gemini 2.5 Pro handles these overlapping, stitched segments better than older models.

Review Questions

If a podcast is longer than two hours, what overlap strategy would you use to stitch transcripts together, and why does overlap matter?
How would you modify a prompt to reduce timestamp noise while still preserving diarization accuracy?
What limitations arise from down-sampling to 16k and converting stereo to a single channel, and how might that affect certain audio-analysis tasks?

Key Points

1
Gemini 2.5 Pro’s 64,000-token generation limit makes full podcast-length transcription more practical, enabling roughly two hours of transcript output in many workflows.
2
Audio tokenization runs at about 32 tokens per second (~115,000 tokens per hour), so long recordings may require splitting to stay under token ceilings.
3
Use the upload API for audio (up to 2 GB per file) instead of embedding audio inline, which is limited to 20 MB per call.
4
Gemini 2.5 Pro can produce diarized transcripts without the traditional embedding-and-clustering pipeline, often leveraging conversational cues like speakers addressing each other by name.
5
Prompt timestamping can be tuned to reduce clutter—e.g., timestamp on speaker changes and every 30 seconds within the same speaker.
6
After transcription, summarizing the transcript into bullet notes with timestamps enables fast navigation and quote-finding without re-listening.
7
For recordings beyond output limits, transcribe overlapping segments and stitch them using fuzzy matching on the overlap window.

Highlights

The 64,000-token generation cap is framed as the real unlock: it shifts transcription from “fragments” to “full, navigable podcast transcripts.”

Diarization is presented as largely “out of the box,” with conversational name cues helping the model infer who is speaking when.

A practical timestamping trick—timestamp only on speaker changes and every 30 seconds—keeps diarization detail while making transcripts readable.

For long audio, overlapping segment transcription plus fuzzy matching is the suggested way to stitch together a longer transcript.

Topics

Audio Transcription
Diarization
Token Limits
Prompt Engineering
Timestamped Summaries

Mentioned

Sam Witteveen
TTS
UI
MP3
AAC
flack
TTES