Gemini TTS - Native Audio Out
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Native audio out in preview lets Gemini generate single-speaker and multi-speaker audio with voice selection and delivery-style control.
Briefing
Google’s “native audio out” for Gemini is now available in preview, letting developers generate speech directly from Gemini models with controllable voice and delivery style—single-speaker narration or multi-speaker “podcast-like” dialogue. The practical payoff is that prompts can shape not just what gets said, but how it’s performed (e.g., excited, whispering, stern), and multi-speaker configurations can produce back-and-forth conversations that resemble NotebookLM-style audio interactions.
In the walkthrough, the key capability is exposed through Google’s Gemini API: after initializing the SDK and selecting a text-to-speech-capable model, a request specifies (1) a prompt that includes both the instruction for delivery style and the actual text, and (2) a response modality set to audio. Voice selection is configured inside the speech settings, and the returned audio is delivered in the response candidates. The author notes a workflow difference: when running scripts outside of Google Colab, the audio payload sometimes needs conversion to base64, while Colab appears to handle it without extra steps. The response also includes metadata such as token counts, though pricing details for these audio tokens remain unclear during preview.
Two preview models are highlighted for native audio out: “2.5 flash” and “2.5 Pro.” The flash model is recommended for general voice quality, while the Pro model may be better when multi-speaker scenarios require more nuanced emotional direction. In AI Studio, users can audition voices before coding, and the UI supports generating both single-speaker and multi-speaker audio.
For single-speaker output, the prompt structure is central. Delivery cues like “say excitedly,” “whisper softly,” “laughing and giggling,” or “stern and more angry” can be placed directly in the prompt, and the model tends to follow those instructions—though the results can sometimes sound “overacted.” The walkthrough also flags that laughter control is imperfect and somewhat stochastic: the model may place laughter at the beginning, end, or both depending on generation randomness. Temperature can be tuned, but the behavior remains variable.
For multi-speaker audio, the workflow shifts to generating a dialogue transcript first, then converting it to speech using multi-speaker voice configuration. A sample transcript has two named speakers—David and Jenny—discussing the woolly mammoth revival, with occasional interruptions. The resulting audio alternates voices accordingly, producing a fast path from text generation to a podcast-style conversation. The author suggests that adding per-line speaking guidance can improve performance, but that doing so may increase the risk of exaggerated acting.
Finally, the discussion turns to cost and deployment tradeoffs. Pricing for native audio out in preview isn’t confirmed, raising the question of whether cloud-based generation will be economical enough versus open, locally run TTS systems. Local models remain attractive for real-time use where avoiding cloud round trips can improve speed and responsiveness.
Cornell Notes
Native audio out for Gemini is available in preview, enabling developers to generate speech from Gemini with controllable voice and speaking style. Using the Gemini API, requests set response modality to audio and include prompt instructions (what to say plus how to say it) along with voice configuration. Two preview models—2.5 flash and 2.5 Pro—are available; flash is suggested for general quality, while Pro may help with emotion-heavy scenarios. Single-speaker prompts can drive whispering, excitement, stern delivery, and even laughter, though results can be stochastic and sometimes overacted. Multi-speaker output works by generating a dialogue transcript and then applying multi-speaker voice configuration to produce podcast-like back-and-forth audio.
How does native audio out turn a text prompt into actual audio in the Gemini API workflow?
What prompt techniques matter most for shaping how the voice performs (not just what it says)?
Why is laughter and other expressive behavior described as unpredictable?
How do multi-speaker “podcast” outputs work end-to-end?
When should developers choose 2.5 flash versus 2.5 Pro?
What practical concerns remain during preview, especially around cost and deployment?
Review Questions
- What request parameters are required to generate audio (modality, prompt structure, and voice configuration), and where does the audio payload appear in the response?
- How does the two-step approach (transcript generation followed by multi-speaker voice configuration) enable podcast-like dialogue?
- What kinds of prompt instructions reliably change speaking style, and which expressive effects remain inconsistent?
Key Points
- 1
Native audio out in preview lets Gemini generate single-speaker and multi-speaker audio with voice selection and delivery-style control.
- 2
API calls require response modality set to audio, plus speech/voice configuration embedded in the request.
- 3
Prompt instructions can steer performance (excited, whispering, stern), but results may sometimes sound overacted.
- 4
Laughter and other expressive cues behave stochastically, so temperature tuning helps but doesn’t guarantee consistent placement.
- 5
Two preview models are available: 2.5 flash (recommended for general voice quality) and 2.5 Pro (potentially better for emotion-heavy direction).
- 6
Multi-speaker audio is produced by generating a dialogue transcript with named speakers, then applying multi-speaker voice configuration to map voices to speakers.
- 7
Preview pricing and audio-token cost behavior remain unclear, so developers should plan for uncertainty and compare against local TTS options for real-time needs.