Get AI summaries of any video or article — Sign up free
Gemini TTS - Native Audio Out thumbnail

Gemini TTS - Native Audio Out

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Native audio out in preview lets Gemini generate single-speaker and multi-speaker audio with voice selection and delivery-style control.

Briefing

Google’s “native audio out” for Gemini is now available in preview, letting developers generate speech directly from Gemini models with controllable voice and delivery style—single-speaker narration or multi-speaker “podcast-like” dialogue. The practical payoff is that prompts can shape not just what gets said, but how it’s performed (e.g., excited, whispering, stern), and multi-speaker configurations can produce back-and-forth conversations that resemble NotebookLM-style audio interactions.

In the walkthrough, the key capability is exposed through Google’s Gemini API: after initializing the SDK and selecting a text-to-speech-capable model, a request specifies (1) a prompt that includes both the instruction for delivery style and the actual text, and (2) a response modality set to audio. Voice selection is configured inside the speech settings, and the returned audio is delivered in the response candidates. The author notes a workflow difference: when running scripts outside of Google Colab, the audio payload sometimes needs conversion to base64, while Colab appears to handle it without extra steps. The response also includes metadata such as token counts, though pricing details for these audio tokens remain unclear during preview.

Two preview models are highlighted for native audio out: “2.5 flash” and “2.5 Pro.” The flash model is recommended for general voice quality, while the Pro model may be better when multi-speaker scenarios require more nuanced emotional direction. In AI Studio, users can audition voices before coding, and the UI supports generating both single-speaker and multi-speaker audio.

For single-speaker output, the prompt structure is central. Delivery cues like “say excitedly,” “whisper softly,” “laughing and giggling,” or “stern and more angry” can be placed directly in the prompt, and the model tends to follow those instructions—though the results can sometimes sound “overacted.” The walkthrough also flags that laughter control is imperfect and somewhat stochastic: the model may place laughter at the beginning, end, or both depending on generation randomness. Temperature can be tuned, but the behavior remains variable.

For multi-speaker audio, the workflow shifts to generating a dialogue transcript first, then converting it to speech using multi-speaker voice configuration. A sample transcript has two named speakers—David and Jenny—discussing the woolly mammoth revival, with occasional interruptions. The resulting audio alternates voices accordingly, producing a fast path from text generation to a podcast-style conversation. The author suggests that adding per-line speaking guidance can improve performance, but that doing so may increase the risk of exaggerated acting.

Finally, the discussion turns to cost and deployment tradeoffs. Pricing for native audio out in preview isn’t confirmed, raising the question of whether cloud-based generation will be economical enough versus open, locally run TTS systems. Local models remain attractive for real-time use where avoiding cloud round trips can improve speed and responsiveness.

Cornell Notes

Native audio out for Gemini is available in preview, enabling developers to generate speech from Gemini with controllable voice and speaking style. Using the Gemini API, requests set response modality to audio and include prompt instructions (what to say plus how to say it) along with voice configuration. Two preview models—2.5 flash and 2.5 Pro—are available; flash is suggested for general quality, while Pro may help with emotion-heavy scenarios. Single-speaker prompts can drive whispering, excitement, stern delivery, and even laughter, though results can be stochastic and sometimes overacted. Multi-speaker output works by generating a dialogue transcript and then applying multi-speaker voice configuration to produce podcast-like back-and-forth audio.

How does native audio out turn a text prompt into actual audio in the Gemini API workflow?

The workflow initializes the Gemini SDK, selects a TTS-capable model (2.5 flash or 2.5 Pro), and sends a request that includes: (1) a prompt that combines delivery instructions with the text to speak, and (2) response modality set to audio. Speech configuration contains the voice settings, and the voice configuration is passed as part of the speech config. The audio bytes come back inside the response candidates, and the author notes that scripts may require converting that payload to base64, while Google Colab can work without that extra step.

What prompt techniques matter most for shaping how the voice performs (not just what it says)?

Delivery cues placed at the start of the prompt—such as “say excitedly,” “whisper softly,” “laughing and giggling,” or “stern and more angry”—tend to steer the speaking style. The walkthrough also mentions formatting tricks like using a colon to separate instructions from the spoken text, and optionally wrapping instructions in quotes for longer prompts. Even with good cues, the output can sound overacted, and laughter control is inconsistent.

Why is laughter and other expressive behavior described as unpredictable?

Expressive effects like laughter are treated as stochastic outcomes from the underlying generation process. The author reports that laughter may appear at the front, at the end, or only at one side of the line depending on the generation. Temperature can be adjusted, but the behavior still varies across runs.

How do multi-speaker “podcast” outputs work end-to-end?

Multi-speaker generation is built in two steps: first, generate a dialogue transcript (about ~200 words in the example) with named speakers and interaction rules (e.g., occasional interruptions). Second, convert that transcript to audio using multi-speaker voice configuration, where each speaker (e.g., Jenny and David) is assigned a specific voice. Listening to the result shows alternating voices and a conversation cadence similar to a podcast.

When should developers choose 2.5 flash versus 2.5 Pro?

The walkthrough recommends experimenting with both, but notes that 2.5 flash often produces voices that “work out really well” for general use. 2.5 Pro is suggested as a better candidate for multi-speaker scenarios where prompts emphasize emotions and nuanced delivery.

What practical concerns remain during preview, especially around cost and deployment?

Token and pricing details for audio generation appear uncertain in preview. The response metadata includes token counts, but the author couldn’t confirm how audio-related tokens are priced compared with standard tokens. The discussion also raises deployment tradeoffs: cloud-based native audio out may be costly or slower than local open TTS systems, which can be advantageous for real-time applications.

Review Questions

  1. What request parameters are required to generate audio (modality, prompt structure, and voice configuration), and where does the audio payload appear in the response?
  2. How does the two-step approach (transcript generation followed by multi-speaker voice configuration) enable podcast-like dialogue?
  3. What kinds of prompt instructions reliably change speaking style, and which expressive effects remain inconsistent?

Key Points

  1. 1

    Native audio out in preview lets Gemini generate single-speaker and multi-speaker audio with voice selection and delivery-style control.

  2. 2

    API calls require response modality set to audio, plus speech/voice configuration embedded in the request.

  3. 3

    Prompt instructions can steer performance (excited, whispering, stern), but results may sometimes sound overacted.

  4. 4

    Laughter and other expressive cues behave stochastically, so temperature tuning helps but doesn’t guarantee consistent placement.

  5. 5

    Two preview models are available: 2.5 flash (recommended for general voice quality) and 2.5 Pro (potentially better for emotion-heavy direction).

  6. 6

    Multi-speaker audio is produced by generating a dialogue transcript with named speakers, then applying multi-speaker voice configuration to map voices to speakers.

  7. 7

    Preview pricing and audio-token cost behavior remain unclear, so developers should plan for uncertainty and compare against local TTS options for real-time needs.

Highlights

Native audio out supports both single-speaker narration and multi-speaker, podcast-like dialogue by combining Gemini text generation with audio conversion.
Delivery style cues in prompts (excited, whispering, stern) can meaningfully change how the model performs the spoken output.
Multi-speaker “podcasts” are created by generating a structured transcript with speaker labels, then assigning distinct voices per speaker in multi-speaker voice configuration.
Expressive behaviors like laughter are inconsistent across runs, reflecting stochastic generation rather than deterministic control.

Topics

  • Gemini TTS
  • Native Audio Out
  • Multi-Speaker Speech
  • Prompt-Based Voice Style
  • Audio API Integration