Kyutai STT & TTS - A Perfect Local Voice Solution?
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
QI TTS and Kyutai’s speech-to-text release target fast local speech workflows, currently supporting English and French.
Briefing
Kyutai’s latest local speech stack—QI TTS for text-to-speech and speech-to-text for ASR—pairs fast, small models with voice conditioning that can sound remarkably natural, including on-the-fly voice cloning from a short sample. The practical catch: Kyutai is not releasing the voice embedding/cloning model itself, limiting users to a curated set of downloadable voice embeddings and preventing straightforward fine-tuning for new languages or custom voices.
The ASR side (speech-to-text) is currently limited to English and French, but it transcribes quickly and appears to run efficiently on capable hardware. The bigger story is the architecture Kyutai has been building toward: low-latency speech in, speech out, with ASR feeding into a larger language model and then returning through TTS. Earlier experiments reportedly showed odd behavior when prompting the integrated model, but the latency-focused design remained the key attraction—making local, interactive voice workflows more feasible.
On the TTS side, Kyutai’s QI TTS uses a 1.6B model and supports English and French, with multiple selectable voices. The quality is framed as competitive with other well-known TTS systems (including comparisons to Chatterbox, Dier, and 11 Labs), and the release includes a strong demonstration of voice cloning: TTS is conditioned on a 10-second voice sample, reproducing intonation and voice characteristics even when the sample is “weird” or stylized. In the example, the system quickly generates speech in the target voice, suggesting the model can capture prosody and identity cues from limited audio.
Where the release frustrates power users is in what’s missing. Kyutai provides voice embeddings and a repository of voices based on datasets such as Espresso and VCTK, but explicitly withholds the voice embedding model to ensure cloning happens only with consent. That means developers can’t directly fine-tune the cloning pipeline for other languages or train their own voice-to-embedding mapping without additional tooling or data. The transcript notes Kyutai is “exploring ideas” for expanding beyond English and French, but no fine-tuning path is available yet.
The practical workaround comes from the released materials on Hugging Face and Kyutai’s GitHub code examples. The voices are distributed as safe tensor embeddings (with a “safe tensors” option for embeddings and a “web” option for listening). By loading an embedding, conditioning the QI TTS model on it, and generating audio from custom text, the user can produce speech in the provided voices. The transcript also shows experimentation with voice embeddings: inspecting their tensor shape, saving them, and blending two voices by averaging embeddings to create an in-between “latent space” voice. The result is speech that sits between the two source voices rather than simply switching.
Overall, the release is positioned as a strong local foundation for ASR and TTS, with additional promise if/when an MLX version arrives for Mac hardware. For now, the system delivers fast, high-quality speech synthesis and usable voice control—just not the full cloning model needed for unrestricted customization.
Cornell Notes
Kyutai’s QI TTS and speech-to-text releases bring fast local speech capabilities, currently limited to English and French. QI TTS uses a 1.6B model and can generate natural-sounding speech conditioned on voice identity, including a demo of voice cloning from a 10-second sample. The quality is competitive in side-by-side comparisons, but Kyutai does not release the voice embedding/cloning model directly. Instead, users get a repository of pre-made voice embeddings (safe tensors) and can run Kyutai’s code to synthesize speech, pick voices, and even blend voices by averaging embeddings. This enables experimentation, but it blocks straightforward fine-tuning for new languages or fully custom voice cloning.
What makes Kyutai’s speech stack feel “local” and low-latency, and where does the transcript say the key workflow fits?
How does QI TTS handle voice identity, and what evidence is given for voice cloning quality?
Why can’t users fully clone arbitrary voices or fine-tune the system for other languages, according to the transcript?
What exactly is released for voice control, and how does the transcript use it in code?
How does the transcript demonstrate creating new “in-between” voices from existing embeddings?
What dataset and model-labeling details are mentioned for training the TTS system?
Review Questions
- What components make up Kyutai’s local speech workflow, and how does the transcript connect ASR output to TTS input through a low-latency pipeline?
- How does the transcript’s code approach voice conditioning using safe tensor embeddings, and what steps change when generating multiple utterances with the same voice?
- What are the practical consequences of Kyutai not releasing the voice embedding model, and how does embedding blending partially compensate for that limitation?
Key Points
- 1
QI TTS and Kyutai’s speech-to-text release target fast local speech workflows, currently supporting English and French.
- 2
QI TTS uses a 1.6B model and can produce natural-sounding speech conditioned on voice identity, including a 10-second voice-sample cloning demo.
- 3
Kyutai provides voice embeddings and a voice repository, but withholds the voice embedding/cloning model to restrict non-consensual cloning.
- 4
Users can run Hugging Face/GitHub examples to load safe tensor voice embeddings, condition the TTS model, and generate audio from custom text.
- 5
Voice embeddings can be inspected and manipulated in PyTorch, including saving and blending embeddings by averaging to create intermediate voices.
- 6
The transcript highlights training scale for QI TTS: 2.5 million hours of data labeled with Whisper Media.
- 7
An MLX version is flagged as a likely next step for running locally on a Mac laptop.