Get AI summaries of any video or article — Sign up free
Clone ANY Voice for Free — Qwen Just Changed Everything thumbnail

Clone ANY Voice for Free — Qwen Just Changed Everything

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Qwen’s Quen 3 TTS family is open-sourced, including voice cloning and voice design, removing earlier API-only access barriers.

Briefing

Open-source voice cloning and “voice design” have moved from closed, API-only systems into the open TTS ecosystem: Qwen has released its Quen 3 TTS family as open weights, including tools for cloning, designing, and generating voices. That shift matters because it lets developers download models, run them locally, and build multilingual voice experiences without being locked into a single vendor’s interface—an important change after months of rapid TTS progress that largely stayed behind paywalled access.

The Quen 3 TTS family arrives with two model sizes. A smaller 0.6B model supports multiple languages and streaming, and—crucially—comes with both a base model and a fine-tune model. That pairing is positioned as a practical on-ramp for creating custom voices: instead of only using pre-made speaker presets, developers can fine-tune from the open base. The larger 1.7B model retains the smaller model’s multilingual capabilities while adding instruction control for voice design and voice cloning. In practice, that means text plus an instruction prompt can shape the output voice (for example, describing a “young anime” tone or a “documentary” delivery), and voice cloning can be triggered by providing a short audio sample so the system extracts a voice representation and renders new speech in that voice.

A major theme is reducing the barrier for non-English users. The 0.6B model supports 10 languages, nine dialects, and 49 “tambas” (as described in the announcements), and the open release includes tokenizers and related components. That openness is framed as enabling training and adaptation for additional languages and dialects—especially relevant when many earlier systems performed best in English.

The transcript also highlights quality and usability features that go beyond “sound it out phonetically.” Because the system is trained on large text corpora and uses a tokenizer/codebook approach, it can pronounce symbols like email addresses and other structured text without requiring manual phonetic transcription. The system is also described as fully end-to-end, with multiple token types (text tokens, codebook tokens, and speaker embeddings) feeding a streaming decoder—contrasting with older pipelines that stitched together separate modules.

Hands-on demos on Hugging Face show the workflow. With the 0.6B model, the user selects pre-made speakers (including Chinese dialect variants) and generates speech in other languages, sometimes with noticeable artifacts (Spanish is called out). The demos include batch inference for faster audio generation and long-form text handling such as dates and numbers. Auto language guessing is also demonstrated, supporting code-switching scenarios where a mostly single-language script can include brief shifts.

With the 1.7B model, instruction prompts drive voice style changes—cartoon, villain, documentary, and emotional variants like neutral, excited, whispering, loud/soft, and dramatic. Voice cloning is demonstrated using roughly a 10-second sample audio, producing new speech in the cloned voice. The results are described as impressive but not perfect, with the transcript noting that real-world audio cleanup (e.g., noise reduction) can affect the clone’s fidelity.

Finally, the release includes a Hugging Face space for trying the models and a paper referencing training on over 5 million hours of speech data. The overall takeaway is that Qwen’s open-sourced Quen 3 TTS family makes advanced voice cloning and instruction-driven voice generation more accessible for experimentation, fine-tuning, and future deployment on smaller or edge-focused runtimes.

Cornell Notes

Qwen’s Quen 3 TTS family is now open-sourced, bringing voice cloning and instruction-based “voice design” into the open TTS world. The 0.6B model supports multilingual generation (including streaming) and is released with both base and fine-tune variants, enabling developers to create custom voices. The 1.7B model adds instruction control for shaping voice style and supports cloning from a short audio sample by extracting a voice representation. Demos on Hugging Face show multilingual speech, batch and long-form generation, auto language guessing, and improved handling of structured text like emails without manual phonetics. The release matters because it removes the earlier API-only barrier and enables local experimentation, fine-tuning, and adaptation for more languages and dialects.

What changed with Qwen’s Quen 3 TTS release, and why does it matter for builders?

The Quen 3 TTS family moved from closed, API-only access to open-sourced weights and components. That includes voice cloning and voice design capabilities, plus downloadable model artifacts (not just a hosted endpoint). For developers, this enables local inference, experimentation, and fine-tuning workflows rather than depending on a single provider’s interface.

How do the 0.6B and 1.7B models differ in capabilities?

The 0.6B model is the smaller option and supports multiple languages (including streaming) and comes with both a base model and a fine-tune model. The 1.7B model keeps those multilingual abilities but adds instruction control for voice design and supports voice cloning via an audio sample. In demos, the 1.7B model responds to prompts like “young anime voice” or “documentary voice,” and can also switch emotional delivery (neutral, dramatic, whispering, etc.).

What does “voice design” mean in this system, based on the demos?

Voice design is driven by instruction prompts layered on top of the text to be spoken. Instead of only selecting a preset speaker, the user describes the desired delivery—e.g., cartoon-like, documentary, villain, or specific emotional styles. The model then generates speech that matches the described style, and the demos show this working across languages (with varying artifacts).

How does voice cloning work, and what are the practical limitations mentioned?

Voice cloning is triggered by providing a short segment of audio (about 10 seconds in the demo). The model extracts a representation of the voice from that sample and uses it to render new text in the cloned voice. The transcript notes it’s impressive but not perfect, and that audio preprocessing (like noise reduction) can affect clone quality—e.g., noise reduction tools may cut parts of the voice, reducing fidelity.

What evidence is given that the model handles real-world text better than phonetic-only approaches?

The transcript emphasizes training plus a tokenizer/codebook approach that lets the model pronounce structured content such as email addresses and other symbols without requiring users to write phonetics. In the long-form demo, it also handles numbers and dates (e.g., speed of light phrasing and a scheduled meeting time) in a way that’s described as “quite good,” though not flawless.

What workflow features show up in the hands-on usage?

The demos include selecting pre-made speakers with metadata (speaker name, description, native language, and dialects), running batch inference to generate many outputs faster, and producing long-form speech. There’s also a demonstration of auto language guessing, which can support code-switching where most text stays in one language with occasional words in another.

Review Questions

  1. What specific capabilities does the 1.7B model add beyond the 0.6B model, and how are they triggered in prompts?
  2. Why does releasing tokenizers and fine-tuning components change what developers can do for new languages or dialects?
  3. In the demos, what kinds of text inputs are handled without manual phonetic transcription, and what does that imply about the tokenizer/codebook approach?

Key Points

  1. 1

    Qwen’s Quen 3 TTS family is open-sourced, including voice cloning and voice design, removing earlier API-only access barriers.

  2. 2

    The 0.6B model supports multilingual generation (including streaming) and is released with both base and fine-tune variants for custom voice creation.

  3. 3

    The 1.7B model adds instruction control for voice design (style and emotion) and supports cloning from a short audio sample.

  4. 4

    The system is positioned as handling structured text (like email addresses) without requiring phonetic rewriting, thanks to its tokenizer/codebook approach.

  5. 5

    Hugging Face demos show practical features like batch inference, long-form generation, and auto language guessing for code-switching.

  6. 6

    Voice cloning quality depends on input audio quality; noise reduction and preprocessing can materially affect results.

  7. 7

    The release cites large-scale training (over 5 million hours of speech data) and describes an end-to-end architecture feeding a streaming decoder.

Highlights

Open-source voice cloning and instruction-driven voice design are now available in the Quen 3 TTS family, not just through closed APIs.
The 0.6B model’s base + fine-tune release is aimed at enabling custom voices rather than only using preset speakers.
The 1.7B model can generate distinct voice styles and emotions from instruction prompts, and can clone a voice from roughly a 10-second sample.
Structured text like email addresses can be pronounced without manual phonetics, reflecting tokenizer/codebook strengths.
Demos show multilingual transfer (including dialect variants) plus batch and long-form generation, with some artifacts noted in certain languages.

Topics

Mentioned