Use AI to Clone Voices & Speak OTHER LANGUAGES! - Elevenlabs + ChatGPT 4
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
ElevenLabs’ experimental multilingual text-to-speech model supports English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi using the same cloned voice identity.
Briefing
ElevenLabs’ newly added multilingual text-to-speech model can render the same cloned voice across multiple languages—English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi—while keeping the speaker’s identity largely intact. In side-by-side tests using a voice cloned from the creator, the multilingual model produced English that sounded close to the older monolingual model, but it also introduced noticeable instability during longer passages, including glitches, shifting tone, and background artifacts.
The first major check was a tongue-twister in English generated with ChatGPT, then spoken using both models. The monolingual and multilingual outputs were broadly similar in intelligibility and overall quality, with differences showing up more as generation variation than as a clear upgrade or downgrade. ElevenLabs’ multilingual system was also flagged as experimental, with guidance that some numbers and symbols may be mispronounced—so the tests included attention to how the model handles pronunciation rather than relying only on casual reading.
German provided a clearer contrast. When the cloned voice—trained on English—attempted German, the monolingual model delivered a heavy, thick American accent, while the multilingual model sounded more authentically German to the listener’s ear. The experiment also compared direct German output versus “phonetic” handling (spelling words out to guide pronunciation). Both approaches looked German on the page, but the multilingual model’s output still differed from the monolingual version in cadence and accent strength, suggesting the multilingual training meaningfully changes how the voice shapes foreign phonemes.
Spanish revealed the model’s current ceiling. A longer Spanish story initially sounded decent but then degraded: the audio fluctuated between quiet and loud, the voice changed character, and the narration developed a “glitchy” quality that felt less human. When the story was translated back into English and read with the same cloned voice, the monolingual model sounded smoother and more consistent, reinforcing the pattern that multilingual performance is strongest on shorter, simpler lines and weaker on extended narration.
Beyond multilingual cloning, the transcript also highlights ElevenLabs’ “voice design” workflow for creating voices from scratch. With a paid Voice Design tier, the creator generated a randomized voice, iterated on accent settings (including attempts at British, Australian, and American), and then used the new synthetic voice to read a scripted message. The generated voices were described as capable of sounding impressive, though accent controls appeared inconsistent—changing accent strength could unexpectedly alter pitch or tone.
Overall, the tests suggest ElevenLabs’ multilingual model is a real step toward identity-preserving voice translation, especially for short-form dialogue and languages like German where accent shaping is more convincing. But longer multilingual narration still needs work to eliminate artifacts and maintain stable, human-like delivery—an issue the creator repeatedly noticed when switching between multilingual and monolingual outputs.
Cornell Notes
ElevenLabs’ experimental multilingual text-to-speech model can speak through an identity-cloned voice in multiple languages (English, German, Polish, Spanish, Italian, French, Portuguese, Hindi). In English, the multilingual model sounded very similar to the older monolingual model, suggesting the main change is language capability rather than a dramatic quality shift. German showed the clearest improvement: the multilingual model reduced the heavy American accent compared with the monolingual model, and “phonetic” prompting changed how words were pronounced. Spanish exposed current weaknesses—longer passages developed glitches, volume swings, and unstable voice characteristics. ElevenLabs also offers voice creation from scratch via Voice Design, letting users generate new synthetic voices and adjust accent-related parameters, though results can be inconsistent.
What languages does ElevenLabs’ experimental multilingual model support, and how was that tested?
How did the multilingual model perform in English compared with the monolingual model?
Why did German sound different across models, and what role did “phonetic” prompting play?
What went wrong with Spanish narration, and what does that imply about current multilingual limits?
How does ElevenLabs’ “voice design” feature change the workflow compared with cloning?
Review Questions
- In what ways did the multilingual model improve or fail compared with the monolingual model for German and Spanish?
- What evidence from the Spanish test suggests the multilingual model struggles with longer generations?
- How did the creator’s “phonetic” German approach differ from direct German, and what effect did it have on perceived pronunciation?
Key Points
- 1
ElevenLabs’ experimental multilingual text-to-speech model supports English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi using the same cloned voice identity.
- 2
English output from the multilingual model was broadly similar to the monolingual model, with differences that looked like normal generation variation.
- 3
German showed stronger accent adaptation in the multilingual model, while the monolingual model produced a noticeably heavy American accent.
- 4
“Phonetic” prompting for German changed pronunciation guidance, but the multilingual model still delivered a distinct accent and cadence compared with monolingual output.
- 5
Spanish long-form narration exposed instability in the multilingual model, including glitches, volume swings, and shifting voice characteristics.
- 6
ElevenLabs’ Voice Design enables creating voices from scratch (not only cloning), though accent controls can behave inconsistently.
- 7
For now, monolingual output appears more reliable for longer English narration, while multilingual is promising for cross-language identity-preserving speech—especially on shorter lines.