Use AI to Clone Voices & Speak OTHER LANGUAGES!

TL;DR

ElevenLabs’ experimental multilingual text-to-speech model supports English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi using the same cloned voice identity.

Briefing Cornell Notes

Briefing

ElevenLabs’ newly added multilingual text-to-speech model can render the same cloned voice across multiple languages—English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi—while keeping the speaker’s identity largely intact. In side-by-side tests using a voice cloned from the creator, the multilingual model produced English that sounded close to the older monolingual model, but it also introduced noticeable instability during longer passages, including glitches, shifting tone, and background artifacts.

The first major check was a tongue-twister in English generated with ChatGPT, then spoken using both models. The monolingual and multilingual outputs were broadly similar in intelligibility and overall quality, with differences showing up more as generation variation than as a clear upgrade or downgrade. ElevenLabs’ multilingual system was also flagged as experimental, with guidance that some numbers and symbols may be mispronounced—so the tests included attention to how the model handles pronunciation rather than relying only on casual reading.

German provided a clearer contrast. When the cloned voice—trained on English—attempted German, the monolingual model delivered a heavy, thick American accent, while the multilingual model sounded more authentically German to the listener’s ear. The experiment also compared direct German output versus “phonetic” handling (spelling words out to guide pronunciation). Both approaches looked German on the page, but the multilingual model’s output still differed from the monolingual version in cadence and accent strength, suggesting the multilingual training meaningfully changes how the voice shapes foreign phonemes.

Spanish revealed the model’s current ceiling. A longer Spanish story initially sounded decent but then degraded: the audio fluctuated between quiet and loud, the voice changed character, and the narration developed a “glitchy” quality that felt less human. When the story was translated back into English and read with the same cloned voice, the monolingual model sounded smoother and more consistent, reinforcing the pattern that multilingual performance is strongest on shorter, simpler lines and weaker on extended narration.

Beyond multilingual cloning, the transcript also highlights ElevenLabs’ “voice design” workflow for creating voices from scratch. With a paid Voice Design tier, the creator generated a randomized voice, iterated on accent settings (including attempts at British, Australian, and American), and then used the new synthetic voice to read a scripted message. The generated voices were described as capable of sounding impressive, though accent controls appeared inconsistent—changing accent strength could unexpectedly alter pitch or tone.

Overall, the tests suggest ElevenLabs’ multilingual model is a real step toward identity-preserving voice translation, especially for short-form dialogue and languages like German where accent shaping is more convincing. But longer multilingual narration still needs work to eliminate artifacts and maintain stable, human-like delivery—an issue the creator repeatedly noticed when switching between multilingual and monolingual outputs.

Cornell Notes

ElevenLabs’ experimental multilingual text-to-speech model can speak through an identity-cloned voice in multiple languages (English, German, Polish, Spanish, Italian, French, Portuguese, Hindi). In English, the multilingual model sounded very similar to the older monolingual model, suggesting the main change is language capability rather than a dramatic quality shift. German showed the clearest improvement: the multilingual model reduced the heavy American accent compared with the monolingual model, and “phonetic” prompting changed how words were pronounced. Spanish exposed current weaknesses—longer passages developed glitches, volume swings, and unstable voice characteristics. ElevenLabs also offers voice creation from scratch via Voice Design, letting users generate new synthetic voices and adjust accent-related parameters, though results can be inconsistent.

What languages does ElevenLabs’ experimental multilingual model support, and how was that tested?

The multilingual model was described as supporting English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi. The creator tested it by using a cloned voice trained on English and then generating the same content in English (tongue-twister), followed by direct German, phonetic German, Spanish (a longer story), and several other languages (French, Polish, Hindi, Portuguese, plus Italian). The comparisons were made against the older monolingual model using the same cloned voice settings.

How did the multilingual model perform in English compared with the monolingual model?

In English, the multilingual output was largely comparable to the monolingual version. The same lemon-themed tongue twister was read with both models, and the creator reported no major quality or pronunciation issues—differences sounded more like normal variation between generations than a systematic multilingual improvement or failure.

Why did German sound different across models, and what role did “phonetic” prompting play?

German highlighted a contrast in accent shaping. The monolingual model produced a thick American accent when speaking German, while the multilingual model sounded more authentically German. The test also compared direct German output versus a phonetic approach (spelling words out to guide pronunciation). Both methods looked German in text, but the multilingual model still differed in delivery and accent strength, implying multilingual training changes how the voice maps foreign sounds.

What went wrong with Spanish narration, and what does that imply about current multilingual limits?

Spanish narration degraded during a longer story. The creator heard audio that became very quiet then very loud, with the voice changing character and sounding glitchy—described as background artifacts and instability that made it less human. When the story was later read in English using the monolingual model, the delivery was smoother, implying the multilingual model’s stability drops as passage length increases.

How does ElevenLabs’ “voice design” feature change the workflow compared with cloning?

Instead of cloning a specific person’s voice, Voice Design lets users create voices from scratch using randomized generation and adjustable parameters. The creator noted that access to voice design and instant voice cloning costs five dollars a month and allows creation of 10 voices. Accent-related controls (e.g., British vs Australian vs American) produced inconsistent results—accent strength could shift pitch or tone in unexpected ways—yet the generated voices could still sound convincing.

Review Questions

In what ways did the multilingual model improve or fail compared with the monolingual model for German and Spanish?
What evidence from the Spanish test suggests the multilingual model struggles with longer generations?
How did the creator’s “phonetic” German approach differ from direct German, and what effect did it have on perceived pronunciation?

Key Points

1
ElevenLabs’ experimental multilingual text-to-speech model supports English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi using the same cloned voice identity.
2
English output from the multilingual model was broadly similar to the monolingual model, with differences that looked like normal generation variation.
3
German showed stronger accent adaptation in the multilingual model, while the monolingual model produced a noticeably heavy American accent.
4
“Phonetic” prompting for German changed pronunciation guidance, but the multilingual model still delivered a distinct accent and cadence compared with monolingual output.
5
Spanish long-form narration exposed instability in the multilingual model, including glitches, volume swings, and shifting voice characteristics.
6
ElevenLabs’ Voice Design enables creating voices from scratch (not only cloning), though accent controls can behave inconsistently.
7
For now, monolingual output appears more reliable for longer English narration, while multilingual is promising for cross-language identity-preserving speech—especially on shorter lines.

Highlights

Multilingual English sounded close to monolingual English, suggesting the biggest leap is language coverage rather than a major quality overhaul.

German accent shaping improved: the multilingual model sounded more German than the monolingual model, which retained a thick American accent.

Spanish long stories exposed the multilingual model’s weak spot—glitchy artifacts and unstable volume/voice behavior.

Voice Design can generate entirely new voices from scratch, but accent strength settings can produce unexpected tonal changes.

The experimental multilingual model is best treated as a work in progress for extended narration, even when short outputs sound strong.

Topics

Multilingual Text-to-Speech
Voice Cloning
Accent Adaptation
Voice Design
Real-Time Translation

Use AI to Clone Voices & Speak OTHER LANGUAGES! - Elevenlabs + ChatGPT 4