Cloning my Voice Into an AI Assistant

TL;DR

Voice cloning with Piper works best when audio clips are short (~15 seconds), mono, consistently sampled (22.05 kHz), and stripped of music and silence.

Briefing Cornell Notes

Briefing

Cloning a voice locally is possible with open-source tools—if the data is clean and the training pipeline is handled carefully. The core takeaway is that voice quality depends less on “magic” and more on disciplined preprocessing: short, noise-free audio clips; accurate transcriptions; and a training setup that matches the model’s software and hardware requirements.

The workflow starts with building a dataset. For a self-voice, the tutorial uses Piper Recording Studio to capture hundreds of spoken utterances through a microphone, then exports the recordings into Piper’s expected structure: a Wave folder plus a metadata CSV pairing each audio filename with its transcription. The process also requires audio hygiene steps—installing FFmpeg, converting to a consistent format (mono, 22.05 kHz), and removing silence and background artifacts. When pulling voice samples from YouTube instead, the same cleanliness rules apply, but the pipeline adds extra steps: downloading audio with yt-dlp, editing out music and interruptions (with Audacity), stripping silence using FFmpeg, and splitting long recordings into many short segments (no longer than ~15 seconds) so the training data matches Piper’s constraints.

Once the dataset is ready, the pipeline transcribes every clip using Whisper (Local Whisper). A Python script runs Whisper over the Wave directory and outputs a metadata CSV that aligns filenames to transcribed text—an essential requirement for training. After that, the training environment is set up by cloning the Piper repository and installing specific dependency versions. A major pain point appears with GPU compatibility: a CUDA 11.7 requirement caused issues on an NVIDIA 4090 setup, and the fix involved adjusting Piper’s requirements and installing a compatible PyTorch version. The tutorial emphasizes that these version mismatches are a common source of “headaches,” and that getting the dependency stack aligned is as important as the model itself.

Training is done by fine-tuning an existing Piper checkpoint rather than starting from scratch. The script runs a preprocessing step, then trains using the checkpoint downloaded from Hugging Face, with options for GPU acceleration and multi-GPU scaling. The author demonstrates three setups: a laptop with WSL and an NVIDIA 3080, a dual 4090 server for training a second voice, and an AWS EC2 instance with multiple GPUs (a g5.12xlarge) to train another voice in the cloud. Training can be interrupted and resumed by pointing to the latest checkpoint in the lightning logs.

After training completes, the final checkpoint is exported into Onyx format (plus a config JSON). The resulting model can be tested via Piper TTS on the command line, then integrated into a local voice assistant. The tutorial shows deploying the Onyx files to a Home Assistant server using a Samba share, configuring the Piper integration, and refreshing the Wyoming device so the new voice appears. Once wired into the assistant, the cloned voice can be used for conversation and scripted prompts.

Finally, the results highlight a practical reality: even with correct tooling, voice quality varies widely depending on the source clips. The author’s own YouTube-derived dataset produced a poor-sounding clone until the dataset was rebuilt using Piper Recording Studio with a larger number of utterances (hundreds). The “best” clone came from pristine, diverse, well-recorded samples and enough training iterations—turning voice cloning into a data-quality problem rather than a purely technical one.

Cornell Notes

Voice cloning with open-source tools is achievable locally using Piper, but the quality hinges on dataset cleanliness and correct training dependencies. The process builds a dataset (Wave folder + metadata CSV), removes silence and background noise, splits audio into short segments, and uses Local Whisper to transcribe clips into a matching CSV. Training fine-tunes a prebuilt Piper checkpoint, with GPU acceleration and checkpoint-based resume support; dependency/version mismatches (including CUDA/PyTorch issues) can otherwise derail training. After training, the model is exported to Onyx format and integrated into a Home Assistant voice assistant via the Piper integration, requiring correct file naming and device refreshes. The author’s best results came from recording fresh, pristine utterances with Piper Recording Studio rather than relying on imperfect YouTube audio.

Why does the tutorial insist on “clean data” before training a voice clone?

Because Piper’s training quality is tightly coupled to the audio segments and their transcriptions. The workflow repeatedly removes music and silence, converts recordings to a consistent mono/22.05 kHz format, and ensures each clip is short enough for training (the tutorial targets ~15 seconds). It also generates a metadata CSV that pairs each audio filename with the exact transcription, using Local Whisper. If clips include background noise, keyboard sounds, interruptions, or inaccurate transcripts, the model learns those artifacts and the resulting voice can sound distorted or “ridiculous.”

What does Piper Recording Studio produce, and why is that structure important?

Piper Recording Studio captures utterances and then exports them into Piper’s expected dataset layout: a Wave folder containing the audio files and a metadata CSV listing each audio file name alongside its transcription. The tutorial highlights that Piper training expects this exact pairing—audio files plus a CSV that matches filenames to text—so the export step is not optional housekeeping; it’s the bridge between recording and model training.

How does the pipeline handle YouTube-sourced voice data differently from self-recorded data?

YouTube audio requires extra preprocessing. The tutorial downloads audio with yt-dlp, then uses Audacity to remove music at the beginning/end and export clean mono audio at 22.05 kHz. It uses FFmpeg to strip silence, then splits long recordings into many short segments (again targeting ~15 seconds) using scripts. Finally, it transcribes the resulting clips with Local Whisper to create the metadata CSV that Piper needs.

What caused major training failures on an NVIDIA 4090 setup, and how was it resolved?

The tutorial reports that the project required CUDA 11.7, and that CUDA 11.7 had a bug that prevented working properly with NVIDIA 4090. The fix was to adjust Piper’s requirements and install a compatible PyTorch version (rather than using the mismatched stack). The broader lesson is that Piper’s older dependencies can conflict with modern environments, so aligning dependency versions is critical.

How does training get sped up and scaled across hardware?

Training uses GPU acceleration when available; otherwise it falls back to CPU (much slower). The tutorial demonstrates WSL on a laptop with an NVIDIA 3080, then a dual 4090 server, and finally a cloud GPU setup on AWS EC2 (g5.12xlarge with four GPUs). It also shows multi-GPU training configuration (DDP) for the dual/multi-GPU cases, and checkpoint resume so training can be paused and continued without starting over.

How are the trained voices deployed into a Home Assistant setup?

The Onyx model export (model file) and the accompanying config JSON are copied into the Home Assistant server via a Samba share. The tutorial stresses correct file naming (e.g., renaming to match the assistant/voice name) and then restarting/refreshing the Piper add-on and the Wyoming device so the new voice appears in the “Text to Speech” selection. Once configured, the assistant can use the cloned voice for responses.

Review Questions

What specific dataset artifacts (folder/file types) does Piper require, and how does the tutorial generate them for both self-recorded and YouTube-sourced voices?
Why might a voice clone sound poor even after successful training, and which preprocessing choices in the tutorial are most likely to fix that?
How do checkpointing and dependency/version alignment affect the ability to resume training and avoid GPU-related failures?

Key Points

1
Voice cloning with Piper works best when audio clips are short (~15 seconds), mono, consistently sampled (22.05 kHz), and stripped of music and silence.
2
Piper training requires a Wave folder plus a metadata CSV that maps each audio filename to its transcription; Local Whisper is used to generate that CSV for YouTube-derived data.
3
Dependency and hardware compatibility matter: CUDA/PyTorch mismatches (notably CUDA 11.7 issues on NVIDIA 4090) can break training unless requirements are adjusted.
4
Fine-tuning a prebuilt Piper checkpoint is faster and more practical than training from scratch, and training can be resumed from the latest checkpoint in lightning logs.
5
Training quality depends heavily on source material: pristine, diverse utterances recorded with Piper Recording Studio produced far better results than imperfect YouTube audio in the author’s tests.
6
Exported models must be converted to Onyx format (plus config JSON) and then deployed with correct naming into Home Assistant’s Piper integration, followed by add-on/device refreshes so the voice appears.

Highlights

The tutorial’s biggest “quality lever” isn’t the model—it’s the dataset: clean, short, silence-free clips with accurate transcriptions.

A CUDA 11.7 requirement caused failures on an NVIDIA 4090; the workaround was aligning Piper’s requirements with a compatible PyTorch version.

Training can be paused and resumed by pointing to the latest checkpoint from lightning logs, avoiding wasted compute.

Onyx export plus Home Assistant Piper configuration turns a trained voice model into a usable assistant voice—after careful file naming and device refresh steps.

The author’s own clone sounded bad until the dataset was rebuilt using Piper Recording Studio with hundreds of utterances, underscoring how source quality drives outcomes.

Topics

Voice Cloning
Piper TTS
Local Whisper
Onyx Export
Home Assistant Integration

Mentioned

Piper
Home Assistant
AWS
Keeper
Audacity
FFmpeg
Whisper
WSL
NVIDIA
PyTorch
Hugging Face
Mike
Chuck
Terry
Mike (editor)
TTS
WSL
GPU
CPU
AWS
EC2
DDP
CUDA
CSV
GPU
VRAM