Get AI summaries of any video or article — Sign up free
my local, AI Voice Assistant (I replaced Alexa!!) thumbnail

my local, AI Voice Assistant (I replaced Alexa!!)

NetworkChuck·
5 min read

Based on NetworkChuck's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Home Assistant can run a fully local voice pipeline by combining Whisper (STT), Piper (TTS), and Open Wake Word (wake phrase) with Assist for intent-to-action control.

Briefing

A fully local voice assistant is now practical for home automation: Home Assistant can run an offline wake word, speech-to-text, intent handling, and text-to-speech on your own hardware—then scale those pieces across multiple devices using the Wyoming protocol. The payoff is control without cloud dependence, plus the ability to swap in a local “brain” like Llama 3 for more capable, context-aware conversations that can drive real actions in the house.

The build starts with Home Assistant on a local device (a Raspberry Pi in the demo) and adds a “voice pipeline” through add-ons. Whisper provides offline speech-to-text, Piper handles offline text-to-speech, and Open Wake Word listens for a chosen wake phrase. Home Assistant’s Assist layer then connects those audio components to home automation actions—turning phrases into commands like switching lights—while keeping everything on the local network. Early tests show the system working but requiring careful phrasing and reacting with some latency, which sets up later improvements.

Next comes scaling beyond a single box. The Wyoming protocol turns extra hardware into “satellites” that can listen and speak while delegating the heavy lifting back to Home Assistant. A second Raspberry Pi is configured with a ReSpeaker 2-Mic Pi Hat, then the Wyoming satellite software is installed and run as a service so it stays online. Home Assistant connects to it over the network, and the assistant can control devices from anywhere in the house. The demo also adds an LED status behavior using the ReSpeaker’s pixel ring so the user can tell when the assistant is actively listening.

The real leap in capability arrives when the “conversation agent” is replaced with a local large language model. Instead of relying on Home Assistant’s default conversational behavior, the system points to Ollama running Llama 3.2 (downloaded and served locally). With the LLM in place, the assistant can answer factual questions and—crucially—maintain context across turns, enabling follow-up commands like turning a light back on after a prior interaction.

To reduce bottlenecks, the pipeline is offloaded to more powerful machines using Wyoming containers. On a Windows laptop, Docker runs Wyoming Whisper (speech-to-text) and Wyoming Piper (text-to-speech), and Home Assistant switches its voice pipeline endpoints to those remote services. The demo then adds a second LLM server (“Terry,” an AI server) so multiple models can be used together. The result is a fast, fully self-hosted assistant that can control smart lighting and handle more natural conversation—while still staying offline from the cloud.

The build ends with remaining gaps compared with Alexa: custom wake word training (e.g., “Terry”) and custom voice generation aren’t fully solved yet. The next step is training a new wake word using Open Wake Word’s Google Colab workflow and uploading the resulting model files into Home Assistant via the Samba add-on. Custom voice cloning is flagged as the next frontier for a future video, after hours of troubleshooting. Overall, the message is clear: with Home Assistant, Wyoming, and local LLM tooling, a cloud-free home assistant can be assembled piece by piece—and then upgraded as hardware and models improve.

Cornell Notes

The core idea is building a voice assistant that stays local end-to-end: wake word detection, speech-to-text, intent handling, and text-to-speech run on your own devices, while the “brain” can be a local LLM served via Ollama. Home Assistant orchestrates the pipeline using add-ons like Whisper (STT), Piper (TTS), and Open Wake Word (wake phrase), then routes recognized intents to home automation actions. The Wyoming protocol lets additional Raspberry Pis act as remote “satellites” for microphones/speakers, while Docker containers can offload STT/TTS to faster machines. Swapping Home Assistant’s default conversation agent for Llama 3.2 via Ollama enables more capable, context-aware responses that can drive real actions like controlling lights.

How does Home Assistant turn raw speech into actions without using cloud services?

It uses a local voice pipeline: Whisper converts speech to text, Open Wake Word detects a wake phrase, and Piper converts text back to speech. Home Assistant’s Assist layer then performs intent recognition—mapping phrases like “turn off Chuck lamp light” into commands that control devices (e.g., Phillips Hue lights). The demo configures these components under Home Assistant → Voice assistants, selecting Whisper for speech-to-text, Piper for text-to-speech, and the wake word engine for wake phrase detection.

What is the Wyoming protocol used for in this setup?

Wyoming lets separate hardware run as voice “satellites” that communicate with Home Assistant over the network. A Raspberry Pi with a microphone/speaker hat (ReSpeaker 2-Mic Pi Hat) runs the Wyoming satellite software as a service, exposing an endpoint (e.g., port 10700) that Home Assistant connects to. This allows multiple listening/speaking nodes around the house while keeping the main orchestration in Home Assistant.

Why does replacing the conversation agent with Ollama + Llama 3.2 matter?

Home Assistant’s default conversation agent is described as too limited. Pointing the conversation agent to Ollama running Llama 3.2 makes the assistant smarter and more context-aware. The demo shows factual Q&A (e.g., first U.S. president) and improved multi-turn behavior, such as turning a light off and then successfully turning it back on in a follow-up command.

How does the system speed up by offloading STT and TTS to other machines?

Docker containers run Wyoming Whisper (speech-to-text) and Wyoming Piper (text-to-speech) on a Windows laptop. Home Assistant then updates its voice pipeline endpoints to those remote services (via Wyoming protocol service entries). The demo renames the old local STT/TTS endpoints to avoid confusion, then switches speech-to-text to “faster whisper” and text-to-speech to “Piper,” improving responsiveness.

What remaining limitations are highlighted, and what’s the next technical step?

Two gaps are emphasized: custom wake word recognition (so the assistant answers to “Terry”) and custom voice generation. For wake words, the plan is to use Open Wake Word’s training environment on Google Colab, run the training to produce TF Lite and ONNX files, then upload them into Home Assistant using the Samba add-on so the wake word can be set to “Terry.” Custom voice cloning is deferred to a future video after troubleshooting.

Review Questions

  1. What components make up the local voice pipeline in Home Assistant, and what role does each one play?
  2. How does Wyoming enable adding remote microphone/speaker hardware without rebuilding the entire assistant?
  3. What changes when the conversation agent is switched from Home Assistant’s default to an Ollama-served Llama 3.2 model?

Key Points

  1. 1

    Home Assistant can run a fully local voice pipeline by combining Whisper (STT), Piper (TTS), and Open Wake Word (wake phrase) with Assist for intent-to-action control.

  2. 2

    Wyoming protocol turns extra devices into voice “satellites,” letting a Raspberry Pi with a mic/speaker handle listening and speaking while Home Assistant orchestrates the rest.

  3. 3

    Using Ollama to serve Llama 3.2 upgrades the assistant’s conversational ability and improves multi-turn context for follow-up commands.

  4. 4

    Dockerized Wyoming Whisper and Wyoming Piper let STT/TTS run on faster hardware, and Home Assistant can switch endpoints to those remote services.

  5. 5

    Multiple LLM servers can be integrated by updating the voice assistant’s conversation agent settings to point to different Ollama instances (e.g., “Terry”).

  6. 6

    Custom wake word training for a name like “Terry” requires training a new model (TF Lite/ONNX) and uploading it into Home Assistant via Samba.

  7. 7

    Custom voice generation remains unsolved in this build and is slated for a follow-up effort after wake word training works.

Highlights

A cloud-free assistant is built by wiring offline STT (Whisper), offline TTS (Piper), and offline wake word detection (Open Wake Word) into Home Assistant’s voice pipeline.
Wyoming protocol makes it easy to add remote listening/speaking nodes—turning Raspberry Pis into satellites that Home Assistant can connect to over the network.
Switching the conversation agent to Ollama + Llama 3.2 enables context-aware back-and-forth, not just one-shot commands.
Running Wyoming Whisper and Wyoming Piper in Docker containers offloads the heavy audio work to faster hardware and speeds up responses.
The next major milestone is training a custom wake word (“Terry”) using Open Wake Word’s Colab workflow and uploading the resulting model files into Home Assistant.

Topics

  • Local Voice Assistant
  • Home Assistant
  • Wyoming Protocol
  • Ollama Llama
  • Speech To Text
  • Text To Speech
  • Wake Word Training

Mentioned

  • AI
  • STT
  • TTS
  • LLM
  • GPU
  • WSL2
  • GPU
  • HTTP
  • TF
  • ONNX
  • DCU