100% Local AI Speech to Speech with RAG - Low Latency | Mistral 7B, Faster Whisper ++
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The system runs fully locally: microphone speech is transcribed with Faster Whisper, answers are generated by a local LLM, and replies are spoken using local TTS.
Briefing
A fully local “speech to speech” assistant can run end-to-end on a single machine—microphone input becomes text in real time, that text can update a local RAG knowledge base, and the assistant can answer back with low-latency local TTS. The practical takeaway is that the system stays offline while still supporting retrieval from user-added text files and uploaded PDFs, letting voice commands directly modify what the assistant knows.
The setup combines several open-source components into one pipeline. Audio from a microphone is transcribed with Faster Whisper, then routed either to an agent for immediate responses or to voice-driven commands that write the transcript into a local “vault” text file. That vault text is converted into embeddings and stored in a vector database, which the assistant queries using cosine similarity. A key parameter, “top K,” is set to 3, meaning the assistant retrieves the three most relevant text chunks from the vault for each user query.
For speech output, the system uses local TTS. It mentions XTTS v2 for voice generation (noted as slower) and Open Voice for low-latency speech (positioned as faster). The assistant’s behavior can also be shaped with a system prompt that defines a specific persona—here, an assistant named Emma responding to Chris, including conversational quirks like complaining when tasks are requested.
On the RAG side, the project supports operational voice commands such as “insert info,” “delete info,” and “print info.” “Insert info” appends the transcribed content to vault text, which then becomes available to the agent through embeddings. “Delete info” removes the vault text but requires a spoken confirmation (“yes”) to avoid accidental data loss. “Print info” provides a quick sanity check by showing what’s currently stored in the vault and, by extension, what the assistant can retrieve.
A major implementation detail is performance tuning for inference. The system tries to use the GPU aggressively: Faster Whisper runs with CUDA, the TTS model also uses CUDA, and LM Studio is used to offload the full model to the GPU for speed. Without a GPU, the creator warns, latency becomes noticeably worse.
Model choice is treated as a lever for both quality and responsiveness. The assistant can swap LLMs for the agent—starting with Mistral 7B and later switching to a larger 13B model (described as “Qwen chat 13B”) for improved RAG performance, at the cost of higher latency. For embeddings, it uses all-MiniLM-L6-v2, and for voice, it uses XTTS v2.
In a live test, Chris speaks meeting details using voice commands and the assistant updates the vault, then lists upcoming meetings from the stored information. In a second test, a PDF (“More Agents is All You Need”) is uploaded, converted into text, embedded, and added to the vault. After retrieval, the assistant answers a question about the paper’s method for scaling performance with the number of agents, identifying “sampling and voting” (multiple agents contributing responses) as the mechanism and noting that it improves performance as more agents are involved. The overall message: local, voice-driven RAG can be a solid baseline for AI engineering projects, especially when GPU acceleration and model selection are handled carefully.
Cornell Notes
The assistant is a fully local speech-to-speech system that combines Faster Whisper transcription, local TTS, and RAG over a user-managed “vault.” Voice commands like “insert info,” “delete info,” and “print info” update the vault by writing transcribed speech to a text file, embedding it with all-MiniLM-L6-v2, and retrieving relevant chunks via cosine similarity (top K = 3). GPU acceleration (CUDA in Faster Whisper and TTS, plus LM Studio offloading) is emphasized to keep latency low. Model selection matters: Open Voice is faster for low latency than XTTS v2, and switching from Mistral 7B to a larger Qwen chat 13B is described as improving RAG quality while slowing responses. The system can also ingest PDFs by converting them to text and embedding them for retrieval.
How does the system turn spoken input into something the RAG agent can use?
What do “insert info,” “delete info,” and “print info” do in practice?
Why is GPU usage treated as central to low latency?
How do model choices affect both quality and speed?
How does the system handle PDFs for retrieval-augmented answers?
Review Questions
- What is the role of top K (set to 3) in the retrieval step, and how does it influence what the agent sees?
- Which components run on CUDA in this setup, and how does that relate to end-to-end latency?
- How do voice commands change the vault contents, and why does that matter for later RAG responses?
Key Points
- 1
The system runs fully locally: microphone speech is transcribed with Faster Whisper, answers are generated by a local LLM, and replies are spoken using local TTS.
- 2
Voice commands directly manage a local RAG knowledge base by writing transcribed speech into vault text and embedding it for retrieval.
- 3
RAG retrieval uses cosine similarity over embeddings with top K set to 3, pulling the three most relevant text chunks for each query.
- 4
Open Voice is positioned as the low-latency TTS option, while XTTS v2 is slower but still usable.
- 5
GPU acceleration is treated as essential: Faster Whisper and TTS use CUDA, and LM Studio offloads models to the GPU for speed.
- 6
Model size and selection trade off quality and latency; switching from Mistral 7B to Qwen chat 13B is described as improving RAG performance but increasing response time.
- 7
PDF ingestion works by converting the PDF to text, appending it to the vault, embedding it, and then answering questions grounded in that content.