100% Local AI Speech to Speech with RAG - Low Latency

TL;DR

The system runs fully locally: microphone speech is transcribed with Faster Whisper, answers are generated by a local LLM, and replies are spoken using local TTS.

Briefing Cornell Notes

Briefing

A fully local “speech to speech” assistant can run end-to-end on a single machine—microphone input becomes text in real time, that text can update a local RAG knowledge base, and the assistant can answer back with low-latency local TTS. The practical takeaway is that the system stays offline while still supporting retrieval from user-added text files and uploaded PDFs, letting voice commands directly modify what the assistant knows.

The setup combines several open-source components into one pipeline. Audio from a microphone is transcribed with Faster Whisper, then routed either to an agent for immediate responses or to voice-driven commands that write the transcript into a local “vault” text file. That vault text is converted into embeddings and stored in a vector database, which the assistant queries using cosine similarity. A key parameter, “top K,” is set to 3, meaning the assistant retrieves the three most relevant text chunks from the vault for each user query.

For speech output, the system uses local TTS. It mentions XTTS v2 for voice generation (noted as slower) and Open Voice for low-latency speech (positioned as faster). The assistant’s behavior can also be shaped with a system prompt that defines a specific persona—here, an assistant named Emma responding to Chris, including conversational quirks like complaining when tasks are requested.

On the RAG side, the project supports operational voice commands such as “insert info,” “delete info,” and “print info.” “Insert info” appends the transcribed content to vault text, which then becomes available to the agent through embeddings. “Delete info” removes the vault text but requires a spoken confirmation (“yes”) to avoid accidental data loss. “Print info” provides a quick sanity check by showing what’s currently stored in the vault and, by extension, what the assistant can retrieve.

A major implementation detail is performance tuning for inference. The system tries to use the GPU aggressively: Faster Whisper runs with CUDA, the TTS model also uses CUDA, and LM Studio is used to offload the full model to the GPU for speed. Without a GPU, the creator warns, latency becomes noticeably worse.

Model choice is treated as a lever for both quality and responsiveness. The assistant can swap LLMs for the agent—starting with Mistral 7B and later switching to a larger 13B model (described as “Qwen chat 13B”) for improved RAG performance, at the cost of higher latency. For embeddings, it uses all-MiniLM-L6-v2, and for voice, it uses XTTS v2.

In a live test, Chris speaks meeting details using voice commands and the assistant updates the vault, then lists upcoming meetings from the stored information. In a second test, a PDF (“More Agents is All You Need”) is uploaded, converted into text, embedded, and added to the vault. After retrieval, the assistant answers a question about the paper’s method for scaling performance with the number of agents, identifying “sampling and voting” (multiple agents contributing responses) as the mechanism and noting that it improves performance as more agents are involved. The overall message: local, voice-driven RAG can be a solid baseline for AI engineering projects, especially when GPU acceleration and model selection are handled carefully.

Cornell Notes

The assistant is a fully local speech-to-speech system that combines Faster Whisper transcription, local TTS, and RAG over a user-managed “vault.” Voice commands like “insert info,” “delete info,” and “print info” update the vault by writing transcribed speech to a text file, embedding it with all-MiniLM-L6-v2, and retrieving relevant chunks via cosine similarity (top K = 3). GPU acceleration (CUDA in Faster Whisper and TTS, plus LM Studio offloading) is emphasized to keep latency low. Model selection matters: Open Voice is faster for low latency than XTTS v2, and switching from Mistral 7B to a larger Qwen chat 13B is described as improving RAG quality while slowing responses. The system can also ingest PDFs by converting them to text and embedding them for retrieval.

How does the system turn spoken input into something the RAG agent can use?

Microphone audio is transcribed directly to text using Faster Whisper. That transcript either goes straight to the agent for response generation or is written into a local vault text file via voice commands (e.g., “insert info”). The vault text is then converted into embeddings using all-MiniLM-L6-v2 and stored in a vector database. When the agent answers a question, it retrieves the most relevant chunks from the vault by cosine similarity (top K is set to 3).

What do “insert info,” “delete info,” and “print info” do in practice?

“Insert info” appends the transcribed voice content to vault text, making it available for future retrieval. “Delete info” removes vault text, but it requires a spoken confirmation (“yes”) before deletion to prevent accidental loss. “Print info” displays the current contents of the vault, acting as a quick check that embeddings will reflect the latest stored information.

Why is GPU usage treated as central to low latency?

The pipeline is computationally heavy: transcription (Faster Whisper), speech synthesis (Open Voice / XTTS v2), and LLM inference all benefit from acceleration. The setup uses CUDA for Faster Whisper and also runs the TTS model on CUDA. It further uses LM Studio to offload the full model to the GPU. The creator notes that on CPU-only setups, the system becomes slow.

How do model choices affect both quality and speed?

TTS speed differs by model: Open Voice is described as optimized for low latency, while XTTS v2 is “slower.” For the LLM, the system starts with Mistral 7B, then switches to a larger Qwen chat 13B model for RAG operations, reporting better retrieval performance at the cost of increased latency. Embeddings use all-MiniLM-L6-v2, and the assistant’s retrieval quality depends on the embeddings and the chosen LLM.

How does the system handle PDFs for retrieval-augmented answers?

A PDF is uploaded through a simple interface, converted into text, appended into the vault text, and then embedded into the vector database. After that, the assistant can answer questions grounded in the uploaded document. In the demo, the paper “More Agents is All You Need” is ingested and then used to answer a question about how performance scales with the number of agents.

Review Questions

What is the role of top K (set to 3) in the retrieval step, and how does it influence what the agent sees?
Which components run on CUDA in this setup, and how does that relate to end-to-end latency?
How do voice commands change the vault contents, and why does that matter for later RAG responses?

Key Points

1
The system runs fully locally: microphone speech is transcribed with Faster Whisper, answers are generated by a local LLM, and replies are spoken using local TTS.
2
Voice commands directly manage a local RAG knowledge base by writing transcribed speech into vault text and embedding it for retrieval.
3
RAG retrieval uses cosine similarity over embeddings with top K set to 3, pulling the three most relevant text chunks for each query.
4
Open Voice is positioned as the low-latency TTS option, while XTTS v2 is slower but still usable.
5
GPU acceleration is treated as essential: Faster Whisper and TTS use CUDA, and LM Studio offloads models to the GPU for speed.
6
Model size and selection trade off quality and latency; switching from Mistral 7B to Qwen chat 13B is described as improving RAG performance but increasing response time.
7
PDF ingestion works by converting the PDF to text, appending it to the vault, embedding it, and then answering questions grounded in that content.

Highlights

Voice commands can update the assistant’s knowledge instantly: “insert info” appends transcribed speech to vault text, which then becomes retrievable via embeddings.

Retrieval is explicitly controlled with top K = 3, limiting the assistant to the three most relevant chunks from the vault for each answer.

The pipeline is designed for low latency by pushing transcription, TTS, and LLM inference onto the GPU using CUDA and LM Studio offloading.

Switching from Mistral 7B to Qwen chat 13B is presented as a way to improve RAG quality, even though it slows down responses.

A PDF (“More Agents is All You Need”) can be uploaded, embedded, and queried; the assistant identifies “sampling and voting” as the method for scaling performance with more agents.

Topics

Local Speech to Speech
RAG with Voice Commands
Faster Whisper Transcription
Open Voice Low Latency
PDF to Embeddings

Mentioned

RAG
TTS
LLM
CUDA
LM Studio

100% Local AI Speech to Speech with RAG - Low Latency | Mistral 7B, Faster Whisper ++