Local Gemma 4 with OpenCode & llama.cpp | Build a Local RAG with LangChain | đź”´ Live
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemma 4 can run locally with llama.cpp, but “effective” parameter sizes still translate into large weight-loading requirements that stress memory.
Briefing
A local RAG app built around Gemma 4 can work surprisingly well on a single machine—but getting reliable retrieval depends less on the chat model and more on the ingestion and embedding pipeline. In a live build, Venelin Valkov runs Gemma 4 locally via llama.cpp (with a tokenizer fix) and wires it into an OpenCode + Streamlit workflow that lets users upload PDFs, convert them to markdown, index them, and chat with selected documents.
The setup starts with practical constraints. Even when Gemma 4 is marketed with “effective” parameter sizes, the weights still need to be loaded in full, making consumer hardware struggle. On an M4 MacBook with 48 GB unified memory, 4-bit quantization improves throughput in llama.cpp: the “hello world” test rises from about 40 tokens/sec on 8-bit to roughly 57 tokens/sec on 4-bit. The build also notes that longer context windows demand more VRAM and can trigger CPU offloading, slowing generation.
For the RAG application, the plan is intentionally simple: upload a PDF, convert it to local markdown, store it, and then answer questions using retrieved chunks. The initial architecture uses LangChain with an in-memory/local vector store (Chroma DB), PDF-to-markdown conversion via PyMuPDF for LLM, and embeddings generated through Ollama. OpenCode is configured to expose a local OpenAI-compatible endpoint for llama.cpp using an exact Gemma 4 GGML model filename and a large context window setting (256K).
Early progress is fast—folders and modules appear, and the Streamlit UI comes up—but the build hits several integration issues. LangChain import paths change across versions, causing “chains” import failures until the correct modules are used. The Streamlit app also initially fails due to an importable “source” folder conflict, requiring project cleanup and resync.
Once the app runs, the retrieval quality becomes the main battleground. A first attempt using Ollama embeddings produces wrong or irrelevant answers, including cases where the system claims it cannot find information that is clearly present in the PDF. Debugging leads to a key insight: embeddings quality dominates. Switching to a dedicated embedding model (nomic-embed-text) and later to newer Nomic embeddings materially changes retrieval outcomes. With the improved embeddings, the system correctly answers a question about the paper’s evaluation hardware (Nvidia A100) and can point to the specific chunk(s) that supported the answer.
Still, reliability isn’t guaranteed. When the app is tested across multiple PDFs and longer conversations, chunk numbering and context passing can break, and some runs degrade retrieval again—sometimes producing “garbage” chunks like “introduction” that don’t match the question. The session ends with a pragmatic takeaway: for production-like deployments, fine-tuning smaller models and/or using stronger, domain-appropriate ingestion and embedding strategies may be more effective than relying on a large local model alone.
The broader message is that local Gemma 4 + RAG is feasible, but the fastest path to good answers is engineering the ingestion and retrieval loop—chunking strategy, embedding model choice, and traceable citations—so the system can consistently ground responses in the right parts of the documents.
Cornell Notes
Gemma 4 can power a fully local RAG workflow, but the biggest determinant of answer quality is retrieval—especially embeddings and chunking—rather than the chat model itself. In the build, Gemma 4 runs locally through llama.cpp (with a tokenizer fix), while a Streamlit + LangChain app handles PDF upload, PDF-to-markdown conversion, vector indexing (Chroma DB), and question answering. Early runs suffer from integration issues (LangChain import changes, Streamlit import conflicts) and then from retrieval failures when embeddings are weak or misconfigured. Switching to a dedicated, newer embedding model (nomic-embed-text) and iterating on retrieval settings dramatically improves accuracy and enables chunk-level citations that match the source PDF.
Why does “effective parameter size” still make Gemma 4 hard to run locally?
What changed the local generation speed when moving from 8-bit to 4-bit?
What were the main blockers during the OpenCode + Streamlit + LangChain build?
How did embeddings determine whether answers were grounded in the PDF?
Why did chunk-level debugging become necessary?
Review Questions
- What hardware constraint does the session highlight that undermines the usefulness of “effective” parameter sizes for local deployment?
- Which two categories of issues delayed the RAG app’s first successful run, and how were they resolved?
- Describe how switching embedding models changed retrieval outcomes and why chunk-level inspection mattered.
Key Points
- 1
Gemma 4 can run locally with llama.cpp, but “effective” parameter sizes still translate into large weight-loading requirements that stress memory.
- 2
On an M4 MacBook with 48 GB unified memory, 4-bit quantization improved generation speed (about 40→57 tokens/sec) compared with 8-bit.
- 3
A working local RAG pipeline depends on correct LangChain imports and clean project structure to avoid Streamlit import errors.
- 4
Retrieval accuracy often fails when embeddings are weak or misconfigured, even if the chat model is strong.
- 5
Switching to dedicated, newer embeddings (nomic-embed-text and updated Nomic embeddings) can dramatically improve grounding and citation correctness.
- 6
Chunking and retrieval settings can still break across multiple PDFs or longer sessions, so chunk-level UI inspection and chunk IDs are valuable for debugging.