Get AI summaries of any video or article — Sign up free
Local Gemma 4 with OpenCode & llama.cpp | Build a Local RAG with LangChain | đź”´ Live thumbnail

Local Gemma 4 with OpenCode & llama.cpp | Build a Local RAG with LangChain | đź”´ Live

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemma 4 can run locally with llama.cpp, but “effective” parameter sizes still translate into large weight-loading requirements that stress memory.

Briefing

A local RAG app built around Gemma 4 can work surprisingly well on a single machine—but getting reliable retrieval depends less on the chat model and more on the ingestion and embedding pipeline. In a live build, Venelin Valkov runs Gemma 4 locally via llama.cpp (with a tokenizer fix) and wires it into an OpenCode + Streamlit workflow that lets users upload PDFs, convert them to markdown, index them, and chat with selected documents.

The setup starts with practical constraints. Even when Gemma 4 is marketed with “effective” parameter sizes, the weights still need to be loaded in full, making consumer hardware struggle. On an M4 MacBook with 48 GB unified memory, 4-bit quantization improves throughput in llama.cpp: the “hello world” test rises from about 40 tokens/sec on 8-bit to roughly 57 tokens/sec on 4-bit. The build also notes that longer context windows demand more VRAM and can trigger CPU offloading, slowing generation.

For the RAG application, the plan is intentionally simple: upload a PDF, convert it to local markdown, store it, and then answer questions using retrieved chunks. The initial architecture uses LangChain with an in-memory/local vector store (Chroma DB), PDF-to-markdown conversion via PyMuPDF for LLM, and embeddings generated through Ollama. OpenCode is configured to expose a local OpenAI-compatible endpoint for llama.cpp using an exact Gemma 4 GGML model filename and a large context window setting (256K).

Early progress is fast—folders and modules appear, and the Streamlit UI comes up—but the build hits several integration issues. LangChain import paths change across versions, causing “chains” import failures until the correct modules are used. The Streamlit app also initially fails due to an importable “source” folder conflict, requiring project cleanup and resync.

Once the app runs, the retrieval quality becomes the main battleground. A first attempt using Ollama embeddings produces wrong or irrelevant answers, including cases where the system claims it cannot find information that is clearly present in the PDF. Debugging leads to a key insight: embeddings quality dominates. Switching to a dedicated embedding model (nomic-embed-text) and later to newer Nomic embeddings materially changes retrieval outcomes. With the improved embeddings, the system correctly answers a question about the paper’s evaluation hardware (Nvidia A100) and can point to the specific chunk(s) that supported the answer.

Still, reliability isn’t guaranteed. When the app is tested across multiple PDFs and longer conversations, chunk numbering and context passing can break, and some runs degrade retrieval again—sometimes producing “garbage” chunks like “introduction” that don’t match the question. The session ends with a pragmatic takeaway: for production-like deployments, fine-tuning smaller models and/or using stronger, domain-appropriate ingestion and embedding strategies may be more effective than relying on a large local model alone.

The broader message is that local Gemma 4 + RAG is feasible, but the fastest path to good answers is engineering the ingestion and retrieval loop—chunking strategy, embedding model choice, and traceable citations—so the system can consistently ground responses in the right parts of the documents.

Cornell Notes

Gemma 4 can power a fully local RAG workflow, but the biggest determinant of answer quality is retrieval—especially embeddings and chunking—rather than the chat model itself. In the build, Gemma 4 runs locally through llama.cpp (with a tokenizer fix), while a Streamlit + LangChain app handles PDF upload, PDF-to-markdown conversion, vector indexing (Chroma DB), and question answering. Early runs suffer from integration issues (LangChain import changes, Streamlit import conflicts) and then from retrieval failures when embeddings are weak or misconfigured. Switching to a dedicated, newer embedding model (nomic-embed-text) and iterating on retrieval settings dramatically improves accuracy and enables chunk-level citations that match the source PDF.

Why does “effective parameter size” still make Gemma 4 hard to run locally?

The session emphasizes that “effective” sizes don’t eliminate the need to load the full weight set. Even when a model is described as 2B or 4B effective, the real weights are larger (e.g., an effective 2B behaves like ~4.5B real parameters; effective 4B behaves like ~8B). That means the GPU/unified memory must still hold the full quantized weights, so hardware requirements remain steep even with 4-bit quantization.

What changed the local generation speed when moving from 8-bit to 4-bit?

A llama.cpp “hello world” benchmark shows throughput rising from about 40 tokens/sec on 8-bit quantization to roughly 57 tokens/sec on 4-bit. The trade-off is potential accuracy loss, but on the author’s machine the 4-bit run is noticeably smoother.

What were the main blockers during the OpenCode + Streamlit + LangChain build?

Two integration problems dominate early debugging: (1) LangChain module paths changed, causing failures like missing “LangChain chains” imports until the correct imports are used; (2) Streamlit couldn’t import from a “source” folder, requiring project cleanup/resync. After those, the app could run and start indexing PDFs.

How did embeddings determine whether answers were grounded in the PDF?

With weaker or misconfigured embeddings, the system produced wrong answers or claimed the retrieved context lacked the needed information (even when the PDF clearly contained it). After switching to a dedicated embedding model (nomic-embed-text) and then using improved Nomic embeddings, retrieval quality improved sharply: the system returned the correct evaluation GPU (Nvidia A100) and could identify the chunk that supported the answer.

Why did chunk-level debugging become necessary?

The app sometimes returned “garbage” chunks or mismatched citations, suggesting retrieval wasn’t selecting the right text spans. The debugging approach was to display retrieved chunks in the UI and request chunk IDs used for the final answer, so the developer could verify whether the retriever was actually pulling relevant passages before blaming the LLM.

Review Questions

  1. What hardware constraint does the session highlight that undermines the usefulness of “effective” parameter sizes for local deployment?
  2. Which two categories of issues delayed the RAG app’s first successful run, and how were they resolved?
  3. Describe how switching embedding models changed retrieval outcomes and why chunk-level inspection mattered.

Key Points

  1. 1

    Gemma 4 can run locally with llama.cpp, but “effective” parameter sizes still translate into large weight-loading requirements that stress memory.

  2. 2

    On an M4 MacBook with 48 GB unified memory, 4-bit quantization improved generation speed (about 40→57 tokens/sec) compared with 8-bit.

  3. 3

    A working local RAG pipeline depends on correct LangChain imports and clean project structure to avoid Streamlit import errors.

  4. 4

    Retrieval accuracy often fails when embeddings are weak or misconfigured, even if the chat model is strong.

  5. 5

    Switching to dedicated, newer embeddings (nomic-embed-text and updated Nomic embeddings) can dramatically improve grounding and citation correctness.

  6. 6

    Chunking and retrieval settings can still break across multiple PDFs or longer sessions, so chunk-level UI inspection and chunk IDs are valuable for debugging.

Highlights

4-bit quantization in llama.cpp boosted local throughput from ~40 tokens/sec to ~57 tokens/sec on the author’s setup.
Retrieval failures weren’t fixed by changing the chat model; improving embeddings was the turning point for correct answers and citations.
The build repeatedly surfaced integration fragility: LangChain version changes and Streamlit import conflicts can derail an otherwise sound RAG design.
Displaying retrieved chunks and requesting chunk IDs made it possible to diagnose when the retriever—not the LLM—was the problem.

Mentioned

  • Venelin Valkov
  • RAG
  • LLM
  • VRAM
  • GPU
  • CPU
  • UI
  • PDF
  • API
  • GGML
  • M4
  • A100
  • RTX
  • Chroma DB
  • PiMuPDF
  • Ollama
  • Streamlit
  • UV
  • IoT