Get AI summaries of any video or article — Sign up free
LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!! thumbnail

LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Local Retrieval QA without OpenAI is achievable by pairing local embeddings (e.g., instructor embeddings) with a local LLM inside a LangChain retrieval chain.

Briefing

Getting rid of OpenAI entirely for Retrieval QA with LangChain is feasible, but the quality hinges on the local LLM’s context limits, prompt format sensitivity, and GPU budget. Four local models—ranging from a small FLAN-T5 variant to a larger StableVicuna and finally WizardLM—produce noticeably different answer quality, from terse and sometimes incomplete responses to more thorough, citation-like explanations. The practical takeaway: retrieval isn’t the only bottleneck; the model’s maximum input tokens and how it expects prompts can make or break the system.

The workflow starts by swapping out OpenAI embeddings for local embeddings (using instructor embeddings with ChromaDB), then running retrieval QA using a local language model. That part is straightforward, but token limits force new design decisions. With FLAN-T5 XL (a 3B Seq2Seq model), the system is constrained to 512 tokens, so the retriever’s “context” parameter must be tuned carefully. When the retrieved text fits, answers can be correct but often lack depth—questions like “What is flash attention?” return short, non-verbose responses. When context exceeds the model’s trained maximum, the system fails with an error rather than degrading gracefully.

A fine-tuned FLAN-T5 variant (“Fast Chat T5”) improves extraction from retrieved context in some cases, but introduces its own quirks—odd padding tokens and visible spacing/double-spacing issues in the output. Those formatting problems can be fixed with preprocessing or postprocessing, but they highlight that model-specific tokenization and generation behavior matter even when the retrieval pipeline is identical.

Scaling up to StableVicuna (13B) changes the trade-off: it can handle longer contexts (up to 2048, potentially 4000), yet it becomes highly sensitive to prompt formatting. The model expects a specific “hash/human” style structure; when the prompt doesn’t match, it may still answer but can also drift into self-questioning or irrelevant follow-ups. Attempts to rewrite the prompt sometimes help, but too much prompt manipulation can trigger “not enough context” errors—suggesting the retrieval may be adequate while the prompt-template mismatch prevents the model from using it effectively.

The best balance comes from WizardLM, a LLaMA-based causal model configured for 1024 tokens. It delivers “best of both worlds” behavior: thorough answers without the StableVicuna prompt-format failure modes. For example, it explains flash attention in a coherent way, breaks down “I/O aware” more usefully than the earlier succinct responses, and provides detailed tool-related explanations aligned with the ToolFormer paper (including examples like Google/Bing, Wolfram Alpha/Mathway, and translation systems). The remaining constraint is operational: larger models yield higher-quality generation but require more GPU memory and can be slower, while smaller models are faster but may sacrifice detail.

Overall, the path to “No OpenAI!!!” is less about LangChain mechanics and more about selecting the right local model, tuning context length, and matching prompt templates to each model’s expectations. The next step suggested is broader benchmarking and testing additional LaMini models and task-tuned variants built specifically for retrieval QA.

Cornell Notes

A LangChain Retrieval QA pipeline can run fully locally—using local embeddings and a local LLM—without OpenAI. The biggest determinants of answer quality are (1) the LLM’s maximum input tokens (e.g., FLAN-T5 XL caps at 512), (2) how well the prompt format matches the model’s training (StableVicuna is sensitive), and (3) GPU constraints when running embeddings plus generation on the same hardware. Small models tend to be fast but produce terse answers; larger models can be more detailed but may be slower and more prompt-fragile. WizardLM (LLaMA-based) emerges as the best practical balance in these tests, producing thorough, coherent answers within a 1024-token setup.

Why does removing OpenAI force new design decisions in Retrieval QA?

Without OpenAI’s large token budget, local models impose hard limits on how much retrieved context can be passed. In the FLAN-T5 XL setup, the system is constrained to 512 tokens; increasing context beyond that triggers an error. That means the retriever’s context length (e.g., choosing 3 vs. fewer chunks) must be tuned to fit the model’s trained maximum, or the pipeline must accept shorter context or fail safely.

How do FLAN-T5 XL and Fast Chat T5 differ in retrieval QA behavior?

FLAN-T5 XL (3B Seq2Seq) yields correct but often non-verbose answers—short responses to questions like “What is flash attention?” and “What is I/O aware mean?” Fast Chat T5, a fine-tuned variant, can extract retrieved information a bit better, but it shows generation artifacts such as padding-token weirdness and spacing/double-spacing problems. Those formatting issues likely require preprocessing/postprocessing even if the retrieval pipeline is unchanged.

What goes wrong with StableVicuna despite its longer context window?

StableVicuna (13B) supports much longer contexts (2048, possibly 4000), but it expects a specific prompt structure (e.g., a “hash/human” style). When the prompt template doesn’t match, outputs can include odd behavior like asking itself questions or drifting into irrelevant follow-ups. Prompt rewriting sometimes helps, but excessive changes can produce “not enough context” errors, implying the model isn’t interpreting the retrieved text the way the prompt expects.

Why does WizardLM perform best among the four tested models?

WizardLM (LLaMA-based causal model) is configured for 1024 tokens and produces thorough, coherent answers without the prompt-format fragility seen in StableVicuna. It explains concepts like flash attention clearly, gives a more useful breakdown of “I/O aware,” and aligns tool-related answers with ToolFormer’s framing—naming external tools and describing how tool use is trained via API calls.

What trade-off governs model choice for local retrieval QA?

Quality versus cost. Larger models can generate more detailed answers but require more GPU memory and can be slower to produce tokens. Smaller models are faster and more responsive but may be less detailed or more terse. The practical goal is to find a “sweet spot” model that fits the hardware budget while still using retrieved context effectively.

Review Questions

  1. Which failure mode is most directly tied to FLAN-T5 XL’s 512-token limit, and how would you mitigate it in a LangChain retriever configuration?
  2. How does prompt-format sensitivity explain StableVicuna’s inconsistent outputs even when retrieval context is likely present?
  3. What specific symptoms in Fast Chat T5 suggest a preprocessing/postprocessing need beyond retrieval tuning?

Key Points

  1. 1

    Local Retrieval QA without OpenAI is achievable by pairing local embeddings (e.g., instructor embeddings) with a local LLM inside a LangChain retrieval chain.

  2. 2

    Hard context limits on local models require tuning the retriever’s context length; exceeding the limit can cause outright errors rather than graceful degradation.

  3. 3

    Seq2Seq models like FLAN-T5 XL can produce correct but terse answers, especially when the system is constrained to short context windows.

  4. 4

    Fine-tuned variants such as Fast Chat T5 may improve extraction quality but can introduce formatting artifacts (padding tokens, spacing/double-spacing) that need cleanup.

  5. 5

    Larger models like StableVicuna can handle longer contexts but may be highly sensitive to prompt templates, leading to drift or self-questioning when formatting is off.

  6. 6

    WizardLM provides a strong practical balance in these tests by delivering thorough answers without the prompt-template problems seen in StableVicuna.

  7. 7

    Model selection is ultimately a quality–latency–GPU-memory trade-off: bigger models can be better but slower and more resource intensive.

Highlights

FLAN-T5 XL’s 512-token cap forces retriever context tuning; too much retrieved text triggers errors instead of partial answers.
Fast Chat T5 sometimes improves answer usefulness but shows spacing/padding-token artifacts that can degrade readability.
StableVicuna’s longer context doesn’t guarantee better QA—prompt-format mismatch can cause irrelevant self-questioning or “not enough context” failures.
WizardLM delivers the most consistently useful answers in the set, including detailed explanations aligned with ToolFormer’s tool-use framing.

Mentioned