How to use BGE Embeddings for LangChain and RAG
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Avoid treating “OpenAI text model” and “OpenAI embeddings” as a coupled choice; embeddings can be swapped independently.
Briefing
BGE embeddings from the Beijing Academy of AI have surged to the top of major embedding benchmarks while dramatically shrinking model size—making them a practical upgrade for retrieval-augmented generation (RAG) pipelines built with LangChain and Chroma. The core takeaway is straightforward: better benchmark performance isn’t the only win. The smaller BGE models cut embedding latency, reduce RAM/VRAM requirements, and can even be run on a CPU, which changes the cost and operational friction of building and maintaining a vector database.
The transcript starts by challenging a common confusion in the embedding ecosystem: using an OpenAI text model doesn’t require using OpenAI embeddings. It argues against long-term reliance on OpenAI embeddings for production RAG systems, mainly due to vendor lock-in. Once a large corpus is embedded with a proprietary provider, switching later forces a full re-embedding of everything—an expensive and time-consuming reset. There’s also a risk of future deprecation: as newer embedding models arrive, older OpenAI embeddings may be retired, again requiring re-embedding. OpenAI embeddings are framed as acceptable for quick experiments—testing an idea or validating a prototype—but not ideal for major projects that need portability and longevity.
From there, the focus shifts to BGE embeddings. These models come with separate English and Chinese embedding variants, and the team is also working on multilingual embeddings that aren’t released yet. In the benchmark context referenced (hosted by Hugging Face via the Massive Text Embedding Benchmark leaderboard), BGE models have rapidly climbed in just the past few days. A key comparison is size: the previously favored instructor XL embeddings are roughly under 5 GB, while the BGE base English model is about a tenth that size. A larger BGE variant is still far smaller than instructor XL (just over 1 GB) and uses bigger embedding dimensions.
The practical demonstration uses the BGE base English embedding model with LangChain and a Chroma vector store. The pipeline stays largely the same—using a Llama 270B model hosted on Together API for retrieval QA—while swapping the embedding backend. The transcript specifies cosine similarity by normalizing embeddings (“normalize the embeddings as true”) and then plugging the embedding function into Chroma in the same way as before.
The operational impact is concrete: embedding roughly 1,000 text chunks drops from minutes with instructor XL to about 35 seconds with BGE base en on a T4. Retrieval quality is described as at least comparable, with the system correctly finding the right outputs in the tested cases. Some answers include odd or inconsistent extra text, suggesting that a second pass (output cleanup) could improve final responses.
Overall, the “big win” is less about a dramatic jump in retrieval accuracy and more about efficiency: smaller models mean faster embedding time, lower memory demands, and the possibility of CPU-based inference. The transcript ends by recommending close attention to BGE’s upcoming multilingual embeddings, anticipating improvements for multilingual RAG use cases.
Cornell Notes
BGE embeddings from the Beijing Academy of AI are gaining benchmark momentum while cutting model size sharply, which makes RAG systems cheaper and faster to operate. The transcript recommends avoiding long-term reliance on OpenAI embeddings due to vendor lock-in and the likelihood of future deprecations that would force full re-embedding. In a LangChain + Chroma setup, the pipeline swaps only the embedding model while keeping the Llama 270B QA model hosted on Together API. Using BGE base en with cosine similarity (via normalized embeddings) reduces embedding time for ~1,000 texts from minutes to about 35 seconds on a T4, with retrieval results described as comparable. The smaller footprint also lowers RAM/VRAM needs and may allow CPU inference.
Why does the transcript discourage using OpenAI embeddings for long-term production RAG systems?
What makes BGE embeddings attractive compared with instructor XL embeddings in this setup?
How is the BGE embedding model integrated into the LangChain + Chroma pipeline?
What performance improvement is reported when switching from instructor XL to BGE base en?
Does the transcript claim BGE embeddings dramatically improve answer quality?
What should builders watch for regarding multilingual support?
Review Questions
- What two lifecycle risks does the transcript associate with using OpenAI embeddings in production, and how do they affect long-term maintenance?
- In the LangChain + Chroma setup, what specific similarity approach is used with BGE base en, and how is it implemented?
- Why does the transcript frame the main advantage of BGE embeddings as efficiency rather than a major jump in retrieval quality?
Key Points
- 1
Avoid treating “OpenAI text model” and “OpenAI embeddings” as a coupled choice; embeddings can be swapped independently.
- 2
For long-term RAG projects, prioritize open-source embeddings to reduce vendor lock-in and re-embedding risk.
- 3
OpenAI embeddings are positioned as useful for quick testing, but production systems face deprecation and full re-embedding costs.
- 4
BGE embeddings from the Beijing Academy of AI have rapidly climbed benchmark rankings while remaining much smaller than instructor XL.
- 5
Using BGE base en with LangChain and Chroma can cut embedding time for ~1,000 texts from minutes to about 35 seconds on a T4.
- 6
Smaller embedding models reduce RAM/VRAM demands and may make CPU-based inference feasible.
- 7
Multilingual BGE models are expected to improve multilingual RAG performance once released.