Get AI summaries of any video or article — Sign up free
How to use BGE Embeddings for LangChain and RAG thumbnail

How to use BGE Embeddings for LangChain and RAG

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Avoid treating “OpenAI text model” and “OpenAI embeddings” as a coupled choice; embeddings can be swapped independently.

Briefing

BGE embeddings from the Beijing Academy of AI have surged to the top of major embedding benchmarks while dramatically shrinking model size—making them a practical upgrade for retrieval-augmented generation (RAG) pipelines built with LangChain and Chroma. The core takeaway is straightforward: better benchmark performance isn’t the only win. The smaller BGE models cut embedding latency, reduce RAM/VRAM requirements, and can even be run on a CPU, which changes the cost and operational friction of building and maintaining a vector database.

The transcript starts by challenging a common confusion in the embedding ecosystem: using an OpenAI text model doesn’t require using OpenAI embeddings. It argues against long-term reliance on OpenAI embeddings for production RAG systems, mainly due to vendor lock-in. Once a large corpus is embedded with a proprietary provider, switching later forces a full re-embedding of everything—an expensive and time-consuming reset. There’s also a risk of future deprecation: as newer embedding models arrive, older OpenAI embeddings may be retired, again requiring re-embedding. OpenAI embeddings are framed as acceptable for quick experiments—testing an idea or validating a prototype—but not ideal for major projects that need portability and longevity.

From there, the focus shifts to BGE embeddings. These models come with separate English and Chinese embedding variants, and the team is also working on multilingual embeddings that aren’t released yet. In the benchmark context referenced (hosted by Hugging Face via the Massive Text Embedding Benchmark leaderboard), BGE models have rapidly climbed in just the past few days. A key comparison is size: the previously favored instructor XL embeddings are roughly under 5 GB, while the BGE base English model is about a tenth that size. A larger BGE variant is still far smaller than instructor XL (just over 1 GB) and uses bigger embedding dimensions.

The practical demonstration uses the BGE base English embedding model with LangChain and a Chroma vector store. The pipeline stays largely the same—using a Llama 270B model hosted on Together API for retrieval QA—while swapping the embedding backend. The transcript specifies cosine similarity by normalizing embeddings (“normalize the embeddings as true”) and then plugging the embedding function into Chroma in the same way as before.

The operational impact is concrete: embedding roughly 1,000 text chunks drops from minutes with instructor XL to about 35 seconds with BGE base en on a T4. Retrieval quality is described as at least comparable, with the system correctly finding the right outputs in the tested cases. Some answers include odd or inconsistent extra text, suggesting that a second pass (output cleanup) could improve final responses.

Overall, the “big win” is less about a dramatic jump in retrieval accuracy and more about efficiency: smaller models mean faster embedding time, lower memory demands, and the possibility of CPU-based inference. The transcript ends by recommending close attention to BGE’s upcoming multilingual embeddings, anticipating improvements for multilingual RAG use cases.

Cornell Notes

BGE embeddings from the Beijing Academy of AI are gaining benchmark momentum while cutting model size sharply, which makes RAG systems cheaper and faster to operate. The transcript recommends avoiding long-term reliance on OpenAI embeddings due to vendor lock-in and the likelihood of future deprecations that would force full re-embedding. In a LangChain + Chroma setup, the pipeline swaps only the embedding model while keeping the Llama 270B QA model hosted on Together API. Using BGE base en with cosine similarity (via normalized embeddings) reduces embedding time for ~1,000 texts from minutes to about 35 seconds on a T4, with retrieval results described as comparable. The smaller footprint also lowers RAM/VRAM needs and may allow CPU inference.

Why does the transcript discourage using OpenAI embeddings for long-term production RAG systems?

It highlights two main risks: lock-in and deprecation. If a large corpus is embedded using OpenAI embeddings, switching later means re-embedding everything from scratch—costly in both time and API spend. It also anticipates that OpenAI will eventually deprecate older embedding models as better ones arrive, creating a period where both may work but still requiring a full re-embed once the older model is retired.

What makes BGE embeddings attractive compared with instructor XL embeddings in this setup?

The standout advantage is size. instructor XL is described as roughly under 5 GB, while BGE base English is about a tenth of that size. Even the larger BGE variant is just over 1 GB. The transcript ties this directly to operational benefits: faster embedding, lower memory requirements, and the possibility of running inference on CPU rather than relying on heavy GPU resources.

How is the BGE embedding model integrated into the LangChain + Chroma pipeline?

The code structure remains largely the same; the embedding backend is swapped. The transcript uses BGE base en and cosine similarity by normalizing embeddings (“normalize the embeddings as true”). That embedding function is then passed into Chroma similarly to the earlier instructor XL setup, and the retriever is configured to return source documents.

What performance improvement is reported when switching from instructor XL to BGE base en?

Embedding time for about 1,000 text chunks drops from several minutes with instructor XL to roughly 35 seconds with BGE base en on a T4. The transcript also notes reduced memory/VRAM needs, implying faster and lighter inference and indexing.

Does the transcript claim BGE embeddings dramatically improve answer quality?

Not necessarily. It says retrieval quality is not hugely worse or dramatically better—some outputs are comparable to the earlier approach. The bigger win is efficiency (smaller model, faster embedding, lower resource use). It also mentions that the language model sometimes produces strange extra text, suggesting a potential second-pass cleanup step.

What should builders watch for regarding multilingual support?

BGE already has English and Chinese embedding models, and the team is working on a multilingual embedding model that isn’t released yet. The transcript recommends keeping a close watch because multilingual embeddings could improve multilingual retrieval and output quality once available.

Review Questions

  1. What two lifecycle risks does the transcript associate with using OpenAI embeddings in production, and how do they affect long-term maintenance?
  2. In the LangChain + Chroma setup, what specific similarity approach is used with BGE base en, and how is it implemented?
  3. Why does the transcript frame the main advantage of BGE embeddings as efficiency rather than a major jump in retrieval quality?

Key Points

  1. 1

    Avoid treating “OpenAI text model” and “OpenAI embeddings” as a coupled choice; embeddings can be swapped independently.

  2. 2

    For long-term RAG projects, prioritize open-source embeddings to reduce vendor lock-in and re-embedding risk.

  3. 3

    OpenAI embeddings are positioned as useful for quick testing, but production systems face deprecation and full re-embedding costs.

  4. 4

    BGE embeddings from the Beijing Academy of AI have rapidly climbed benchmark rankings while remaining much smaller than instructor XL.

  5. 5

    Using BGE base en with LangChain and Chroma can cut embedding time for ~1,000 texts from minutes to about 35 seconds on a T4.

  6. 6

    Smaller embedding models reduce RAM/VRAM demands and may make CPU-based inference feasible.

  7. 7

    Multilingual BGE models are expected to improve multilingual RAG performance once released.

Highlights

BGE base en is described as about a tenth the size of instructor XL, enabling much faster embedding and lower resource use.
Embedding ~1,000 texts reportedly drops to ~35 seconds on a T4 when switching from instructor XL to BGE base en.
The transcript’s strongest production argument against OpenAI embeddings is lock-in: switching later forces full re-embedding of the corpus.
Retrieval quality is presented as broadly comparable, while the efficiency gains are the main practical improvement.
BGE’s English and Chinese embeddings exist now, with multilingual embeddings expected to land later.

Topics

Mentioned