Get AI summaries of any video or article — Sign up free
Augmented Language Models (LLM Bootcamp) thumbnail

Augmented Language Models (LLM Bootcamp)

The Full Stack·
6 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Treat prompt-context construction as an information-retrieval problem rather than a manual selection task, especially when user-specific data is involved.

Briefing

Augmented language models hinge on a simple constraint: modern LLMs are strong at language and instruction-following, but they lack up-to-date world knowledge, access to a company’s private data, and reliable performance on harder tasks like math-heavy reasoning. The practical fix isn’t retraining the model for every new need—it’s giving the model the right external “tools” and “data” at inference time. The core idea is to treat an LLM less like a self-contained oracle and more like a reasoning engine that needs a curated context to answer real questions.

The starting point is retrieval augmentation, which reframes “stuffing context into the prompt” as an information-retrieval problem. Instead of manually deciding which user or document snippets to include, systems search an external corpus for the most relevant items and then inject those results into the prompt. This matters immediately in multi-user settings: rules like “include the most recent users” or “include users mentioned in the query” break down when the relationship between a question and the right data is too complex to encode with simple logic. Retrieval turns that selection step into something searchable and measurable.

Traditional search relies on inverted indexes and word-level heuristics such as Boolean filtering and ranking methods like BM25. Those approaches work well for exact or unambiguous keyword matches, but they miss semantic meaning—so ambiguous queries can return documents about the wrong sense of a word. Embeddings shift the approach by representing text (and other modalities) as dense vectors so semantically similar items land near each other in vector space. The retrieval pipeline then becomes: embed the query, find nearest neighbors among embedded documents, and pass the top matches into the LLM prompt.

The lecture emphasizes that “embedding databases” are often unnecessary at small scale. For fewer than roughly 100,000 vectors, simple nearest-neighbor search with NumPy can be enough. As scale grows, approximate nearest neighbor methods speed up lookup by using specialized index structures (e.g., HNSW-based approaches) that trade a bit of accuracy for speed. But production systems need more than a fast index: they must handle metadata, filtering, embedding management, and reliable ingestion/update workflows. The practical recommendation is to start with whatever database already powers the application—many support vector search via extensions or built-in features—then graduate to dedicated vector databases when you need richer capabilities.

Beyond nearest-neighbor retrieval, the biggest limitation is context window size. If only a few documents fit, the system can fail when the answer isn’t retrieved into those top slots. One workaround is “chains”: use an LLM to re-rank or select among a larger candidate set before the final answer prompt. Chains also generalize to other patterns like hypothetical document embeddings (invent a document that might contain the answer), map-reduce style summarization across large corpora, and multi-step workflows orchestrated with frameworks such as LangChain.

Finally, tools broaden augmentation beyond search. Rather than only retrieving documents, LLMs can call external systems—calculators, SQL databases, APIs—either through developer-defined chains or through plugin-style interfaces where the model decides when tool use is helpful. The takeaway is a hierarchy of augmentation strategies: start with retrieval and heuristics, move to chains for more complex context-building and token-limit workarounds, and use tools/plugins when the model needs to interact with external capabilities or live data.

Cornell Notes

LLMs often need external help because they don’t have private or up-to-date knowledge and can’t reliably access a user’s data on their own. Retrieval augmentation fixes this by treating “prompt context selection” as an information-retrieval problem: embed the query and retrieve the most similar embedded documents, then insert those snippets into the prompt. Embeddings enable semantic search, while traditional inverted-index search relies on word-level correlations (e.g., BM25). At small scale, simple nearest-neighbor search (even with NumPy) can work; at larger scale, approximate nearest neighbor indices and production-grade retrieval systems become important. When context is too small, “chains” add extra LLM steps (like re-ranking) to build better context, and “tools/plugins” let models call external APIs such as SQL or calculators.

Why does “put more data in the context window” stop working as systems grow?

Context windows are limited, and token cost rises with more included text. In multi-user or large-corpus settings, manually selecting which documents to include becomes brittle—rules like “include the most recent users” or “include users mentioned in the query” fail when the mapping from question to relevant data is hard to encode. Retrieval augmentation addresses this by searching a corpus and selecting the most relevant snippets automatically, but it still faces the hard limit that only a few retrieved items can fit into the prompt.

How does retrieval augmentation differ from traditional keyword search?

Traditional search typically uses an inverted index: it flags documents containing query terms and ranks them using heuristics like BM25, often after Boolean filtering. This captures statistical word overlap but not deep semantics, so ambiguous queries can return irrelevant senses of a word. Embedding-based retrieval represents text as dense vectors so semantically similar items are close in vector space; the system then performs nearest-neighbor search to find relevant documents for the LLM context.

What makes an embedding “good,” and how is it evaluated?

A key criterion is downstream utility: the embedding should improve performance on the specific retrieval task you care about. General-purpose embeddings may not be optimal for a particular domain, so benchmarking on relevant tasks matters. Another desired property is geometric: similar concepts should be near each other (e.g., “coffee” close to “tea”), while unrelated concepts should be far apart (e.g., “ball” far from “crocodile”).

When is a dedicated vector database necessary versus simple nearest-neighbor search?

For fewer than about 100,000 vectors, the lecture suggests you may not notice much difference versus simple approaches like storing embeddings in an array and doing dot products in NumPy. Dedicated vector databases become more valuable at larger scale due to speed and operational features. Even then, production needs more than an index: metadata handling, filtering, embedding/version management, and reliable ingestion/update workflows matter.

What problem do “chains” solve that pure retrieval can’t?

Pure retrieval depends on the top-k results fitting into the context window. If the answer isn’t among those top retrieved documents, the LLM can’t use it. Chains address this by retrieving more candidates than can fit, then using additional LLM calls to re-rank or select the best subset for the final prompt. This increases latency and cost but improves the odds that the prompt contains the right evidence.

How do tools/plugins extend augmentation beyond retrieval?

Tools let an LLM interact with external systems (e.g., a Python interpreter for math, an archive/search API, or a SQL database). In chain-based tool use, the developer specifies the sequence of LLM calls and tool calls. In plugin-style tool use, the model decides whether to invoke a tool based on an API specification and description included in its context, then continues generation after receiving the tool output.

Review Questions

  1. What trade-offs arise when you retrieve more documents than can fit into the context window, and then use an LLM to select among them?
  2. Describe how embeddings change the retrieval problem compared with inverted-index search, and give one failure mode each approach can have.
  3. Why might production retrieval require more than just an approximate nearest-neighbor index? Name at least two operational needs mentioned.

Key Points

  1. 1

    Treat prompt-context construction as an information-retrieval problem rather than a manual selection task, especially when user-specific data is involved.

  2. 2

    Embedding-based retrieval enables semantic matching by mapping text (or other modalities) into dense vectors where similar meaning is geometrically close.

  3. 3

    For small corpora (roughly under 100,000 vectors), simple nearest-neighbor search with tools like NumPy can be sufficient; dedicated vector databases become more important at scale.

  4. 4

    Production retrieval systems need more than fast vector lookup: they must support metadata, filtering, ingestion/update reliability, and embedding management/versioning.

  5. 5

    When context windows are too small, “chains” add extra LLM steps (e.g., re-ranking) to ensure the final prompt contains the most relevant evidence.

  6. 6

    “Tools” and “plugins” expand augmentation beyond search by letting LLMs call external capabilities like SQL, calculators, and APIs, either via developer-orchestrated chains or model-chosen plugin calls.

Highlights

Retrieval augmentation reframes “which documents go into the prompt?” as a searchable ranking problem, avoiding brittle hand-written context rules.
Embedding search replaces word-overlap heuristics with semantic similarity in vector space, enabling better handling of ambiguous queries.
Under ~100,000 vectors, nearest-neighbor search can be done simply; the bigger leap comes when production needs metadata, filtering, and operational reliability.
Chains help when the answer isn’t in the top retrieved items by adding an LLM re-ranking/selection step over a larger candidate set.
Tools/plugins let LLMs interact with external systems (like SQL) so answers can come from live computation or data, not just retrieved text.

Topics

Mentioned

  • LLM
  • API
  • GPT
  • BM25
  • ANN
  • HNSW
  • HNSW lib
  • MTE
  • SQL
  • PG Vector
  • MP3
  • MP4
  • LLM
  • GPT-3.5
  • GPT-4