Advanced RAG with Llama 3 in Langchain | Chat with PDF using Free Embeddings, Reranker & LlamaParse

TL;DR

Use LlamaParse to convert complex, table-heavy PDFs into structured markdown before any embedding or retrieval happens.

Briefing Cornell Notes

Briefing

Building a high-quality “chat with your PDF” system hinges less on the language model and more on the pipeline around it: parsing complex documents into clean text, chunking and embedding that text, retrieving the most relevant chunks, reranking them, and only then prompting the LLM with tightly scoped context. The workflow demonstrated here uses open models and LangChain to create an advanced RAG stack that can answer questions about a complex financial PDF (Meta’s first-quarter earnings results) with improved retrieval ordering via a dedicated reranker.

The architecture is organized into three main components: a knowledge base, a ranker, and a chat language model (W 3 accessed through the Gro API). The knowledge base starts by parsing the PDF into structured, readable markdown. For this, the setup uses LlamaParse, chosen specifically for messy, table-heavy financial documents where formatting and nested structures (tables within tables, bullets, intricate layouts) often break simpler parsers. After parsing, the markdown is split into overlapping text chunks using a recursive character splitter (2,048 characters per chunk with 128 characters of overlap). Each chunk is then embedded into vectors using FastEmbed embeddings, and those vectors are stored in a local Qdrant vector database.

Retrieval happens in two stages. First, Qdrant performs similarity search against the user’s query to fetch the top candidates (the example uses top-k retrieval of five). This stage returns documents with similarity scores, but the retrieved ordering isn’t always optimal for downstream answering. To fix that, the system adds a reranker using Flashrank (via the Flashrank wrapper around the base retriever). The reranker performs pairwise ranking to filter out irrelevant chunks and reorder the most relevant ones at the top. In the example run, reranking adds noticeable latency (around a few seconds), but it also produces a more useful set of context documents for the LLM.

Finally, the reranked documents and the user question are combined into a prompt for the LLM (Llama 3 70B). The prompt is designed to use the provided context and reduce hallucinations—explicitly instructing the model to answer only when helpful information is present and to avoid inventing details. The chain is implemented as a LangChain question-answering flow with a “stuff” strategy, meaning the retrieved context is inserted into the prompt as a single block.

The system’s behavior is tested with concrete questions about Meta’s earnings PDF. It correctly identifies items like the “most significant innovation” (tied to the new version of Meta AI with W 3 mentioned in a Zuckerberg quote), and it extracts numeric values from tables such as 2024 revenue (36,455 million) and 2023 revenue (28,645 million), including percentage year-over-year change. It also handles arithmetic questions derived from the tables, though not perfectly: one revenue-minus-cost calculation for 2023 is shown to be wrong, illustrating that even with strong retrieval and reranking, the LLM can still make mistakes when performing multi-step computations.

Overall, the build demonstrates that advanced RAG quality comes from disciplined document-to-vector preparation and retrieval refinement (parser + chunking + embeddings + Qdrant + reranker), while the LLM’s prompt can only partially mitigate errors—especially for calculations—without additional safeguards.

Cornell Notes

The pipeline for “chat with a PDF” is built around three parts: a knowledge base that turns PDFs into chunked embeddings, a reranker that improves which chunks reach the LLM, and an LLM (Llama 3 70B via Gro API) that answers using the retrieved context. The knowledge base uses LlamaParse to convert a complex, table-heavy financial PDF into structured markdown, then splits it into overlapping chunks (2,048 chars with 128 overlap), embeds them with FastEmbed, and stores vectors in Qdrant. Retrieval starts with Qdrant similarity search (top-k candidates), then Flashrank reranks those candidates using pairwise ranking to push the most relevant chunks to the top. In tests on Meta earnings content, the system extracts table values accurately and answers quote-based questions well, but it can still make arithmetic mistakes, showing that retrieval quality doesn’t fully eliminate LLM errors.

Why does the build emphasize LlamaParse for financial PDFs instead of relying on generic PDF text extraction?

Financial reports often contain dense formatting: tables, nested tables, bullets inside tables, and mixed text/figure layouts. LlamaParse is used to produce well-structured markdown from such PDFs, preserving table structure and readability. That structured markdown then becomes the input for chunking and embedding; if parsing fails, the downstream embeddings and retrieval will reflect broken or scrambled text.

How do chunking choices affect retrieval quality in this RAG setup?

After parsing, the markdown is split with a recursive character text splitter using 2,048-character chunks and 128-character overlap. This overlap helps ensure that important context spanning chunk boundaries still appears in at least one chunk. The example results show that with only about 10 chunks, similarity search can return relevant passages quickly, but reranking is still needed to improve ordering.

What role does Qdrant play, and what does “top five” retrieval mean here?

Qdrant stores the embedding vectors and performs similarity search against the user query. The retriever is configured to return the top five most similar chunks. In the example, these retrieved documents include similarity scores from the vector search, but the ordering may not be optimal for answering—hence the next reranking step.

What does the reranker change compared with raw vector similarity search?

Flashrank reranks the candidate chunks using pairwise ranking. It filters out less relevant documents and sorts the more relevant ones higher, producing a better context set for the LLM. The example notes that reranking adds latency (around a few seconds) but yields documents with reranker metadata scores and improved usefulness for question answering.

How does the prompt design try to reduce hallucinations, and what limitation remains?

The prompt instructs the LLM to use the provided context and to avoid hallucinating. Even with that, the system can still err on tasks requiring computation. The example shows correct extraction of revenue figures and some derived answers, but one 2023 revenue-minus-cost calculation is wrong, demonstrating that prompt constraints and retrieval improvements don’t guarantee correct arithmetic.

What does the “stuff” chain type imply for how context is fed to the model?

Using a “stuff” strategy means the retrieved (and reranked) documents are combined into a single prompt context block rather than being processed separately or iteratively. This is straightforward for short context sets, but it also means the model must parse and use all provided context within one prompt window.

Review Questions

If the parser produced poorly formatted markdown from a table-heavy PDF, which downstream steps would likely degrade first: chunking, embeddings, retrieval, or reranking—and why?
In this setup, where does the system spend most of its time: embeddings, vector search, reranking, or LLM inference? Identify the approximate timings mentioned and what each stage accomplishes.
Why can a RAG system answer quote-based questions well yet still fail on arithmetic derived from extracted table values?

Key Points

1
Use LlamaParse to convert complex, table-heavy PDFs into structured markdown before any embedding or retrieval happens.
2
Split parsed markdown into overlapping chunks (2,048 characters with 128 overlap) to preserve context across boundaries.
3
Store chunk embeddings in Qdrant and start retrieval with similarity search (e.g., top-k candidates) to narrow the search space.
4
Add a reranker (Flashrank) after vector retrieval to reorder and filter candidates using pairwise ranking for better context quality.
5
Feed the reranked context plus the user question into Llama 3 70B via Gro API using prompt instructions that discourage hallucinations.
6
Expect remaining failure modes for multi-step computations even when retrieval and reranking are strong; consider adding validation or calculator-style checks if accuracy is critical.

Highlights

The pipeline’s quality depends on upstream document handling: LlamaParse is used specifically to handle financial PDFs with tables and intricate formatting.

Reranking changes the outcome: Flashrank performs pairwise ranking on top of Qdrant’s similarity search to push the most relevant chunks forward.

Even with strong retrieval, arithmetic can still go wrong—one derived revenue calculation for 2023 is shown to be incorrect.

The system demonstrates end-to-end RAG: parse → chunk → embed → Qdrant retrieve → Flashrank rerank → Llama 3 70B answer with context.

Topics

Advanced RAG
PDF Parsing
Embeddings
Reranking
LangChain QA

Mentioned

LangChain
LlamaParse
Qdrant
FastEmbed
Flashrank
Llama 3
Gro API
LlamaParse
RAG
API