Get AI summaries of any video or article — Sign up free
Build 100% Local Advanced RAG System for Financial PDFs with Qwen 3.5 | Docling, LangGraph & Ollama thumbnail

Build 100% Local Advanced RAG System for Financial PDFs with Qwen 3.5 | Docling, LangGraph & Ollama

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The architecture runs fully locally: Streamlit UI, FastAPI back end, local inference via Ollama, and local Postgres storage with pgvector.

Briefing

A fully local “advanced RAG” stack for financial PDFs can be built end-to-end—PDF upload, parsing into citation-ready chunks, hybrid retrieval, reranking, and streamed answers—without relying on external services. The system is designed to keep every step running on a single machine, store parsed content in a local Postgres vector database, and return responses with traceable sources pulled directly from the document chunks.

At the core is a layered architecture: a Streamlit front end for uploading PDFs and chatting, a FastAPI back end that exposes REST endpoints, and a retrieval workflow that turns user questions into grounded answers. The FastAPI server acts as a facade between the UI and the data/inference layers, offering three key endpoints: one to ingest a PDF (returning a task ID and ingestion status), one to check that status, and one chat endpoint that streams results back to the client using server-sent events.

Ingestion is handled by a Docling-based pipeline that converts PDFs into markdown, including table structure detection and optional image description. The pipeline then chunk-splits the markdown by document structure (title/section/subsection). If a chunk exceeds a token limit (set to 1,024 in the implementation), it is reduced to fit. Each chunk is also enriched with an additional LLM call to add context before storage. The system saves both the full parsed markdown and the derived chunks into Supabase’s Postgres database using pgvector, with separate tables for documents and chunks. This design ensures chunking and enrichment happen once at ingest time, so later queries can focus on retrieval and generation.

For answering questions, the retrieval layer uses query expansion plus hybrid search: an “expanded query” is generated (using a Gemma 1B model in the example) to improve embedding-based matching, while hybrid retrieval combines full-text search with vector similarity to find relevant chunks. A reranker then reorders candidates using the Flashrank library, and a LangGraph workflow orchestrates the full sequence—expansion, hybrid retrieval, reranking, and final generation. The inference layer runs locally via Ollama (with an embedding model and an LLM served through the local setup), and the chat endpoint streams token-by-token output along with pipeline stage updates and source citations.

The transcript demonstrates the system with financial filings from Nvidia and Apple. After uploading, ingestion reports show chunk counts (e.g., 11 chunks for one filing) and the stored markdown and chunk records in Supabase. Example questions show the workflow distinguishing between multiple uploaded documents and returning specific figures with citations. One query about Nvidia data center revenue growth yields quarter-over-quarter and year-over-year growth percentages, with the numbers matching the underlying table values. A second query asks for Apple cash and cash equivalents for December 27, 2025, and the system returns the exact total (45,317 million) with sources pointing to the relevant table chunks.

Overall, the build emphasizes practical deployment: everything runs locally via a root-level docker-compose command, the database uses pgvector for embeddings, and the API streams structured events so the UI can display both the answer and its retrieval trace.

Cornell Notes

The system builds a fully local RAG pipeline for financial PDFs: upload documents, convert them to markdown, split into structured chunks, enrich each chunk with an LLM, and store both markdown and chunks in a Postgres database using pgvector. For questions, it expands the query, performs hybrid retrieval (full-text + embeddings), reranks candidates with Flashrank, and uses a LangGraph workflow to generate an answer grounded in retrieved chunks. A FastAPI back end exposes ingestion and chat endpoints, streaming results to a Streamlit UI via server-sent events. The approach matters because it produces citation-ready answers from financial tables while keeping processing and storage offline and reproducible on a single machine.

How does the ingestion pipeline turn a financial PDF into citation-ready sources?

It converts PDFs to markdown using Docling with PDF pipeline options that detect table structures and can describe images when requested. The markdown is then chunked by document structure (title/section/subsection). If a chunk exceeds a token limit (1,024 in the example), the chunk is reduced. Each chunk is enriched using an LLM prompt to add extra context, and then both the full markdown and the enriched chunks are stored in Supabase (Postgres with pgvector). Later answers cite these stored chunks.

What makes retrieval “hybrid” in this setup, and why does query expansion matter?

Hybrid retrieval combines full-text search with embedding-based similarity to find relevant chunks. Before searching, the system expands the user query using a Gemma 1B model (used as a weaker but effective expansion step). The expanded query improves embedding matching and helps the hybrid search surface the right table sections even when the user phrasing is more general than the table labels.

How does reranking improve answer grounding?

After hybrid retrieval returns a small set of candidate chunks (three in the example trace), Flashrank reranks them to prioritize the most relevant passages. The transcript notes that even when candidates are already highly relevant, reranking still helps ensure the final context used for generation is the best match to the question.

What does the chat endpoint stream back to the UI?

The FastAPI chat endpoint returns a streaming response using server-sent events. The stream includes token messages for incremental text generation, pipeline stage messages that show progress through the workflow, and source/sighting events that carry the retrieved chunks used for citations. The UI iterates over these events and assembles the final answer while collecting sources for traceability.

How does the system support multiple documents without mixing context?

Each chunk is stored with a document ID reference tied to the original PDF. During retrieval, the hybrid search and reranking operate over the indexed chunks, and the workflow selects the chunks most relevant to the current query. In the demo, questions about Nvidia and Apple return answers sourced from the correct filing, with sources pointing to chunks from the appropriate document.

Review Questions

  1. What steps occur only during ingestion (and why is that important for performance and citation accuracy)?
  2. Describe the order of operations in the retrieval workflow from query expansion to reranking to generation.
  3. How do full-text search and embeddings work together in the hybrid retrieval approach?

Key Points

  1. 1

    The architecture runs fully locally: Streamlit UI, FastAPI back end, local inference via Ollama, and local Postgres storage with pgvector.

  2. 2

    Ingestion converts PDFs to markdown with Docling (including table structure detection) and optionally adds image descriptions.

  3. 3

    Chunking is structure-aware (title/section/subsection) and enforces a token limit (1,024) by reducing oversized chunks.

  4. 4

    Each chunk is enriched with an LLM prompt before being stored, so later queries reuse precomputed context.

  5. 5

    Hybrid retrieval combines full-text search with embedding similarity, improving recall for financial-table queries.

  6. 6

    Flashrank reranks retrieved chunks, and LangGraph orchestrates expansion → retrieval → reranking → grounded generation.

  7. 7

    The chat endpoint streams both generated tokens and citation sources via server-sent events for traceable answers.

Highlights

A single-machine RAG stack can ingest financial PDFs, store citation-ready chunks in pgvector, and answer questions with streamed, source-backed results.
Query expansion (Gemma 1B) plus hybrid retrieval (full-text + embeddings) helps the system match user questions to the right table sections.
Flashrank reranking and LangGraph orchestration produce grounded answers that match exact table values in the demo (e.g., Apple cash and cash equivalents for December 27, 2025).

Topics

  • Local RAG
  • Financial PDFs
  • Docling Parsing
  • Hybrid Retrieval
  • LangGraph Workflow

Mentioned

  • RAG
  • API
  • PDF
  • UI
  • LLM
  • OCR
  • pgvector
  • SSE
  • FastAPI
  • UI
  • HTTPX
  • Qwen
  • Ollama
  • LangGraph
  • Docling
  • Flashrank
  • Gemma
  • LM
  • VM
  • SQL
  • JSON
  • CPU