Build 100% Local Advanced RAG System for Financial PDFs with Qwen 3.5 | Docling, LangGraph & Ollama
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The architecture runs fully locally: Streamlit UI, FastAPI back end, local inference via Ollama, and local Postgres storage with pgvector.
Briefing
A fully local “advanced RAG” stack for financial PDFs can be built end-to-end—PDF upload, parsing into citation-ready chunks, hybrid retrieval, reranking, and streamed answers—without relying on external services. The system is designed to keep every step running on a single machine, store parsed content in a local Postgres vector database, and return responses with traceable sources pulled directly from the document chunks.
At the core is a layered architecture: a Streamlit front end for uploading PDFs and chatting, a FastAPI back end that exposes REST endpoints, and a retrieval workflow that turns user questions into grounded answers. The FastAPI server acts as a facade between the UI and the data/inference layers, offering three key endpoints: one to ingest a PDF (returning a task ID and ingestion status), one to check that status, and one chat endpoint that streams results back to the client using server-sent events.
Ingestion is handled by a Docling-based pipeline that converts PDFs into markdown, including table structure detection and optional image description. The pipeline then chunk-splits the markdown by document structure (title/section/subsection). If a chunk exceeds a token limit (set to 1,024 in the implementation), it is reduced to fit. Each chunk is also enriched with an additional LLM call to add context before storage. The system saves both the full parsed markdown and the derived chunks into Supabase’s Postgres database using pgvector, with separate tables for documents and chunks. This design ensures chunking and enrichment happen once at ingest time, so later queries can focus on retrieval and generation.
For answering questions, the retrieval layer uses query expansion plus hybrid search: an “expanded query” is generated (using a Gemma 1B model in the example) to improve embedding-based matching, while hybrid retrieval combines full-text search with vector similarity to find relevant chunks. A reranker then reorders candidates using the Flashrank library, and a LangGraph workflow orchestrates the full sequence—expansion, hybrid retrieval, reranking, and final generation. The inference layer runs locally via Ollama (with an embedding model and an LLM served through the local setup), and the chat endpoint streams token-by-token output along with pipeline stage updates and source citations.
The transcript demonstrates the system with financial filings from Nvidia and Apple. After uploading, ingestion reports show chunk counts (e.g., 11 chunks for one filing) and the stored markdown and chunk records in Supabase. Example questions show the workflow distinguishing between multiple uploaded documents and returning specific figures with citations. One query about Nvidia data center revenue growth yields quarter-over-quarter and year-over-year growth percentages, with the numbers matching the underlying table values. A second query asks for Apple cash and cash equivalents for December 27, 2025, and the system returns the exact total (45,317 million) with sources pointing to the relevant table chunks.
Overall, the build emphasizes practical deployment: everything runs locally via a root-level docker-compose command, the database uses pgvector for embeddings, and the API streams structured events so the UI can display both the answer and its retrieval trace.
Cornell Notes
The system builds a fully local RAG pipeline for financial PDFs: upload documents, convert them to markdown, split into structured chunks, enrich each chunk with an LLM, and store both markdown and chunks in a Postgres database using pgvector. For questions, it expands the query, performs hybrid retrieval (full-text + embeddings), reranks candidates with Flashrank, and uses a LangGraph workflow to generate an answer grounded in retrieved chunks. A FastAPI back end exposes ingestion and chat endpoints, streaming results to a Streamlit UI via server-sent events. The approach matters because it produces citation-ready answers from financial tables while keeping processing and storage offline and reproducible on a single machine.
How does the ingestion pipeline turn a financial PDF into citation-ready sources?
What makes retrieval “hybrid” in this setup, and why does query expansion matter?
How does reranking improve answer grounding?
What does the chat endpoint stream back to the UI?
How does the system support multiple documents without mixing context?
Review Questions
- What steps occur only during ingestion (and why is that important for performance and citation accuracy)?
- Describe the order of operations in the retrieval workflow from query expansion to reranking to generation.
- How do full-text search and embeddings work together in the hybrid retrieval approach?
Key Points
- 1
The architecture runs fully locally: Streamlit UI, FastAPI back end, local inference via Ollama, and local Postgres storage with pgvector.
- 2
Ingestion converts PDFs to markdown with Docling (including table structure detection) and optionally adds image descriptions.
- 3
Chunking is structure-aware (title/section/subsection) and enforces a token limit (1,024) by reducing oversized chunks.
- 4
Each chunk is enriched with an LLM prompt before being stored, so later queries reuse precomputed context.
- 5
Hybrid retrieval combines full-text search with embedding similarity, improving recall for financial-table queries.
- 6
Flashrank reranks retrieved chunks, and LangGraph orchestrates expansion → retrieval → reranking → grounded generation.
- 7
The chat endpoint streams both generated tokens and citation sources via server-sent events for traceable answers.