Vectorless RAG - Local Financial RAG Without Vector Database | Tree-Based Indexing with Ollama
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Vectorless RAG replaces embedding + vector search with a document-structure tree index built from markdown hierarchy and LLM-generated node summaries.
Briefing
Vectorless RAG can retrieve and answer questions from structured documents without any vector database by building a tree index from the document’s own hierarchy and letting a local LLM choose which sections to use. Instead of embedding chunks and running similarity search, the system splits a document (e.g., a financial PDF converted to markdown), generates short summaries for leaf sections, then rolls those summaries up into a JSON tree. At query time, the model receives the tree’s node titles and summaries and performs “tree search” to select the most relevant nodes, using their underlying text as grounded context for the final answer.
The practical payoff is simpler infrastructure: no Postgres + PGvector setup, no embedding pipeline, and no vector-store maintenance. The approach also leans into document structure—exactly the kind of formatting found in financial filings (10-Q, 10-K) and legal documents—so the model can cite the right sections and reason across multiple levels of granularity. Because selection happens through the tree, the model can provide answers with an explicit rationale tied to why certain nodes were chosen, which can make prompt tuning more straightforward than debugging embedding-based retrieval.
The trade-offs are cost and scalability. Retrieval is slower because the model must do additional inference steps to traverse and reason over the tree rather than relying on fast vector similarity. Indexing is also expensive: summarizing every leaf (and then every parent node) requires one or more LLM calls per section, so costs rise with document size and number of documents. The method doesn’t scale well when the model’s context window is limited (the transcript cites a 30k context window as a pain point). Summaries-of-summaries can help, but the “basic implementation” still struggles as datasets grow.
When to use which approach comes down to document count, structure, and cost. For small, well-structured local RAG workloads—especially where the LLM can handle the inference overhead—vectorless tree-based indexing can perform well. For thousands of documents, classical RAG with embeddings is likely faster and more scalable because vector search remains efficient, and systems can add hybrid search (keywords + embeddings) to improve chunk selection.
The implementation shown runs entirely locally using an LLM from the transcript’s example: “geometry, the 4 billion parameter version.” The tree nodes store a title, note ID, content, and an LLM-generated summary, plus children pointers. The indexing step uses LangChain’s markdown header text splitter to create hierarchical sections, then performs bottom-up summarization via an iterative traversal (a stack-based approach). Query tests include questions like “What was NVDS total revenue and earnings per share in Q3?” and “What partnerships did Nvidia announce for AI infrastructure?” The system retrieves the relevant sections and produces answers grounded in the extracted filing text. For the partnerships question, the most relevant node is identified as “00005 data center,” which cites a strategic partnership with OpenAI to deploy at least 10 GW of Nvidia systems, along with Microsoft, Oracle, and XAI. For Q4 outlook, the retrieved outlook node yields revenue expectations of 65 billion ± 2% and gross margins of 74.8% (GAP) and 75% (non-GAAP), matching the document’s figures.
Cornell Notes
The transcript describes a “vectorless RAG” system that answers questions from structured documents without embeddings or a vector database. It converts a document (like a financial filing) into a markdown hierarchy, splits it into sections, and uses an LLM to generate short summaries for leaf chunks. Those summaries roll up into a JSON tree, and at query time the LLM performs tree search to choose which nodes’ text to use as context for the final answer. This design removes vector-store dependencies and can improve grounding and explainability because selection follows the document’s structure. The main downsides are higher indexing and retrieval costs and weaker scalability when document collections grow large or context windows are limited.
How does vectorless RAG retrieve relevant context if it can’t run embedding similarity search?
Why does document structure matter so much for this approach?
What are the biggest cost and performance trade-offs compared with classical embedding-based RAG?
When does the transcript suggest vectorless RAG is a good fit versus classical RAG?
What does the system store in each tree node, and how is the tree built?
How did the example answer specific questions from the Nvidia filing?
Review Questions
- What steps transform a structured markdown document into a tree index, and where do LLM calls occur in that pipeline?
- How does tree search decide which nodes to use for a query, and how does that differ from embedding-based top-k retrieval?
- What conditions (document count, structure, context window size) make vectorless RAG less scalable, according to the transcript?
Key Points
- 1
Vectorless RAG replaces embedding + vector search with a document-structure tree index built from markdown hierarchy and LLM-generated node summaries.
- 2
Leaf summaries are generated first, then parent summaries are created bottom-up to form a JSON tree that mirrors the document’s sections.
- 3
At query time, the LLM performs tree search over node titles and summaries to select relevant nodes, then answers using the selected nodes’ text as grounded context.
- 4
The approach removes vector database dependencies (e.g., no Postgres + PGvector setup) but increases indexing and retrieval cost due to extra LLM inference steps.
- 5
Scalability can degrade with large document collections (thousands+) and with limited context windows (the transcript flags a 30k context window as a challenge).
- 6
Vectorless RAG is best suited to small, well-structured local workloads where document hierarchy is reliable and the LLM can manage the added reasoning overhead.
- 7
Classical embedding-based RAG remains preferable for large-scale retrieval because vector search is faster and can be enhanced with hybrid keyword + embedding search.