Vectorless RAG - Local Financial RAG Without Vector Database

TL;DR

Vectorless RAG replaces embedding + vector search with a document-structure tree index built from markdown hierarchy and LLM-generated node summaries.

Briefing Cornell Notes

Briefing

Vectorless RAG can retrieve and answer questions from structured documents without any vector database by building a tree index from the document’s own hierarchy and letting a local LLM choose which sections to use. Instead of embedding chunks and running similarity search, the system splits a document (e.g., a financial PDF converted to markdown), generates short summaries for leaf sections, then rolls those summaries up into a JSON tree. At query time, the model receives the tree’s node titles and summaries and performs “tree search” to select the most relevant nodes, using their underlying text as grounded context for the final answer.

The practical payoff is simpler infrastructure: no Postgres + PGvector setup, no embedding pipeline, and no vector-store maintenance. The approach also leans into document structure—exactly the kind of formatting found in financial filings (10-Q, 10-K) and legal documents—so the model can cite the right sections and reason across multiple levels of granularity. Because selection happens through the tree, the model can provide answers with an explicit rationale tied to why certain nodes were chosen, which can make prompt tuning more straightforward than debugging embedding-based retrieval.

The trade-offs are cost and scalability. Retrieval is slower because the model must do additional inference steps to traverse and reason over the tree rather than relying on fast vector similarity. Indexing is also expensive: summarizing every leaf (and then every parent node) requires one or more LLM calls per section, so costs rise with document size and number of documents. The method doesn’t scale well when the model’s context window is limited (the transcript cites a 30k context window as a pain point). Summaries-of-summaries can help, but the “basic implementation” still struggles as datasets grow.

When to use which approach comes down to document count, structure, and cost. For small, well-structured local RAG workloads—especially where the LLM can handle the inference overhead—vectorless tree-based indexing can perform well. For thousands of documents, classical RAG with embeddings is likely faster and more scalable because vector search remains efficient, and systems can add hybrid search (keywords + embeddings) to improve chunk selection.

The implementation shown runs entirely locally using an LLM from the transcript’s example: “geometry, the 4 billion parameter version.” The tree nodes store a title, note ID, content, and an LLM-generated summary, plus children pointers. The indexing step uses LangChain’s markdown header text splitter to create hierarchical sections, then performs bottom-up summarization via an iterative traversal (a stack-based approach). Query tests include questions like “What was NVDS total revenue and earnings per share in Q3?” and “What partnerships did Nvidia announce for AI infrastructure?” The system retrieves the relevant sections and produces answers grounded in the extracted filing text. For the partnerships question, the most relevant node is identified as “00005 data center,” which cites a strategic partnership with OpenAI to deploy at least 10 GW of Nvidia systems, along with Microsoft, Oracle, and XAI. For Q4 outlook, the retrieved outlook node yields revenue expectations of 65 billion ± 2% and gross margins of 74.8% (GAP) and 75% (non-GAAP), matching the document’s figures.

Cornell Notes

The transcript describes a “vectorless RAG” system that answers questions from structured documents without embeddings or a vector database. It converts a document (like a financial filing) into a markdown hierarchy, splits it into sections, and uses an LLM to generate short summaries for leaf chunks. Those summaries roll up into a JSON tree, and at query time the LLM performs tree search to choose which nodes’ text to use as context for the final answer. This design removes vector-store dependencies and can improve grounding and explainability because selection follows the document’s structure. The main downsides are higher indexing and retrieval costs and weaker scalability when document collections grow large or context windows are limited.

How does vectorless RAG retrieve relevant context if it can’t run embedding similarity search?

It builds a hierarchical tree index from the document itself. After splitting a markdown document using its header structure, the system creates leaf nodes and asks the LLM for 2–3 sentence summaries per section. Then it constructs a JSON tree where parent nodes summarize children in a bottom-up pass. For a user query, the LLM receives the node titles and summaries and runs a “tree search” prompt to select the nodes most likely to contain the answer, then uses the selected nodes’ underlying text as context for the final answer.

Why does document structure matter so much for this approach?

The method depends on consistent hierarchy—titles, subsections, and nested sections—so the tree index mirrors the document’s organization. Financial filings (10-Q, 10-K) and legal documents often have predictable structure (tables, outlook sections, and labeled subsections). That structure makes it easier for the LLM to navigate and justify which parts are relevant, and it helps with grounding because the selected nodes correspond to specific sections in the source.

What are the biggest cost and performance trade-offs compared with classical embedding-based RAG?

Retrieval is more expensive because the LLM must do additional inference to traverse and reason over the tree rather than using fast vector search. Indexing is also costly: each leaf summary requires an LLM call, and then each parent node summary requires additional calls. Scalability suffers when many documents are involved (thousands+) or when the model’s context window is limited (the transcript mentions a 30k context window as problematic).

When does the transcript suggest vectorless RAG is a good fit versus classical RAG?

Vectorless RAG is recommended for local, small-document workloads where the documents are well-structured and the LLM can handle the inference overhead. Classical RAG is suggested for larger collections (thousands of documents), where vector search is faster and more scalable. The transcript also notes that classical systems can use hybrid search (keywords plus embeddings) to improve retrieval quality.

What does the system store in each tree node, and how is the tree built?

Each node includes a title, note ID, content, and a summary generated by the LLM, plus a children list. The tree is built by splitting the markdown using LangChain’s markdown header text splitter, then creating nodes for deeper sections. Summaries are generated bottom-up: leaf nodes get summaries first, then parent summaries are created using children summaries. The transcript describes an iterative traversal using a stack to manage this process.

How did the example answer specific questions from the Nvidia filing?

For the partnerships question, the system retrieved a node labeled “00005 data center,” which cited a strategic partnership with OpenAI to deploy at least 10 GW of Nvidia systems and also mentioned Microsoft, Oracle, and XAI. For the Q4 outlook question, it retrieved the outlook node and returned revenue expected to be 65 billion ± 2% and gross margins expected to be 74.8% (GAP) and 75% (non-GAAP), matching the filing’s figures.

Review Questions

What steps transform a structured markdown document into a tree index, and where do LLM calls occur in that pipeline?
How does tree search decide which nodes to use for a query, and how does that differ from embedding-based top-k retrieval?
What conditions (document count, structure, context window size) make vectorless RAG less scalable, according to the transcript?

Key Points

1
Vectorless RAG replaces embedding + vector search with a document-structure tree index built from markdown hierarchy and LLM-generated node summaries.
2
Leaf summaries are generated first, then parent summaries are created bottom-up to form a JSON tree that mirrors the document’s sections.
3
At query time, the LLM performs tree search over node titles and summaries to select relevant nodes, then answers using the selected nodes’ text as grounded context.
4
The approach removes vector database dependencies (e.g., no Postgres + PGvector setup) but increases indexing and retrieval cost due to extra LLM inference steps.
5
Scalability can degrade with large document collections (thousands+) and with limited context windows (the transcript flags a 30k context window as a challenge).
6
Vectorless RAG is best suited to small, well-structured local workloads where document hierarchy is reliable and the LLM can manage the added reasoning overhead.
7
Classical embedding-based RAG remains preferable for large-scale retrieval because vector search is faster and can be enhanced with hybrid keyword + embedding search.

Highlights

Tree-based indexing lets retrieval follow the document’s hierarchy, enabling grounding to specific sections without embeddings.

The system’s indexing cost grows with the number of leaf sections because each leaf requires an LLM summary call, followed by additional parent summaries.

In the Nvidia example, the “00005 data center” node was identified as the key source for AI infrastructure partnerships, including OpenAI and other named partners.

For Q4 outlook, the retrieved outlook node returned revenue and gross margin figures that matched the filing’s stated numbers.

Topics

Vectorless RAG
Tree-Based Indexing
Local Financial RAG
Document Summarization
Tree Search

Mentioned

Venelin Valkov
RAG
LLM
LM
JSON
GW
NVDS
Q3
Q4
GAP
non-GAAP
10-Q
10-K
Raptor

Vectorless RAG - Local Financial RAG Without Vector Database | Tree-Based Indexing with Ollama