How to Analyze Tables In Large Financial Reports Using GPT-4 (w/Jerry Liu, LlamaIndex)

TL;DR

Naive chunk-and-embed retrieval often fails on embedded tables because raw table text is dense and number-heavy, producing poor top-k matches.

Briefing Cornell Notes

Briefing

Advanced RAG for financial filings hinges on treating embedded tables as first-class, queryable objects—then retrieving them indirectly through summaries and a recursive lookup—rather than embedding raw table text like ordinary chunks. In large documents such as SEC 10-Ks (and even Wikipedia’s “world billionaires” tables), naive chunk-and-embed retrieval often fails because table contents are dense, number-heavy, and semantically hard to match. The result is either missed tables or answers that don’t correspond to the correct figures.

The core fix is a hierarchical indexing strategy. Text sections and extracted tables are mapped into separate “nodes”: text trunks become nodes containing the text itself, while table nodes store an LLM-generated summary plus a reference (via an ID) to a structured query engine for the underlying table. At query time, a top-level vector retriever selects the most relevant nodes based on semantic similarity. Then recursive retrieval kicks in: if a retrieved node is merely text, standard RAG continues; if it’s a table summary node, the system follows the reference and runs a targeted query against the table (for example, via a pandas-based query engine). This indirect reference approach avoids the poor retrieval behavior of embedding entire tables as plain text, while still enabling precise extraction of numbers.

The demo starts with Wikipedia’s billionaires page converted to PDF and parsed with Camelot, which extracts clean tables into data frames. For each table, the system builds a pandas query engine and pairs it with a summary node. When asked, “How many billionaires were there in 2009?” the recursive retriever pulls the correct table summary node, then executes the corresponding pandas operation to return the exact value (793). A baseline RAG pipeline that simply chunks and embeds the flattened document text fails on the same question, returning “not possible to determine” and even hallucinating by pulling the wrong year’s table content.

A second example uses an SEC filing in HTML form (a 10-K) processed with Unstructured to partition content into a document hierarchy. Tables are extracted into nodes, and each table node is again represented by an LLM summary rather than the full raw table. Even when the extracted tables are messy or partially missing formatting, the summary-based retrieval helps the system locate the right table node; then recursive retrieval queries the underlying table to answer questions like “What was the revenue in 2020?” The baseline approach again struggles, even for straightforward numeric questions.

Beyond the demos, the discussion frames production tradeoffs: table parsing quality and document-tree construction strongly affect downstream accuracy; summary quality matters because retrieval depends on it; and preprocessing (parsing plus LLM summarization across many tables) can add latency, though the actual retrieval step is typically fast. The architecture is also flexible about persistence—table data can live in a document store, SQL database, or vector database—because recursive retrieval follows references rather than assuming a single storage model. Finally, there’s an open question about whether multimodal table parsing (e.g., screenshot-to-description with GPT-4V) can improve summaries versus text-based extraction, especially when tables are not cleanly parseable.

Cornell Notes

The approach for querying large financial documents with embedded tables replaces “embed the whole table” with a hierarchical, recursive RAG design. Tables are extracted into structured data frames, but the index stores a compact LLM summary plus a reference to a table-specific query engine (e.g., pandas). A vector retriever first selects relevant nodes using those summaries; recursive retrieval then follows table references and runs precise queries to pull exact numbers. This avoids naive RAG failures where flattened, number-dense table text leads to wrong retrieval or hallucinated year/value matches. The method works for both clean PDF tables (Camelot) and messier filings (Unstructured), and it can be persisted in different backends because references drive the lookup.

Why does naive RAG struggle with embedded tables in financial documents?

Naive RAG typically chunks a document into text segments, embeds those chunks, and retrieves top-k matches from a vector database. When tables are embedded as plain text, the retrieval signal becomes noisy: tables contain many numbers and dense formatting, so semantic similarity often fails to land on the correct table or the correct year/row. In the demo, the baseline pipeline either can’t determine the exact value or pulls the wrong table section, leading to incorrect answers (including a wrong-year match).

How does recursive retrieval improve table question answering?

Recursive retrieval adds a second stage after vector search. The system indexes text trunks and table summaries as separate nodes. When a query retrieves a table summary node, the pipeline follows the node’s reference to an underlying table query engine (such as a pandas query engine). It then executes a structured query against the table to return the exact numeric result, rather than relying on the model to infer values from retrieved plain-text table chunks.

What does a “table node” contain in the LlamaIndex-style architecture described?

A table node is represented by (1) an LLM-generated summary used for embedding and retrieval, and (2) a reference (via an ID) to the underlying structured table representation and its query engine. Text nodes, by contrast, primarily store the text chunk itself. During retrieval, the pipeline checks whether a node is just text or contains a reference to a table; if it references a table, it recursively queries that table.

How were tables extracted in the two main demos, and why does it matter?

For the Wikipedia billionaires PDF, Camelot parses clean tables into data frames, making pandas-style querying straightforward. For the SEC-style 10-K example, Unstructured partitions HTML into a hierarchy and extracts tables even when formatting is messier. In both cases, the extraction quality affects downstream accuracy: if parsing misses content or produces incomplete tables, the system’s structured queries can’t reliably return correct figures.

What tradeoffs affect latency, cost, and accuracy in this workflow?

Accuracy depends heavily on table parsing quality and on summary quality—if summaries are non-descriptive, retrieval won’t select the right table node. Latency increases during preprocessing because each table may require LLM summarization and indexing. Retrieval itself is usually fast because it starts with vector lookup and then runs targeted structured queries only for the retrieved table references. Model choice can influence summary quality, which in turn affects retrieval performance.

Does the architecture require storing tables in a specific database type?

No. Recursive retrieval follows references, so table storage is flexible. Tables can be stored as text in a vector store, in a document store (e.g., MongoDB), or in a SQL database, depending on the user’s persistence needs. The key requirement is that the referenced table can be queried when its node is selected during retrieval.

Review Questions

In what way does embedding a table “as-is” differ from embedding a table summary plus a reference to a query engine, and how does that change retrieval outcomes?
Describe the sequence of steps from user question to final answer in recursive retrieval for a table-backed node.
What factors most strongly determine whether the system retrieves the correct table node in messy filings?

Key Points

1
Naive chunk-and-embed retrieval often fails on embedded tables because raw table text is dense and number-heavy, producing poor top-k matches.
2
Index tables as summaries plus references to structured query engines, rather than embedding entire tables as plain text.
3
Use a top-level vector retriever to select relevant nodes, then apply recursive retrieval to follow table references and run precise table queries.
4
Table parsing quality (Camelot for clean PDFs, Unstructured for messier filings) directly impacts correctness of numeric answers.
5
LLM-generated table summaries must be descriptive enough to be retrievable; weak summaries reduce table selection accuracy.
6
Preprocessing can be slow because it includes parsing and LLM summarization across many tables, while retrieval is typically faster once the index exists.
7
Recursive retrieval is storage-agnostic: referenced tables can live in document stores, SQL databases, or vector databases as long as they can be queried.

Highlights

Recursive retrieval fixes table QA by retrieving table *summaries* first, then executing structured queries against the referenced table to return exact numbers.

In the billionaires example, the baseline flattened-RAG approach couldn’t reliably answer “How many billionaires were there in 2009?” while the recursive table-query pipeline returned 793.

For SEC-style filings, Unstructured can build a document hierarchy and extract tables even when formatting is messy, enabling the same summary-plus-reference retrieval pattern.

The approach treats table relevance as document-relative: embedded tables are usually small and context-specific, so embedded-structure modeling can outperform generic database-first hybrids.

Topics

Advanced RAG
Embedded Tables
Recursive Retrieval
Table Parsing
Financial Filings

Mentioned

Jerry Liu
RAG
SEC
10-K
10-Q
PDF
SQL
LLM
topk
GPT-4V