Get AI summaries of any video or article — Sign up free
GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla's 10-K Annual Reports) thumbnail

GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla's 10-K Annual Reports)

Chat with data·
5 min read

Based on Chat with data's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Convert each PDF to text, then chunk it to fit model context limits before embedding.

Briefing

Answering questions across multiple massive PDF files—like several years of Tesla 10-K annual reports—becomes practical when each document is converted into searchable embeddings and the system can dynamically route a user’s question to the right year-specific “namespace” in a vector database. The core payoff: a single chat can retrieve relevant passages from 2020, 2021, and 2022 reports, then compute or summarize across them while citing the underlying page sources.

The workflow starts with ingestion. Each PDF is converted from binary into text, then split into chunks because models only handle limited context at a time. Those chunks are embedded into numeric vectors (representations of meaning) and stored in a vector store so retrieval can happen fast. The design uses separate namespaces per year—e.g., Tesla 2020, Tesla 2021, Tesla 2022—so the system can search within the correct subset of documents. Pinecone is used as the vector store, and the embeddings dimension must match the embedding model output (the transcript notes 1536 for OpenAI embeddings). During ingestion, metadata such as page numbers and source references are attached to chunks so answers can be grounded with citations.

Query-time logic is where multi-year analysis is made to work. For a single-year question, the system can search one namespace directly. But cross-year questions require a routing strategy: the system must determine which years the user is referring to, then search multiple namespaces and combine the retrieved context. To avoid hard-coding, GPT-4 is used to extract the years from the user’s question (via a function that returns the referenced years). Those years are mapped to the corresponding namespaces. Once the relevant namespaces are identified, the system retrieves top-matching chunks from each, merges the context, and sends it to GPT-4 to produce a synthesized answer.

The transcript demonstrates this with finance-style prompts such as “What were the risk factors for Tesla 2022?” (with page-level references) and then escalates to questions spanning multiple years, like gross margin changes over three years and revenue growth calculations. In the multi-year case, the system not only pulls supporting excerpts from each annual report but also performs arithmetic based on the retrieved statements (for example, computing revenue growth using consolidated statement figures). The demo also references a front-end and API adaptation, with LangChain chains orchestrating the steps: extracting years, retrieving from Pinecone namespaces, and generating the final response.

Implementation details reinforce operational realism: ingestion is done via scripts that load PDFs from a reports folder, chunk them (the transcript mentions chunk sizes and overlap), and insert vectors in batches because Pinecone has limits (the transcript notes a recommendation around 100 vectors per insert). The result is an architecture that can chat over roughly a thousand pages across multiple PDFs while keeping answers tied to the original documents—turning tedious reading into targeted, source-cited analysis across time.

Cornell Notes

The system turns each year’s Tesla 10-K PDFs into text chunks, embeds those chunks into vectors, and stores them in a vector database (Pinecone) under year-specific namespaces (e.g., Tesla 2020, Tesla 2021, Tesla 2022). When a user asks a question, GPT-4 extracts which years are referenced, maps those years to the matching namespaces, retrieves the most relevant chunks from each, and then generates a combined answer with citations to the source pages. This dynamic namespace routing is what enables cross-year analysis instead of only single-PDF Q&A. Page numbers and source references are preserved as metadata during ingestion so the final responses can be checked against the original filings.

Why convert PDFs into embeddings and store them in a vector database?

PDFs are binary, so they’re first converted to text. That text is then split into chunks to fit model context limits. Each chunk is embedded into a numeric vector representation, which makes semantic search possible. Pinecone stores these vectors along with metadata (including page numbers and source references), enabling fast retrieval of the most relevant passages when a question arrives.

What problem does “namespaces per year” solve?

Separate namespaces let the system control what it searches. A single-year question can target one namespace (e.g., Tesla 2022). Cross-year questions require searching multiple namespaces (Tesla 2020 + Tesla 2021 + Tesla 2022) and combining the retrieved context. Without this structure, retrieval would mix years and degrade accuracy.

How does the system handle questions that mention multiple years without hard-coding?

GPT-4 extracts the years referenced in the user’s question (using a function that returns the years, then a mapping step converts those years into the corresponding namespace names). The system then retrieves from each identified namespace and merges the results before asking GPT-4 to generate the final response.

How are citations and page references preserved?

During ingestion, each chunk is associated with metadata such as page numbers and references to the original document. When retrieval pulls relevant chunks from Pinecone, those metadata travel with the context, allowing the final answer to point back to specific pages (e.g., risk factors referencing a particular page in the 2022 report).

What operational constraints matter when inserting vectors into Pinecone?

The transcript notes that Pinecone insertion has limits, so vectors can’t be inserted all at once. It mentions a recommendation around inserting in batches (e.g., ~100 vectors per insert). It also stresses configuration correctness: the embedding dimension must match (1536 is cited), the environment/pod region must align with the code’s environment variable, and the index name must match the configured index.

What kinds of questions work best in this setup?

Questions that can be answered from retrieved passages plus light computation. The demo includes risk-factor questions, management/profitability themes, and multi-year metrics like gross margin changes and revenue growth. For computed metrics, the system retrieves the underlying financial statement numbers from the relevant years and then performs calculations using those retrieved figures.

Review Questions

  1. How does dynamic year extraction change the retrieval process compared with a fixed single-namespace approach?
  2. Describe the ingestion pipeline from PDF to Pinecone, including chunking and metadata. Why is chunking necessary?
  3. What configuration details (e.g., embedding dimension, index name, environment) must match between code and Pinecone for insertion and retrieval to work?

Key Points

  1. 1

    Convert each PDF to text, then chunk it to fit model context limits before embedding.

  2. 2

    Store embeddings in Pinecone under year-specific namespaces so retrieval can be scoped to the right annual report.

  3. 3

    Use GPT-4 to extract which years a user’s question references, then map those years to the correct namespaces dynamically.

  4. 4

    Retrieve relevant chunks from each identified namespace, merge the context, and generate a synthesized answer with citations.

  5. 5

    Attach page numbers and source references as metadata during ingestion so answers can be cross-checked in the original filings.

  6. 6

    Insert vectors in batches to respect Pinecone limits, and ensure embedding dimensions and Pinecone configuration match the code.

  7. 7

    Use LangChain chains to orchestrate year extraction, vector retrieval, and final GPT-4 response generation.

Highlights

Cross-year Q&A works by dynamically routing questions to multiple year namespaces (e.g., 2020–2022) rather than searching a single document set.
The system preserves page-level citations by storing page numbers and source references as chunk metadata during ingestion.
Multi-year metrics can be computed when the system retrieves the underlying financial statement figures from each year’s report.

Topics

Mentioned