Talk to your CSV & Excel with LangChain

TL;DR

LangChain’s CSV agent enables natural-language querying of CSV/Excel data by executing pandas operations through a Python REPL.

Briefing Cornell Notes

Briefing

LangChain can turn natural-language questions into accurate, on-the-fly analysis of CSV and Excel data by using a “CSV agent” that runs a Python REPL over a pandas DataFrame. The practical payoff is that users can ask things like “How many rows are in this file?” or “How many people stayed more than three years in the city and are female?” without writing pandas code—while the system still performs real filtering and counting rather than guessing.

The setup starts with a Black Friday Sales Data set from Kaggle, loaded into pandas purely for sanity checks: column names such as gender (encoded as f and m), age (bucketed), stay in current city number of years, and marital status. LangChain then takes over using its own CSV loading mechanism, not a training step. A key implementation choice is using an OpenAI language model with temperature set to 0 to reduce randomness and limit hallucinations. Verbose mode is enabled so the prompts and intermediate reasoning steps are visible.

Under the hood, the CSV agent is careful: it effectively runs a Python agent, meaning it can be vulnerable to prompt-injection style attacks if exposed to untrusted end users. In the demo, everything runs in a controlled environment, so the risk is treated as manageable.

The agent’s workflow follows a loop of “thought → action → observation,” with a scratchpad that carries intermediate results between model calls. For a simple query—“How many rows are there?”—the agent decides it needs to count the DataFrame length and uses the Python REPL to compute it, returning 550,000 rows. For a more semantic question—“How many people are female?”—it correctly maps the user’s wording (“female”) to the dataset’s encoding (gender values f and m) by referencing the DataFrame’s column structure and then counting rows where gender equals f, yielding 135,000.

More complex filters work the same way. When asked how many people stayed in the city for more than three years, the agent writes a pandas filter on stay in current city number of years > 3 and returns the count. It can combine conditions, such as staying more than three years and being female, by generating a compound pandas query (stay_in_city > 3 and gender == f). It also supports comparative analytics; after counting males and females, it can answer whether one group is larger.

Excel support comes indirectly: LangChain has no native Excel import in this approach, so Excel files are converted to CSV before loading. Once converted, the agent can list column names, compute aggregates like average age, and handle frequency questions such as which country appears most. In one example, it identifies the United States as the most frequent country with 48 occurrences. Another dataset enables ratio-style questions like the male-to-female ratio, demonstrating that the same natural-language querying pattern applies across different CSVs.

Overall, the method enables quick, lightweight “data Q&A” apps for people who don’t want to learn pandas, while still relying on actual DataFrame operations for counts, filters, averages, and group-style computations.

Cornell Notes

LangChain’s CSV agent lets users query CSV data using natural language by translating questions into pandas operations executed via a Python REPL. The agent loads the file into a pandas DataFrame, then iteratively decides what action to take, using a scratchpad to keep intermediate results. Setting the OpenAI model temperature to 0 helps keep outputs consistent and reduces hallucinations. The approach works for multi-step questions like filtering by multiple conditions (e.g., stay length and gender) and for comparisons (e.g., whether males outnumber females). Excel files require conversion to CSV first, since there’s no native Excel import in this workflow.

How does the CSV agent answer “How many rows are there?” without any manual pandas code?

It loads the CSV into a pandas DataFrame and then uses the Python REPL to compute the DataFrame length. In the demo dataset, the agent counts the rows by taking the DataFrame’s size/length and returns 550,000.

Why can the agent answer “How many people are female?” even though the dataset doesn’t contain the words “female” or “male”?

The dataset encodes gender as f and m in a column named gender. The agent inspects the DataFrame structure (column names and values) and maps “female” to gender == f, then counts matching rows. The demo result is 135,000.

What changes when the question becomes a compound filter like “more than three years and female”?

The agent generates a pandas query that combines conditions. It filters stay in current city number of years > 3 and gender == f, then counts the resulting rows. This is handled through the same action/observation loop, with the scratchpad carrying intermediate outputs.

How does the agent support comparative questions such as “Are there more males than females?”

It performs counts for each group (males and females) and records those numbers in its scratchpad/observations. After it has both counts, it compares them and returns a conclusion (in the demo, males are more than females for that dataset).

What’s the limitation for Excel files in this workflow, and how is it handled?

LangChain doesn’t provide native Excel import here. The workflow converts the Excel file to CSV first, then runs the same CSV-agent approach over the converted data. After conversion, it can list column names, compute averages like average age, and find the most frequent country (e.g., the United States with 48 occurrences).

Review Questions

What role does temperature=0 play in the agent’s reliability when answering questions about DataFrame contents?
Describe the agent’s loop structure (thought/action/observation) and how the scratchpad helps with multi-step queries.
Why is prompt injection a concern for agents that run Python, and what mitigation is implied by running in a trusted environment?

Key Points

1
LangChain’s CSV agent enables natural-language querying of CSV/Excel data by executing pandas operations through a Python REPL.
2
Setting the OpenAI model temperature to 0 reduces randomness and helps keep answers grounded in the underlying data.
3
The agent uses an iterative thought/action/observation loop and a scratchpad to carry intermediate results across steps.
4
Prompt injection risk increases when the agent runs code (Python REPL), so untrusted end users require extra caution and safer execution environments.
5
The demo shows correct handling of semantic questions by mapping user intent (e.g., “female”) to dataset encodings (gender values f/m).
6
Compound filters are handled by generating combined pandas conditions (e.g., stay length threshold plus gender).
7
Excel support in this approach requires converting Excel to CSV first, since there’s no native Excel import.

Highlights

The agent answers real DataFrame questions by running pandas code via a Python REPL, not by guessing from text.

Gender queries work even when the dataset uses encoded values (f/m), because the agent leverages the DataFrame’s column structure.

Multi-condition questions translate into combined pandas filters, enabling precise counts like “stay > 3 years and female.”

Excel queries work after converting Excel to CSV, after which the same agent can compute averages and frequency counts (e.g., United States appears 48 times).

Topics

LangChain CSV Agent
Natural Language Data Querying
pandas DataFrame Filtering
Python REPL Agents
Excel-to-CSV Workflow