Talk to your CSV & Excel with LangChain
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LangChain’s CSV agent enables natural-language querying of CSV/Excel data by executing pandas operations through a Python REPL.
Briefing
LangChain can turn natural-language questions into accurate, on-the-fly analysis of CSV and Excel data by using a “CSV agent” that runs a Python REPL over a pandas DataFrame. The practical payoff is that users can ask things like “How many rows are in this file?” or “How many people stayed more than three years in the city and are female?” without writing pandas code—while the system still performs real filtering and counting rather than guessing.
The setup starts with a Black Friday Sales Data set from Kaggle, loaded into pandas purely for sanity checks: column names such as gender (encoded as f and m), age (bucketed), stay in current city number of years, and marital status. LangChain then takes over using its own CSV loading mechanism, not a training step. A key implementation choice is using an OpenAI language model with temperature set to 0 to reduce randomness and limit hallucinations. Verbose mode is enabled so the prompts and intermediate reasoning steps are visible.
Under the hood, the CSV agent is careful: it effectively runs a Python agent, meaning it can be vulnerable to prompt-injection style attacks if exposed to untrusted end users. In the demo, everything runs in a controlled environment, so the risk is treated as manageable.
The agent’s workflow follows a loop of “thought → action → observation,” with a scratchpad that carries intermediate results between model calls. For a simple query—“How many rows are there?”—the agent decides it needs to count the DataFrame length and uses the Python REPL to compute it, returning 550,000 rows. For a more semantic question—“How many people are female?”—it correctly maps the user’s wording (“female”) to the dataset’s encoding (gender values f and m) by referencing the DataFrame’s column structure and then counting rows where gender equals f, yielding 135,000.
More complex filters work the same way. When asked how many people stayed in the city for more than three years, the agent writes a pandas filter on stay in current city number of years > 3 and returns the count. It can combine conditions, such as staying more than three years and being female, by generating a compound pandas query (stay_in_city > 3 and gender == f). It also supports comparative analytics; after counting males and females, it can answer whether one group is larger.
Excel support comes indirectly: LangChain has no native Excel import in this approach, so Excel files are converted to CSV before loading. Once converted, the agent can list column names, compute aggregates like average age, and handle frequency questions such as which country appears most. In one example, it identifies the United States as the most frequent country with 48 occurrences. Another dataset enables ratio-style questions like the male-to-female ratio, demonstrating that the same natural-language querying pattern applies across different CSVs.
Overall, the method enables quick, lightweight “data Q&A” apps for people who don’t want to learn pandas, while still relying on actual DataFrame operations for counts, filters, averages, and group-style computations.
Cornell Notes
LangChain’s CSV agent lets users query CSV data using natural language by translating questions into pandas operations executed via a Python REPL. The agent loads the file into a pandas DataFrame, then iteratively decides what action to take, using a scratchpad to keep intermediate results. Setting the OpenAI model temperature to 0 helps keep outputs consistent and reduces hallucinations. The approach works for multi-step questions like filtering by multiple conditions (e.g., stay length and gender) and for comparisons (e.g., whether males outnumber females). Excel files require conversion to CSV first, since there’s no native Excel import in this workflow.
How does the CSV agent answer “How many rows are there?” without any manual pandas code?
Why can the agent answer “How many people are female?” even though the dataset doesn’t contain the words “female” or “male”?
What changes when the question becomes a compound filter like “more than three years and female”?
How does the agent support comparative questions such as “Are there more males than females?”
What’s the limitation for Excel files in this workflow, and how is it handled?
Review Questions
- What role does temperature=0 play in the agent’s reliability when answering questions about DataFrame contents?
- Describe the agent’s loop structure (thought/action/observation) and how the scratchpad helps with multi-step queries.
- Why is prompt injection a concern for agents that run Python, and what mitigation is implied by running in a trusted environment?
Key Points
- 1
LangChain’s CSV agent enables natural-language querying of CSV/Excel data by executing pandas operations through a Python REPL.
- 2
Setting the OpenAI model temperature to 0 reduces randomness and helps keep answers grounded in the underlying data.
- 3
The agent uses an iterative thought/action/observation loop and a scratchpad to carry intermediate results across steps.
- 4
Prompt injection risk increases when the agent runs code (Python REPL), so untrusted end users require extra caution and safer execution environments.
- 5
The demo shows correct handling of semantic questions by mapping user intent (e.g., “female”) to dataset encodings (gender values f/m).
- 6
Compound filters are handled by generating combined pandas conditions (e.g., stay length threshold plus gender).
- 7
Excel support in this approach requires converting Excel to CSV first, since there’s no native Excel import.