Python Tutorial: CSV Module - How to Read, Parse, and Write CSV Files
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
CSV stands for comma-separated values, but the delimiter can be commas, tabs, dashes, or other characters as long as the parser matches it.
Briefing
CSV files store structured data as plain text, typically using a delimiter like commas to separate fields on each line. A header row names the columns (e.g., First name, Last name, Email), and each subsequent line holds the corresponding values. That simple format is exactly why CSV parsing matters: without a proper parser, names or fields that contain delimiter characters can break naive string-splitting approaches.
Python’s built-in `csv` module streamlines reading, parsing, and writing CSV data. For reading, the workflow starts by opening the file with a context manager and creating a `csv.reader` object. Iterating over that reader yields each row as a list of values, where the header row appears as the first list. With index-based access, the email is consistently the third element (index 2) when the file is structured as First name, Last name, Email. If the header row isn’t desired, the code can advance the iterator using `next(...)` to skip the first line before processing the remaining records.
Writing CSV data follows a parallel pattern: open a new output file for writing, create a `csv.writer`, and pass the desired delimiter. When the delimiter changes—such as switching from commas to dashes—the output becomes harder to read, but it demonstrates an important safety feature: the writer automatically quotes fields that contain the delimiter character. In the example, an email containing a dash is wrapped in quotes so the dash inside the email doesn’t get mistaken for a field separator. Similarly, a hyphenated last name is quoted to preserve it as a single value. Using a more common delimiter like tabs (`\t`) produces a cleaner, readable file, and the same delimiter must be specified when reading back the data.
A key troubleshooting point emerges when the delimiter is wrong: reading a tab-delimited file with the default comma expectation results in rows that don’t split into multiple fields. Fixing the issue requires explicitly setting `delimiter='\t'` in `csv.reader` so the parser matches the file’s actual structure.
For more maintainable code, the tutorial recommends dictionary-based parsing and writing. `csv.DictReader` turns each row into a mapping keyed by the header fields, so accessing values becomes semantic (e.g., `line['Email']`) rather than index-based (e.g., `line[2]`). On the writing side, `csv.DictWriter` requires the field names upfront and can optionally write headers. It also makes column selection straightforward: removing the `'email'` key before writing a row drops that column entirely, producing an output file with only the remaining fields (First name and Last name). Overall, the `csv` module avoids brittle parsing logic while handling delimiters, quoting, headers, and structured access to fields.
Cornell Notes
CSV files store tabular data as plain text, using a delimiter to separate fields on each line, with a header row naming the columns. Python’s `csv.reader` reads rows as lists, where field positions are accessed by index (e.g., email at index 2 for First name, Last name, Email). Writing uses `csv.writer`, and changing the delimiter requires matching it during reading; otherwise, parsing fails (e.g., tab-delimited data read as comma-delimited yields one value per line). For clearer, safer code, `csv.DictReader` and `csv.DictWriter` map each row to dictionaries keyed by header names, enabling direct access like `line['Email']` and easy column removal by deleting keys before writing.
Why is using the `csv` module safer than splitting each line with `str.split(',')`?
How does `csv.reader` represent each row, and how are fields accessed?
What’s the correct way to skip the header row when using `csv.reader`?
What happens if the delimiter used by `csv.reader` doesn’t match the file’s delimiter?
How do `csv.DictReader` and `csv.DictWriter` improve code clarity?
Review Questions
- When using `csv.reader`, what index corresponds to the email field given a header of First name, Last name, Email?
- Why must the delimiter be specified consistently when writing with `csv.writer` and reading with `csv.reader`?
- How does deleting the `'email'` key before `csv.DictWriter.writerow(...)` change the output file’s columns?
Key Points
- 1
CSV stands for comma-separated values, but the delimiter can be commas, tabs, dashes, or other characters as long as the parser matches it.
- 2
`csv.reader` returns each row as a list, making index-based access possible but less readable than named-field access.
- 3
Skipping the header row with `next(reader)` prevents treating column names as data records.
- 4
`csv.writer` automatically quotes fields containing the delimiter character, preserving values like hyphenated emails or names.
- 5
Reading with the wrong delimiter causes incorrect parsing (often leaving rows unsplit), so `delimiter` must match the file format.
- 6
`csv.DictReader` and `csv.DictWriter` map rows to dictionaries keyed by header names, enabling clearer access like `line['Email']` and easy column removal by deleting keys.