Python Pandas Tutorial (Part 11): Reading/Writing Data to Different Sources - Excel, JSON, SQL, Etc
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use `pd.read_csv(..., index_col='respondent')` and `DataFrame.to_csv(...)` to move survey data into and out of CSV while preserving respondent IDs as the index.
Briefing
Pandas can move data between common formats—CSV, tab-delimited text, Excel, JSON, and SQL databases—using a small set of consistent read/write methods, with a few format-specific parameters to get the structure right. The practical takeaway: once the right arguments are set (paths, separators, index columns, JSON orientation, and database connection details), the same filtered DataFrame can be exported and later re-imported without losing the core tabular content.
The walkthrough starts with CSV. Data is loaded from a relative path (a data folder next to the Jupyter notebook) using `pd.read_csv`, with `index_col` set to `respondent` so each survey respondent ID becomes the DataFrame index. After filtering the dataset to a specific country (e.g., `country == 'India'`), the filtered DataFrame is exported back to disk using `DataFrame.to_csv`, writing a new file such as `modified_D.csv` in the same data directory. The result is a raw CSV that preserves column headers and rows, making it easy to share or reuse.
For tab-delimited files, the process stays nearly identical: pandas still uses CSV methods, but the delimiter changes. Writing uses `to_csv` with a tab separator (via `sep='\t'`) and a `.tsv` filename, producing a file that looks like CSV except fields are separated by tabs. Reading a TSV follows the same pattern by passing `sep='\t'` into `read_csv`.
Excel support requires installing additional packages. The tutorial installs `xlwt` for older `.xls` writing, `openpyxl` for `.xlsx`, and `xlrd` for reading Excel files. With those in place, `DataFrame.to_excel` exports the filtered dataset to `modified.xlsx`. Importing back uses `pd.read_excel`, again setting `index_col='respondent'` so the index matches the original CSV/filtered structure. It also notes that Excel’s multi-sheet layout can be handled by specifying a `sheet_name` argument, and that pandas supports reading/writing from specific regions if needed.
JSON is handled with `DataFrame.to_json` and `pd.read_json`, but the key detail is orientation. By default, pandas writes a dictionary-like JSON structure. Switching to a list-like format uses `orient='records'` and `lines=True`, which writes each record on its own line. Reading back requires matching those same orientation settings; otherwise, the resulting DataFrame shape can differ.
Finally, SQL integration uses SQLAlchemy and a Postgres driver (`psycopg2-binary`). After creating a database connection with `create_engine` using a Postgres connection string, the filtered DataFrame is written with `DataFrame.to_sql` into a table like `sample_table`. If the table already exists, `if_exists='replace'` allows overwriting; other options include erroring or appending. Reading uses `pd.read_sql` (or `pd.read_sql_query` for custom queries), with `index_col='respondent'` to restore the index. The tutorial closes with a convenience tip: many pandas readers can load JSON directly from a URL using `pd.read_json` without downloading the file first.
Cornell Notes
Pandas can read and write the same dataset across CSV, TSV, Excel, JSON, and SQL by using consistent DataFrame methods plus a few format-specific settings. CSV and TSV rely on `read_csv`/`to_csv`, with TSV simply changing the separator to `\t`. Excel requires installing extra packages and using `to_excel`/`read_excel`, typically setting `index_col='respondent'` to preserve the index. JSON works with `to_json`/`read_json`, but the `orient` and `lines` options must match between export and import. SQL uses SQLAlchemy: connect with `create_engine`, write with `to_sql` (using `if_exists` for existing tables), and read with `read_sql` or `read_sql_query` while specifying `index_col` when needed.
How does pandas preserve a meaningful index when moving survey data between formats like CSV and Excel?
What changes when exporting to a tab-delimited file instead of CSV?
Why do JSON exports sometimes re-import into a different structure, and how is that controlled?
What extra setup is required to write and read Excel files with pandas?
How does pandas write a DataFrame into a SQL table and handle the case where the table already exists?
When reading from SQL, when should `read_sql_query` be used instead of `read_sql`?
Review Questions
- When exporting to JSON, what two parameters must match between `to_json` and `read_json` to keep the DataFrame structure consistent?
- In the SQL workflow, what does `if_exists='replace'` do, and what alternative behaviors does pandas offer for existing tables?
- How do `sep='\t'` and the `.tsv` extension work together to distinguish tab-delimited files from CSV in pandas?
Key Points
- 1
Use `pd.read_csv(..., index_col='respondent')` and `DataFrame.to_csv(...)` to move survey data into and out of CSV while preserving respondent IDs as the index.
- 2
Treat TSV as CSV with a different delimiter: set `sep='\t'` when reading and writing, and use a `.tsv` filename.
- 3
Excel read/write requires installing `xlwt`, `openpyxl`, and `xlrd`, then using `to_excel`/`read_excel` with `index_col` to restore the index.
- 4
JSON export/import must match orientation settings: `orient='records'` and `lines=True` require the same options in `pd.read_json` to reconstruct the DataFrame correctly.
- 5
SQL integration uses SQLAlchemy: connect with `create_engine`, write with `to_sql` (using `if_exists` to control behavior for existing tables), and read with `read_sql` or `read_sql_query`.
- 6
When working with large SQL datasets, prefer `read_sql_query` with a `SELECT` statement (and optional `WHERE`) to limit what pandas loads into memory.
- 7
Many pandas readers can load from URLs when the format matches the reader (e.g., `pd.read_json(url)` for JSON).