Python Pandas Tutorial (Part 11): Reading/Writing Data to Different Sources

TL;DR

Use `pd.read_csv(..., index_col='respondent')` and `DataFrame.to_csv(...)` to move survey data into and out of CSV while preserving respondent IDs as the index.

Briefing Cornell Notes

Briefing

Pandas can move data between common formats—CSV, tab-delimited text, Excel, JSON, and SQL databases—using a small set of consistent read/write methods, with a few format-specific parameters to get the structure right. The practical takeaway: once the right arguments are set (paths, separators, index columns, JSON orientation, and database connection details), the same filtered DataFrame can be exported and later re-imported without losing the core tabular content.

The walkthrough starts with CSV. Data is loaded from a relative path (a data folder next to the Jupyter notebook) using `pd.read_csv`, with `index_col` set to `respondent` so each survey respondent ID becomes the DataFrame index. After filtering the dataset to a specific country (e.g., `country == 'India'`), the filtered DataFrame is exported back to disk using `DataFrame.to_csv`, writing a new file such as `modified_D.csv` in the same data directory. The result is a raw CSV that preserves column headers and rows, making it easy to share or reuse.

For tab-delimited files, the process stays nearly identical: pandas still uses CSV methods, but the delimiter changes. Writing uses `to_csv` with a tab separator (via `sep='\t'`) and a `.tsv` filename, producing a file that looks like CSV except fields are separated by tabs. Reading a TSV follows the same pattern by passing `sep='\t'` into `read_csv`.

Excel support requires installing additional packages. The tutorial installs `xlwt` for older `.xls` writing, `openpyxl` for `.xlsx`, and `xlrd` for reading Excel files. With those in place, `DataFrame.to_excel` exports the filtered dataset to `modified.xlsx`. Importing back uses `pd.read_excel`, again setting `index_col='respondent'` so the index matches the original CSV/filtered structure. It also notes that Excel’s multi-sheet layout can be handled by specifying a `sheet_name` argument, and that pandas supports reading/writing from specific regions if needed.

JSON is handled with `DataFrame.to_json` and `pd.read_json`, but the key detail is orientation. By default, pandas writes a dictionary-like JSON structure. Switching to a list-like format uses `orient='records'` and `lines=True`, which writes each record on its own line. Reading back requires matching those same orientation settings; otherwise, the resulting DataFrame shape can differ.

Finally, SQL integration uses SQLAlchemy and a Postgres driver (`psycopg2-binary`). After creating a database connection with `create_engine` using a Postgres connection string, the filtered DataFrame is written with `DataFrame.to_sql` into a table like `sample_table`. If the table already exists, `if_exists='replace'` allows overwriting; other options include erroring or appending. Reading uses `pd.read_sql` (or `pd.read_sql_query` for custom queries), with `index_col='respondent'` to restore the index. The tutorial closes with a convenience tip: many pandas readers can load JSON directly from a URL using `pd.read_json` without downloading the file first.

Cornell Notes

Pandas can read and write the same dataset across CSV, TSV, Excel, JSON, and SQL by using consistent DataFrame methods plus a few format-specific settings. CSV and TSV rely on `read_csv`/`to_csv`, with TSV simply changing the separator to `\t`. Excel requires installing extra packages and using `to_excel`/`read_excel`, typically setting `index_col='respondent'` to preserve the index. JSON works with `to_json`/`read_json`, but the `orient` and `lines` options must match between export and import. SQL uses SQLAlchemy: connect with `create_engine`, write with `to_sql` (using `if_exists` for existing tables), and read with `read_sql` or `read_sql_query` while specifying `index_col` when needed.

How does pandas preserve a meaningful index when moving survey data between formats like CSV and Excel?

The workflow sets the DataFrame index explicitly using `index_col='respondent'`. When reading CSV, `pd.read_csv(..., index_col='respondent')` makes `respondent` the index. When exporting a filtered DataFrame, `to_csv` keeps the index as part of the file structure. When importing from Excel, `pd.read_excel('modified.xlsx', index_col='respondent')` restores the same index so the DataFrame aligns with the original survey respondent IDs.

What changes when exporting to a tab-delimited file instead of CSV?

Nothing fundamental about the pandas method names—`to_csv` is still used. The key change is the separator: `sep='\t'` and a `.tsv` filename. Reading TSV similarly uses `read_csv` with `sep='\t'`. This produces files that look like CSV but with tabs separating fields rather than commas.

Why do JSON exports sometimes re-import into a different structure, and how is that controlled?

JSON structure depends on pandas’ orientation settings. The default `to_json` output is dictionary-like. To make it list-like, the tutorial uses `orient='records'` and `lines=True`, writing one record per line. Re-importing must use the same options in `pd.read_json` (including `orient` and `lines`) so pandas reconstructs the DataFrame with the expected rows/columns.

What extra setup is required to write and read Excel files with pandas?

Excel support needs additional packages installed in the active environment. The tutorial installs `xlwt` (older `.xls` writing), `openpyxl` (newer `.xlsx` writing), and `xlrd` (reading Excel). After that, `DataFrame.to_excel('modified.xlsx')` writes the file and `pd.read_excel('modified.xlsx', index_col='respondent')` reads it back.

How does pandas write a DataFrame into a SQL table and handle the case where the table already exists?

After connecting via SQLAlchemy (`create_engine` with a Postgres connection string), the DataFrame is written with `DataFrame.to_sql('sample_table', engine, ...)`. If the table doesn’t exist, pandas creates it. If it already exists, the tutorial uses `if_exists='replace'` to overwrite the table; other supported behaviors include raising an error (default) or appending new rows.

When reading from SQL, when should `read_sql_query` be used instead of `read_sql`?

Use `pd.read_sql` to load an entire table by name. Use `pd.read_sql_query` when a specific SQL statement is needed—such as `SELECT * FROM sample_table` or a query with a `WHERE` clause to filter rows server-side. This is especially useful for large databases where loading everything would be inefficient.

Review Questions

When exporting to JSON, what two parameters must match between `to_json` and `read_json` to keep the DataFrame structure consistent?
In the SQL workflow, what does `if_exists='replace'` do, and what alternative behaviors does pandas offer for existing tables?
How do `sep='\t'` and the `.tsv` extension work together to distinguish tab-delimited files from CSV in pandas?

Key Points

1
Use `pd.read_csv(..., index_col='respondent')` and `DataFrame.to_csv(...)` to move survey data into and out of CSV while preserving respondent IDs as the index.
2
Treat TSV as CSV with a different delimiter: set `sep='\t'` when reading and writing, and use a `.tsv` filename.
3
Excel read/write requires installing `xlwt`, `openpyxl`, and `xlrd`, then using `to_excel`/`read_excel` with `index_col` to restore the index.
4
JSON export/import must match orientation settings: `orient='records'` and `lines=True` require the same options in `pd.read_json` to reconstruct the DataFrame correctly.
5
SQL integration uses SQLAlchemy: connect with `create_engine`, write with `to_sql` (using `if_exists` to control behavior for existing tables), and read with `read_sql` or `read_sql_query`.
6
When working with large SQL datasets, prefer `read_sql_query` with a `SELECT` statement (and optional `WHERE`) to limit what pandas loads into memory.
7
Many pandas readers can load from URLs when the format matches the reader (e.g., `pd.read_json(url)` for JSON).

Highlights

A single filtered DataFrame can be exported to multiple formats—CSV, TSV, Excel, JSON, and SQL—by pairing the right pandas method with the right parameters.

Preserving `respondent` as the index is a recurring pattern: `index_col='respondent'` is used across CSV, Excel, and SQL reads.

JSON orientation matters: switching to `orient='records'` with `lines=True` changes how records are laid out and must be mirrored on import.

SQLAlchemy plus `to_sql`/`read_sql_query` turns pandas into a bidirectional bridge for Postgres tables, including overwrite control via `if_exists='replace'`.

TSV handling is essentially CSV handling with `sep='\t'`, making delimiter changes the main adjustment. 

Topics

Reading CSV
Writing Excel
JSON Orientation
SQLAlchemy Integration
TSV Separators

Mentioned

ORM
SQL
SQLAlchemy
TSV
JSON
CSV
ORM

Python Pandas Tutorial (Part 11): Reading/Writing Data to Different Sources - Excel, JSON, SQL, Etc