Python Pandas Tutorial (Part 3): Indexes - How to Set, Reset, and Use Indexes
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Pandas indexes provide the labels that `.loc` uses for row selection, turning lookups into direct label-based queries.
Briefing
Pandas indexes turn row and column lookups from “search by position” into “search by label,” and that shift makes real survey data far easier to query. The tutorial starts with a small people table where the default index is just 0, 1, 2, then shows how swapping that index to a meaningful identifier—like email—lets lookups become direct and readable.
Using the sample DataFrame, the index is initially an unnamed, range-based integer identifier. Since pandas doesn’t strictly enforce uniqueness, indexes are usually chosen from values that are effectively unique in practice. Email is used as the example: after setting it with `df.set_index('email')`, the email values appear as the left-side index (displayed in bold and carrying the column name). A key detail is that `set_index` does not modify the existing DataFrame unless `inplace=True` is provided. Without `inplace`, printing `df` still shows the original default index; with `inplace=True`, the DataFrame’s index actually changes, and `df.index` confirms the index values and its name.
Once email becomes the index, label-based selection with `.loc` becomes the natural way to retrieve rows. Instead of using the old integer labels, the tutorial demonstrates selecting a specific person by passing the email string to `df.loc[...]`, and optionally narrowing to a column (e.g., retrieving `last_name`). Trying to use the old integer labels with `.loc` triggers an error because those labels no longer exist. For integer-based retrieval by position, the tutorial points to `.iloc` as the alternative.
The tutorial then moves from toy data to a Stack Overflow survey dataset. The survey includes a `respondent` column that functions as a unique ID per row, so the default range index is replaced with that ID. This can be done either after loading via `df.set_index('respondent', inplace=True)` or during import by passing `index_col='respondent'` to `read_csv`. With the respondent ID as the index, selecting the first respondent becomes as simple as `df.loc[1]`.
A second DataFrame, `schema`, maps survey column names to question text. Setting the schema’s index to the `column` field makes it possible to look up what a column means without manually scanning the entire schema. For instance, looking up `hobbiest` returns the question text (“Do you code as a hobby”), and looking up a less familiar label like `MGR idiot` retrieves the full prompt. To improve usability, the tutorial also sorts the schema index alphabetically using `schema_df.sort_index()`, with an option for descending order via `ascending=False`. If sorting should persist for later operations, `inplace=True` is used.
Overall, indexes are presented as a practical tool for making `.loc` lookups fast, readable, and aligned with how identifiers are already embedded in the data—especially when working with large, schema-driven datasets.
Cornell Notes
Indexes in pandas replace the default row numbering with meaningful labels, enabling direct lookups using `.loc`. The tutorial shows setting an index with `set_index('email')`, and highlights that changes don’t persist unless `inplace=True` is used. After switching the index to email, rows are retrieved by label (e.g., `df.loc['cory@example.com']`), while integer-based selection requires `.iloc`. In the Stack Overflow survey example, using `respondent` as `index_col` (or via `set_index`) makes respondent-based queries straightforward. The same idea applies to the `schema` DataFrame: setting its index to `column` allows quick retrieval of question text for labels like `hobbiest` and `MGR idiot`, and sorting the index makes navigation easier.
Why does choosing a custom index (like email) change how you query a DataFrame?
What’s the difference between `df.set_index(...)` and `df.set_index(..., inplace=True)`?
How can you set the respondent ID as the index when loading Stack Overflow survey CSV files?
How does setting the `schema` DataFrame index help interpret survey columns?
Why sort the schema index, and how is it done?
Review Questions
- When you switch a DataFrame’s index from the default integers to email strings, which indexer should you use for label-based lookup, and what happens if you try the old integers with it?
- What two methods does the tutorial present for setting `respondent` as the index when working with CSV data, and how do they differ in timing?
- How does setting the `schema` DataFrame index to the `column` field change the way you retrieve question text for labels like `hobbiest`?
Key Points
- 1
Pandas indexes provide the labels that `.loc` uses for row selection, turning lookups into direct label-based queries.
- 2
`set_index` does not modify the original DataFrame unless `inplace=True` is used (or the returned DataFrame is assigned).
- 3
After setting a custom index (e.g., email), integer labels from the default index are no longer valid for `.loc` and will raise errors.
- 4
Use `.iloc` when you want integer position access even after changing the index.
- 5
When loading CSV survey data, `read_csv(..., index_col='respondent')` can set the index immediately for cleaner downstream queries.
- 6
Setting the `schema` DataFrame index to the `column` field enables instant retrieval of question text for any survey column label via `.loc`.
- 7
Sorting indexes with `sort_index()` (and optionally `ascending=False` / `inplace=True`) improves navigation through large label sets.