Python Pandas Tutorial (Part 9): Cleaning Data - Casting Datatypes and Handling Missing Values

TL;DR

Use `dropna(axis='index', how='any')` to remove rows with any missing value, and `how='all'` to remove only fully missing rows.

Briefing Cornell Notes

Briefing

Cleaning data in pandas often starts with two practical tasks: removing or retaining rows/columns with missing values, and converting columns into the right numeric types so calculations don’t break. The core takeaway is that pandas treats different “missing” markers differently—so you need to both standardize missing values (including custom strings) and then choose the right drop/cast strategy for the analysis you’re trying to run.

For missing values, the tutorial begins with `dropna()`, showing how pandas behaves under its defaults and how to control it. With `axis='index'` and `how='any'`, pandas drops any row that contains at least one missing value. Changing `how` to `all` keeps rows where only some fields are missing, dropping only rows where every value in the row is missing. Switching `axis` to `columns` flips the logic to dropping columns that are entirely missing. The lesson matters because “missing” doesn’t always mean “useless”: an analysis might tolerate a missing email address but not a missing identifier.

To target missingness in specific fields, the tutorial uses `subset` with `dropna()`. By specifying `subset=['email']`, rows are removed only when the email value is missing; other columns can be incomplete without triggering a drop. When multiple columns are listed in `subset`, the `how` parameter determines whether a row is dropped only when all specified fields are missing (`how='all'`) or when any of them are missing (`how='any'`). The tutorial also emphasizes that these operations are not permanent unless `inplace=True` is used.

Next comes the problem of “custom missing values.” Real datasets sometimes encode missingness as strings like `'Na'` or `'missing'` rather than actual `NaN`. The tutorial demonstrates replacing these placeholders with `numpy.nan` using `replace(..., inplace=True)`, so downstream methods like `isna()` and `dropna()` treat them consistently. It then shows how to inspect missingness with `isna()` and how to fill missing values with `fillna()`—either with a domain-appropriate constant (like `0` for numeric grades) or with a placeholder string.

The second major pillar is casting data types. A column that looks numeric may still be stored as `object` (often strings), which breaks numeric operations such as `mean()`. The tutorial uses the `dtypes` attribute to confirm that a column like `age` is stored as strings, then converts it with `astype()`. A key caveat appears when missing values are present: converting to `int` fails because `np.nan` is a float under the hood, so the safer cast is to `float` (or to fill missing values first). Once cast correctly, `mean()` works and returns the expected average.

To tie everything together, the tutorial applies the same cleaning and casting steps to Stack Overflow survey data to compute average years of coding experience. The `years_code` column contains numeric values plus strings like `'less than one year'` and `'more than 50 years'`. The workflow is: convert custom missing markers during CSV loading (via `na_values`), replace `'less than one year'` with `0`, replace `'more than 50 years'` with `51`, cast the column to `float`, and then compute the mean (about 11.5 years). The result underscores why missing-value handling and type casting aren’t optional chores—they’re prerequisites for reliable analysis.

Cornell Notes

Missing data in pandas is handled through `dropna()` and `fillna()`, but only after standardizing what counts as “missing.” `dropna()` can remove rows or columns based on `axis` and `how` (`any` vs `all`), and `subset` lets you drop only when specific fields are missing. Custom placeholders like `'Na'` or `'missing'` must be replaced with `numpy.nan` so pandas methods recognize them. For calculations, columns that look numeric may be stored as `object` strings; `astype()` fixes this, but `np.nan` forces numeric casts to use `float` rather than `int`. The Stack Overflow example shows these steps in practice to compute average years of coding experience.

How do `dropna()`’s `axis` and `how` parameters change what gets removed?

`axis` controls whether pandas drops along rows (`axis='index'`) or columns (`axis='columns'`). With `how='any'`, pandas drops a row/column if it contains at least one missing value. With `how='all'`, pandas drops only if every value in that row/column is missing. In the tutorial’s small DataFrame, `axis='index', how='any'` removed more rows than `how='all'` because partially-missing rows were kept under the stricter “all missing” rule.

When should `subset` be used with `dropna()`?

Use `subset` when missingness in some columns is acceptable but missingness in specific columns breaks the analysis. For example, if email is required but first name and last name are optional, `dropna(subset=['email'])` keeps rows where email exists while allowing other fields to be missing. With multiple columns in `subset`, `how='all'` keeps rows as long as at least one of the listed fields is present, while `how='any'` drops rows if any listed field is missing.

Why do custom missing-value strings require replacement before using `dropna()`?

Pandas only treats actual `NaN` values as missing by default. If a dataset encodes missingness as strings like `'Na'` or `'missing'`, those won’t be recognized by `isna()` or `dropna()` until they’re converted. The tutorial replaces these placeholders across the DataFrame with `numpy.nan` using `replace(..., inplace=True)`, after which `isna()` correctly flags those entries and `dropna()` can remove them.

What goes wrong when casting a column with missing values to `int`, and what’s the fix?

`np.nan` is a float internally, so converting a column containing `NaN` to `int` fails. The tutorial shows that `astype(int)` raises an error when `NaN` is present. The fix is either to fill missing values first (e.g., with `fillna(0)` for some numeric use cases) or to cast to `float` so missing values can remain as `NaN`. In the example, casting `age` to `float` enables `mean()` to work.

How does the Stack Overflow example compute average `years_code` despite mixed strings and numbers?

The `years_code` column includes numeric entries plus strings like `'less than one year'` and `'more than 50 years'`. The workflow is: (1) load CSV with `na_values` so known missing markers become `NaN`, (2) replace `'less than one year'` with `0`, (3) replace `'more than 50 years'` with `51`, (4) cast the column to `float`, and (5) compute `mean()`. The tutorial reports an average of roughly 11.5 years and a median of 9 years.

Review Questions

What combination of `axis` and `how` would you use to drop only rows where every value is missing?
Why does `astype(int)` fail on a column that contains `np.nan`, and when would `fillna()` be a better alternative than casting to `float`?
In the Stack Overflow `years_code` column, what replacements are made for `'less than one year'` and `'more than 50 years'`, and why are those replacements necessary before taking the mean?

Key Points

1
Use `dropna(axis='index', how='any')` to remove rows with any missing value, and `how='all'` to remove only fully missing rows.
2
Use `dropna(subset=[...])` to drop rows based only on missingness in specific required columns (e.g., email).
3
Standardize custom missing markers by replacing placeholder strings (like `'Na'` or `'missing'`) with `numpy.nan` before applying missing-value logic.
4
Inspect missingness with `isna()` to confirm which entries pandas will treat as missing.
5
Cast “numeric-looking” columns by checking `dtypes`; `object` columns often contain strings and will break numeric operations like `mean()`.
6
When a column contains `np.nan`, cast to `float` (or fill missing values first) rather than `int` to avoid conversion errors.
7
For real datasets, handle mixed representations (e.g., `'less than one year'`) by replacing them with numeric equivalents before computing summary statistics.

Highlights

`dropna()` behavior hinges on `axis` and `how`: `any` removes partially missing rows, while `all` keeps them.

Custom missing values must be converted to real `NaN` (via `replace` with `numpy.nan`) or pandas won’t treat them as missing.

Casting to `int` fails when `np.nan` is present; casting to `float` preserves missingness and enables calculations.

The Stack Overflow `years_code` mean only works after replacing categorical strings like `'less than one year'` with numeric values and casting to `float`.

Using `subset` lets analysts enforce “required fields” without discarding rows for unrelated missing columns.

Topics

Missing Values
dropna
fillna
Type Casting
Stack Overflow Survey Cleaning