Get AI summaries of any video or article — Sign up free
Graphing/visualization - Data Analysis with Python and Pandas p.2 thumbnail

Graphing/visualization - Data Analysis with Python and Pandas p.2

sentdex·
5 min read

Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Convert date strings to real datetimes with `pd.to_datetime` to prevent overlapping or incorrect x-axis rendering.

Briefing

A clean, reliable time-series plot in pandas hinges on three details: parsing dates correctly, sorting the data into true chronological order, and handling rolling-window edge cases and warnings. Using the avocado dataset filtered to the Albany region, the workflow first rebuilds a basic “average price over time” line chart—then immediately hits two common problems: dates overlap (pandas is treating them as plain strings) and the line is so jagged it’s hard to interpret. Converting the date column with `pd.to_datetime` fixes the axis formatting, while a 25-point rolling mean (`rolling(window=25).mean()`) smooths the fluctuations into a more readable trend.

The next issue is subtler: even after smoothing, the rolling curve can look wrong if the underlying rows aren’t in the expected order. The transcript shows that the index appears reversed at first glance, but the dates themselves aren’t consistently ordered—some years run backward relative to others. Sorting by the index (`sort_index`) resolves the mismatch and produces a more plausible two-hump trend, illustrating why “it looks sorted” isn’t enough when time-series calculations depend on row order.

After the smoothed series is correct, the analysis turns practical: the rolling mean becomes a new column (`price_25MA`) so it can be exported later. That introduces expected `NaN` values at the beginning because a 25-point window can’t be computed until enough prior observations exist. The transcript also demonstrates options for dealing with missing values, including dropping rows with `NaN` via `dropna`.

A pandas warning then appears—classic “setting values on a copy of a slice.” It’s not treated as a fatal error, but as a warning that future edits might silently affect the wrong dataframe because the code created a filtered view (`DF region is Albany`) and then modified it. The fix is to explicitly copy the slice (`DF = DF.copy()`), eliminating the warning and making downstream transformations safer.

Finally, the tutorial scales up from one region to all regions on one chart. The straightforward approach—iterating through `DF region.unique()`, filtering each region, computing the rolling mean, and joining results—runs into a performance disaster: RAM usage explodes during joins. The cause is traced to duplicate dates created by the dataset’s structure (e.g., multiple PLU entries per date and differing `type` values like “conventional” and “organic”), which makes index-aligned joins ambiguous and extremely expensive. The remedy is to sort and set the index correctly from the source: convert dates to datetime immediately, copy the dataframe, sort by `date` (and use the sorted order consistently), and then rebuild the region-wise rolling series.

With the corrected ordering and indexing, the multi-region plot becomes feasible. The transcript ends by addressing visualization limits (legend clutter, empty gaps from `NaN`s) and suggests simple matplotlib-style adjustments: set figure size, disable the legend, and optionally drop missing values before plotting.

Cornell Notes

The workflow for plotting avocado prices with pandas depends on getting time-series fundamentals right. First, convert the date column using `pd.to_datetime` so pandas treats it as real dates rather than strings. Second, sort the data into true chronological order (not just “looks reversed”) because rolling-window calculations assume row order. Third, rolling means create leading `NaN` values until enough observations exist, so use `dropna` or accept the gaps. When filtering to a region and then modifying columns, pandas may warn about “setting on a copy”; using `DF = DF.copy()` prevents confusing side effects. Scaling to all regions requires careful indexing: duplicate dates across PLUs/types can make joins explode in memory, so sorting and indexing from the source avoids that bottleneck.

Why do dates overlap on the x-axis before any smoothing is applied?

Overlapping tick labels usually means pandas isn’t treating the date column as datetime. Converting the column with `DF['date'] = pd.to_datetime(DF['date'])` (or `DF date = pd.to_datetime(DF date)`) forces pandas to recognize the axis as time, producing slanted but properly spaced date labels.

How can a rolling mean look “wrong” even after converting dates to datetime?

Rolling calculations depend on the order of rows. If the dataframe isn’t truly chronological, the rolling window averages the wrong neighbors. The transcript shows that sorting only by what appears to be the index order can still leave dates inconsistent across years. Applying `DF.sort_index(inplace=True)` (or sorting by the date column) corrects the row order and changes the rolling curve into a more believable trend.

What causes the initial `NaN` values after creating a 25-point moving average column?

A 25-point rolling mean can’t be computed until there are at least 25 observations available. Using `DF['price_25MA'] = DF['average_price'].rolling(window=25).mean()` yields `NaN` for the first 24 rows. The transcript suggests either dropping them with `dropna` or using `tail` to inspect later values.

What does the pandas “setting on a copy of a slice” warning mean, and how is it fixed?

The warning appears when code filters a dataframe (creating a slice/view) and then assigns new values to that slice. pandas warns that changes might not behave as expected because the slice may reference the original data. The fix is to explicitly copy the slice before modifying: `DF = DF.copy()` (or `DF = DF.copy()` after filtering to Albany).

Why does joining region-wise rolling results cause RAM to explode when plotting all regions?

The dataset contains multiple entries per date across categories like PLU and `type` (e.g., “conventional” and “organic”). When region-wise dataframes are joined on the index, duplicate dates make index alignment ambiguous and the join becomes extremely expensive. The transcript identifies duplicate dates as the driver and resolves it by converting dates to datetime early, copying from the source, and sorting by `date` so the rolling/join logic operates on a consistent time index.

What practical steps make the final multi-region plot readable?

The transcript notes two issues: legend clutter and gaps caused by `NaN` values. It suggests using matplotlib figure sizing (`figsize=(...)`), turning off the legend (`legend=False`), and optionally dropping missing values (`dropna`) before plotting to remove empty gaps.

Review Questions

  1. What specific operations ensure that rolling-window averages in pandas use the correct chronological neighbors?
  2. When would you prefer `sort_index` versus sorting by a date column, and how does each affect rolling calculations?
  3. How do duplicate dates in a dataset change the behavior and cost of dataframe joins?

Key Points

  1. 1

    Convert date strings to real datetimes with `pd.to_datetime` to prevent overlapping or incorrect x-axis rendering.

  2. 2

    Rolling-window results depend on row order; sort the dataframe chronologically (e.g., `sort_index` or sorting by the date column) before computing rolling means.

  3. 3

    A 25-point rolling mean produces leading `NaN` values until enough prior rows exist; handle them with `dropna` or by accepting gaps.

  4. 4

    Pandas “setting on a copy of a slice” warnings are avoided by copying filtered dataframes explicitly using `DF = DF.copy()`.

  5. 5

    When plotting multiple regions, region-wise joins can become memory-heavy if the index contains duplicate dates; sort and index correctly from the source to prevent ambiguous joins.

  6. 6

    For readability, adjust plot size and legend behavior, and drop missing values before plotting to eliminate empty gaps.

Highlights

Converting the date column with `pd.to_datetime` turns an overlapping, unreadable time axis into a properly spaced datetime axis.
A rolling mean can still be misleading if the dataframe isn’t truly chronological; sorting fixes the rolling curve’s shape.
The “setting on a copy” warning isn’t fatal, but `DF = DF.copy()` prevents future surprises when modifying filtered slices.
RAM explosions during multi-region joins trace back to duplicate dates (from PLU/type structure), making index-aligned joins extremely expensive.

Topics

  • Pandas Date Parsing
  • Rolling Mean
  • Time-Series Sorting
  • Pandas Warnings
  • DataFrame Joins