Python Pandas Tutorial (Part 8): Grouping and Aggregating

TL;DR

Use median for “typical” salary when outliers make the mean misleading; median ignores NaN responses automatically.

Briefing Cornell Notes

Briefing

Grouping and aggregation in pandas turns a raw survey table into answers—like “What’s the median salary by country?” or “Which social networks are most popular in each country?”—by splitting data into groups, applying summary functions, and then recombining the results. The core workflow starts with aggregate functions (median, mean, describe, count) to reduce many rows into a single statistic, then scales up with groupby to produce those statistics per category such as country.

The tutorial begins with basic aggregation on the developer survey dataset. For salaries, the median is computed from the converted compensation column (converted comp), yielding a “typical” figure around $57,000. That choice is deliberate: the mean salary is pulled upward by outliers, making it a less reliable measure of what most developers earn. The median ignores missing responses (NaN), which appear when respondents skip questions.

Next comes broader, column-wise summaries. Running median across the entire DataFrame returns medians for numeric columns like age and work hours per week. For a quick statistical overview, describe provides count, mean, standard deviation, min, and quartiles (including the 50th percentile, which matches the median). The count metric is clarified as the number of non-missing entries, not a tally of individual values within rows—an important distinction that prevents common mistakes.

For categorical questions, value_counts becomes the go-to tool. Applied to a yes/no hobbyist column, it produces counts for True/False. Applied to social media preferences, it ranks platforms by frequency (e.g., Reddit leading overall) and can normalize to show percentages instead of raw counts. To connect these results to geography, the tutorial introduces groupby: it splits the dataset by country, applies a function within each country group, and combines the outputs.

A key example uses groupby with value_counts to find the most popular social media platform per country. The result is a Series with a multi-index: country as the first level and social media as the second. That structure lets analysts query any country directly (e.g., retrieving India’s top platforms) without manually filtering the dataset each time.

The same grouping approach extends to numeric aggregates. Using groupby on country and then computing the median of converted comp yields median salaries per country (with country names as the index). For multiple statistics at once, agg supports combining functions like mean and median, returning a DataFrame indexed by country.

The tutorial then tackles a more complex question: “What percentage of developers in each country know Python?” A direct attempt to use string matching on a groupby object fails, and the fix is to use apply with a lambda. For each country group, it counts rows where language worked with contains “Python.” To convert counts into percentages, it builds a new DataFrame by concatenating two Series—total respondents per country and Python-knowers per country—renames columns for clarity, and computes PCT = (Python-knowers / total respondents) * 100. Sorting by this percentage highlights both meaningful patterns (e.g., countries with many respondents) and misleading edge cases (tiny sample sizes producing 100%). The result is a reusable pattern for country-level analytics: group, compute counts/aggregates, then derive rates.

Cornell Notes

Grouping and aggregation in pandas let analysts answer survey questions by category. The workflow starts with aggregate functions like median, describe, and count to summarize columns while ignoring missing values (NaN). For category-level answers, groupby splits the DataFrame by a column (such as country), applies a function within each group, and recombines results—often producing multi-index outputs when using value_counts. The tutorial also shows how to compute multiple aggregates at once with agg (e.g., mean and median). Finally, it demonstrates a practical rate calculation: using groupby + apply to count “Python” mentions per country, then combining that with total respondents to compute the percentage who know Python.

Why does the tutorial prefer median salary over mean salary for the developer survey?

Mean salary is pulled upward by outliers—rare but very large salary values can distort the “typical” figure. Median salary is more robust because it represents the middle of the distribution. In the example, median converted comp is around $57,000, while the mean is much higher (about $127,000), making the mean a poor proxy for what most developers earn.

What’s the difference between count and value_counts in this context?

count reports how many non-missing (non-NaN) rows exist for a column. It does not tally the frequency of individual values within rows. value_counts, by contrast, counts occurrences of each distinct value in a Series—useful for categorical fields like hobbyist (True/False) or social media platform names.

How does groupby change the shape of the result when finding the most popular social media per country?

Using country_group = df.groupby('country') and then applying value_counts to the social media column returns a Series with a multi-index: the first index level is country, and the second is the social media category. That structure allows direct lookup like selecting the India index to retrieve that country’s ranked platforms without rerunning filters for each country.

Why does string matching for “Python” require apply after groupby?

After groupby, the object is no longer a plain Series; it becomes a SeriesGroupBy. That breaks direct access to Series string methods (like .str.contains). The tutorial uses country_group['language worked with'].apply(lambda x: x.str.contains('Python').sum()) so each group is treated as a Series inside the lambda, enabling .str.contains and then summing the True values.

How is the percentage of Python-knowers per country computed from grouped counts?

First compute total respondents per country (via value_counts on country). Then compute the number of respondents per country whose language worked with contains “Python” (via groupby + apply). Concatenate these two Series into one DataFrame aligned by country index, rename columns (e.g., number of respondents and num knows python), then create PCT = (num_knows_python / number_of_respondents) * 100. Sorting by PCT surfaces both strong signals and misleading results from tiny sample sizes.

Review Questions

When should median be preferred over mean in salary analysis, and what pandas methods support each choice?
What does a multi-index represent in the output of groupby + value_counts, and how would you query a specific country’s results?
Why does groupby + apply work for counting “Python” mentions, while direct .str.contains on a groupby object fails?

Key Points

1
Use median for “typical” salary when outliers make the mean misleading; median ignores NaN responses automatically.
2
describe on a DataFrame gives a compact statistical overview across numeric columns, including quartiles where the 50th percentile matches the median.
3
count measures non-missing entries per column; value_counts measures frequency of distinct values within a Series.
4
groupby splits data by a key (like country), applies a function per group, and recombines results—often yielding multi-index outputs with value_counts.
5
To compute category-level rates (percentages), combine grouped counts with total counts and derive the rate as (part/whole)*100.
6
When string methods fail on a groupby object, switch to apply with a lambda so each group is processed as a Series.
7
Sort derived percentages to find top categories, but watch for misleading 100% results caused by very small sample sizes.

Highlights

Median converted comp lands around $57,000, while the mean is inflated to roughly $127,000 due to outliers—making median the better “typical” metric here.

groupby + value_counts produces a multi-index Series (country, then category), enabling direct country lookups without repeated filtering.

Counting Python knowledge per country requires groupby + apply because string methods like .str.contains don’t work directly on a SeriesGroupBy object.

A complete percentage workflow emerges: compute total respondents per country, compute Python-knowers per country, concatenate, rename, then calculate PCT = (knowers/total)*100.

Sorting by PCT reveals patterns but can also surface misleading 100% cases when only one respondent answered for a country.

Topics

Aggregation
Groupby
Value Counts
Multi-Index
Apply and Lambda
Percentage Calculation

Mentioned

Corey Schafer

Python Pandas Tutorial (Part 8): Grouping and Aggregating - Analyzing and Exploring Your Data