Python Pandas Tutorial (Part 8): Grouping and Aggregating - Analyzing and Exploring Your Data
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use median for “typical” salary when outliers make the mean misleading; median ignores NaN responses automatically.
Briefing
Grouping and aggregation in pandas turns a raw survey table into answers—like “What’s the median salary by country?” or “Which social networks are most popular in each country?”—by splitting data into groups, applying summary functions, and then recombining the results. The core workflow starts with aggregate functions (median, mean, describe, count) to reduce many rows into a single statistic, then scales up with groupby to produce those statistics per category such as country.
The tutorial begins with basic aggregation on the developer survey dataset. For salaries, the median is computed from the converted compensation column (converted comp), yielding a “typical” figure around $57,000. That choice is deliberate: the mean salary is pulled upward by outliers, making it a less reliable measure of what most developers earn. The median ignores missing responses (NaN), which appear when respondents skip questions.
Next comes broader, column-wise summaries. Running median across the entire DataFrame returns medians for numeric columns like age and work hours per week. For a quick statistical overview, describe provides count, mean, standard deviation, min, and quartiles (including the 50th percentile, which matches the median). The count metric is clarified as the number of non-missing entries, not a tally of individual values within rows—an important distinction that prevents common mistakes.
For categorical questions, value_counts becomes the go-to tool. Applied to a yes/no hobbyist column, it produces counts for True/False. Applied to social media preferences, it ranks platforms by frequency (e.g., Reddit leading overall) and can normalize to show percentages instead of raw counts. To connect these results to geography, the tutorial introduces groupby: it splits the dataset by country, applies a function within each country group, and combines the outputs.
A key example uses groupby with value_counts to find the most popular social media platform per country. The result is a Series with a multi-index: country as the first level and social media as the second. That structure lets analysts query any country directly (e.g., retrieving India’s top platforms) without manually filtering the dataset each time.
The same grouping approach extends to numeric aggregates. Using groupby on country and then computing the median of converted comp yields median salaries per country (with country names as the index). For multiple statistics at once, agg supports combining functions like mean and median, returning a DataFrame indexed by country.
The tutorial then tackles a more complex question: “What percentage of developers in each country know Python?” A direct attempt to use string matching on a groupby object fails, and the fix is to use apply with a lambda. For each country group, it counts rows where language worked with contains “Python.” To convert counts into percentages, it builds a new DataFrame by concatenating two Series—total respondents per country and Python-knowers per country—renames columns for clarity, and computes PCT = (Python-knowers / total respondents) * 100. Sorting by this percentage highlights both meaningful patterns (e.g., countries with many respondents) and misleading edge cases (tiny sample sizes producing 100%). The result is a reusable pattern for country-level analytics: group, compute counts/aggregates, then derive rates.
Cornell Notes
Grouping and aggregation in pandas let analysts answer survey questions by category. The workflow starts with aggregate functions like median, describe, and count to summarize columns while ignoring missing values (NaN). For category-level answers, groupby splits the DataFrame by a column (such as country), applies a function within each group, and recombines results—often producing multi-index outputs when using value_counts. The tutorial also shows how to compute multiple aggregates at once with agg (e.g., mean and median). Finally, it demonstrates a practical rate calculation: using groupby + apply to count “Python” mentions per country, then combining that with total respondents to compute the percentage who know Python.
Why does the tutorial prefer median salary over mean salary for the developer survey?
What’s the difference between count and value_counts in this context?
How does groupby change the shape of the result when finding the most popular social media per country?
Why does string matching for “Python” require apply after groupby?
How is the percentage of Python-knowers per country computed from grouped counts?
Review Questions
- When should median be preferred over mean in salary analysis, and what pandas methods support each choice?
- What does a multi-index represent in the output of groupby + value_counts, and how would you query a specific country’s results?
- Why does groupby + apply work for counting “Python” mentions, while direct .str.contains on a groupby object fails?
Key Points
- 1
Use median for “typical” salary when outliers make the mean misleading; median ignores NaN responses automatically.
- 2
describe on a DataFrame gives a compact statistical overview across numeric columns, including quartiles where the 50th percentile matches the median.
- 3
count measures non-missing entries per column; value_counts measures frequency of distinct values within a Series.
- 4
groupby splits data by a key (like country), applies a function per group, and recombines results—often yielding multi-index outputs with value_counts.
- 5
To compute category-level rates (percentages), combine grouped counts with total counts and derive the rate as (part/whole)*100.
- 6
When string methods fail on a groupby object, switch to apply with a lambda so each group is processed as a Series.
- 7
Sort derived percentages to find top categories, but watch for misleading 100% results caused by very small sample sizes.