Combining multiple datasets - Data Analysis with Python and Pandas p.5
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Minimum wage must be mapped from (year, state) onto county-level rows because minimum wage isn’t available at the county level in the provided data.
Briefing
The core takeaway is that combining county-level unemployment data with state-level minimum wage data and then linking both to county-level 2016 presidential voting percentages (focused on Donald Trump) produces weak, inconsistent statistical relationships—suggesting no clear, simple connection between these economic indicators and voting behavior. The exercise is less about proving a political theory and more about learning how to stitch messy datasets together in pandas, then measure correlation and covariance.
First, the workflow loads “unemployment by county” from a CSV and inspects the unemployment rate column. Next comes the key data-integration step: minimum wage data exists by year and state, not by county. To bridge that mismatch, the transcript builds a function that looks up the minimum wage for a given (year, state) pair using pandas indexing (via `.loc` on the minimum-wage table’s year index). That function is then mapped across the unemployment-by-county rows to create a new “minimum wage” column. The mapping approach is intentionally basic and slow—using Python’s `map`—but it reliably handles functions with multiple parameters (year and state). During this step, the transcript also notes a structural limitation: because minimum wage is not truly county-specific, the county-level dataset inherits the same state minimum wage values for all counties in that state.
With the merged unemployment + minimum wage dataset in place, the analysis checks whether unemployment rate and minimum wage move together. Covariance and correlation are computed, and the results don’t show a strong, convincing relationship. The transcript then loads the “2016 presidential election vote by county” dataset, which contains vote counts and percentages by county and candidate. To simplify, it filters down to Donald Trump and keeps only the percent column.
The next challenge is aligning schemas so the datasets can be merged: the presidential dataset uses full state names (or a state field labeled differently), while the unemployment/minimum-wage pipeline uses postal abbreviations. A state-abbreviation mapping file is loaded, converted into a dictionary, and applied so the presidential data’s state field matches the unemployment dataset’s state format. After renaming columns (e.g., standardizing “County” and “state” casing) and setting a dual index on (county, state), the transcript merges the presidential percent-vote data with the unemployment+minimum-wage data.
Finally, correlation and covariance are computed between (1) minimum wage and Trump’s vote percentage and (2) unemployment rate and Trump’s vote percentage. Minimum wage shows a negative association with Trump’s vote share—counties with higher minimum wages tend to have lower Trump vote percentages—while unemployment rate appears to have little relationship. Even where a directional pattern appears, the transcript emphasizes that the results aren’t statistically significant and should not be treated as evidence of a causal link. The practical value is the pandas technique: mapping multi-parameter functions, filtering by candidate, standardizing keys, and merging on county/state to enable correlation-style comparisons.
Cornell Notes
County-level unemployment data is enriched with a minimum-wage column by mapping each row’s (year, state) to the corresponding state minimum wage, then dropping missing values. The analysis checks covariance/correlation between unemployment rate and minimum wage and finds no strong relationship. Next, 2016 presidential vote-by-county data is loaded, filtered to Donald Trump, and reduced to the percent-vote column. State naming differences are resolved using a state-abbreviation mapping so presidential rows can merge with unemployment/minimum-wage rows on (county, state). Correlation/covariance between Trump’s vote percentage and the two economic variables yields a weak picture: minimum wage trends negatively with Trump support, while unemployment rate shows little association, and nothing is treated as statistically conclusive.
Why does the minimum wage need special handling when combining it with county-level unemployment data?
What pandas technique is used to create the minimum-wage column, and why is it intentionally slow?
How does the transcript deal with missing values after merging datasets?
What schema mismatches must be resolved before merging presidential results with unemployment/minimum-wage data?
How is the presidential dataset simplified to focus on one candidate?
What relationships are found between Trump vote share and the economic variables?
Review Questions
- When mapping minimum wage into a county-level dataframe, what two inputs does the lookup function require, and what pandas indexing method is used to retrieve the value?
- What steps are necessary to make two datasets mergeable on (county, state) when one uses full state names and the other uses postal abbreviations?
- After filtering presidential results to Donald Trump and keeping only the percent column, which correlation/covariance pairs are computed, and what directional pattern appears for minimum wage?
Key Points
- 1
Minimum wage must be mapped from (year, state) onto county-level rows because minimum wage isn’t available at the county level in the provided data.
- 2
A multi-parameter pandas workflow can be built with a custom lookup function and `map`, then assigned as a new dataframe column (converting `map` to a list in Python 3).
- 3
Dropping missing values after enrichment is necessary before running covariance/correlation to avoid misleading or failing computations.
- 4
Merging datasets requires consistent join keys; standardize state formats (full names vs postal abbreviations) using a mapping dictionary.
- 5
Filtering presidential results to a single candidate (Donald Trump) and keeping only the percent-vote column simplifies the statistical comparison.
- 6
Correlation/covariance between Trump vote share and economic variables yields weak, non-conclusive results: minimum wage trends negatively, unemployment rate shows little association.