Get AI summaries of any video or article — Sign up free
Combining multiple datasets - Data Analysis with Python and Pandas p.5 thumbnail

Combining multiple datasets - Data Analysis with Python and Pandas p.5

sentdex·
5 min read

Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Minimum wage must be mapped from (year, state) onto county-level rows because minimum wage isn’t available at the county level in the provided data.

Briefing

The core takeaway is that combining county-level unemployment data with state-level minimum wage data and then linking both to county-level 2016 presidential voting percentages (focused on Donald Trump) produces weak, inconsistent statistical relationships—suggesting no clear, simple connection between these economic indicators and voting behavior. The exercise is less about proving a political theory and more about learning how to stitch messy datasets together in pandas, then measure correlation and covariance.

First, the workflow loads “unemployment by county” from a CSV and inspects the unemployment rate column. Next comes the key data-integration step: minimum wage data exists by year and state, not by county. To bridge that mismatch, the transcript builds a function that looks up the minimum wage for a given (year, state) pair using pandas indexing (via `.loc` on the minimum-wage table’s year index). That function is then mapped across the unemployment-by-county rows to create a new “minimum wage” column. The mapping approach is intentionally basic and slow—using Python’s `map`—but it reliably handles functions with multiple parameters (year and state). During this step, the transcript also notes a structural limitation: because minimum wage is not truly county-specific, the county-level dataset inherits the same state minimum wage values for all counties in that state.

With the merged unemployment + minimum wage dataset in place, the analysis checks whether unemployment rate and minimum wage move together. Covariance and correlation are computed, and the results don’t show a strong, convincing relationship. The transcript then loads the “2016 presidential election vote by county” dataset, which contains vote counts and percentages by county and candidate. To simplify, it filters down to Donald Trump and keeps only the percent column.

The next challenge is aligning schemas so the datasets can be merged: the presidential dataset uses full state names (or a state field labeled differently), while the unemployment/minimum-wage pipeline uses postal abbreviations. A state-abbreviation mapping file is loaded, converted into a dictionary, and applied so the presidential data’s state field matches the unemployment dataset’s state format. After renaming columns (e.g., standardizing “County” and “state” casing) and setting a dual index on (county, state), the transcript merges the presidential percent-vote data with the unemployment+minimum-wage data.

Finally, correlation and covariance are computed between (1) minimum wage and Trump’s vote percentage and (2) unemployment rate and Trump’s vote percentage. Minimum wage shows a negative association with Trump’s vote share—counties with higher minimum wages tend to have lower Trump vote percentages—while unemployment rate appears to have little relationship. Even where a directional pattern appears, the transcript emphasizes that the results aren’t statistically significant and should not be treated as evidence of a causal link. The practical value is the pandas technique: mapping multi-parameter functions, filtering by candidate, standardizing keys, and merging on county/state to enable correlation-style comparisons.

Cornell Notes

County-level unemployment data is enriched with a minimum-wage column by mapping each row’s (year, state) to the corresponding state minimum wage, then dropping missing values. The analysis checks covariance/correlation between unemployment rate and minimum wage and finds no strong relationship. Next, 2016 presidential vote-by-county data is loaded, filtered to Donald Trump, and reduced to the percent-vote column. State naming differences are resolved using a state-abbreviation mapping so presidential rows can merge with unemployment/minimum-wage rows on (county, state). Correlation/covariance between Trump’s vote percentage and the two economic variables yields a weak picture: minimum wage trends negatively with Trump support, while unemployment rate shows little association, and nothing is treated as statistically conclusive.

Why does the minimum wage need special handling when combining it with county-level unemployment data?

Minimum wage data is indexed by year and state, while unemployment data is indexed by county (and includes a state field). Because there’s no county-specific minimum wage, the transcript creates a function that takes (year, state) and returns the minimum wage for that pair. That function is then mapped across every unemployment-by-county row so each county inherits the minimum wage of its state for that year.

What pandas technique is used to create the minimum-wage column, and why is it intentionally slow?

A multi-parameter lookup function is mapped across the unemployment dataframe using Python’s `map`. The transcript notes that this is not the most efficient approach, but it “always works” for functions with multiple parameters (year and state). It also highlights a Python 3 detail: `map` returns an iterator, so it must be converted to a list before assigning as a new pandas column.

How does the transcript deal with missing values after merging datasets?

After mapping minimum wage into the unemployment-by-county dataframe, it drops rows with missing minimum wage values using `dropna` (with the minimum wage column referenced). This ensures later correlation/covariance computations operate on aligned, complete numeric data.

What schema mismatches must be resolved before merging presidential results with unemployment/minimum-wage data?

The presidential dataset’s state field format doesn’t match the unemployment dataset’s state format. The transcript loads a state abbreviation file, converts it into a dictionary, and maps presidential state names to postal abbreviations. It also standardizes column names (e.g., renaming “County” and “state” casing) so both dataframes share identical merge keys.

How is the presidential dataset simplified to focus on one candidate?

The presidential dataframe is filtered to rows where the candidate field equals “Donald Trump” (using the candidate column). Then the dataframe drops all columns except the percent vote column, keeping only what’s needed for correlation/covariance with unemployment and minimum wage.

What relationships are found between Trump vote share and the economic variables?

Correlation/covariance calculations suggest Trump’s vote percentage has a negative association with minimum wage: higher minimum wages align with lower Trump vote shares. Unemployment rate shows little covariance/correlation with Trump’s vote percentage. The transcript stresses that these patterns are not statistically significant and shouldn’t be over-interpreted.

Review Questions

  1. When mapping minimum wage into a county-level dataframe, what two inputs does the lookup function require, and what pandas indexing method is used to retrieve the value?
  2. What steps are necessary to make two datasets mergeable on (county, state) when one uses full state names and the other uses postal abbreviations?
  3. After filtering presidential results to Donald Trump and keeping only the percent column, which correlation/covariance pairs are computed, and what directional pattern appears for minimum wage?

Key Points

  1. 1

    Minimum wage must be mapped from (year, state) onto county-level rows because minimum wage isn’t available at the county level in the provided data.

  2. 2

    A multi-parameter pandas workflow can be built with a custom lookup function and `map`, then assigned as a new dataframe column (converting `map` to a list in Python 3).

  3. 3

    Dropping missing values after enrichment is necessary before running covariance/correlation to avoid misleading or failing computations.

  4. 4

    Merging datasets requires consistent join keys; standardize state formats (full names vs postal abbreviations) using a mapping dictionary.

  5. 5

    Filtering presidential results to a single candidate (Donald Trump) and keeping only the percent-vote column simplifies the statistical comparison.

  6. 6

    Correlation/covariance between Trump vote share and economic variables yields weak, non-conclusive results: minimum wage trends negatively, unemployment rate shows little association.

Highlights

Mapping state minimum wage into a county dataset requires a (year, state) lookup because the minimum wage table is not county-granular.
Python 3’s `map` returns an iterator, so converting it to a list is necessary before assigning the results as a pandas column.
State-name normalization (full names to postal abbreviations) is a prerequisite for a clean merge on (county, state).
Trump’s vote percentage shows a negative relationship with minimum wage, while unemployment rate shows little relationship—yet nothing is treated as statistically significant.

Topics

  • Pandas Data Merging
  • Multi-Parameter Mapping
  • Correlation and Covariance
  • County-Level Data
  • State Abbreviation Mapping