Get AI summaries of any video or article — Sign up free
Building an Interactive Python Map (Pt 1) - Web Scraping Wikipedia thumbnail

Building an Interactive Python Map (Pt 1) - Web Scraping Wikipedia

Liam Gower·
4 min read

Based on Liam Gower's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Define the exact data fields needed before scraping so the pipeline stays focused and the DataFrame stays clean.

Briefing

The project’s core idea is to turn a static UK map into an interactive planning tool for EFL Championship fans: plot all 24 clubs on a UK map, then make each point clickable so it reveals the club, its stadium, and the distance from the viewer’s current location. The practical “win” in the first installment is how the club and stadium data gets pulled automatically from Wikipedia—so the map can be built without manually copying tables into code.

After framing the motivation—relegation from the Premier League becoming a reason to plan future match trips—the creator lays out a multi-stage build. Data collection comes first, followed by distance and direction lookups via Google Maps API, then geographic mapping, data manipulation with JSON/pandas/geopandas, and finally interactive visualization using matplotlib. This first video focuses tightly on the initial data-collection step: extracting the relevant club list and stadium names from a Wikipedia table.

The approach starts with a clear requirement: before scraping, decide what fields matter. For this project, the table needs club names and stadiums (with other columns available but not required for the next steps). Wikipedia is chosen because it provides a structured table on the EFL Championship page, including club and stadium information.

Programmatically, the workflow uses three Python tools in sequence. First, `requests` sends an HTTP GET request to fetch the Wikipedia page’s HTML and stores the response. Second, `BeautifulSoup` parses that HTML into a navigable structure (described as a parse tree), enabling targeted queries rather than sifting through raw markup. Third, pandas converts the extracted HTML table into a DataFrame using `read_html`, producing a manageable dataset.

To locate the exact table, the workflow relies on browser inspection tools: right-clicking the table in Chrome’s developer tools reveals the HTML tag and a specific class attribute (notably `wikitable sortable`). That class becomes the filter criterion in BeautifulSoup—find the table with that class, capture its HTML, and then hand it to pandas.

Once the DataFrame is created, the process filters down to only the columns needed for the project (club and stadium, plus whatever additional fields are retained for later steps). The scraping logic is then wrapped into a reusable function, `get_efl_team_data`, which fetches the latest table from Wikipedia, parses it, filters it, and returns a DataFrame representing teams indexed from 0 to 23.

By the end of this installment, the project has a clean, automated starting dataset: the 24 EFL Championship clubs paired with their stadiums. That dataset becomes the foundation for the next stages—adding coordinates and distances via Google Maps API, then building the interactive UK map and click-driven information display.

Cornell Notes

The project builds an interactive UK map for the EFL Championship by first automating club-and-stadium data collection from Wikipedia. The key move is using `requests` to fetch the Wikipedia HTML, `BeautifulSoup` to parse and locate the specific table (identified via Chrome inspection and the table class `wikitable sortable`), and pandas `read_html` to convert that table into a DataFrame. The result is a reusable function, `get_efl_team_data`, that returns the 24 clubs (0–23) with their associated stadiums. This matters because it eliminates manual data entry and ensures the map can be updated by re-running the scraper before adding distances, plotting, and interactivity.

Why does the workflow start by deciding which fields to scrape from Wikipedia?

It prevents wasted effort and messy downstream code. The project needs club identity and stadium information to place points and display details when users click. Even though the Wikipedia table includes additional columns, the scraper filters the DataFrame down to the columns that match the project’s immediate needs.

How do `requests` and `BeautifulSoup` work together in this scraper?

`requests` performs an HTTP GET to retrieve the page’s HTML and stores it in a response variable. `BeautifulSoup` then parses that HTML into a structured parse tree, making it possible to query for specific elements (like a table) instead of manually searching through raw tags.

What technique identifies the exact Wikipedia table to scrape?

Chrome developer tools. By inspecting the table in the browser, the HTML tag and class attributes become visible. The scraper uses that class—`wikitable sortable`—to find the correct table element in the parsed HTML.

How does pandas turn the scraped HTML table into usable data?

After extracting the table’s HTML into a variable, the code passes it to pandas `read_html`. Pandas interprets the HTML table structure and returns a DataFrame, after which the code filters to the desired columns (e.g., club and stadium).

What does wrapping the logic into `get_efl_team_data` accomplish?

It turns a one-off notebook experiment into a repeatable data pipeline. The function fetches the latest Wikipedia table, parses and extracts the correct table, converts it to a DataFrame, filters the columns, and returns the final dataset ready for the next steps (Google Maps API distances and mapping).

Review Questions

  1. What specific HTML element/class does the scraper use to locate the EFL Championship table on Wikipedia?
  2. Describe the end-to-end data path from an HTTP GET request to a filtered pandas DataFrame.
  3. Why is filtering to only the needed columns an important step before distance lookups and plotting?

Key Points

  1. 1

    Define the exact data fields needed before scraping so the pipeline stays focused and the DataFrame stays clean.

  2. 2

    Use `requests` to fetch Wikipedia page HTML via an HTTP GET request.

  3. 3

    Parse the HTML with `BeautifulSoup` so the code can query structured elements rather than scanning raw markup.

  4. 4

    Use browser inspection tools to identify the table’s HTML attributes (class `wikitable sortable`) and target the correct table.

  5. 5

    Convert the extracted HTML table into a pandas DataFrame with `read_html`, then filter to the columns required for the project.

  6. 6

    Encapsulate scraping logic in a function like `get_efl_team_data` to make the dataset refreshable and reusable.

Highlights

The scraper targets the exact Wikipedia table by using the table class `wikitable sortable`, identified through Chrome’s inspect tools.
A three-step pipeline—`requests` (fetch HTML) → `BeautifulSoup` (parse/query) → pandas `read_html` (DataFrame)—turns a web table into structured data.
Wrapping the scraping into `get_efl_team_data` produces an immediately reusable dataset of 24 clubs and their stadiums.
The project’s interactive map depends on this automated dataset as the foundation for later distance lookups and clickable visualization.

Topics

  • Web Scraping
  • Wikipedia Tables
  • Python Requests
  • BeautifulSoup
  • Pandas read_html

Mentioned