Python Tutorial: Web Scraping with BeautifulSoup and Requests

TL;DR

Install `beautifulsoup4`, `requests`, and an HTML parser like `lxml` to reliably parse real-world markup.

Briefing Cornell Notes

Briefing

Web scraping becomes practical when HTML structure is treated like a map: fetch a page with `requests`, parse it with BeautifulSoup, then extract specific tags (titles, summaries, links) and save the results to a CSV. The tutorial’s core payoff is a working workflow that pulls post headlines and descriptions from a homepage, then reconstructs direct YouTube links from embedded iframes—turning messy web markup into clean, spreadsheet-ready data.

The process starts with installing the right tools: `beautifulsoup4` (specifically the “4” line for up-to-date behavior), a compatible HTML parser (`lxml` is recommended), and the `requests` library for HTTP fetching. With those in place, BeautifulSoup turns raw HTML into a navigable object. The tutorial demonstrates key extraction patterns: `soup.title` to grab the first matching tag, `.text` to return only the human-readable content, and `soup.find(...)` / `soup.find_all(...)` to locate elements by attributes like `class` (using `class_` in Python because `class` is a reserved keyword).

A small “simple.html” example shows how to drill down from container tags to the exact fields needed. Once the approach works for one article—grabbing an `h2` link for the headline and a `p` tag for the summary—it scales by looping over all matching article containers using `find_all`. That same logic is then applied to the creator’s homepage, where each post sits inside an `article` element. Headlines come from an `h2` containing an anchor tag, summaries come from the first paragraph inside a `div` with class `entry-content`, and video links require extra steps because the page embeds YouTube via an `iframe`.

For the embedded videos, the tutorial extracts the iframe’s `src` attribute, then parses the URL string to isolate the YouTube video ID. It splits the URL on `/` to find the segment containing the ID, then splits again on `?` to remove query parameters. With the ID in hand, it constructs a standard watch URL (`https://youtube.com/watch?v=...`) using an f-string (noting f-strings require Python 3.6+).

Real-world scraping is brittle, so the tutorial adds resilience: if a post lacks a video iframe, the iframe lookup returns `None` and would crash the script. Wrapping the video-extraction block in a `try/except` prevents failure; when extraction breaks, the video link variable is set to `None` and the scraper continues to the next post.

Finally, the extracted fields are written to `CMS scrape.csv` using Python’s `csv` module. The script outputs three columns—headline, summary, and video link—so the results can be opened in Excel or Numbers. The tutorial closes with practical guidance: prefer public APIs for large platforms when available, and scrape responsibly to avoid overwhelming servers with rapid requests.

Cornell Notes

The tutorial builds a complete scraping pipeline: fetch HTML with `requests`, parse it with BeautifulSoup using `lxml`, then extract structured data by navigating tags and attributes. It demonstrates single-item parsing (headline from an `h2` link, summary from a `p` tag) and scaling to multiple items via `find_all` loops. For embedded YouTube videos, it pulls the iframe `src`, parses out the video ID by splitting the URL, and reconstructs a direct watch link. A `try/except` block handles missing iframes so the scraper doesn’t crash when a post lacks a video. The final step writes headline, summary, and video link into `CMS scrape.csv` for spreadsheet use.

How does BeautifulSoup locate the exact HTML elements needed for scraping?

It uses attribute-based searches. Simple access like `soup.title` returns the first matching tag, while `.text` extracts only the displayed text. For targeted selection, `soup.find('div', class_='footer')` narrows results by attributes; `class_` is used because `class` is reserved in Python. To collect multiple matches, `soup.find_all(...)` returns a list of tags, enabling loops over repeated structures like article blocks.

What’s the difference between extracting one item and extracting many items from a page?

Single-item extraction uses `find(...)`, which returns the first matching tag. Multi-item extraction uses `find_all(...)`, which returns all matching tags as a list. The tutorial first proves headline/summary extraction on one article, then replaces the single `find` with `for article in soup.find_all(...)` and repeats the same inner logic for each article.

How are YouTube links reconstructed when the page embeds videos with iframes?

The scraper finds the iframe (using `find` with `class_='YouTube player'`) and reads its `src` attribute via dictionary-style access (e.g., `iframe['src']`). It then parses the URL string: first split on `/` to locate the segment containing the video ID, then split on `?` to remove query parameters. Finally, it builds `https://youtube.com/watch?v=<video_id>` using an f-string.

Why does the scraper need error handling for missing videos?

If a post lacks an iframe, the iframe search returns `None`. Attempting to access attributes on `None` triggers errors (like “NoneType object is not subscriptable”). Wrapping the video-extraction logic in a `try/except` block prevents the crash; when extraction fails, the script sets the video link variable to `None` and continues processing other posts.

How does the tutorial save scraped results into a CSV file?

It imports `csv`, opens `CMS scrape.csv` in write mode, creates a `csv.writer`, and writes headers as the first row: `headline`, `summary`, `video link`. Inside the loop, it writes one row per article with the extracted values. After the loop, it closes the file so the spreadsheet app can open the completed CSV.

Review Questions

When would `find` be sufficient, and when should `find_all` be used instead?
What string operations are used to extract a YouTube video ID from an iframe `src` URL?
How does the `try/except` block change the scraper’s behavior when a post is missing an iframe?

Key Points

1
Install `beautifulsoup4`, `requests`, and an HTML parser like `lxml` to reliably parse real-world markup.
2
Use BeautifulSoup’s `.text` to extract only the human-readable content from tags.
3
Navigate HTML by combining tag structure (e.g., `h2` → `a`) with attribute filters (e.g., `class_='entry-content'`).
4
Scale from one scraped item to many by looping over `soup.find_all(...)` results.
5
Rebuild direct YouTube watch URLs by extracting the iframe `src`, parsing out the video ID, and formatting `https://youtube.com/watch?v=...`.
6
Wrap fragile extraction steps (like iframe parsing) in `try/except` so missing data doesn’t stop the entire scrape.
7
Write extracted fields to `CMS scrape.csv` with `csv.writer` so results are immediately usable in Excel or Numbers.

Highlights

The scraper reconstructs YouTube links by parsing iframe `src` URLs to isolate the video ID, then generating `https://youtube.com/watch?v=<id>`.

Using `find_all` turns a working “single article” parser into a homepage-wide scraper with minimal code changes.

A `try/except` around iframe extraction prevents crashes when a post has no embedded video, storing `None` instead.

CSV output is produced with a simple `csv.writer` workflow: headers first, then one row per scraped article.

Topics

Web Scraping
BeautifulSoup
Requests
HTML Parsing
CSV Output

Mentioned

Corey Schafer
CSV