Python Tutorial: Web Scraping with BeautifulSoup and Requests
Based on Corey Schafer's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Install `beautifulsoup4`, `requests`, and an HTML parser like `lxml` to reliably parse real-world markup.
Briefing
Web scraping becomes practical when HTML structure is treated like a map: fetch a page with `requests`, parse it with BeautifulSoup, then extract specific tags (titles, summaries, links) and save the results to a CSV. The tutorial’s core payoff is a working workflow that pulls post headlines and descriptions from a homepage, then reconstructs direct YouTube links from embedded iframes—turning messy web markup into clean, spreadsheet-ready data.
The process starts with installing the right tools: `beautifulsoup4` (specifically the “4” line for up-to-date behavior), a compatible HTML parser (`lxml` is recommended), and the `requests` library for HTTP fetching. With those in place, BeautifulSoup turns raw HTML into a navigable object. The tutorial demonstrates key extraction patterns: `soup.title` to grab the first matching tag, `.text` to return only the human-readable content, and `soup.find(...)` / `soup.find_all(...)` to locate elements by attributes like `class` (using `class_` in Python because `class` is a reserved keyword).
A small “simple.html” example shows how to drill down from container tags to the exact fields needed. Once the approach works for one article—grabbing an `h2` link for the headline and a `p` tag for the summary—it scales by looping over all matching article containers using `find_all`. That same logic is then applied to the creator’s homepage, where each post sits inside an `article` element. Headlines come from an `h2` containing an anchor tag, summaries come from the first paragraph inside a `div` with class `entry-content`, and video links require extra steps because the page embeds YouTube via an `iframe`.
For the embedded videos, the tutorial extracts the iframe’s `src` attribute, then parses the URL string to isolate the YouTube video ID. It splits the URL on `/` to find the segment containing the ID, then splits again on `?` to remove query parameters. With the ID in hand, it constructs a standard watch URL (`https://youtube.com/watch?v=...`) using an f-string (noting f-strings require Python 3.6+).
Real-world scraping is brittle, so the tutorial adds resilience: if a post lacks a video iframe, the iframe lookup returns `None` and would crash the script. Wrapping the video-extraction block in a `try/except` prevents failure; when extraction breaks, the video link variable is set to `None` and the scraper continues to the next post.
Finally, the extracted fields are written to `CMS scrape.csv` using Python’s `csv` module. The script outputs three columns—headline, summary, and video link—so the results can be opened in Excel or Numbers. The tutorial closes with practical guidance: prefer public APIs for large platforms when available, and scrape responsibly to avoid overwhelming servers with rapid requests.
Cornell Notes
The tutorial builds a complete scraping pipeline: fetch HTML with `requests`, parse it with BeautifulSoup using `lxml`, then extract structured data by navigating tags and attributes. It demonstrates single-item parsing (headline from an `h2` link, summary from a `p` tag) and scaling to multiple items via `find_all` loops. For embedded YouTube videos, it pulls the iframe `src`, parses out the video ID by splitting the URL, and reconstructs a direct watch link. A `try/except` block handles missing iframes so the scraper doesn’t crash when a post lacks a video. The final step writes headline, summary, and video link into `CMS scrape.csv` for spreadsheet use.
How does BeautifulSoup locate the exact HTML elements needed for scraping?
What’s the difference between extracting one item and extracting many items from a page?
How are YouTube links reconstructed when the page embeds videos with iframes?
Why does the scraper need error handling for missing videos?
How does the tutorial save scraped results into a CSV file?
Review Questions
- When would `find` be sufficient, and when should `find_all` be used instead?
- What string operations are used to extract a YouTube video ID from an iframe `src` URL?
- How does the `try/except` block change the scraper’s behavior when a post is missing an iframe?
Key Points
- 1
Install `beautifulsoup4`, `requests`, and an HTML parser like `lxml` to reliably parse real-world markup.
- 2
Use BeautifulSoup’s `.text` to extract only the human-readable content from tags.
- 3
Navigate HTML by combining tag structure (e.g., `h2` → `a`) with attribute filters (e.g., `class_='entry-content'`).
- 4
Scale from one scraped item to many by looping over `soup.find_all(...)` results.
- 5
Rebuild direct YouTube watch URLs by extracting the iframe `src`, parsing out the video ID, and formatting `https://youtube.com/watch?v=...`.
- 6
Wrap fragile extraction steps (like iframe parsing) in `try/except` so missing data doesn’t stop the entire scrape.
- 7
Write extracted fields to `CMS scrape.csv` with `csv.writer` so results are immediately usable in Excel or Numbers.