Requests-HTML - Checking out a new HTML parsing library for Python

TL;DR

Install Requests-HTML with `pip install requests-html` and use Python 3.6+ as required by the latest release discussed.

Briefing Cornell Notes

Briefing

Requests-HTML is positioned as a fast, Pythonic way to fetch and parse HTML using the same author as the popular Requests library. The core pitch is simple: pull a page with `requests.get`, then use Requests-HTML’s built-in helpers to inspect encoding, headers, redirects, links, and structured content without reaching for heavier parsing workflows.

After installing with `pip install requests-html` (requiring Python 3.6+), the walkthrough uses a controllable test page (“/parse-me”) to demonstrate the main ergonomics. A single fetch returns an object that supports convenient properties and methods. The HTML can be accessed in multiple forms: “raw” HTML preserves the original formatting, while the library’s formatted HTML normalizes whitespace and indentation. Link extraction is equally straightforward—calling the links-related helper returns a set of URLs, deduplicating repeats. The object also exposes metadata like response encoding, cookies, whether a redirect occurred, and the ability to convert responses into JSON when the server provides JSON.

The most distinctive capability is JavaScript rendering. Instead of treating the page as static HTML, Requests-HTML can run client-side JavaScript via a headless browser. The workflow is to fetch the page, then call `.render()` to execute scripts and update the DOM before querying it. In the example, the rendered output includes text that only appears after JavaScript runs, confirming that the library can extract post-JS content.

On Windows, however, the rendering step hits a practical edge: Chromium is downloaded and rendering succeeds, but cleanup fails when the library tries to delete its temporary user-data directory. The error is traced to a permissions issue during directory removal (using `shutil.rmtree`), and the presenter experiments with toggling error-ignoring behavior. Even with elevated privileges, deletion still fails, so a workaround is to avoid raising the exception and continue, or manually delete the temporary directory later.

Beyond basic parsing, the tutorial experiments with searching the DOM—using methods like `.find()` to locate elements such as links, IDs, divs, and table cells. The search syntax is shown to be quick, but also somewhat finicky, especially when targeting attributes or classes with spaces and when mixing selector-like patterns.

Finally, the walkthrough tests Requests-HTML’s pagination support, which can follow “next page” links automatically through an iterator-like interface. It appears to work on Reddit, but the pagination behavior is inconsistent on other sites. On Hacker News, the “next” traversal sometimes lands on the wrong page, suggesting that pagination detection is heuristic and may require tuning or improvement.

Overall, Requests-HTML is presented as a strong alternative to BeautifulSoup for straightforward HTML parsing and for cases where JavaScript-rendered content matters. The main caveats are platform-specific cleanup issues during rendering (notably on Windows) and imperfect pagination across different site structures.

Cornell Notes

Requests-HTML offers a streamlined way to fetch and parse HTML in Python, with convenience methods for links, formatted vs raw HTML, response metadata, and DOM searching. Its standout feature is JavaScript rendering: calling `.render()` executes page scripts so elements populated after load become queryable. In practice, rendering works, but on Windows the library can fail to delete its temporary Chromium user-data directory due to permissions, requiring a workaround or manual cleanup. Pagination support can follow “next” links automatically, but results vary by site, indicating heuristic limitations. The net effect: fast HTML parsing with optional JS execution, plus some rough edges around cleanup and pagination reliability.

How does Requests-HTML make basic HTML parsing easier than manual parsing?

After installing `requests-html`, a fetched page becomes an object with direct helpers. The walkthrough demonstrates accessing response encoding, cookies, headers, and whether a redirect occurred. It also shows extracting links into a set (deduplicated), and viewing HTML in two forms: `raw_html` preserves original formatting, while the formatted HTML normalizes whitespace and indentation for readability.

What’s the difference between raw HTML and formatted HTML in this workflow?

`raw_html` returns the HTML as received, including the original spacing and line breaks. The formatted HTML output restructures whitespace—tabs and newlines are normalized—making it easier to read and inspect, though it may not match the exact original byte-for-byte formatting.

How does JavaScript rendering work, and what does it enable?

JavaScript rendering is triggered with `.render()`. The library uses a headless browser (Chromium) to execute scripts and update the DOM, then the updated elements can be queried. The example confirms this by extracting text that only appears after the page’s JavaScript runs, proving that post-load DOM changes are captured.

What problem appears during `.render()` on Windows, and how is it handled?

Rendering succeeds, but cleanup fails when the library tries to remove the temporary Chromium user-data directory. The error is traced to a permissions/access-denied issue during directory deletion (via `shutil.rmtree`). The workaround explored is to ignore the deletion error (so the program continues) or manually delete the temporary directory later; elevated privileges didn’t fully resolve the deletion failure in the walkthrough.

How does pagination support behave, and why is it unreliable across sites?

Pagination is treated as an iterator-like feature (e.g., using `.next` / iterating pages). It appears to work on Reddit, but on Hacker News the traversal sometimes lands on an unexpected page (e.g., getting stuck on a different domain/page). That suggests pagination detection depends on site-specific link patterns and may be heuristic rather than universally correct.

What does `.find()` help with, and what are common friction points?

`.find()` is used to locate elements like links, IDs, divs, and table data. The walkthrough shows that searching can be quick, but selector-like targeting can be finicky—especially when trying to match classes/attributes with spaces or when the selector syntax doesn’t align with the library’s expectations. In some attempts, the search returns nothing or errors, prompting the presenter to adjust approach (e.g., searching for broader tags first).

Review Questions

When would you choose `raw_html` over formatted HTML, and what practical benefit does formatting provide?
What steps are required to query elements that only appear after JavaScript runs?
Why might pagination work on one site but fail or misroute on another when using Requests-HTML?

Key Points

1
Install Requests-HTML with `pip install requests-html` and use Python 3.6+ as required by the latest release discussed.
2
Use the fetched page object to inspect response details like encoding, cookies, headers, and redirect status without extra plumbing.
3
Extract links directly into a deduplicated set using the library’s link helper methods.
4
Use `.render()` to execute JavaScript and make post-load DOM content available for `.find()` queries.
5
Expect Windows-specific rendering cleanup issues: Chromium temporary user-data directory deletion can fail due to permissions.
6
Pagination support can follow “next” links automatically, but heuristic detection may misidentify the correct next page on some sites.
7
For DOM searching, start with broader tag queries and refine selectors carefully, since attribute/class targeting can be sensitive to syntax.

Highlights

Requests-HTML turns a fetched page into an object with immediate, high-level access to links, HTML (raw vs formatted), and response metadata.

JavaScript rendering via `.render()` enables extraction of content that appears only after client-side scripts run.

On Windows, `.render()` can succeed while still failing to delete Chromium’s temporary user-data directory due to access permissions.

Pagination automation can work (e.g., Reddit) but may misroute on other sites like Hacker News, reflecting imperfect next-link detection.

Topics

Requests-HTML Installation
HTML Parsing
JavaScript Rendering
DOM Searching
Pagination Support
Chromium Cleanup Errors