Build Web Scraper with Llama 3.1 | Get Structured Data By Scraping Web Content With AI

TL;DR

Use Playwright with headless Chromium to capture fully rendered HTML from JavaScript-heavy pages before extraction.

Briefing Cornell Notes

Briefing

A practical pipeline for turning messy, JavaScript-heavy web pages into clean structured data is built by combining Playwright for rendering, HTML-to-text for compression into readable markdown, and Llama 3.1 (via a structured-output wrapper) for extracting specific fields into a schema. The payoff is straightforward: landing pages and listing pages can be scraped into CSV or a pandas DataFrame with far less token waste than sending raw HTML directly to the model.

The workflow starts by using Playwright with a headless Chromium browser to fetch fully rendered HTML, even when the target page relies on frameworks like React or Angular. The scraper sets a realistic user agent (example: a recent Chrome on macOS) and then navigates to the target URL, capturing the page’s complete HTML after rendering. Raw HTML is intentionally treated as a poor input for Llama 3.1 because it’s large and token-heavy; instead, the HTML is converted into markdown using an HTML-to-text library while preserving links. This step produces a shorter, more model-friendly representation that still retains the textual content needed for extraction.

With markdown in hand, the pipeline calls Llama 3.1 70B through ChatOllama (with an API key for the Groq API in the demonstrated setup). A strict instruction is used to prevent the model from inventing or reformatting content: the model is asked to extract information “always extract data without changing it” and to output only what’s requested. A helper function wraps the process: fetch the page with Playwright, convert to markdown, then feed the markdown into a prompt that asks for structured extraction.

To force consistent results, the extraction step uses structured output with a Pydantic model defining the exact fields to return. For a project landing page example, the schema includes the project name, tagline, and a short list of features/benefits. The model is then invoked with system and user messages, where the user message contains the markdown content enclosed in backticks to clearly delimit the source text. The output is parsed into the defined schema and becomes rows in a DataFrame.

The approach is tested on multiple landing pages and then on a more complex “cars listing” page. For the cars scenario, a more detailed schema is defined per listing: make, model, horsepower, price, mileage (kilometer), registration year, and a URL for the car. The scraper again renders and converts the page to markdown, then extracts a list of car objects. Some extracted fields include extra noise, so a filter step cleans the “model” field by removing non-alphanumeric characters, splitting on spaces, and keeping only the first few tokens—improving consistency when the model output contains longer strings.

Finally, the structured results are saved to CSV. The overall method is positioned as a strong starting point for small experiments and data pipelines, with a clear warning that scraping must respect site permissions and terms. In production, additional robustness would be needed, but the core pattern—render HTML, compress to markdown, then extract via schema-constrained Llama 3.1—delivers usable structured data quickly (example: five landing pages scraped in about 23 seconds).

Cornell Notes

The pipeline turns rendered web pages into structured data by chaining three steps: Playwright fetches fully rendered HTML (including JavaScript-driven pages), HTML-to-text converts that HTML into compact markdown while keeping links, and Llama 3.1 extracts fields from the markdown using schema-constrained structured output. This avoids sending huge raw HTML to the model, reducing token waste and improving extraction consistency. Pydantic models define exactly which fields to return (e.g., project name, tagline, benefits; or per-car make, model, horsepower, price, mileage, year, and URL). Results are parsed into a pandas DataFrame and saved to CSV, with optional post-processing (like cleaning the “model” string) to handle noisy outputs.

Why render with Playwright before extraction instead of scraping HTML directly?

Playwright uses a headless Chromium browser to load the page the way a real browser would, so content generated by JavaScript frameworks (React, Angular, and similar) appears in the captured HTML. That means the extractor sees the final, fully populated content rather than an incomplete skeleton page.

What role does HTML-to-text play in the pipeline?

HTML-to-text converts the captured HTML into markdown that is shorter and easier for Llama 3.1 to process. The transcript emphasizes that raw HTML is token-heavy and includes lots of irrelevant material, so markdown compression preserves the needed text (and keeps links) while cutting down noise.

How does the extraction step keep outputs consistent and “schema-like”?

A Pydantic model defines the exact fields to extract, and the Llama 3.1 call uses structured output tied to that schema. For example, a landing-page schema includes project name, tagline, and a short list of features/benefits; a car-listing schema includes make, model, horsepower, price, mileage, registration year, and a car URL.

What prompting tactic is used to reduce model confusion about where the source text begins and ends?

The markdown content is wrapped in backticks inside the user prompt. The transcript notes that clearly delimiting the content helps the model behave better when extracting from long inputs.

What happens when extracted fields contain extra noise or longer-than-expected strings?

A filter step cleans the “model” field for car listings. The method replaces non-alphanumeric characters with spaces, splits the string, and keeps only the first few tokens (e.g., first three examples worth of tokens). This improves consistency when the model output includes additional text beyond the intended value.

How are extracted results turned into something usable for analysis?

The structured outputs are assembled into a pandas DataFrame and then saved to CSV. The transcript demonstrates scraping multiple landing pages, inspecting the first row (e.g., “Video Gen” with benefits), and then saving the car listings similarly.

Review Questions

In what order do Playwright, HTML-to-text, and Llama 3.1 run, and what problem does each step solve?
How does a Pydantic schema change the quality of extraction compared with asking for free-form text?
What post-processing technique is applied to the car “model” field, and why might it be necessary?

Key Points

1
Use Playwright with headless Chromium to capture fully rendered HTML from JavaScript-heavy pages before extraction.
2
Convert HTML to compact markdown with HTML-to-text to reduce token load and remove irrelevant markup while preserving links.
3
Call Llama 3.1 70B with strict instructions to extract data without altering it or adding extra content.
4
Define a Pydantic schema for structured output so the model returns exactly the fields needed (e.g., name/tagline/benefits or car listing attributes).
5
Wrap the markdown source in backticks in the prompt to clearly delimit the input text for the model.
6
When fields come back noisy (like car model strings), apply deterministic cleaning (alphanumeric filtering and token truncation) before saving results.
7
Store extracted objects in a pandas DataFrame and export to CSV for downstream analysis pipelines.

Highlights

Raw HTML is treated as too token-heavy for reliable extraction; converting it to markdown is the key compression step.

Structured output tied to a Pydantic schema turns messy page text into predictable fields like project name, tagline, and benefits.

A practical delimiter trick—wrapping the markdown in backticks—helps the model focus on the intended source content.

Car listings require both schema extraction and sometimes post-processing to clean noisy fields like the “model” string.

The pipeline is designed to work end-to-end: scrape → render → markdown → schema extraction → DataFrame/CSV.

Topics

Web Scraping
Playwright Rendering
HTML to Markdown
Structured Extraction
CSV Data Output

Mentioned

Venelin Valkov
CSV
Pandas
API
HTML
CSV
Pydantic
Groq API