Build Web Scraper with Llama 3.1 | Get Structured Data By Scraping Web Content With AI
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use Playwright with headless Chromium to capture fully rendered HTML from JavaScript-heavy pages before extraction.
Briefing
A practical pipeline for turning messy, JavaScript-heavy web pages into clean structured data is built by combining Playwright for rendering, HTML-to-text for compression into readable markdown, and Llama 3.1 (via a structured-output wrapper) for extracting specific fields into a schema. The payoff is straightforward: landing pages and listing pages can be scraped into CSV or a pandas DataFrame with far less token waste than sending raw HTML directly to the model.
The workflow starts by using Playwright with a headless Chromium browser to fetch fully rendered HTML, even when the target page relies on frameworks like React or Angular. The scraper sets a realistic user agent (example: a recent Chrome on macOS) and then navigates to the target URL, capturing the page’s complete HTML after rendering. Raw HTML is intentionally treated as a poor input for Llama 3.1 because it’s large and token-heavy; instead, the HTML is converted into markdown using an HTML-to-text library while preserving links. This step produces a shorter, more model-friendly representation that still retains the textual content needed for extraction.
With markdown in hand, the pipeline calls Llama 3.1 70B through ChatOllama (with an API key for the Groq API in the demonstrated setup). A strict instruction is used to prevent the model from inventing or reformatting content: the model is asked to extract information “always extract data without changing it” and to output only what’s requested. A helper function wraps the process: fetch the page with Playwright, convert to markdown, then feed the markdown into a prompt that asks for structured extraction.
To force consistent results, the extraction step uses structured output with a Pydantic model defining the exact fields to return. For a project landing page example, the schema includes the project name, tagline, and a short list of features/benefits. The model is then invoked with system and user messages, where the user message contains the markdown content enclosed in backticks to clearly delimit the source text. The output is parsed into the defined schema and becomes rows in a DataFrame.
The approach is tested on multiple landing pages and then on a more complex “cars listing” page. For the cars scenario, a more detailed schema is defined per listing: make, model, horsepower, price, mileage (kilometer), registration year, and a URL for the car. The scraper again renders and converts the page to markdown, then extracts a list of car objects. Some extracted fields include extra noise, so a filter step cleans the “model” field by removing non-alphanumeric characters, splitting on spaces, and keeping only the first few tokens—improving consistency when the model output contains longer strings.
Finally, the structured results are saved to CSV. The overall method is positioned as a strong starting point for small experiments and data pipelines, with a clear warning that scraping must respect site permissions and terms. In production, additional robustness would be needed, but the core pattern—render HTML, compress to markdown, then extract via schema-constrained Llama 3.1—delivers usable structured data quickly (example: five landing pages scraped in about 23 seconds).
Cornell Notes
The pipeline turns rendered web pages into structured data by chaining three steps: Playwright fetches fully rendered HTML (including JavaScript-driven pages), HTML-to-text converts that HTML into compact markdown while keeping links, and Llama 3.1 extracts fields from the markdown using schema-constrained structured output. This avoids sending huge raw HTML to the model, reducing token waste and improving extraction consistency. Pydantic models define exactly which fields to return (e.g., project name, tagline, benefits; or per-car make, model, horsepower, price, mileage, year, and URL). Results are parsed into a pandas DataFrame and saved to CSV, with optional post-processing (like cleaning the “model” string) to handle noisy outputs.
Why render with Playwright before extraction instead of scraping HTML directly?
What role does HTML-to-text play in the pipeline?
How does the extraction step keep outputs consistent and “schema-like”?
What prompting tactic is used to reduce model confusion about where the source text begins and ends?
What happens when extracted fields contain extra noise or longer-than-expected strings?
How are extracted results turned into something usable for analysis?
Review Questions
- In what order do Playwright, HTML-to-text, and Llama 3.1 run, and what problem does each step solve?
- How does a Pydantic schema change the quality of extraction compared with asking for free-form text?
- What post-processing technique is applied to the car “model” field, and why might it be necessary?
Key Points
- 1
Use Playwright with headless Chromium to capture fully rendered HTML from JavaScript-heavy pages before extraction.
- 2
Convert HTML to compact markdown with HTML-to-text to reduce token load and remove irrelevant markup while preserving links.
- 3
Call Llama 3.1 70B with strict instructions to extract data without altering it or adding extra content.
- 4
Define a Pydantic schema for structured output so the model returns exactly the fields needed (e.g., name/tagline/benefits or car listing attributes).
- 5
Wrap the markdown source in backticks in the prompt to clearly delimit the input text for the model.
- 6
When fields come back noisy (like car model strings), apply deterministic cleaning (alphanumeric filtering and token truncation) before saving results.
- 7
Store extracted objects in a pandas DataFrame and export to CSV for downstream analysis pipelines.