Will AI Kill Traditional Web Scraping? (GPT4V + Mistral Medium Project)
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Puppeteer-rendered screenshots can replace HTML parsing when extracting information from websites that are hard to scrape reliably.
Briefing
A new scraping workflow is emerging that replaces HTML parsing with visual capture: use Puppeteer to screenshot target pages, then feed those images into GPT-4V (vision) to extract structured facts, and finally use a Mistral model to turn the extracted data into a clean, human-readable summary—optionally with text-to-speech via ElevenLabs. The practical payoff is reliability on pages that are hard to parse with traditional tools, because the extraction is driven by what’s visible in the rendered page rather than by brittle DOM selectors.
The project starts with a simple loop: define a list of URLs, render each page in a headless browser, and capture screenshots at a chosen viewport size (with full-page capture available, though extreme aspect ratios can break usefulness). To improve access to websites, it adds a “stealth” layer to Puppeteer, aiming to reduce blocks or bot-detection friction. Each screenshot is then encoded to base64 and sent to a vision model with a task-specific system prompt. In the example, the prompt instructs the model to act like a web scraper and output only short, structured bullet points—specifically tech news headlines—based on what appears in the screenshot.
After the vision step produces bullet-point text for each site, the pipeline aggregates everything into a second prompt for a Mistral model. That prompt asks for a top-five list of the most important tech stories, formatted for conversational reading with line breaks between items and capped at five entries. The output is then converted into audio using ElevenLabs’ text-to-speech API, producing an MP3 file that can be played back as a brief news voiceover.
The transcript demonstrates this with multiple tech front pages (The Verge, Gizmodo, Wired, CNBC). Screenshots pop up as each site is processed, and the vision model extracts headline-like bullet points even when pages don’t fully load. The Mistral “medium” model streams the final summary, and the audio file is generated afterward. A sample voiceover includes items such as Apple planning to discontinue Apple Watch Series 9 and Watch Ultra 2, an AI facial recognition ban tied to a proposed FTC settlement, and Adobe abandoning its plan to acquire Figma.
A second test shifts the target from news to live sports pages. The same screenshot-and-vision approach is reused, but the prompt changes: extract the score, key statistics, and the best-performing player from each sports page image. The Mistral step then produces a short report suitable for voiceover. The results include a basketball match summary (score plus top player stats) and a football match summary (possession, shots on target, and goal attempts), with the system generating corresponding audio.
Overall, the workflow isn’t presented as a replacement for all scraping, but as a flexible alternative when visual rendering matters. The core idea is straightforward: treat web pages as images, extract what matters with vision, then structure and narrate it with an LLM—making it easier to adapt to different sites by swapping prompts rather than rewriting parsing logic.
Cornell Notes
The workflow replaces traditional HTML web scraping with a visual pipeline. Puppeteer renders each target URL and captures screenshots, which are encoded to base64 and sent to a vision model (GPT-4V) with a strict prompt to extract structured bullet points from what’s visible. Those extracted notes are aggregated and passed to a Mistral model to generate a concise, formatted summary (e.g., a top-five tech news list or a short sports report). Finally, ElevenLabs converts the text summary into an MP3 voiceover. The approach aims to be more robust on sites where DOM-based scraping is brittle or where content is difficult to parse reliably.
Why capture screenshots with Puppeteer instead of using Beautiful Soup and DOM parsing?
How does the pipeline turn images into usable text and then into a final summary?
What role does prompt engineering play in making the same system work for different domains?
What practical settings matter for screenshot-based extraction?
How is audio output generated from the extracted and summarized text?
Review Questions
- In what ways does screenshot-based extraction reduce brittleness compared with DOM-based scraping?
- How do the prompts differ between the tech-news and sports examples, and what specific outputs do they target?
- What are the main stages of the pipeline from URL list to MP3 file, and what model handles each stage?
Key Points
- 1
Puppeteer-rendered screenshots can replace HTML parsing when extracting information from websites that are hard to scrape reliably.
- 2
A stealth plugin and viewport/full-page settings help control rendering and access, but extreme aspect ratios can break screenshot usefulness.
- 3
GPT-4V vision is used to extract constrained, structured bullet points from screenshots using task-specific prompts.
- 4
A Mistral model then converts aggregated extracted notes into a clean final format such as a top-five tech news list or a short sports voiceover report.
- 5
ElevenLabs text-to-speech turns the final LLM output into an MP3 audio briefing.
- 6
The same screenshot-and-vision pipeline can be repurposed across domains by changing prompts rather than rewriting extraction logic.