Claude 3 Vs Gemini Vs GPT-4: Who Can Make Amazing Powerpoints?
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LLMs can generate slide content and working code quickly, but they still struggle to deliver consistently sleek, professional design without additional structure.
Briefing
LLMs can reliably generate the *facts* and basic slide structure for a presentation, but they still struggle to produce consistently sleek, polished design—especially when the task requires tight layout control, correct asset usage, and faithful adherence to a specific visual style. That gap matters because many white-collar workflows are dominated by “make the deck” labor: if models can draft content quickly, the remaining bottleneck becomes design quality and iteration, not research.
The exercise starts with a practical test: three leading models—ChatGPT Plus (GPT-4), Claude 3 Opus, and Google Gemini 1.5 Pro—are prompted to generate Python code that programmatically builds a 10-slide deck (two slides per subject) about the “five good emperors of Rome.” The prompts are kept as consistent as possible across models, including the requirement for a creative, advertising-agency style “as if it was designed by an advertising agency,” with slide content framed as a pitch deck and focused on each emperor’s top achievements.
All three models converge on similar implementation choices at the code level, producing functions that generate slides for Nerva, Trajan, Hadrian, Antoninus Pius, and Marcus Aurelius. But the outputs diverge sharply in the details that affect real-world usability. ChatGPT’s code runs and produces slides that look “normal” and largely boring, with content that appears reasonably accurate—though some rendering issues show up when transferred into Google Slides (e.g., bleeding at edges).
Claude’s result is more visually ambitious in concept: it pulls in Paul Smith–inspired color ideas, but the layout remains underwhelming. Elements such as date rulings overlap, and the overall composition lacks the kind of proportional spacing that makes decks look professional.
Gemini’s code is the least stable initially, requiring multiple rounds of self-correction to fix import and other errors. Even after it runs, the deck quality suffers: it repeats or misplaces content, and it appears to misunderstand the requested context—at one point treating “being one of the five good emperors” as an achievement. The most telling design failure is that Gemini attempts to use images that don’t exist, suggesting that without explicit asset constraints (or provided image files), multimodal or media-heavy slide generation can break down.
To probe whether the design problem is specific to slide code, the test shifts to v0.dev by Vercel, which generates a website layout from a prompt. Stylistically, the website output is more interesting, but it still gets factual details wrong (it omits Nerva and includes Marcus Aurelius’ son). The follow-up reprompt—asking it to remake the design using Paul Smith colors—shows how iterative prompting can steer aesthetics, though the color palette may still be constrained.
The overall takeaway is pragmatic: LLMs are getting better at assembling correct content and generating working templates, but producing consistently “sleek” design requires more than raw generation. The remaining work likely involves agents and systems that can iterate on layout, enforce design rules (colors, spacing, typography), and manage assets—turning deck creation into a controlled design loop rather than a one-shot prompt.
Cornell Notes
LLMs can draft presentation content and even generate working code for slide decks, but they often fail at the design layer that makes decks look truly professional. In a side-by-side test using ChatGPT Plus (GPT-4), Claude 3 Opus, and Google Gemini 1.5 Pro, all three produced decks about the five good Roman emperors with generally correct names and achievements, yet the visual results varied widely. Claude leaned into Paul Smith–inspired color ideas but still produced awkward layouts with overlapping elements. Gemini required multiple rounds of error fixing and also struggled with context and asset usage, including attempting to place images that weren’t provided. A website generator (v0.dev) produced more engaging styling, but still missed factual details, reinforcing that design iteration and factual grounding remain separate challenges.
What was the core experiment comparing Claude 3 Opus, GPT-4 (via ChatGPT Plus), and Gemini 1.5 Pro?
Where did the models converge, and why does that matter?
How did the three models differ in design and correctness outcomes?
What did the Google Slides and preview checks reveal?
Why switch to v0.dev by Vercel, and what did it demonstrate?
What does the exercise imply about the next step for “amazing” deck generation?
Review Questions
- Which failure modes were most tied to design/layout (overlap, spacing, typography) versus factual/context errors (wrong emperors or wrong achievements)?
- How did missing or unspecified assets (like images) affect Gemini’s slide output, and what prompt changes might prevent that?
- What evidence from the Claude and v0.dev results suggests that iterative reprompting can improve aesthetics, even when facts remain imperfect?
Key Points
- 1
LLMs can generate slide content and working code quickly, but they still struggle to deliver consistently sleek, professional design without additional structure.
- 2
ChatGPT Plus (GPT-4) produced decks that ran and looked fairly standard, with some rendering issues when imported into Google Slides.
- 3
Claude 3 Opus incorporated Paul Smith–inspired color ideas but produced layout problems such as overlapping elements and weak proportional spacing.
- 4
Gemini 1.5 Pro required multiple rounds of error correction and still showed context/content mistakes and asset-related failures (attempting to use images that weren’t provided).
- 5
Switching from PowerPoint-style code to v0.dev by Vercel improved visual interest, but factual accuracy still broke (missing Nerva and adding Marcus Aurelius’ son).
- 6
Iterative prompting and design constraints (e.g., limiting palettes, enforcing layout rules, supplying assets) appear necessary to move from “draft decks” to “agency-grade” outputs.
- 7
The remaining bottleneck for white-collar deck creation is likely design iteration and control, not just content generation or research.