Claude 3 Vs Gemini Vs GPT-4: Who Can Make Amazing Powerpoints?

TL;DR

LLMs can generate slide content and working code quickly, but they still struggle to deliver consistently sleek, professional design without additional structure.

Briefing Cornell Notes

Briefing

LLMs can reliably generate the *facts* and basic slide structure for a presentation, but they still struggle to produce consistently sleek, polished design—especially when the task requires tight layout control, correct asset usage, and faithful adherence to a specific visual style. That gap matters because many white-collar workflows are dominated by “make the deck” labor: if models can draft content quickly, the remaining bottleneck becomes design quality and iteration, not research.

The exercise starts with a practical test: three leading models—ChatGPT Plus (GPT-4), Claude 3 Opus, and Google Gemini 1.5 Pro—are prompted to generate Python code that programmatically builds a 10-slide deck (two slides per subject) about the “five good emperors of Rome.” The prompts are kept as consistent as possible across models, including the requirement for a creative, advertising-agency style “as if it was designed by an advertising agency,” with slide content framed as a pitch deck and focused on each emperor’s top achievements.

All three models converge on similar implementation choices at the code level, producing functions that generate slides for Nerva, Trajan, Hadrian, Antoninus Pius, and Marcus Aurelius. But the outputs diverge sharply in the details that affect real-world usability. ChatGPT’s code runs and produces slides that look “normal” and largely boring, with content that appears reasonably accurate—though some rendering issues show up when transferred into Google Slides (e.g., bleeding at edges).

Claude’s result is more visually ambitious in concept: it pulls in Paul Smith–inspired color ideas, but the layout remains underwhelming. Elements such as date rulings overlap, and the overall composition lacks the kind of proportional spacing that makes decks look professional.

Gemini’s code is the least stable initially, requiring multiple rounds of self-correction to fix import and other errors. Even after it runs, the deck quality suffers: it repeats or misplaces content, and it appears to misunderstand the requested context—at one point treating “being one of the five good emperors” as an achievement. The most telling design failure is that Gemini attempts to use images that don’t exist, suggesting that without explicit asset constraints (or provided image files), multimodal or media-heavy slide generation can break down.

To probe whether the design problem is specific to slide code, the test shifts to v0.dev by Vercel, which generates a website layout from a prompt. Stylistically, the website output is more interesting, but it still gets factual details wrong (it omits Nerva and includes Marcus Aurelius’ son). The follow-up reprompt—asking it to remake the design using Paul Smith colors—shows how iterative prompting can steer aesthetics, though the color palette may still be constrained.

The overall takeaway is pragmatic: LLMs are getting better at assembling correct content and generating working templates, but producing consistently “sleek” design requires more than raw generation. The remaining work likely involves agents and systems that can iterate on layout, enforce design rules (colors, spacing, typography), and manage assets—turning deck creation into a controlled design loop rather than a one-shot prompt.

Cornell Notes

LLMs can draft presentation content and even generate working code for slide decks, but they often fail at the design layer that makes decks look truly professional. In a side-by-side test using ChatGPT Plus (GPT-4), Claude 3 Opus, and Google Gemini 1.5 Pro, all three produced decks about the five good Roman emperors with generally correct names and achievements, yet the visual results varied widely. Claude leaned into Paul Smith–inspired color ideas but still produced awkward layouts with overlapping elements. Gemini required multiple rounds of error fixing and also struggled with context and asset usage, including attempting to place images that weren’t provided. A website generator (v0.dev) produced more engaging styling, but still missed factual details, reinforcing that design iteration and factual grounding remain separate challenges.

What was the core experiment comparing Claude 3 Opus, GPT-4 (via ChatGPT Plus), and Gemini 1.5 Pro?

Each model was asked to generate Python code that programmatically creates a 10-slide pitch deck (two slides per subject) about the five good emperors of Rome. The prompts were kept largely consistent across models, including requirements for a creative, advertising-agency-like style and slide content focused on each emperor’s top achievements. The results were then run and evaluated visually via previews and by importing into Google Slides.

Where did the models converge, and why does that matter?

They converged on similar code-level structure: each produced functions that generate slides for the same set of emperors (Nerva, Trajan, Hadrian, Antoninus Pius, Marcus Aurelius). That convergence matters because it suggests LLMs can reliably handle the “plumbing” of slide generation and content assembly. The remaining differentiator becomes design quality—layout, typography, spacing, and asset handling.

How did the three models differ in design and correctness outcomes?

ChatGPT’s deck ran successfully and looked fairly standard, with some edge bleeding in Google Slides. Claude incorporated Paul Smith–inspired color ideas but produced poor layout behavior, including overlapping elements. Gemini initially produced code with errors that required repeated fixes, and even after correction it showed content/context issues (e.g., treating an emperor’s “being one of the five good emperors” as an achievement) and attempted to use images that weren’t available.

What did the Google Slides and preview checks reveal?

Visual inspection showed that even when code executes, rendering and layout can degrade. ChatGPT’s slides had edge bleeding when imported. Claude’s slides had color cues but lacked proportional layout, with overlapping elements. Gemini’s slides showed repetition/missing or incorrect content and likely suffered from missing assets (images), which would require explicit image inputs or stronger asset constraints.

Why switch to v0.dev by Vercel, and what did it demonstrate?

The switch tested whether the design bottleneck was specific to PowerPoint-style slide code. v0.dev generated a website with sections and profiles that looked more stylistically interesting, but it still made factual mistakes (it omitted Nerva and added Marcus Aurelius’ son). A reprompt to use Paul Smith colors showed that iterative prompting can steer aesthetics, but it also suggested palette constraints and incomplete adherence to the intended visual system.

What does the exercise imply about the next step for “amazing” deck generation?

It implies that one-shot prompting isn’t enough for sleek design. Achieving professional results likely requires an agentic workflow: enforce design rules (color limits, typography, spacing), manage assets explicitly, and iterate based on rendered output. In other words, the problem shifts from generating content to running a controlled design loop that can correct layout and styling until it meets human expectations.

Review Questions

Which failure modes were most tied to design/layout (overlap, spacing, typography) versus factual/context errors (wrong emperors or wrong achievements)?
How did missing or unspecified assets (like images) affect Gemini’s slide output, and what prompt changes might prevent that?
What evidence from the Claude and v0.dev results suggests that iterative reprompting can improve aesthetics, even when facts remain imperfect?

Key Points

1
LLMs can generate slide content and working code quickly, but they still struggle to deliver consistently sleek, professional design without additional structure.
2
ChatGPT Plus (GPT-4) produced decks that ran and looked fairly standard, with some rendering issues when imported into Google Slides.
3
Claude 3 Opus incorporated Paul Smith–inspired color ideas but produced layout problems such as overlapping elements and weak proportional spacing.
4
Gemini 1.5 Pro required multiple rounds of error correction and still showed context/content mistakes and asset-related failures (attempting to use images that weren’t provided).
5
Switching from PowerPoint-style code to v0.dev by Vercel improved visual interest, but factual accuracy still broke (missing Nerva and adding Marcus Aurelius’ son).
6
Iterative prompting and design constraints (e.g., limiting palettes, enforcing layout rules, supplying assets) appear necessary to move from “draft decks” to “agency-grade” outputs.
7
The remaining bottleneck for white-collar deck creation is likely design iteration and control, not just content generation or research.

Highlights

All three models could assemble the right emperor set and generate slide code, but none delivered consistently polished, agency-level layout on the first pass.

Claude’s Paul Smith–inspired colors showed up, yet the deck still suffered from overlapping elements—proof that color alone doesn’t equal good design.

Gemini’s biggest practical weakness wasn’t only code errors; it also tried to place images that didn’t exist and misframed achievements in context.

v0.dev by Vercel produced more compelling styling than the slide-code approach, but it still missed key factual details, underscoring that aesthetics and accuracy are separate problems.

Topics

LLM Slide Generation
Python Deck Code
Design Constraints
Agentic Iteration
v0.dev Websites

Mentioned

ChatGPT Plus
Claude 3 Opus
Google Gemini 1.5 Pro
v0.dev
Vercel
Sam Witteveen