Dall-E 2 VS Stable Diffusion - Direct Text to Image AI Comparison

TL;DR

Runway-style text-to-video editing is already demonstrating background replacement and scene transformation while tracking subjects, suggesting major automation pressure on some VFX tasks.

Briefing Cornell Notes

Briefing

Text-to-video editing is moving fast enough to redraw the VFX job landscape, while the day-to-day choice between text-to-image tools still comes down to access, restrictions, and how faithfully each model follows prompts. A key example comes from a Runway Machine Learning demo: a tennis-court scene can be transformed into entirely different environments—switching the ground to a sandy beach, then changing the background to the moon, and later to Mars—while tracking the subject and preserving elements like shadows. The same workflow is shown shifting scenes into winter forests, enchanted forests with glowing mushrooms, and alien desert landscapes. That kind of background replacement and scene re-synthesis is already close to “edit the world, not just the pixels,” raising fears that some VFX tasks could be automated, even if other work will still require human direction.

Against that backdrop, the transcript pivots to a practical head-to-head: DALL·E 2 versus Stable Diffusion, using identical prompts and comparing both “capability” and “quality” dimensions. Stable Diffusion is positioned as the more accessible platform: it’s open source, expected to run on a home GPU under 10 GB of VRAM, and can be generated quickly (around five seconds on the web app in the current state). DALL·E 2 is described as closed and not something users can run locally, with generation times typically around 10–15 seconds depending on server load. Pricing and usage also diverge: Stable Diffusion is free to run locally, while a web app is expected to cost money (a rumored figure of about $5/month is mentioned). DALL·E 2 is described as pay-per-prompt with a minimum purchase tier (about $15 for 115 prompts, with four images per prompt).

Restrictions are a major differentiator. Stable Diffusion is described as having very low restrictions in the beta—no NSFW content like nudity or extreme gore. DALL·E 2 is portrayed as much stricter, with banned words and a “G-rated” approach, including limits around celebrities and copyrighted material. The transcript also notes that DALL·E 2 struggles with certain prompt types: it can’t reliably generate specific famous people (the example test uses “movie still Walter White from Breaking Bad”), and it tends to produce weaker or inconsistent results for text rendering.

In the prompt-by-prompt comparison, DALL·E 2 often wins on coherence and “photography-like” consistency—especially for character actions (like a ginger cat yawning and stretching) and for prompts that benefit from a more literal photographic style. Stable Diffusion more often surprises with creative variations and, in some cases, better scene elements (like a “shih tzu on a pirate ship” where upscaling artifacts hurt DALL·E 2’s clarity). For other prompts—like “tsunami in a jar”—both models miss parts of the instruction, but Stable Diffusion is favored in at least one of the tested outputs.

Overall, the transcript lands on a nuanced conclusion: the two systems are close enough that prompt choice matters. Stable Diffusion is favored for flexibility, local use, and fewer restrictions; DALL·E 2 is favored when users want tighter coherence and certain prompt interpretations, while accepting higher constraints and less control over deployment.

Cornell Notes

The transcript compares DALL·E 2 and Stable Diffusion through both platform-level factors (access, cost, restrictions, hardware) and a prompt-by-prompt image quality test. Stable Diffusion is presented as open source and runnable on a home GPU under 10 GB VRAM, with fast generation times (about five seconds on the web app). DALL·E 2 is described as closed, server-based, slower (often 10–15 seconds), and more restrictive, including banned words and limits around celebrities/copyright. In the head-to-head prompts, DALL·E 2 frequently produces more coherent, photo-like results, while Stable Diffusion sometimes delivers stronger creative takes and better clarity in certain cases. The practical takeaway: neither model dominates universally; the “best” choice depends on the prompt and what tradeoffs matter most.

What platform differences most affect everyday users choosing between Stable Diffusion and DALL·E 2?

Stable Diffusion is positioned as open source and runnable locally on a home GPU under 10 GB VRAM, enabling offline-style workflows and custom apps. DALL·E 2 is described as closed and not available for local GPU execution, relying on server generation. The transcript also contrasts expected generation speed (Stable Diffusion ~5 seconds on the web app; DALL·E 2 often ~10–15 seconds depending on load) and cost structure (Stable Diffusion free locally; a web app expected to cost money, with a rumored ~$5/month figure mentioned; DALL·E 2 priced per prompt with a minimum purchase tier).

How do content restrictions differ, and why does that matter for prompt writing?

Stable Diffusion is described as having very low restrictions in beta, with the main limitation being no NSFW output like nudity or extreme gore. DALL·E 2 is described as having high restrictions: banned words (the word “shot” is cited as an example) and a general push toward “G-rated” results, including limits around celebrities and copyrighted material. This affects what users can safely type and what kinds of scenes or wording are likely to be blocked or altered.

Why does the “Walter White” test favor Stable Diffusion?

The transcript uses “movie still Walter White from Breaking Bad in a lab coat holding a beaker of green liquid” to probe famous-character fidelity. Stable Diffusion is described as generating Walter White consistently, including lab coat and green liquid, with strong overall resemblance. DALL·E 2 is described as failing to recreate Walter White specifically because famous-person likenesses were stripped from its training data; outputs become “science-looking men” that only partially resemble the character (e.g., glasses, balding, mustache/beard) but not the full distinct face.

In the “tsunami in a jar” prompt, what kinds of failures show up for both models?

Both models are said to match the general mood but miss key elements. Stable Diffusion outputs are described as having better studio-lighting/water splash behavior in one case, yet sometimes omit the jar entirely. DALL·E 2 is described as producing a jar more often, but the tsunami effect can be unclear (and one output is described as having orange water, which doesn’t fit the prompt). The transcript concludes that neither output is perfect, but Stable Diffusion is favored in at least one tested result for including the tsunami-like splash while keeping the jar present.

What does the ginger cat prompt reveal about action and coherence?

The prompt specifies a ginger cat with a white chest and paws yawning and stretching on a windowsill. Stable Diffusion is described as producing a coherent ginger cat but not always capturing the exact action (yawning/stretching). DALL·E 2 is described as better at matching the action—showing yawning and stretching more clearly—though faces can still look mish-mashed. Across multiple variations, DALL·E 2 is repeatedly favored for action fidelity and overall coherence.

Why does the shih tzu on a pirate ship comparison lean toward Stable Diffusion?

The transcript highlights upscaling/clarity problems in DALL·E 2 outputs: pirate ships can become fuzzy or clay-like, and faces may look mush-mashed or artifact-heavy. Stable Diffusion outputs are described as clearer and more realistic-looking, with the dog’s face and the pirate-ship scene holding together better across variations. The comparison calls this the most surprising win for Stable Diffusion, largely due to DALL·E 2’s quality degradation.

Review Questions

Which factor matters more for you—local hardware control, generation speed, or content restrictions—and how would that choice change your prompt strategy?
Pick one prompt from the transcript (e.g., Walter White, tsunami in a jar, ginger cat yawning). What specific prompt element did each model fail or succeed on?
How do the transcript’s examples suggest each model handles “coherence” versus “creative variation,” and what tradeoff does that imply for future prompt writing?

Key Points

1
Runway-style text-to-video editing is already demonstrating background replacement and scene transformation while tracking subjects, suggesting major automation pressure on some VFX tasks.
2
Stable Diffusion is presented as open source and runnable on a home GPU under 10 GB VRAM, while DALL·E 2 is described as closed and server-based.
3
Stable Diffusion is described as faster on the web app (around five seconds), while DALL·E 2 often takes longer (about 10–15 seconds depending on load).
4
Stable Diffusion is described as having low restrictions in beta (no NSFW like nudity or extreme gore), whereas DALL·E 2 has high restrictions including banned words and tighter content rules.
5
In prompt tests, DALL·E 2 often wins on coherence and photo-like consistency, especially for action prompts like a cat yawning and stretching.
6
Stable Diffusion is repeatedly favored for creative variation and, in some cases, clearer results when DALL·E 2 suffers from artifacts or upscaling problems.
7
Famous-character prompts (e.g., Walter White) are described as a major weakness for DALL·E 2, while Stable Diffusion is described as handling them more directly.

Highlights

A Runway demo shows a single recorded tennis-court scene being re-edited into radically different worlds—beach, moon, Mars—while keeping the subject and shadows aligned.

Stable Diffusion’s open-source, local-GPU promise (under 10 GB VRAM) is positioned as the biggest practical advantage over DALL·E 2’s closed, server-only workflow.

DALL·E 2’s restrictions are illustrated not just by topic limits but by banned words (the transcript cites “shot” as an example).

In the “Walter White” test, Stable Diffusion is described as producing the character reliably, while DALL·E 2 outputs only partial, off-brand resemblance.

The shih tzu-on-a-pirate-ship comparison is framed as a clarity/quality win for Stable Diffusion, with DALL·E 2 outputs suffering from fuzzy or clay-like artifacts.

Topics

Text-to-Video Editing
DALL·E 2
Stable Diffusion
Prompt Fidelity
Model Restrictions

Mentioned

Runway Machine Learning
OpenAI
Stability AI
DALL·E 2
Stable Diffusion
Midjourney
GPT-3
Photoshop
Patrick Esser
Matt Video Productions