I Tried All The AI Video Services So You Don't Have To

TL;DR

AI video tools often deliver realistic motion and faces, but they frequently miss specific prompt constraints like exact character likeness and prop placement.

Briefing Cornell Notes

Briefing

AI video tools can produce strikingly realistic clips fast—but they also struggle with basic prompt fidelity, consistency, and cost control. Across multiple services, the most reliable outcome wasn’t “perfectly generated cinema,” it was a messy mix of partial wins (coherent motion, good audio, decent faces) and bizarre failures (wrong subjects, warped anatomy, nonsensical perspective, and content that triggers refusals). The practical takeaway: results depend less on clever wording than on which platform you’re using, how long you’re willing to wait, and how much you’re prepared to pay per attempt.

The testing began with a celebrity mashup prompt: “Generate a 10-second video of Will Smith enjoying freshly cooked pasta while listening to Eminem.” One service produced something usable but visibly off-target—most notably, the generated scene included a problematic “hand in the spaghetti” detail and other mismatches. Another platform’s output was slower and stranger, with motion and framing that felt like a commercial or a surreal stare rather than the intended moment. A third tool generated faster but leaned into absurdity, delivering a clip that looked more like a broken proof-of-concept than a polished narrative.

As prompts escalated, the gaps widened. A “yoga ball fighting a guy with a mustache in a hoodie” request highlighted how some tools can nail comedic timing and recognizable action beats (including reflections and mirror-like effects), while others stall in the queue or return visually confusing scenes where objects and body logic don’t hold together. The creator repeatedly adjusted expectations using a tier-list approach that weighed both quality and generation time, because a “great” result that takes too long or costs too much can still lose in practice.

The strongest contrast came from repeated patterns: some services delivered coherent motion and sound more often, while others either failed to match key constraints (like specific characters or consistent visual continuity) or produced outputs that were “either horse crap or not too bad,” with few middle-ground successes. One platform’s results were described as “shockingly good” for certain prompts, while another was criticized as expensive and slow—sometimes producing realistic-looking clips but with distracting errors that made the outcome feel uncanny or wrong.

The experiment also turned into a social prompt roulette. Viewers supplied prompts under time and subscription constraints, and the random selection process generated a stream of increasingly unhinged ideas—mustache characters, mayonnaise-heavy scenarios, anime waifus, and absurd “cinema” prompts. Some requests were refused for policy reasons (notably when content violated community guidelines), and others slipped through but produced unsettling or grotesque imagery. Even when the results were funny, they often revealed the same underlying limitation: these systems can remix style and motion convincingly while still missing the “story logic” implied by the prompt.

By the end, the testing wasn’t about declaring a single winner for all cases. It was about identifying which tools were best for speed, which were best for coherence, and which were best avoided when cost and latency mattered. The final vibe: AI video generation is entertaining and occasionally impressive, but prompt fidelity, consistency, and platform economics remain the real bottlenecks.

Cornell Notes

Multiple AI text-to-video services were stress-tested with celebrity, action, and comedic prompts to see which platforms deliver usable results without excessive delay or cost. Outputs varied sharply: some tools produced fast, chaotic clips; others generated more coherent scenes but were slow or expensive. Prompt fidelity often broke down—characters, props, and even basic visual logic (like hands, reflections, and anatomy) could be wrong despite realistic rendering. The practical lesson was to judge by both quality and generation time, since “best-looking” clips can lose if they’re too costly or unreliable. The prompt roulette with viewer suggestions further showed that policy refusals and bizarre remixes are common when prompts push boundaries.

Why did the Will Smith + Eminem pasta prompt produce mixed results across services?

The intended scene was straightforward: Will Smith enjoying freshly cooked pasta while listening to Eminem. Some services returned realistic-looking faces and motion, but key details still drifted—most notably an unnatural “hand in the spaghetti” artifact and other mismatches. Other platforms either took longer to generate or produced surreal framing that didn’t match the “iconic moment” vibe, showing that even when the general theme is captured, specific constraints (exact characters, consistent prop placement, and coherent action) often fail.

How did the tester decide rankings when generation time and cost differed?

The ranking method explicitly weighed both quality and time. A service that eventually produced something impressive could still rank lower if it was slow or expensive enough to make iteration impractical. Conversely, a faster tool that produced consistently “good enough” results could rank higher even if the clips were more ridiculous or less faithful to the prompt.

What did the yoga ball fighting prompt reveal about motion and visual logic?

The yoga ball battle request highlighted how some tools can generate comedic action beats (including reflections/mirror-like effects and a sense of fighting), while others struggle with object and body consistency. Even when the clip looked dynamic, the physical plausibility could be off—weights and gym objects could appear wrong, and the choreography could reduce to repetitive stances or confusing perspectives.

Why did viewer prompts like “anime waifu” and mayonnaise lead to refusals or unsettling outputs?

Prompts that combined sexualized framing (e.g., “voluupuous anime waifu”) with explicit or fetish-adjacent elements (like mayonnaise eating) sometimes triggered community-guideline refusals on at least one platform. Other services generated content but with disturbing or uncanny visuals—warped faces, exaggerated expressions, and surreal prop behavior—demonstrating that policy enforcement and content safety vary by platform and that “allowed” content can still look grotesque.

What was the practical limitation the tester kept returning to: can these tools refine a concept over multiple generations?

A recurring complaint was that outputs often behaved like separate, self-contained attempts rather than iterative improvements. The tester suggested that once a clip is generated, it’s not easy to treat it as a baseline and “make it better” in a controlled way, which makes experimentation costly when prompts don’t land on the first try.

What did the “mustache man” and “soy latte developer” prompts demonstrate about creativity vs. control?

These prompts produced highly stylized, meme-like scenes—often “cinematic” in framing—showing the systems’ ability to remix themes and generate entertaining absurdity. But the tester still noted that the results could be inconsistent: characters might morph, details might drift, and the story logic could collapse into surreal imagery rather than the intended narrative.

Review Questions

Which factors mattered most in the tier-list: prompt accuracy, visual realism, generation speed, or cost—and how did those trade off against each other?
Give an example of a prompt detail that was not reliably preserved across services (character identity, prop placement, or action logic). What happened instead?
Why do policy refusals and “allowed but unsettling” outputs both matter when evaluating AI video tools for real use?

Key Points

1
AI video tools often deliver realistic motion and faces, but they frequently miss specific prompt constraints like exact character likeness and prop placement.
2
Generation speed and per-clip cost can outweigh raw visual quality when iterating on prompts.
3
A practical ranking should weigh both quality and latency; a slow “best” result may be less useful than a faster “good enough” one.
4
Prompt fidelity tends to degrade under complex, multi-constraint requests (celebrity + specific action + specific audio + specific props).
5
Some prompts trigger community-guideline refusals, while others pass but still produce uncanny or disturbing imagery.
6
Viewer-driven prompt roulette shows that these systems can be entertaining and meme-ready, yet control and consistency remain weak for narrative accuracy.
7
Many outputs behave like standalone remixes rather than controllable iterations, making refinement difficult and expensive.

Highlights

The Will Smith + Eminem pasta test produced realistic-looking clips but with notable prompt drift—especially unnatural prop/hand placement—undercutting “iconic moment” accuracy.

The yoga ball fight prompt exposed a recurring pattern: some services can nail comedic action beats and reflections, while others return confusing physical logic and repetitive stances.

The anime waifu + mayonnaise prompts demonstrated both policy enforcement (generation failures) and platform-dependent “allowed” outputs that can still look grotesque or unsettling.

A tier-list approach based on both quality and generation time captured the real-world tradeoff: speed and cost often decide the winner, not just aesthetics.

Topics

AI Video Services
Text-to-Video
Prompt Fidelity
Content Moderation
Cost vs Quality