Four Video Models VS Real Usecases | End of Year Mega Test

TL;DR

Haleo 2.3 (Miniax) was the most consistently usable model across physics-heavy and camera-control tasks, especially windshield reflections and scene coherence.

Briefing Cornell Notes

Briefing

AI video quality in late 2025 isn’t just about “best visuals”—it’s about tradeoffs between controllability, native audio, and how reliably models handle physics-heavy details like reflections, motion, and anatomy. Across multiple standardized prompts and reference-image tests, Haleo 2.3 (Miniax) repeatedly landed as the most usable option overall, especially when camera movement and real-world optics mattered. VO3.1 (Google) emerged as the top pick when native audio is required, while Sora 2 (OpenAI) delivered strong cinematic motion but struggled with reference-image restrictions and occasional low-resolution/consistency issues. LTX2 (LTX AI) lagged in professional reliability despite promising features like native 1080p and plans for open weights.

The showdown began with a single-dancer “K-pop inspired squid dance” prompt. VO3.1 produced the most coherent interpretive dance, with Sora 2 close behind. Haleo 2.3 showed better dance fidelity and detail but suffered noticeable anatomical morphing when the character’s body orientation became complex. LTX2’s concept was understandable, yet anatomy and motion quality were poor enough to place it last.

A second test used a reference image: a “fruit themed anime duel” (lemon vs banana) with “three clean hits” and smear frames. VO3.1 again won on animation clarity and hit timing. Haleo 2.3 stayed consistent but leaned more toward Flash-like motion than anime-style frame behavior, with “mushiness” during clashes. Sora 2 introduced creative changes to character design and even added extra fingers, making action harder to read. LTX2’s fighting was too fast and visually unclear to judge effectively.

The most decisive differences showed up in physics and camera-control challenges. In a horror-style scene (rain, windshield reflections, and a distorted lemon-tree creature), Haleo 2.3 handled reflections and camera push-ins best and kept the creature from incorrectly “glitching” onto the pickup truck—an error that repeatedly appeared with VO3.1. Sora 2 was blocked from using the original reference because it included a person, forcing it into a weaker, less controllable recreation. LTX2 managed reflections better than VO3.1 in places, but overall usability still trailed.

Product and camera-command tests reinforced the same pattern. For a glossy 360° Crocs product prompt, Haleo 2.3 and Sora 2 were both strong, with Sora 2 having native audio but also more subtle or artifact-prone output. In camera-control prompts (tracking shots, zooms, and lens-like framing), Haleo 2.3 delivered the most faithful adherence to the intended shot composition, while VO3.1 sometimes added unwanted elements (like trees growing into frame) and LTX2 frequently broke coherence.

In the final cyberpunk action scenario, Haleo 2.3 again felt most cinematic and consistent with the reference, while VO3.1 offered native audio but less convincing scene structure. Sora 2 performed well only when it wasn’t constrained by reference-image rules, and LTX2 remained inconsistent.

By the end, the practical recommendation was clear: choose Haleo 2.3 for highest overall cinematic usability and control (especially reflections and camera behavior), choose VO3.1 when native audio is non-negotiable, and treat Sora 2 as a strong but more workflow-constrained option due to reference limitations and occasional resolution/consistency issues. LTX2 was viewed as promising for the future—particularly if open weights arrive—but not yet dependable enough for professional production workflows.

Cornell Notes

Late-2025 AI video quality hinges on tradeoffs: controllability, native audio, and physics reliability. Haleo 2.3 (Miniax) repeatedly produced the most usable cinematic results, especially with reflections, camera push-ins, and scene coherence when reference images were allowed. VO3.1 (Google) often won when native audio mattered and when motion timing needed to stay readable, but it struggled with reflection/physics in some reference-based horror scenes. Sora 2 delivered strong cinematic motion and native audio, yet reference-image restrictions (people in the reference) and occasional low-resolution/artifact issues reduced consistency. LTX2 had native audio and native 1080p, but anatomy, coherence, and action readability lagged behind the top contenders.

Why did Haleo 2.3 come out on top overall in the showdown?

Haleo 2.3 most consistently handled physics-heavy details and camera movement. In the horror windshield-reflection test, it kept the creature’s reflection on the correct glass surfaces and then let it disappear appropriately when the interior view fully came into frame—avoiding the “glitching onto the pickup truck” failure seen with VO3.1. It also delivered strong results in camera-control prompts (tracking, push-ins, zoom behavior) using Haleo’s built-in camera-control prompting, producing shots that stayed closer to the intended composition.

What made VO3.1 the go-to choice when native audio is required?

VO3.1 was one of the few models that reliably generated audio natively alongside video, and it repeatedly produced the clearest action readability in animation-style tests. In the “fruit themed anime duel” reference test, VO3.1’s hit animations were judged the best combination of timing and similarity to the reference. In the final cyberpunk action test, VO3.1 also delivered native audio, making it the practical option when sound can’t be added later.

How did Sora 2’s reference-image limitation affect performance?

Sora 2 couldn’t use reference images that included a person, which removed it from some of the most controlled, physics-anchored tests. In the horror scene, that restriction forced a weaker recreation, and the result was less controllable and less consistent with the intended setup. In the cyberpunk scenario, Sora 2 required a “cameo” workaround, and while it could produce cinematic dialogue/audio, it still showed hit-or-miss character likeness and scene consistency.

What were the main weaknesses of LTX2 in these comparisons?

LTX2’s outputs were often judged less anatomically and motion-coherent. In the squid-dance test, it showed poor anatomical performance despite having some audio/music representation. In the fruit duel, action became too fast and unclear to follow. Even when it improved on reflections in the horror test, overall coherence and professional usability still lagged behind Haleo 2.3 and VO3.1.

What did the product test suggest about each model’s handling of text/logos and 360° rotation?

For the “limited edition MVPX Crocs” product prompt, Haleo 2.3 and Sora 2 both managed convincing product rotation and softbox reflections. Haleo 2.3 produced clearer front quality and spelled the text correctly, but it sometimes hallucinated or mis-rendered logos on the back. Sora 2 maintained 360° coherence better, but still introduced some text errors (e.g., “MVP times Crocs” instead of the intended “MVP Xcrox”). VO3.1 wasn’t the top performer here due to later-stage mush/rotation issues.

Review Questions

In which specific test did VO3.1’s reflection/physics errors become a dealbreaker, and what did Haleo 2.3 do differently?
What workflow constraint made Sora 2 less suitable for reference-image-driven projects, and how did that show up in the horror scene?
Why did LTX2 lose despite having native 1080p and native audio—what recurring failure modes affected usability?

Key Points

1
Haleo 2.3 (Miniax) was the most consistently usable model across physics-heavy and camera-control tasks, especially windshield reflections and scene coherence.
2
VO3.1 (Google) was the strongest practical choice when native audio generation is required, even when some physics details can fail under reference constraints.
3
Sora 2 can look cinematic and generate native audio, but reference-image restrictions involving people can remove it from the most controlled tests.
4
LTX2’s native 1080p and native audio didn’t translate into reliable anatomy, action readability, or coherence for professional workflows.
5
Reference-image tests exposed the biggest differences: models that handled reflections and camera movement correctly were far more usable than those that only looked good in isolated moments.
6
For product-style prompts with rotation and text/logos, Haleo 2.3 and Sora 2 both performed well, but both could hallucinate or mis-render text on the back angles.
7
A practical production strategy could combine models: use Haleo 2.3 for visuals/control and VO3.1 for native audio when sound is essential.

Highlights

Haleo 2.3 repeatedly avoided the “reflection glitches” that caused VO3.1 to smear or morph the creature onto the pickup truck during the horror windshield zoom-in.

VO3.1 won the interpretive dance and fruit-duel action readability tests, but couldn’t match Haleo 2.3’s reflection physics in the horror scene.

Sora 2’s inability to use reference images containing people forced weaker recreations in key controlled tests, even when its native audio and cinematic motion were strong.

LTX2’s native audio and native 1080p were not enough to overcome recurring coherence and anatomy problems that made action hard to follow.

Topics

AI Video Model Showdown
Native Audio Generation
Camera Control
Physics Reflections
Reference Image Constraints

Mentioned

Miniax
Haleo AI
Google
Gemini app
Sora app
LTX AI
MattVidPro
VO3.1
LTX2