The First AI Art Generator That Can Spell: New FREE Open Source AI Art Generator
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Mid-journey V4 can generate visually strong images but often produces unreliable or incorrect text, motivating demand for spelling-capable models.
Briefing
Text-to-image AI has long struggled with one basic requirement: producing readable, correctly spelled words. Mid-journey V4 can generate beautiful images, but its lettering often turns into gibberish or near-misses—prompting a years-long search for models that can actually spell. Google’s earlier “party” line of models showed incremental progress (from partial letters to occasional full words at very large parameter counts), but access was restricted, leaving most users stuck with models that can’t reliably write.
A new candidate aims to change that. A Google model described as a masked modeling system in discrete token space is positioned as a more efficient alternative to pixel-space diffusion approaches (including prior systems like DALL·E 2 and Muse-style diffusion). Instead of generating images pixel-by-pixel, it predicts randomly masked image tokens, with the input text first extracted from a large language model and then used to guide token prediction. Google claims state-of-the-art image generation performance—specifically including coherent spelling—while also reducing generation time through parallel decoding. The approach is also tied to “mask-free” and editing workflows: after an initial image is produced, the system can iteratively resample image tokens conditioned on a new text prompt to modify objects, swap elements (like changing the food next to a latte), or alter scene details (such as changing backgrounds from New York to Paris or adding hot air balloons).
The most immediately actionable news in the transcript comes from Deep Floyd AI, working in partnership with Stability AI. Deep Floyd announced an “if” model that produces sharply legible text in generated images—examples include clear, Photoshop-like lettering and consistent word shapes in multiple prompts. The announcement is framed around a playful poem (“if I were a model, I’d be open source”), and the model is presented as likely to be open source, with free variants available through websites similar to how Stable Diffusion is distributed. The transcript also highlights a range of themed generations—spray-painted phrases on walls, signage-style text, and character prompts—where the spelling is described as unusually accurate for text-to-image systems.
For users who want something to try right now, the transcript points to “Carlo,” a model that can produce text accurately in some cases. Carlo is available via a free iOS/Android app called B carrot discover, and it can also be accessed raw on Hugging Face. The results are described as prompt-dependent: simple words like “hello” can come out readable, though errors still happen (such as missing or extra letters). The model is labeled alpha, with a full release expected later.
Taken together, the message is clear: the next wave of text-to-image progress is shifting from “pretty pictures with messy text” toward systems that can reliably render readable words—either through Google’s token-based masked modeling approach, Deep Floyd’s open-source “if” push, or free early-access tools like Carlo.
Cornell Notes
Text-to-image models have improved dramatically at generating images, but readable spelling has remained a weak spot. Mid-journey V4 often produces attractive visuals with unreliable or incorrect text. Google’s newer approach uses masked modeling in discrete token space, predicting masked image tokens guided by text extracted from a large language model; Google claims this yields state-of-the-art image quality and coherent spelling while being more efficient than pixel-space diffusion methods. Deep Floyd AI’s “if” model is presented as producing unusually accurate lettering and is expected to be open source with free variants. For hands-on testing, Carlo is offered for free via a mobile app and Hugging Face, with spelling accuracy that varies by prompt.
Why do many popular text-to-image models fail at spelling even when images look great?
What is the core technical shift in Google’s spelling-focused model description?
How does the model enable editing beyond generating a fresh image?
What makes Deep Floyd AI’s “if” model stand out in the transcript?
What practical options does the transcript offer for testing spelling today?
Review Questions
- What specific limitation of text-to-image models is repeatedly highlighted, and how do the examples illustrate it?
- How does masked modeling in discrete token space differ from pixel-space diffusion in the transcript’s description?
- What evidence does the transcript provide that Carlo’s spelling performance is prompt-dependent rather than consistently accurate?
Key Points
- 1
Mid-journey V4 can generate visually strong images but often produces unreliable or incorrect text, motivating demand for spelling-capable models.
- 2
Google’s described token-based masked modeling approach aims to improve spelling by predicting masked image tokens guided by text processed through a large language model.
- 3
Google claims efficiency gains via parallel decoding and positions the method as more efficient than pixel-space diffusion approaches while maintaining or improving results.
- 4
Deep Floyd AI’s “if” model is presented as producing unusually accurate, legible text and is framed as likely open source with free variants.
- 5
Carlo offers a free, prompt-dependent way to test spelling today via a mobile app (B carrot discover) and Hugging Face, though it remains in alpha.
- 6
Across the options, spelling accuracy is treated as a key differentiator—better models are judged by how readable the generated words are, not just image quality.