Get AI summaries of any video or article — Sign up free
The First AI Art Generator That Can Spell: New FREE Open Source AI Art Generator thumbnail

The First AI Art Generator That Can Spell: New FREE Open Source AI Art Generator

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Mid-journey V4 can generate visually strong images but often produces unreliable or incorrect text, motivating demand for spelling-capable models.

Briefing

Text-to-image AI has long struggled with one basic requirement: producing readable, correctly spelled words. Mid-journey V4 can generate beautiful images, but its lettering often turns into gibberish or near-misses—prompting a years-long search for models that can actually spell. Google’s earlier “party” line of models showed incremental progress (from partial letters to occasional full words at very large parameter counts), but access was restricted, leaving most users stuck with models that can’t reliably write.

A new candidate aims to change that. A Google model described as a masked modeling system in discrete token space is positioned as a more efficient alternative to pixel-space diffusion approaches (including prior systems like DALL·E 2 and Muse-style diffusion). Instead of generating images pixel-by-pixel, it predicts randomly masked image tokens, with the input text first extracted from a large language model and then used to guide token prediction. Google claims state-of-the-art image generation performance—specifically including coherent spelling—while also reducing generation time through parallel decoding. The approach is also tied to “mask-free” and editing workflows: after an initial image is produced, the system can iteratively resample image tokens conditioned on a new text prompt to modify objects, swap elements (like changing the food next to a latte), or alter scene details (such as changing backgrounds from New York to Paris or adding hot air balloons).

The most immediately actionable news in the transcript comes from Deep Floyd AI, working in partnership with Stability AI. Deep Floyd announced an “if” model that produces sharply legible text in generated images—examples include clear, Photoshop-like lettering and consistent word shapes in multiple prompts. The announcement is framed around a playful poem (“if I were a model, I’d be open source”), and the model is presented as likely to be open source, with free variants available through websites similar to how Stable Diffusion is distributed. The transcript also highlights a range of themed generations—spray-painted phrases on walls, signage-style text, and character prompts—where the spelling is described as unusually accurate for text-to-image systems.

For users who want something to try right now, the transcript points to “Carlo,” a model that can produce text accurately in some cases. Carlo is available via a free iOS/Android app called B carrot discover, and it can also be accessed raw on Hugging Face. The results are described as prompt-dependent: simple words like “hello” can come out readable, though errors still happen (such as missing or extra letters). The model is labeled alpha, with a full release expected later.

Taken together, the message is clear: the next wave of text-to-image progress is shifting from “pretty pictures with messy text” toward systems that can reliably render readable words—either through Google’s token-based masked modeling approach, Deep Floyd’s open-source “if” push, or free early-access tools like Carlo.

Cornell Notes

Text-to-image models have improved dramatically at generating images, but readable spelling has remained a weak spot. Mid-journey V4 often produces attractive visuals with unreliable or incorrect text. Google’s newer approach uses masked modeling in discrete token space, predicting masked image tokens guided by text extracted from a large language model; Google claims this yields state-of-the-art image quality and coherent spelling while being more efficient than pixel-space diffusion methods. Deep Floyd AI’s “if” model is presented as producing unusually accurate lettering and is expected to be open source with free variants. For hands-on testing, Carlo is offered for free via a mobile app and Hugging Face, with spelling accuracy that varies by prompt.

Why do many popular text-to-image models fail at spelling even when images look great?

The transcript contrasts strong visual generation (e.g., Mid-journey V4 producing high-quality logos and scenes) with weak text rendering, where letters can become gibberish or near-English. The implied issue is that common generation pipelines don’t reliably enforce character-level structure, so text becomes an artifact of image synthesis rather than a constrained, readable output.

What is the core technical shift in Google’s spelling-focused model description?

Google’s approach is described as masked modeling in discrete token space. The typed text is processed by extracting information from a large language model, then the system predicts randomly masked image tokens. The transcript contrasts this with pixel-space diffusion methods (like those associated with earlier image generators), claiming the token approach is more efficient and supports better language alignment through parallel decoding.

How does the model enable editing beyond generating a fresh image?

The transcript describes “zero shot mask free editing,” where the system iteratively resamples image tokens conditioned on a text prompt to edit an already-produced image. Examples include changing the food next to a latte while retaining latte detail, swapping backgrounds (New York vs. Paris vs. San Francisco), and adding or changing elements like hot air balloons or leaf colors around a gazebo.

What makes Deep Floyd AI’s “if” model stand out in the transcript?

The transcript emphasizes legible, consistent text in generated images—examples are described as clear enough to look like typed or edited text. Deep Floyd’s announcement includes a poem implying open sourcing, and the model is presented as likely to be open source with free variants, similar to how Stable Diffusion is distributed.

What practical options does the transcript offer for testing spelling today?

Two immediate paths are highlighted: (1) Deep Floyd’s “if” via community updates (Discord link mentioned), and (2) Carlo, available for free through the B carrot discover iOS/Android app and also on Hugging Face. Carlo is in alpha, and spelling accuracy is prompt-dependent—simple words like “hello” can come out readable, but letter mistakes still occur.

Review Questions

  1. What specific limitation of text-to-image models is repeatedly highlighted, and how do the examples illustrate it?
  2. How does masked modeling in discrete token space differ from pixel-space diffusion in the transcript’s description?
  3. What evidence does the transcript provide that Carlo’s spelling performance is prompt-dependent rather than consistently accurate?

Key Points

  1. 1

    Mid-journey V4 can generate visually strong images but often produces unreliable or incorrect text, motivating demand for spelling-capable models.

  2. 2

    Google’s described token-based masked modeling approach aims to improve spelling by predicting masked image tokens guided by text processed through a large language model.

  3. 3

    Google claims efficiency gains via parallel decoding and positions the method as more efficient than pixel-space diffusion approaches while maintaining or improving results.

  4. 4

    Deep Floyd AI’s “if” model is presented as producing unusually accurate, legible text and is framed as likely open source with free variants.

  5. 5

    Carlo offers a free, prompt-dependent way to test spelling today via a mobile app (B carrot discover) and Hugging Face, though it remains in alpha.

  6. 6

    Across the options, spelling accuracy is treated as a key differentiator—better models are judged by how readable the generated words are, not just image quality.

Highlights

Mid-journey V4 delivers impressive imagery but frequently fails at producing real, readable English text.
Google’s spelling-focused system uses masked modeling in discrete token space and claims state-of-the-art results with coherent spelling plus faster generation via parallel decoding.
Deep Floyd AI’s “if” model is showcased with sharp, readable lettering and is positioned as open source, potentially making spelling-capable generation widely accessible.
Carlo is available for free (mobile app and Hugging Face) and can spell sometimes—often close enough to read, but still prone to missing or extra letters.

Topics