NEW Text to Image AI "Simulacrabot" Compares to DALL-E 2 & is OPEN SOURCE!
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Simulacrabot is an early-access text-to-image bot accessed through Discord that uses Stable Diffusion and is tuned with aesthetic ratings.
Briefing
A new open-access text-to-image bot called “Simulacrabot” is drawing comparisons to major paid systems by pairing Stable Diffusion with a highly curated dataset of synthetic images and human aesthetic ratings. The core advantage isn’t just raw generation—it’s the way the model is tuned using “Simulacra Static Captions,” a dataset built from more than 238,000 AI-generated images. Those images come with caption-and-rating triplets derived from over 40,000 user-submitted prompts, then scored for aesthetic quality, producing a royalty-free resource with more than 176,000 ratings that can also be reused for downstream research and training.
Simulacrabot is accessed through Discord and operates under early-access rules: no NSFW content, no copyrighted material (such as generating Disney princesses), no hateful content, and no personal information. Despite those guardrails, the examples shown are striking for how closely the outputs follow prompt style and for how quickly the system returns results. The model produces coherent, visually pleasing images across multiple categories—origami “Godzilla” destroying origami Tokyo, near-photoreal portraits (including Albert Einstein), stylized oil paintings that still read clearly as a single scene, and character concepts like an orangutan described with a trending ArtStation prompt.
The most direct comparisons come from side-by-side prompting against DALL·E 2. In one test, both systems generate an aesthetically strong image, but the Simulacrabot output is described as missing only a small amount of “coherency” relative to DALL·E 2. The transcript repeatedly frames this as a competitive gap that could narrow further as open-source pipelines improve—especially with free upscalers that can raise resolution. Another test targets photorealism: a “nature photograph of a frog” looks convincing enough to be mistaken for a real photo, though the frog isn’t perfectly centered. Re-running the same prompt yields a slightly different composition, reinforcing that the model is responsive but not always perfectly aligned.
Simulacrabot also appears strong at faces, a notoriously difficult area for text-to-image models. Examples include a photorealistic red-bearded man eating ice cream and multiple “Grand Theft Auto 5 cover art” style celebrity depictions, including Jeff Bezos, Mark Zuckerberg, and Elon Musk. The outputs are presented as accurate enough to be recognizable, with the “stitched art” look matching the GTA cover aesthetic.
Overall, the takeaway is that open-source text-to-image systems—when trained or tuned with large, rating-based aesthetic datasets—can reach a level of visual quality that rivals mainstream paid tools. That puts pressure on pricing and performance assumptions for closed models like DALL·E 2, particularly as iteration speed and community tooling accelerate.
Cornell Notes
Simulacrabot is an early-access text-to-image bot accessed via Discord that uses Stable Diffusion and is tuned with the “Simulacra Static Captions” dataset. That dataset contains 238,000+ synthetic images generated from models including Stable Diffusion, with captions and aesthetic ratings derived from 40,000+ user prompts. The ratings-based training approach is presented as boosting visual quality and coherence, making outputs consistently attractive across styles—photorealism, oil painting, and character concepts. Examples also highlight strong face generation and recognizable celebrity “Grand Theft Auto 5 cover art” style results. The practical implication: open-source systems can compete closely with paid models like DALL·E 2, especially as resolution and tooling improve.
What dataset is central to Simulacrabot’s performance, and why does it matter?
How does Simulacrabot’s output quality compare with DALL·E 2 in the examples shown?
What kinds of prompts were used to test the bot, and what failure modes appeared?
Why does the transcript emphasize faces as a benchmark for text-to-image quality?
What constraints does Simulacrabot enforce, and how does that shape the kinds of examples shown?
Review Questions
- How does the use of aesthetic rating triplets in “Simulacra Static Captions” differ from training on captions alone, and what effect is claimed in the transcript?
- Which specific output categories (photorealism, stylized painting, character concepts, faces) show the strongest results for Simulacrabot, and what limitations are mentioned?
- What role do resolution and upscaling tools play in narrowing the gap between open-source systems and DALL·E 2, according to the examples?
Key Points
- 1
Simulacrabot is an early-access text-to-image bot accessed through Discord that uses Stable Diffusion and is tuned with aesthetic ratings.
- 2
“Simulacra Static Captions” provides 238,000+ synthetic images with caption-image-rating triplets, including 176,000+ ratings released under public domain terms.
- 3
The rating-based dataset design is presented as a major driver of improved visual quality and coherence, not just faster generation.
- 4
Simulacrabot outputs are shown as competitive in multiple styles, including photorealistic scenes, oil painting aesthetics, and 3D render-like prompts.
- 5
Face generation is highlighted as a key strength, with examples framed as more convincing than Midjourney in the current comparison.
- 6
Comparisons to DALL·E 2 suggest the remaining gap may be smaller than expected, potentially tied to coherency and output resolution rather than fundamentals.
- 7
Guardrails restrict NSFW, copyrighted characters, hateful content, and personal information, shaping the prompt examples used.