AI is Speeding Up AGAIN! HUGE Open Source AI Advancements!

TL;DR

Apple’s MDM multi-resolution diffusion model can generate images and text-to-video outputs across multiple resolutions, with code planned for release soon.

Briefing Cornell Notes

Briefing

Apple is moving deeper into generative AI research with a multi-resolution diffusion model (MDM) designed to produce high-quality images and videos, and it plans to release code soon. The model can generate at multiple resolutions and supports text-to-video, with results that look broadly in line with today’s diffusion baseline—minor warping and short clips, but not a collapse in quality. A key efficiency claim is that the approach doesn’t require a pre-trained VAE, a component Stable Diffusion XL typically needs to render full-resolution outputs. That design choice could translate into faster generation and a lighter pipeline than standard SDXL workflows, while the text ability appears limited—similar to SDXL-style “imperfect” text handling. The most consequential takeaway isn’t just the output quality; it’s Apple’s involvement, which signals that major consumer platforms are actively building research-grade generative systems rather than waiting on the sidelines.

Open-source acceleration is also reshaping what “at home” AI looks like. A new SDXL-based distillation model called SSD 1B is presented as roughly 60% faster and using about 40% less VRAM than full SDXL, with similar image quality despite being smaller. The practical implication is that many users may be able to run SDXL-class generation with far less hardware—potentially around 3 GB of VRAM in a ComfyUI-style setup, though the transcript treats exact numbers as uncertain. For Mac users, a “latent consistency model” integrated into ComfyUI is described as enabling SDXL image generation in under a second on M1 and M2 Macs, positioning local generation as both private and faster than many free web demos.

On the language-model side, a 7B open model named “Zer 7B beta” is pitched as a standout small model, outperforming prior 7B contenders on benchmark scores and even challenging much larger systems in the same evaluation context. It’s built on the Mistral architecture and trained with an “ultra chat” dataset that includes dialogue from ChatGPT plus feedback data judged by GPT-4—raising the familiar question of how much “cheating” is involved when proprietary models help label training data. Still, the performance narrative is clear: small open models are getting close enough to mainstream chat quality that they may run locally even on phones.

Multimodal hallucination correction gets a more direct fix with “Woodpecker,” a method aimed at vision-language models that invent objects not actually present in an image. Instead of retraining the underlying model, Woodpecker identifies key objects mentioned in a caption, queries the model to verify whether those objects truly exist, uses another LLM to validate the questions, and then rewrites the caption accordingly. Reported results show large jumps in mitigation accuracy across different vision models, with the broader message that post-processing can unlock reliability without rebuilding the base system.

The roundup closes with a fast-moving set of adjacent trends: monetization for custom chatbots in PO, fine-tuning for highly specific image styles (including Toy Story), rapid upgrades in AI video generators like Pika and Genmo, and the rise of 2D-to-3D workflows such as Dreamcraft 3D. It also flags “Common Canvas,” an open diffusion model trained on Creative Commons images, framed as a potential ethical and legal pressure-release valve for AI image training. Together, the throughline is speed plus practicality—models are getting smaller, faster, and more usable locally, while reliability and customization are improving through both research and productization.

Cornell Notes

Apple introduced a multi-resolution diffusion model (MDM) for generating images and videos, with plans to release code soon. The approach supports multiple resolutions and text-to-video, and it avoids the pre-trained VAE that Stable Diffusion XL typically relies on, which may improve efficiency and speed. Open-source SDXL distillation (SSD 1B) and ComfyUI workflows are pushing SDXL-class generation onto consumer hardware, including claims of near-second generation on M1/M2 Macs. On the reliability front, Woodpecker targets multimodal hallucinations by verifying caption-claimed objects and correcting captions without retraining the underlying vision model. The broader theme: generative AI is accelerating locally while also getting more dependable and customizable through new training and post-processing methods.

What makes Apple’s MDM diffusion model potentially more efficient than Stable Diffusion XL?

MDM is described as not needing a pre-trained VAE to produce full-resolution outputs. Since Stable Diffusion XL typically uses a pre-trained VAE to bring images to full resolution, skipping that component can reduce pipeline complexity. The transcript links this design choice to likely efficiency gains, potentially making MDM faster than standard SDXL generation, while still supporting multi-resolution outputs and text-to-video.

How does SSD 1B aim to bring SDXL-quality results to lower-cost hardware?

SSD 1B is presented as an SDXL-based distillation model that is about 60% faster and uses about 40% less VRAM. The claim is that distillation training preserves image quality close to the full-sized Stable Diffusion model while shrinking compute and memory requirements. The transcript notes that exact VRAM numbers aren’t specified in the paper, but consensus estimates suggest around 6 GB for SDXL and potentially around 3 GB in a ComfyUI-style setup.

Why is the “latent consistency model” for ComfyUI on M1/M2 Macs a big deal in this roundup?

The transcript says that with the latent consistency model layered on top of a ComfyUI base installation for Mac silicon (M1 and M2), users can generate full SDXL generations in under a second. The emphasis is on local, private generation that can be faster than free web SDXL options, making high-end image generation more accessible without relying on remote servers.

What is Woodpecker doing to reduce hallucinations in multimodal models?

Woodpecker corrects captions by checking whether objects mentioned in the caption actually exist in the image. It first identifies key objects referenced in the caption, then asks the LLM questions to verify those objects, uses another LLM to validate the questions, and finally rewrites the caption based on the validation output. Reported improvements include raising miniGPT-4 from 54% to 86% and PloT/owl from 62% to 86%, with an overall ~80% accuracy in mitigating hallucinations—without retraining the underlying vision model.

What training-data strategy is highlighted for Zer 7B beta, and why does it matter?

Zer 7B beta is described as using an “ultra chat” dataset containing dialogue from ChatGPT and additional feedback data where prompts were judged by GPT-4. The transcript frames this as potentially “cheating” in spirit, but the practical point is that small open models are being trained with high-quality proprietary-model supervision. The result is strong benchmark performance for a 7B model, including claims that it can beat the free version of ChatGPT on an ALPaca eval leaderboard in the cited evaluation context.

Review Questions

Which architectural component does MDM avoid compared with Stable Diffusion XL, and how does that relate to efficiency?
How does Woodpecker improve multimodal caption reliability without retraining the base vision model?
What does distillation aim to preserve when shrinking SDXL models like SSD 1B, and what hardware implications are claimed?

Key Points

1
Apple’s MDM multi-resolution diffusion model can generate images and text-to-video outputs across multiple resolutions, with code planned for release soon.
2
MDM’s design avoids a pre-trained VAE, a component Stable Diffusion XL typically uses for full-resolution rendering, potentially improving speed and efficiency.
3
SSD 1B is an SDXL-based distillation model positioned as ~60% faster and ~40% lower VRAM than full SDXL while maintaining similar image quality.
4
ComfyUI plus a latent consistency model is described as enabling under-a-second SDXL generation on Mac silicon M1 and M2, emphasizing local and private workflows.
5
Woodpecker reduces multimodal hallucinations by verifying caption-claimed objects through question-and-validation loops, then rewriting captions—without retraining the underlying vision model.
6
Zer 7B beta is presented as a top-performing 7B open language model trained with ChatGPT dialogue and GPT-4 judged feedback, narrowing the gap to larger systems in benchmarks.
7
AI video and 3D pipelines are accelerating through rapid model updates (e.g., Pika and Genmo) and 2D-to-3D approaches like Dreamcraft 3D, while Creative Commons-based training (Common Canvas) is framed as a more ethically safer data source.

Highlights

Apple’s MDM model supports multi-resolution image generation and text-to-video, and it reportedly skips the pre-trained VAE that Stable Diffusion XL needs for full-resolution outputs.

SSD 1B claims SDXL-class quality with major efficiency gains—about 60% faster and around 40% less VRAM—making local high-quality generation more feasible.

Woodpecker tackles multimodal hallucinations with a verification-and-rewrite loop, reporting large accuracy jumps (e.g., miniGPT-4 54%→86%) without retraining the vision model.

ComfyUI on M1/M2 Macs, paired with a latent consistency model, is described as enabling full SDXL generations in under a second.

Topics

Apple MDM Diffusion
SDXL Distillation
ComfyUI Latent Consistency
Multimodal Hallucination Correction
Open 7B Language Models
AI Video Updates
2D-to-3D Generation
Creative Commons Training

Mentioned

MDM
VAE
SDXL
VRAM
LLM
GPT
M1
M2
ALPaca