ALL Recent AI Advancements! Open Source LLMs at GPT-4 Potential, AI Music, Txt to Speech
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT-4 Vision can prioritize text instructions embedded inside an image over the user’s typed prompt, even when the two conflict.
Briefing
OpenAI’s GPT-4 Vision appears to be getting a surprising kind of “instruction-following” behavior: when text inside an image conflicts with the user’s prompt, the model tends to prioritize the instructions embedded in the picture. A reported example has GPT-4 Vision obeying a handwritten note presented as an image—responding as if it were a “picture of a rose” even when the user’s request says otherwise. The behavior isn’t perfectly stable; when pressed, the model can backtrack, apologize, and admit the content is handwritten text rather than an image of a rose. Still, the pattern suggests that vision models may treat visual instructions as higher priority than plain-language instructions, raising both practical uses (image-based control) and security concerns (potential jailbreak-style prompting).
That same theme—capabilities accelerating fast in open ecosystems—shows up in the push toward open-source language models that can approach GPT-4 on math. A new reasoning-focused agent called Torah (a tool integrated reasoning agent for mathematical problem solving) is reported to score extremely high on math benchmarks, with results described as competitive with GPT-4 on GSM8K. The key point isn’t just the benchmark number; it’s that Torah is open source, meaning it can be downloaded, run locally, and built upon without relying on a single closed provider. A larger Torah variant (70B) is also mentioned as trailing closely behind the smaller code model, reinforcing that open models are narrowing the gap on difficult reasoning tasks.
Multimodal open-source models are also moving toward “instant” interaction. Adapt AI Labs is open sourcing Fuyu 8B, a multimodal foundation model designed to understand images quickly and respond in under 100 milliseconds for large-image inputs. The model is positioned as lightweight enough to run on a phone, with examples including handwriting transcription and reading complex graphs. The architecture is described as straightforward and similar in spirit to other open multimodal systems such as LLaVA 1.5 and Flamingo, but the emphasis here is speed plus practical deployability.
On the real-time communication front, latency is becoming the headline. PlayHT’s “PlayHT 2.0 Turbo” is presented as producing speech with roughly 300-millisecond low latency, enabling back-and-forth conversation that feels natural rather than delayed. The system supports voice cloning and offers developer SDKs, with a demo-style call-and-response exchange used to illustrate how quickly the audio can arrive.
AI music generation is also accelerating, with multiple directions emerging: 11 Labs is previewing synthetic music work described as coherent and capable of incorporating requested lyrics; meanwhile, a Refusion music generator is framed as easy to use and social-media-like, though limited in output length (around 12 seconds) and more heavily censored than some alternatives. Finally, a Microsoft Azure AI team paper is highlighted for “idea-to-image” improvements—linking GPT-4 Vision with Stable Diffusion XL and iteratively teaching it to prompt SDXL more effectively, producing outputs described as approaching Dolly 3-level quality and improving text handling, conversation understanding, and image-to-image recreation.
Taken together, the throughline is clear: vision models are gaining control over generation and interpretation, open models are closing benchmark gaps, and low-latency audio plus multimodal pipelines are making AI feel more like real-time interaction than a slow batch process.
Cornell Notes
GPT-4 Vision is reported to prioritize instructions embedded in an image over the user’s typed prompt, which can be leveraged for control but also raises jailbreak-style concerns. Open-source math and multimodal models are rapidly improving: Torah is described as competitive with GPT-4 on math benchmarks and is runnable locally, while Adapt AI Labs’ Fuyu 8B targets fast image understanding with sub-100ms responses. Real-time speech is moving from novelty to usability, with PlayHT’s “Turbo” mode aiming for ~300ms latency and supporting voice cloning. AI music generation is advancing in parallel, with 11 Labs previewing synthetic music and Refusion offering a simpler, more censored generator. Microsoft’s Azure AI team also reports an “idea-to-image” pipeline that uses GPT-4 Vision to teach Stable Diffusion XL better prompting, improving text and image-to-image results.
What does the GPT-4 Vision “image instruction priority” behavior mean, and why does it matter?
Why is Torah’s open-source status significant for math reasoning?
What makes Fuyu 8B notable compared with many other multimodal models?
How does PlayHT’s low-latency speech change what AI conversations can feel like?
What are the main differences in the AI music generation options mentioned?
How does the Microsoft Azure “idea-to-image” approach improve Stable Diffusion XL outputs?
Review Questions
- If an image contains conflicting instructions, what behavior should designers expect from GPT-4 Vision, and how could that affect prompt safety?
- How do open-source math agents like Torah change the way researchers and developers can test and deploy reasoning models?
- What practical advantage does sub-100ms multimodal inference (like Fuyu 8B) offer compared with slower image understanding systems?
Key Points
- 1
GPT-4 Vision can prioritize text instructions embedded inside an image over the user’s typed prompt, even when the two conflict.
- 2
The “image-first” behavior appears to be imperfect: under pressure, the model may acknowledge the visual content is handwritten text rather than the claimed image description.
- 3
Torah is an open-source reasoning agent for math that is reported to score competitively with GPT-4 on GSM8K and can be run locally.
- 4
Adapt AI Labs’ Fuyu 8B targets fast multimodal responses (under 100 milliseconds) and is described as small enough to potentially run on a phone.
- 5
PlayHT’s “PlayHT 2.0 Turbo” emphasizes real-time speech with about 300ms low latency and supports voice cloning for natural-feeling conversations.
- 6
AI music generation is splitting into different tradeoffs: 11 Labs emphasizes quality, while Refusion emphasizes ease of use and social discovery but with shorter outputs and heavier censorship.
- 7
Microsoft Azure’s GPT-4 Vision + Stable Diffusion XL pipeline improves text-to-image and image-to-image results by teaching better prompting through iterative training.