Get AI summaries of any video or article — Sign up free
ALL Recent AI Advancements! Open Source LLMs at GPT-4 Potential, AI Music, Txt to Speech thumbnail

ALL Recent AI Advancements! Open Source LLMs at GPT-4 Potential, AI Music, Txt to Speech

MattVidPro·
6 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

GPT-4 Vision can prioritize text instructions embedded inside an image over the user’s typed prompt, even when the two conflict.

Briefing

OpenAI’s GPT-4 Vision appears to be getting a surprising kind of “instruction-following” behavior: when text inside an image conflicts with the user’s prompt, the model tends to prioritize the instructions embedded in the picture. A reported example has GPT-4 Vision obeying a handwritten note presented as an image—responding as if it were a “picture of a rose” even when the user’s request says otherwise. The behavior isn’t perfectly stable; when pressed, the model can backtrack, apologize, and admit the content is handwritten text rather than an image of a rose. Still, the pattern suggests that vision models may treat visual instructions as higher priority than plain-language instructions, raising both practical uses (image-based control) and security concerns (potential jailbreak-style prompting).

That same theme—capabilities accelerating fast in open ecosystems—shows up in the push toward open-source language models that can approach GPT-4 on math. A new reasoning-focused agent called Torah (a tool integrated reasoning agent for mathematical problem solving) is reported to score extremely high on math benchmarks, with results described as competitive with GPT-4 on GSM8K. The key point isn’t just the benchmark number; it’s that Torah is open source, meaning it can be downloaded, run locally, and built upon without relying on a single closed provider. A larger Torah variant (70B) is also mentioned as trailing closely behind the smaller code model, reinforcing that open models are narrowing the gap on difficult reasoning tasks.

Multimodal open-source models are also moving toward “instant” interaction. Adapt AI Labs is open sourcing Fuyu 8B, a multimodal foundation model designed to understand images quickly and respond in under 100 milliseconds for large-image inputs. The model is positioned as lightweight enough to run on a phone, with examples including handwriting transcription and reading complex graphs. The architecture is described as straightforward and similar in spirit to other open multimodal systems such as LLaVA 1.5 and Flamingo, but the emphasis here is speed plus practical deployability.

On the real-time communication front, latency is becoming the headline. PlayHT’s “PlayHT 2.0 Turbo” is presented as producing speech with roughly 300-millisecond low latency, enabling back-and-forth conversation that feels natural rather than delayed. The system supports voice cloning and offers developer SDKs, with a demo-style call-and-response exchange used to illustrate how quickly the audio can arrive.

AI music generation is also accelerating, with multiple directions emerging: 11 Labs is previewing synthetic music work described as coherent and capable of incorporating requested lyrics; meanwhile, a Refusion music generator is framed as easy to use and social-media-like, though limited in output length (around 12 seconds) and more heavily censored than some alternatives. Finally, a Microsoft Azure AI team paper is highlighted for “idea-to-image” improvements—linking GPT-4 Vision with Stable Diffusion XL and iteratively teaching it to prompt SDXL more effectively, producing outputs described as approaching Dolly 3-level quality and improving text handling, conversation understanding, and image-to-image recreation.

Taken together, the throughline is clear: vision models are gaining control over generation and interpretation, open models are closing benchmark gaps, and low-latency audio plus multimodal pipelines are making AI feel more like real-time interaction than a slow batch process.

Cornell Notes

GPT-4 Vision is reported to prioritize instructions embedded in an image over the user’s typed prompt, which can be leveraged for control but also raises jailbreak-style concerns. Open-source math and multimodal models are rapidly improving: Torah is described as competitive with GPT-4 on math benchmarks and is runnable locally, while Adapt AI Labs’ Fuyu 8B targets fast image understanding with sub-100ms responses. Real-time speech is moving from novelty to usability, with PlayHT’s “Turbo” mode aiming for ~300ms latency and supporting voice cloning. AI music generation is advancing in parallel, with 11 Labs previewing synthetic music and Refusion offering a simpler, more censored generator. Microsoft’s Azure AI team also reports an “idea-to-image” pipeline that uses GPT-4 Vision to teach Stable Diffusion XL better prompting, improving text and image-to-image results.

What does the GPT-4 Vision “image instruction priority” behavior mean, and why does it matter?

When instructions appear as text inside an image, GPT-4 Vision can follow those visual instructions even if they conflict with the user’s typed prompt. A cited example has the model treating a handwritten note as if it were an instruction to identify the image as “a picture of a rose,” despite the user’s request. The behavior matters because it changes how prompts should be designed for reliable outputs—and it creates a potential security angle where malicious or misleading instructions could be embedded in images.

Why is Torah’s open-source status significant for math reasoning?

Torah is described as an open-source reasoning agent for mathematical problem solving that can be downloaded and run on a user’s own machine. Benchmark results are reported as extremely close to GPT-4 on GSM8K, with the smaller Torah code model scoring near GPT-4 and a larger 70B variant also trailing closely. The open-source angle matters because it enables independent evaluation, local deployment, and community development rather than depending on a single closed API.

What makes Fuyu 8B notable compared with many other multimodal models?

Fuyu 8B is positioned as a fast multimodal foundation model that can respond to large images in under 100 milliseconds, aiming for near-instant interaction. It’s also described as small enough to potentially run on a phone. Examples include accurate transcription of messy handwriting and reading complex graphs, with an architecture said to be simple and similar to other open multimodal approaches like LLaVA 1.5 and Flamingo.

How does PlayHT’s low-latency speech change what AI conversations can feel like?

PlayHT’s “PlayHT 2.0 Turbo” is described as generating audio with roughly 300-millisecond low latency, producing responses in under a second. With voice cloning and developer SDKs, this supports more natural back-and-forth dialogue—closer to human conversation timing than typical delayed text-to-speech pipelines.

What are the main differences in the AI music generation options mentioned?

11 Labs is previewing synthetic music work described as coherent and capable of including requested lyrics, with a focus on quality. Refusion is framed as easier to generate music from lyrics on its site and includes a social/explore-style interface, but output length is limited (about 12 seconds) and it’s more censored (e.g., restrictions on swears) compared with less-censored alternatives like Sono AI.

How does the Microsoft Azure “idea-to-image” approach improve Stable Diffusion XL outputs?

The approach links GPT-4 Vision with Stable Diffusion XL and iteratively teaches GPT-4 Vision how to prompt SDXL more effectively over time. Reported improvements include better text handling, stronger understanding of conversational context, and the ability to do image-to-image tasks by referencing specific example images. The results are described as approaching Dolly 3-level quality, and the method is presented as potentially applicable to other text-to-image models beyond SDXL.

Review Questions

  1. If an image contains conflicting instructions, what behavior should designers expect from GPT-4 Vision, and how could that affect prompt safety?
  2. How do open-source math agents like Torah change the way researchers and developers can test and deploy reasoning models?
  3. What practical advantage does sub-100ms multimodal inference (like Fuyu 8B) offer compared with slower image understanding systems?

Key Points

  1. 1

    GPT-4 Vision can prioritize text instructions embedded inside an image over the user’s typed prompt, even when the two conflict.

  2. 2

    The “image-first” behavior appears to be imperfect: under pressure, the model may acknowledge the visual content is handwritten text rather than the claimed image description.

  3. 3

    Torah is an open-source reasoning agent for math that is reported to score competitively with GPT-4 on GSM8K and can be run locally.

  4. 4

    Adapt AI Labs’ Fuyu 8B targets fast multimodal responses (under 100 milliseconds) and is described as small enough to potentially run on a phone.

  5. 5

    PlayHT’s “PlayHT 2.0 Turbo” emphasizes real-time speech with about 300ms low latency and supports voice cloning for natural-feeling conversations.

  6. 6

    AI music generation is splitting into different tradeoffs: 11 Labs emphasizes quality, while Refusion emphasizes ease of use and social discovery but with shorter outputs and heavier censorship.

  7. 7

    Microsoft Azure’s GPT-4 Vision + Stable Diffusion XL pipeline improves text-to-image and image-to-image results by teaching better prompting through iterative training.

Highlights

GPT-4 Vision reportedly follows instructions written inside an image even when they contradict the user’s prompt—suggesting “seeing” can override “telling.”
Torah’s open-source math reasoning performance is described as nearly matching GPT-4 on GSM8K, with both 34B and 70B variants highlighted.
PlayHT’s Turbo mode targets roughly 300ms latency, making AI speech feel responsive enough for conversation.
Fuyu 8B is pitched as a fast, phone-capable multimodal model that can transcribe handwriting and interpret graphs quickly.
Microsoft’s Azure AI team links GPT-4 Vision to Stable Diffusion XL to iteratively improve prompting, pushing image quality toward Dolly 3-level results.

Topics

Mentioned