Get AI summaries of any video or article — Sign up free
The Tech that’s *probably* inside GPT-5 just got Open Sourced! thumbnail

The Tech that’s *probably* inside GPT-5 just got Open Sourced!

MattVidPro·
6 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Claude 3 Haiku can be pushed toward Claude 3 Opus-like quality using an open-source prompting pipeline that generates a task-specific system prompt from Claude 3 Opus outputs.

Briefing

Large language models don’t just get better by training bigger weights—many of the biggest gains come from “extracting” more capability out of models people already have. A viral open-source notebook tied to Matt Schumer’s work claims a smaller Claude 3 variant (Claude 3 Haiku) can be pushed close to Claude 3 Opus quality by feeding it carefully constructed examples and then generating a system prompt that makes the small model behave like the top performer. The practical pitch is straightforward: similar output quality, but far lower cost and latency, because the heavy lifting happens through prompting and example generation rather than by paying for the most expensive model on every request.

The notebook’s workflow starts with a task description plus a single input/output example given to Claude 3 Opus. From that seed, the repository generates a diverse set of additional examples, then uses the task and those examples to produce a system prompt suitable for Claude 3 Haiku (and other small or large models). It also saves the resulting system prompt and examples into a Python file formatted for generation—aimed at developers who want to drop the technique into their own products quickly. A concrete example in the transcript imagines building an AI story-writing website: instead of paying for Claude 3 Opus or GPT-4-class APIs for every user request, the approach targets Claude 3 Haiku while claiming Opus-level results, making the product cheaper to run for both the builder and end users.

That “distillation without retraining” theme is reinforced by another open-source technique called Quiet Star, attributed to Bindu Reddy. Quiet Star pushes reasoning into the generation process by having the model generate internal rationales—essentially token-by-token “inner monologues”—and then using a reward mechanism to teach which rationales lead to better outcomes. Reported results include a jump for a 7B model on common-sense question answering from 36% to 47%, alongside doubled math performance. The transcript also notes that while Quiet Star is open source and can be applied after a model is already trained (including to Claude 3 and ChatGPT), it costs more to run because it adds extra reasoning tokens.

The transcript then connects these ideas into a compounding stack: Quiet Star-style internal thinking, plus example-driven prompting that teaches a small model how to act, plus prompt-level Chain-of-Thought prompting that forces stepwise planning. The endgame is less about waiting for GPT-5 or Claude 4 to be released and more about using existing models as teachers—turning their capabilities into reusable prompting patterns and agent workflows.

Finally, the transcript points to Schumer’s open-source “Claude investor” agent, described as a constrained system that chains multiple Claude 3 calls to gather financial data, analyze sentiment and trends, and rank stocks with price targets—while explicitly warning it isn’t financial advice. The recurring message is that value can be extracted through orchestration and prompting, even when the underlying model weights remain closed. Open sourcing these prompting and agent patterns, the transcript argues, could accelerate adoption widely—because developers can implement the techniques without waiting for new frontier model releases.

Cornell Notes

The transcript argues that large-model performance can be improved dramatically without retraining the base model—by extracting capability through prompting, example generation, and reasoning scaffolds. A key example is an open-source notebook attributed to Matt Schumer that uses Claude 3 Opus to generate a system prompt and diverse examples so a smaller Claude 3 Haiku can perform close to Opus quality at lower cost and latency. Another technique, Quiet Star (Bindu Reddy), adds token-level “inner monologues” during generation and uses reward learning to favor better rationales, reporting gains like 36%→47% on common-sense QA for a 7B model and doubled math performance. The transcript suggests these methods can be compounded with Chain-of-Thought prompting and agent orchestration to push small models toward near top-tier behavior.

How can Claude 3 Haiku be made to approach Claude 3 Opus quality without training a new model?

The approach described uses an open-source notebook that starts with Claude 3 Opus given (1) a task description and (2) one input/output example. Claude 3 Opus then generates additional diverse examples similar in structure to the seed. Those examples plus the task description are used to create a system prompt that’s fed to Claude 3 Haiku (and can work with other small or large models). The claim is that the small model’s behavior becomes aligned with the top model’s style and problem-solving pattern, yielding near-Opus quality at a fraction of the cost and latency.

What exactly does Quiet Star change during generation, and why does it help reasoning?

Quiet Star changes the generation process by making the model produce internal rationales—an “inner monologue”—token by token. At each step, it mixes predictions with and without these rationales, then a reward mechanism teaches the model to prefer the rationales that lead to better outcomes. The transcript reports that this improved a 7B model’s common-sense QA from 36% to 47% (an over 10% increase) and doubled math performance, while noting it costs more to run because it generates extra reasoning tokens.

Why does the transcript emphasize “compounding” techniques like Quiet Star, example prompting, and Chain of Thought?

The core idea is that each technique targets a different bottleneck in getting good outputs. Quiet Star improves reasoning by forcing internal planning during token generation. Example-driven prompting (the Claude Opus→Haiku method) teaches a smaller model how to behave for a specific task using generated examples and a system prompt. Chain-of-Thought prompting forces step-by-step planning at the prompt level. Combined, the transcript suggests the model can think more before answering and also has better task-specific guidance, potentially multiplying gains.

What trade-offs come with these reasoning-heavy methods?

The transcript highlights two main trade-offs. First, Quiet Star-style approaches are more expensive to run because they add additional tokens for inner rationales. Second, agentic workflows that chain many model calls can become “token-eating monstrosities,” even if they improve output quality. The proposed mitigation is using smaller models for most calls (e.g., Haiku) and relying on techniques that reduce the number of expensive calls.

How do the described agents fit into the broader theme of extracting value from existing models?

Agents are presented as orchestration layers that turn model capability into task execution. The transcript cites Matt Schumer’s open-source “Claude investor” agent as an example: it’s constrained (behavior tightly controlled), chains multiple Claude 3 calls to gather financial data and analyze sentiment/trends, ranks stocks by investment potential, and outputs price targets—while warning it’s not financial advice. The broader theme is that even with closed model weights, developers can extract value through prompting patterns, constraints, and multi-step tool use.

Review Questions

  1. What are the two inputs used to start the Opus→Haiku prompting pipeline, and how does the system prompt get produced?
  2. How does Quiet Star’s token-level inner monologue differ from prompt-level Chain-of-Thought prompting?
  3. What kinds of cost increases are mentioned for reasoning and agent workflows, and what strategies are suggested to manage them?

Key Points

  1. 1

    Claude 3 Haiku can be pushed toward Claude 3 Opus-like quality using an open-source prompting pipeline that generates a task-specific system prompt from Claude 3 Opus outputs.

  2. 2

    The Opus→Haiku method begins with a task description plus one input/output example, then expands that into diverse examples and uses them to construct the system prompt.

  3. 3

    Quiet Star improves reasoning by generating token-level internal rationales and using a reward mechanism to reinforce the rationales that lead to better results.

  4. 4

    Reported Quiet Star gains include a 7B model improving common-sense QA from 36% to 47% and doubling math performance, at the cost of higher inference compute.

  5. 5

    Chain-of-Thought prompting is treated as a prompt-level planning scaffold, while Quiet Star is treated as a generation-time reasoning scaffold; both can be combined with example-driven prompting.

  6. 6

    Agent systems (like a constrained Claude investor agent) can chain multiple model calls to execute tasks, but they can become expensive due to token volume.

  7. 7

    A recurring thesis is that open-sourcing prompting/agent patterns can let developers extract more value from existing models without waiting for new frontier releases.

Highlights

A prompting pipeline claims Claude 3 Haiku can be made nearly as capable as Claude 3 Opus by using Claude 3 Opus once to generate diverse examples and a system prompt, then running the smaller model for production.
Quiet Star adds token-by-token “inner monologues” and uses reward learning to select better rationales; the transcript reports 36%→47% common-sense QA gains on a 7B model and doubled math performance.
The transcript’s central strategy is compounding: internal reasoning (Quiet Star) + task-specific example prompting (Opus→Haiku) + stepwise planning prompts (Chain of Thought) to squeeze more performance out of smaller models.

Topics

Mentioned