How to OPTIMIZE your prompts for better Reasoning!
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Prompt quality for LLM tasks depends strongly on the context and inputs, so prompt optimization should be treated as a systematic process rather than manual guesswork.
Briefing
Prompt quality in large language model (LLM) work depends heavily on context and input design—not just the question. Microsoft’s new “prompt Wizard” framework targets that bottleneck by treating prompt engineering as an optimization problem: it iteratively improves both the instruction text and the in-context learning (ICL) examples, using an LLM to critique outputs and then update what goes into the next run. The payoff is a more reliable way to generate task-specific prompts that can also produce stronger chain-of-thought style reasoning traces, including synthetic reasoning data for training.
Prompt Wizard is built around three main ideas. First is feedback-driven refinement: an LLM generates critiques, and those critiques drive revisions to the prompt instructions and example sets. Second is joint optimization, where the system searches for better combinations of instructions and diverse yet task-aligned synthetic examples rather than relying on a single static template. Third—and most notable—is self-generated chain-of-thought steps. Instead of only asking for an answer, the framework encourages the model to produce reasoning traces and then uses validation-style checks to improve problem solving. The end result is an “optimized prompt” package: refined instructions, a set of few-shot examples, reasoning/validation formatting, and an expert-like persona prompt tailored to the task.
Under the hood, the workflow is staged. Stage one refines the prompt instruction itself. The system starts with a base instruction (for example, a math-focused identity plus a minimal directive like “let’s think step by step”), then mutates the instruction using “thinking styles” (similar in spirit to prompt mutation approaches such as DeepMind’s prompt breeder). Each mutated version is scored by an evaluation loop. The LLM then critiques why certain mutations performed better—calling out issues like unclear approach, missing goals, or insufficient structure—and synthesizes an improved instruction. This loop repeats for multiple iterations.
Stage two combines the improved instruction with ICL examples. The framework again critiques and synthesizes, selecting diverse examples from training data and generating additional synthetic ones when needed. The goal is not only correctness, but better reasoning traces that match the task’s intent and structure.
A concrete walkthrough uses the GSM 8K dataset (math word problems). The setup includes environment variables for API access and a Hugging Face key to download the dataset. The optimization run uses a configuration system (YAML) and can target different model endpoints; the walkthrough notes using GPT-4o and warns that some models (like o1) may not handle chain-of-thought generation as expected. During execution, the base instruction begins very simple, with five few-shot examples, then evolves into a much more structured prompt: it adds explicit steps like comprehending the problem, extracting key information, planning strategically, executing with precision, verifying cross-checking logical consistency, and presenting the final answer in a specific “answer start/answer end” format.
The practical message is that teams with existing evals and test sets can run prompt Wizard to automatically produce stronger, model-specific prompting artifacts—often reducing the need for manual prompt iteration or heavier “agent” setups. The framework is released under an MIT license, includes code and scenarios, and is designed to be adapted to other APIs that follow OpenAI-compatible formats.
Cornell Notes
Prompt Wizard turns prompt engineering into an automated optimization loop. It refines both the instruction text and the in-context learning (ICL) examples by having an LLM generate critiques, score results, and synthesize improved prompts over multiple iterations. A key feature is self-generated chain-of-thought reasoning plus validation-style checks, aiming to improve problem-solving and produce synthetic reasoning traces for training. In a GSM 8K math example, a minimal starting instruction (“let’s think step by step” plus a math expert identity) evolves into a detailed, structured prompt with explicit planning, execution, verification, and answer formatting. The approach matters because prompt quality strongly determines LLM output quality, and the best prompt often changes across models and fine-tuning runs.
Why does prompt optimization matter more than simply choosing a strong model?
What are the three core mechanisms behind Prompt Wizard’s optimization loop?
How does Prompt Wizard refine the instruction before touching in-context examples?
What changes in stage two once the instruction is improved?
What does the GSM 8K example show about the evolution of a prompt?
Why might chain-of-thought prompting behave differently across model endpoints?
Review Questions
- How does Prompt Wizard use LLM-generated feedback to iteratively improve both instructions and example sets?
- Describe the difference between stage one (instruction refinement) and stage two (ICL example optimization) in Prompt Wizard’s workflow.
- In the GSM 8K walkthrough, what specific prompt elements were added as the optimized instruction evolved from a minimal directive?
Key Points
- 1
Prompt quality for LLM tasks depends strongly on the context and inputs, so prompt optimization should be treated as a systematic process rather than manual guesswork.
- 2
Prompt Wizard automates prompt engineering by using an LLM to critique outputs, score performance, and synthesize improved instructions and examples in an iterative loop.
- 3
The framework optimizes instructions and in-context learning examples jointly, aiming for both correctness and better structured reasoning traces.
- 4
Self-generated chain-of-thought plus validation-style checks are central to improving problem-solving and generating synthetic reasoning data.
- 5
A staged workflow first mutates and refines the instruction text, then selects and synthesizes ICL examples to strengthen reasoning and task alignment.
- 6
The GSM 8K example demonstrates how a short starting prompt can evolve into a detailed, step-by-step math procedure with explicit verification and answer formatting.
- 7
Model endpoint behavior can affect chain-of-thought generation, so prompt Wizard runs should be tested and configured per model/API.