How to OPTIMIZE your prompts for better Reasoning!

TL;DR

Prompt quality for LLM tasks depends strongly on the context and inputs, so prompt optimization should be treated as a systematic process rather than manual guesswork.

Briefing Cornell Notes

Briefing

Prompt quality in large language model (LLM) work depends heavily on context and input design—not just the question. Microsoft’s new “prompt Wizard” framework targets that bottleneck by treating prompt engineering as an optimization problem: it iteratively improves both the instruction text and the in-context learning (ICL) examples, using an LLM to critique outputs and then update what goes into the next run. The payoff is a more reliable way to generate task-specific prompts that can also produce stronger chain-of-thought style reasoning traces, including synthetic reasoning data for training.

Prompt Wizard is built around three main ideas. First is feedback-driven refinement: an LLM generates critiques, and those critiques drive revisions to the prompt instructions and example sets. Second is joint optimization, where the system searches for better combinations of instructions and diverse yet task-aligned synthetic examples rather than relying on a single static template. Third—and most notable—is self-generated chain-of-thought steps. Instead of only asking for an answer, the framework encourages the model to produce reasoning traces and then uses validation-style checks to improve problem solving. The end result is an “optimized prompt” package: refined instructions, a set of few-shot examples, reasoning/validation formatting, and an expert-like persona prompt tailored to the task.

Under the hood, the workflow is staged. Stage one refines the prompt instruction itself. The system starts with a base instruction (for example, a math-focused identity plus a minimal directive like “let’s think step by step”), then mutates the instruction using “thinking styles” (similar in spirit to prompt mutation approaches such as DeepMind’s prompt breeder). Each mutated version is scored by an evaluation loop. The LLM then critiques why certain mutations performed better—calling out issues like unclear approach, missing goals, or insufficient structure—and synthesizes an improved instruction. This loop repeats for multiple iterations.

Stage two combines the improved instruction with ICL examples. The framework again critiques and synthesizes, selecting diverse examples from training data and generating additional synthetic ones when needed. The goal is not only correctness, but better reasoning traces that match the task’s intent and structure.

A concrete walkthrough uses the GSM 8K dataset (math word problems). The setup includes environment variables for API access and a Hugging Face key to download the dataset. The optimization run uses a configuration system (YAML) and can target different model endpoints; the walkthrough notes using GPT-4o and warns that some models (like o1) may not handle chain-of-thought generation as expected. During execution, the base instruction begins very simple, with five few-shot examples, then evolves into a much more structured prompt: it adds explicit steps like comprehending the problem, extracting key information, planning strategically, executing with precision, verifying cross-checking logical consistency, and presenting the final answer in a specific “answer start/answer end” format.

The practical message is that teams with existing evals and test sets can run prompt Wizard to automatically produce stronger, model-specific prompting artifacts—often reducing the need for manual prompt iteration or heavier “agent” setups. The framework is released under an MIT license, includes code and scenarios, and is designed to be adapted to other APIs that follow OpenAI-compatible formats.

Cornell Notes

Prompt Wizard turns prompt engineering into an automated optimization loop. It refines both the instruction text and the in-context learning (ICL) examples by having an LLM generate critiques, score results, and synthesize improved prompts over multiple iterations. A key feature is self-generated chain-of-thought reasoning plus validation-style checks, aiming to improve problem-solving and produce synthetic reasoning traces for training. In a GSM 8K math example, a minimal starting instruction (“let’s think step by step” plus a math expert identity) evolves into a detailed, structured prompt with explicit planning, execution, verification, and answer formatting. The approach matters because prompt quality strongly determines LLM output quality, and the best prompt often changes across models and fine-tuning runs.

Why does prompt optimization matter more than simply choosing a strong model?

LLM output quality is conditional on the context and inputs provided. Even when models improve, the “best” prompt can shift across model versions and fine-tuning runs. Prompt Wizard targets that by optimizing the instruction and the ICL examples together, rather than relying on a one-size-fits-all template.

What are the three core mechanisms behind Prompt Wizard’s optimization loop?

(1) Feedback-driven refinement: the LLM critiques outputs and those critiques update prompts and examples. (2) Joint optimization: instructions and diverse, task-aligned synthetic examples are optimized together. (3) Self-generated chain-of-thought steps: the system incorporates reasoning traces and validation checks to improve solving and to generate training-like reasoning data.

How does Prompt Wizard refine the instruction before touching in-context examples?

Stage one starts with a base instruction (e.g., math expert identity plus “let’s think step by step”), then mutates it using “thinking styles.” Each mutated instruction is scored on task performance. The LLM then critiques what’s missing or unclear (such as goals, approach, or structure) and synthesizes a revised instruction. This repeats for multiple iterations.

What changes in stage two once the instruction is improved?

Stage two combines the refined instruction with ICL examples. The system critiques and synthesizes example sets, selecting diverse examples from training data and generating additional synthetic ones when needed. This strengthens both correctness and the quality/structure of chain-of-thought reasoning traces.

What does the GSM 8K example show about the evolution of a prompt?

The run begins with a minimal directive and few-shot setup (five examples). Over iterations, the prompt expands into a structured workflow: comprehend the problem, extract key information, plan strategically, execute with precision (including math-specific handling like percentages/proportions), verify via cross-checking logical consistency, and present the final answer using an “answer start/answer end” format. It also produces an expert persona/system-style prompt tailored to the dataset.

Why might chain-of-thought prompting behave differently across model endpoints?

The walkthrough notes that some models (specifically o1) may not handle chain-of-thought generation as expected when instructed to produce different kinds of reasoning traces. That implies prompt Wizard’s optimization may need endpoint-specific configuration and testing.

Review Questions

How does Prompt Wizard use LLM-generated feedback to iteratively improve both instructions and example sets?
Describe the difference between stage one (instruction refinement) and stage two (ICL example optimization) in Prompt Wizard’s workflow.
In the GSM 8K walkthrough, what specific prompt elements were added as the optimized instruction evolved from a minimal directive?

Key Points

1
Prompt quality for LLM tasks depends strongly on the context and inputs, so prompt optimization should be treated as a systematic process rather than manual guesswork.
2
Prompt Wizard automates prompt engineering by using an LLM to critique outputs, score performance, and synthesize improved instructions and examples in an iterative loop.
3
The framework optimizes instructions and in-context learning examples jointly, aiming for both correctness and better structured reasoning traces.
4
Self-generated chain-of-thought plus validation-style checks are central to improving problem-solving and generating synthetic reasoning data.
5
A staged workflow first mutates and refines the instruction text, then selects and synthesizes ICL examples to strengthen reasoning and task alignment.
6
The GSM 8K example demonstrates how a short starting prompt can evolve into a detailed, step-by-step math procedure with explicit verification and answer formatting.
7
Model endpoint behavior can affect chain-of-thought generation, so prompt Wizard runs should be tested and configured per model/API.

Highlights

Prompt Wizard treats prompt engineering as optimization: it iteratively mutates instructions, scores results, and uses LLM critiques to synthesize better prompts.

The framework explicitly targets chain-of-thought quality by generating reasoning traces and pairing them with validation-style checks.

In the GSM 8K demo, a minimal “let’s think step by step” instruction expands into a full structured math workflow with extraction, planning, execution, verification, and strict answer formatting.

Prompt Wizard can generate synthetic in-context examples when training examples are limited or absent, strengthening task alignment without manual curation.