Prompt Engineering Technique: Chain-Of-Thought Prompting , Enabling Reasoning Ability of LLMs

TL;DR

Chain-of-thought prompting can improve multi-step reasoning by encouraging intermediate steps rather than jumping straight to an answer.

Briefing Cornell Notes

Briefing

Chain-of-thought prompting can unlock stronger multi-step reasoning from large language models—often without changing model size—by explicitly steering the model to work through intermediate steps. The practical takeaway is straightforward: adding “let’s think step by step” can fix some reasoning failures, but harder tasks may still require few-shot examples that demonstrate step-by-step solutions.

The lesson starts by grounding chain-of-thought in how humans solve multi-step problems. When people tackle a word problem, they naturally break it into intermediate calculations before reaching a final answer. Large language models can mimic that behavior when prompts include examples of chain-of-thought reasoning. That framing matters because simply scaling up models has not consistently solved challenging categories like arithmetic, common-sense reasoning, and symbolic reasoning—areas where explicit reasoning structure can be the missing ingredient.

A series of demonstrations shows how prompt design affects outcomes. For a symbolic task—concatenating the last letters of “Amy Brown”—a zero-shot prompt fails until the instruction is augmented with “let’s think step by step.” After that change, the model correctly identifies the last letters (“y” and “n”) and returns “YN.”

Another example—reversing “lollipop”—illustrates the limits of minimal prompting. Zero-shot fails, few-shot without chain-of-thought fails, and even “zero-shot + chain-of-thought” still produces an incorrect result. The breakthrough comes only when few-shot examples include step-by-step reasoning, after which the model successfully reverses the string. The pattern is iterative: try zero-shot first, then add chain-of-thought, and if performance remains poor, move to few-shot with explicit reasoning demonstrations.

The lesson then shifts from generating answers to verifying them. A math word problem asks whether a student’s proposed solution is correct. Without reasoning enablement, the model agrees with the student—even though a manual check reveals a key mistake: the student uses “$100 per square feet” instead of the problem’s “$10 per square feet,” leading to an incorrect final cost. When prompted to “work out your own solution first, then compare,” the model produces a correct intermediate computation (including the correct $10 per square feet), and only then judges the student’s answer—resulting in the correct conclusion that the student solution is incorrect.

Finally, the lesson offers a rule-of-thumb for when chain-of-thought prompting is most likely to help. Benefits are strongest when the task is challenging and requires multi-step reasoning, when the model is large (the lesson cites roughly ≥100 billion parameters as a threshold for impressive results), and when scaling alone yields a relatively flat improvement curve—suggesting that explicit reasoning structure adds value beyond brute-force model size. It also notes why smaller models struggle: weaker arithmetic, inability to reliably construct final answers, and failures even on simpler symbol-mapping tasks. For those without access to very large models, the lesson points to fine-tuning approaches aimed at improving reasoning ability.

Cornell Notes

Chain-of-thought prompting improves large language model performance on tasks that require multi-step reasoning by encouraging intermediate steps. In simple symbolic mapping, adding “let’s think step by step” can turn an incorrect zero-shot output into a correct one (e.g., extracting last letters from “Amy Brown”). For harder tasks like reversing “lollipop,” zero-shot and even zero-shot with chain-of-thought may fail, while few-shot examples that include step-by-step reasoning succeed. For evaluation tasks, instructing the model to solve first and then compare against a student’s work forces more reliable verification. The approach works best with challenging reasoning tasks and sufficiently large models (the lesson cites ≥100B parameters).

Why does adding “let’s think step by step” help on some tasks but not others?

It helps when the task benefits from decomposing work into intermediate steps that the model can mirror. In the “Amy Brown” last-letter concatenation example, the model initially fails under zero-shot, then succeeds once the prompt explicitly requests step-by-step thinking, producing “YN.” But for “lollipop” reversal, zero-shot still fails, and even “zero-shot + chain-of-thought” remains incorrect—suggesting the model needs concrete step-by-step demonstrations, not just the instruction. The successful version uses few-shot examples where the reasoning steps are shown, indicating that harder tasks may require exemplars of the reasoning pattern.

What’s the iterative prompting strategy demonstrated for the “lollipop” reversal task?

The workflow is: try zero-shot first; if it fails, try few-shot; if it still fails, add chain-of-thought to zero-shot; if that fails, add chain-of-thought to few-shot. The transcript shows that zero-shot fails, few-shot fails, zero-shot with “let’s think step by step” still fails, and only few-shot with chain-of-thought examples produces the correct reversal. The key idea is to keep escalating prompt structure until the model reliably follows the reasoning pattern.

How does chain-of-thought prompting change behavior in solution-verification problems?

Without reasoning enablement, the model can rubber-stamp an incorrect student answer. In the student-solution check example, the model initially claims the student’s solution is correct, even though the student misapplies the maintenance cost rate ($100 per square feet instead of $10 per square feet). When instructed to “work out your own solution first, then compare,” the model computes the correct intermediate cost (including the correct $10 per square feet) and then contrasts it with the student’s result—leading to the correct verdict that the student solution is incorrect.

What thumb-rule conditions make chain-of-thought prompting most effective?

The lesson lists three conditions: (1) the task is challenging and needs multi-step reasoning; (2) the model is large enough—results are described as unimpressive below about 100B parameters; and (3) the scaling curve is relatively flat, meaning simply increasing model size doesn’t yield big gains, so explicit reasoning structure becomes more valuable. If one or more conditions don’t hold, chain-of-thought benefits shrink.

Why might smaller language models fail to benefit from chain-of-thought prompting?

The transcript points to several failure modes: small models can fail even on relatively easy symbol mapping tasks, have weaker arithmetic abilities, and may not reliably produce a final answer that can be constructed from intermediate reasoning. It also suggests that chain-of-thought success depends on emergent abilities tied to model scale, so prompting alone may not compensate for limited underlying capability.

Review Questions

In the “lollipop” reversal example, which specific prompt variant finally succeeds, and what distinguishes it from the earlier attempts?
What instruction changes the model from agreeing with an incorrect student solution to correctly rejecting it, and why does that instruction matter?
According to the lesson’s thumb rule, what three conditions determine when chain-of-thought prompting is most likely to work?

Key Points

1
Chain-of-thought prompting can improve multi-step reasoning by encouraging intermediate steps rather than jumping straight to an answer.
2
For some symbolic tasks, adding “let’s think step by step” to a zero-shot prompt can be enough to correct errors.
3
Harder reasoning tasks may require few-shot examples that explicitly demonstrate step-by-step solutions, not just the instruction to think step by step.
4
For verification tasks, forcing the model to solve first and then compare against a provided solution reduces the chance of rubber-stamping mistakes.
5
Chain-of-thought is most effective when tasks are genuinely multi-step and challenging.
6
The lesson cites roughly ≥100B parameters as a threshold for impressive chain-of-thought benefits; smaller models often show weaker arithmetic and symbol-handling.
7
If scaling model size alone yields limited gains, chain-of-thought prompting is more likely to add value beyond brute-force capacity.

Highlights

Adding “let’s think step by step” turns a failed zero-shot symbolic task (last-letter concatenation for “Amy Brown”) into a correct output (“YN”).

Reversing “lollipop” requires few-shot examples with step-by-step reasoning; zero-shot and zero-shot with chain-of-thought still fail.

In solution checking, instructing the model to compute its own solution first prevents agreement with a student’s incorrect maintenance-cost calculation.

Chain-of-thought prompting is framed as most useful for challenging multi-step tasks and sufficiently large models (around ≥100B parameters).

Topics

Mentioned

LLMs
GPT