Get AI summaries of any video or article — Sign up free
Solving Wordle using information theory thumbnail

Solving Wordle using information theory

3Blue1Brown·
6 min read

Based on 3Blue1Brown's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Model each Wordle feedback pattern as a partition of candidate answers, then score guesses by the expected information gained from that partition.

Briefing

Wordle can be treated as a problem in information theory: each color pattern (green/yellow/gray) functions like a noisy “measurement” that reduces uncertainty about the hidden five-letter word. Building an optimal solver becomes a matter of choosing guesses that maximize the expected information gained—quantified using entropy—so the remaining candidate set shrinks as efficiently as possible. The practical payoff is a strategy that performs well in simulations and explains why certain opening words tend to work better than others.

The approach starts with the game mechanics: a guess yields a pattern describing which letters appear and whether they match the correct positions. With roughly 13,000 valid guess words but only about 2,300 curated answer words, the solver’s job is to pick guesses that best narrow the answer space within six attempts. Early on, the method assumes all candidate answers are equally likely. For any proposed guess, it enumerates every possible feedback pattern and computes (1) the probability of each pattern occurring and (2) the information content of each pattern, measured in bits. Patterns that are likely to occur (like “all grays”) carry little information; rare patterns that split the candidate set sharply carry more. The solver then selects the guess with the highest expected information, i.e., the entropy of the pattern distribution.

This “version 1” solver repeatedly applies the same logic: after each feedback pattern, it restricts the candidate set to the words consistent with that pattern, then chooses the next guess that maximizes expected entropy over the reduced set. In a uniform-probability world, entropy initially behaves like a fancy way to count how many candidates remain—because all consistent words are treated as equally likely. Simulations across all possible Wordle answers yield an average score around 4.124, which is respectable but leaves room for improvement, especially for the jump from an average of four guesses to maximizing three-guess wins.

The key upgrade is replacing uniform assumptions with a probability model based on English word frequencies. Using frequency data (from Mathematica’s word frequency function, sourced from Google Books Ngram), the solver assigns higher prior probability to common words and lower probability to obscure ones. A sigmoid cutoff turns raw frequency ranks into a smoother “likely vs unlikely” probability distribution, reflecting the intuition that very rare words should not dominate uncertainty even if they remain technically possible. With non-uniform probabilities, entropy no longer matches the raw count of remaining candidates; instead it reflects effective uncertainty, where many “extra” candidates contribute little because the model deems them unlikely.

A further refinement targets expected performance rather than pure information gain. After estimating how likely each candidate answer is, the solver links remaining uncertainty (in bits) to expected number of additional guesses using empirical regression from prior simulations. This “version 2” strategy improves average performance to about 3.6. Incorporating the true Wordle answer list as a prior can push performance slightly further (around 3.43), and a deeper two-step lookahead suggests “crane” is the best opener in that framework. The overall conclusion is that no method can reliably guarantee an average near three guesses without enough room to extract information in only two moves—Wordle’s constraints prevent that kind of perfect third-guess certainty.

Cornell Notes

Wordle feedback patterns can be treated as information that reduces uncertainty about the hidden answer. A solver can score candidate guesses by computing the expected information gained (entropy in bits) from the distribution of possible color patterns. In a first version, all remaining candidate answers are assumed equally likely, so entropy roughly tracks the number of remaining possibilities. Performance improves when the solver uses a non-uniform prior based on English word frequencies, so unlikely candidates contribute less to uncertainty. A final refinement estimates expected remaining guesses from the current uncertainty, improving average scores in simulations to around 3.6.

How does a Wordle guess translate into “information” measured in bits?

Each guess produces a color pattern (greens/yellows/greys) that partitions the candidate answer set into subsets consistent with that feedback. For a given guess, the solver considers every possible pattern, computes the probability of each pattern (how often it would occur under its current model of answer likelihoods), and assigns information to each pattern as −log2(p). Rare patterns (small p) carry more bits; common patterns carry fewer. The guess’s overall score is the expected information: sum over patterns of p(pattern)·(−log2 p(pattern)), which is the entropy of the pattern distribution.

Why does maximizing entropy help, and why do “all grays” patterns tend to be poor?

Entropy rewards guesses whose feedback is both likely to occur in many different ways and, crucially, whose outcomes are spread across patterns rather than concentrated in a single dominant one. “All grays” is often the most probable outcome for many guesses, so it yields little information: it doesn’t sharply reduce the candidate set. The solver instead prefers guesses that sometimes produce feedback patterns that are unlikely under the model but drastically shrink the remaining possibilities.

What changes when the solver stops assuming all candidate answers are equally likely?

With a uniform prior, entropy largely corresponds to log2(number of consistent candidates). With a frequency-based prior, the solver assigns higher probability to common words and much lower probability to obscure ones. Then entropy reflects effective uncertainty rather than raw count: many low-probability candidates can remain consistent with the feedback but contribute little to uncertainty because their probabilities are tiny. This is why two patterns with the same number of matching words can have different entropies under a non-uniform model.

How does the solver connect “bits of uncertainty” to the expected number of remaining guesses?

After incorporating non-uniform probabilities, the solver estimates expected score by combining (a) the probability that each candidate is the true answer and (b) how many additional guesses are likely after that feedback. It uses empirical data from simulations (version 1) to relate uncertainty levels (in bits) to the average number of guesses needed to finish. A regression produces a function f that maps uncertainty to expected remaining guesses, enabling the solver to choose guesses that optimize expected game score, not just expected information.

What performance results come from the different solver versions?

The uniform-prior “version 1” solver averages about 4.124 guesses across all possible Wordle answers. The frequency-informed “version 2” improves the average to around 3.6, with occasional cases requiring more than six guesses due to tradeoffs between maximizing information and optimizing expected score. Using the true Wordle answer list as part of the prior can reach about 3.43 in the best performance reported, and a deeper two-step search suggests “crane” as the best opener.

Why can’t an algorithm guarantee an average near three guesses?

The transcript’s argument is constrained by information limits: even with optimal play, the amount of uncertainty that can be removed in only two guesses is bounded by what the feedback patterns can distinguish given the available word set. A brute-force estimate suggests that after two perfect guesses, the best-case remaining uncertainty is about one bit—roughly a 50-50 situation—so consistently forcing the answer into the third slot every time is not feasible.

Review Questions

  1. In the entropy-based scoring, what is the relationship between a pattern’s probability and its information value?
  2. How does a non-uniform prior (word frequency) change what entropy means compared with the uniform-prior case?
  3. What is the difference between choosing the next guess by maximizing expected information versus maximizing expected game score?

Key Points

  1. 1

    Model each Wordle feedback pattern as a partition of candidate answers, then score guesses by the expected information gained from that partition.

  2. 2

    Compute information in bits using −log2(p), and compute a guess’s quality as the expected value of that quantity over all possible feedback patterns.

  3. 3

    Uniform priors make entropy roughly equivalent to counting remaining candidates, but frequency-based priors make entropy reflect effective uncertainty instead of raw set size.

  4. 4

    A frequency-informed prior can be built from English word frequencies (e.g., via Google Books Ngram data) and converted into probabilities using a sigmoid-style cutoff.

  5. 5

    Improving beyond entropy-only selection requires mapping remaining uncertainty (bits) to expected remaining guesses, using simulation data and regression.

  6. 6

    Even with optimal lookahead, Wordle’s feedback structure limits how much uncertainty can be removed in two moves, preventing consistently solving in three guesses.

Highlights

Entropy turns Wordle feedback into a measurable quantity: rare feedback patterns carry more bits because they sharply reduce the candidate set.
With a non-uniform prior, entropy no longer equals log2(number of remaining candidates); low-probability “extra” candidates contribute little to uncertainty.
The solver’s best improvement comes from shifting from “maximize expected information” to “maximize expected score,” using an empirical relationship between bits of uncertainty and remaining guess count.
A deeper two-step search points to “crane” as the best opener, even though the earlier one-step entropy logic might suggest different candidates.
A brute-force information limit implies that after two optimal guesses, the remaining uncertainty can’t reliably be low enough to guarantee a third-guess win every time.

Topics