ChatGPT Fails Basic Logic but Now Has Vision, Wins at Chess and Prompts a Masterpiece

TL;DR

The reversal curse shows that language models often fail when asked to infer “B is A” after learning “A is B,” revealing weak bidirectional generalization.

Briefing Cornell Notes

Briefing

Language models still stumble on basic logical generalization—yet they can perform impressively in tasks that look like reasoning, from chess to generating images—highlighting a gap between “knowing facts” and reliably applying logic in new directions.

A central thread comes from a paper dubbed the “reversal curse,” which finds that models often fail when asked to generalize a simple implication. In one pattern, if a model learns that “A is B,” it may not correctly infer “B is A” in a new prompt. The transcript illustrates this with named-entity reversals: when asked about Tom Cruise’s mother, the system can identify the mother, but when asked about the mother, it can’t reliably name the famous son. A similar example uses Gabriel Ma: one chat correctly identifies Suzanne Pula as Gabriel Ma’s mother, but a follow-up chat asking for the famous son of Suzanne Pula produces a wrong or confused answer (the transcript notes the model’s attempt to connect Suzanne Pula to Elon Musk via family relations), while failing to retrieve Gabriel Ma.

The same asymmetry shows up even when personal-data concerns are removed. The transcript describes tests with a real island in Norway named Hugo. When prompted with “Hugo Norway,” the model can’t provide the island’s description or key facts like length and county. But when the prompt includes a descriptive phrase that matches what the model has in its training associations—then asks for the name—the model can output “Hugo.” That pattern suggests the model’s knowledge is triggered by specific input-to-output mappings rather than by a stable, bidirectional understanding of the underlying entity relationships.

Researchers connect this to how language models learn: training optimizes next-token prediction, which can make forward associations easier than backward inference. Neil Nander (formerly of DeepMind) is cited describing an input-output asymmetry: the model treats the output as having fixed meaning in the learned mapping, so it may predict “son of Mary Lee Fifer” given Tom Cruise, without treating the reverse as an equation-like variable relationship.

At the same time, other results complicate the story. GPT-3.5 is reported to play chess at around 1,800 ELO in a 150-game sample with extremely high move legality, and the transcript argues that such performance doesn’t require memorizing every position. Yet counterfactual testing suggests the “reasoning” can be brittle: when chess problems are reformatted so that the same task depends less on memorized patterns, accuracy can drop toward random. Related work (cited from the Allen Institute for AI and multiple universities) finds transformers handle simpler compositional tasks well but fail as multi-step structure grows, implying that systematic problem-solving doesn’t automatically emerge from maximum-likelihood training.

The transcript ends by placing these limitations alongside rapid capability gains: GPT-4-class systems can still generate “masterpiece” images with GPT Vision (DALL·E 3 is mentioned), and the broader AGI timeline debate remains largely unchanged because researchers expect reasoning to be supported by more than one approach—tool use, retrieval, reinforcement learning systems like MuZero, and efficient search methods.

In short: the systems can look like they reason, but they often generalize in one direction only, and their reliability depends heavily on how prompts line up with learned mappings and training-derived structure—an important constraint as AI moves toward more autonomous, logic-heavy applications.

Cornell Notes

The reversal curse highlights a core weakness in language models: they often fail to generalize simple logical implications in reverse. If a model can answer “A is B,” it may not correctly infer “B is A” when the prompt direction flips, even for non-personal facts like a Norwegian island. This behavior aligns with next-token training and an input-output asymmetry—models learn forward mappings that trigger the right outputs, not stable bidirectional relationships. Yet the same systems can still excel at tasks like chess and image generation, suggesting performance can come from pattern matching, compositional cues in prompts, or auxiliary mechanisms rather than consistent deductive logic. The practical takeaway is that “reasoning-like” performance can mask fragile generalization.

What is the “reversal curse,” and how does it show up in the examples?

The reversal curse describes a failure of logical generalization: after learning or recognizing that “A is B,” a model may not correctly infer “B is A” when the prompt is reversed. The transcript illustrates this with named entities: it can identify Tom Cruise’s mother from “Tom Cruise who is Tom Cruise’s mother,” but struggles when asked “who is Tom Cruise’s mother” in a way that requires reversing the relationship. A similar pattern appears with Gabriel Ma and Suzanne Pula: one prompt correctly links Suzanne Pula as Gabriel Ma’s mother, while a follow-up asking for the famous son of Suzanne Pula produces an incorrect or confused association rather than reliably returning Gabriel Ma.

Why do the Hugo Norway tests matter if the model can’t answer the description directly?

They demonstrate that the issue isn’t just about personal-data refusal. For Hugo Norway, the transcript says the model can’t provide the island’s facts (like length and county) when asked directly for “Hugo Norway.” But when the prompt includes a descriptive phrase that matches what the model has learned to associate with Hugo, it can output the name “Hugo.” That asymmetry suggests the model’s knowledge is triggered by specific input patterns rather than by a robust, reversible mapping between a name and its attributes.

How does next-token prediction relate to forward-vs-reverse inference?

The transcript cites Neil Nander’s explanation of an input-output asymmetry: language models optimize prediction of the next token given the previous context, so the learned mapping can treat the output as having fixed meaning in that direction. That makes forward associations (predicting an output given an input) easier than backward inference (treating the relationship like an equation where variables can be solved in either direction). The result is that “A is B” may be easy, while “B is A” fails.

Why can chess performance look strong while deeper tests suggest fragile reasoning?

The transcript reports GPT-3.5 achieving about 1,800 ELO in a 150-game sample with 99.7% legal moves, implying it can produce competent-looking play without memorizing every board state. But counterfactual testing is used to probe whether the system relies on stable logic or on prompt-aligned patterns. When chess tasks are reformatted so that the model can’t lean on memorized cues, accuracy drops sharply—suggesting the “reasoning” can be brittle and dependent on how the problem is presented.

What do compositional-reasoning and “faith and fate” style results imply?

Cited findings from the Allen Institute for AI and universities argue that transformers can reduce multi-step compositional tasks into linearized pattern matching, performing well when compositional complexity is low but failing when it grows. The transcript also notes that even when models are prompted with step-by-step chain-of-thought, they may not achieve systematic problem-solving. A key implication is that maximum-likelihood training doesn’t guarantee reliable, general deductive reasoning.

How do researchers reconcile these reasoning limits with ongoing AGI optimism?

The transcript quotes Leo Gal of OpenAI saying current limitations in ML systems don’t rule out machine learning as a path to AGI or imply long timelines. It also points to complementary approaches: tool use, retrieval, decomposition into steps, and reinforcement-learning systems like MuZero and EfficientZero (mentioned) that master games via search and learning dynamics rather than purely language-model next-token prediction. The argument is that “reasoning” may be achieved by combining methods rather than expecting one model to be perfect at logic.

Review Questions

In the reversal curse, what specific change in prompting direction causes the failure, and why does that reveal more than simple factual gaps?
How do counterfactual task formats help distinguish memorization/pattern matching from genuine multi-step reasoning?
What evidence in the transcript suggests that chain-of-thought prompting alone may not produce systematic reasoning?

Key Points

1
The reversal curse shows that language models often fail when asked to infer “B is A” after learning “A is B,” revealing weak bidirectional generalization.
2
Entity-to-attribute knowledge can be triggered by descriptive prompts even when direct name-to-description queries fail, as illustrated by Hugo Norway.
3
Next-token training and input-output asymmetry help explain why forward associations can work while reverse inference breaks down.
4
High performance on tasks like chess can coexist with poor performance on counterfactual or compositional variants, indicating brittle reliance on prompt-aligned patterns.
5
Research on compositional tasks suggests transformers may excel at linearized pattern matching but struggle as multi-step structure becomes more complex.
6
AGI timelines remain contested because reasoning may come from hybrid systems—retrieval, tool use, decomposition, and reinforcement-learning/search—rather than from one model’s deductive logic alone.

Highlights

The reversal curse isn’t just about personal data: the Hugo Norway example suggests the model’s knowledge is directionally triggered by how facts are presented.

A model can correctly identify an entity in one direction (name → attribute) yet fail when the prompt flips the direction (attribute → name).

Chess competence can look impressive—until counterfactual reformulations reveal accuracy collapsing toward random when memorization cues are removed.

Compositional reasoning failures point to a gap between “step-by-step looking” and truly systematic problem-solving.

AGI optimism persists by leaning on multiple approaches, including reinforcement-learning systems like MuZero and search-based methods, not only LLM next-token prediction.

Topics

Reversal Curse
Logical Generalization
GPT Vision
Chess ELO
Compositional Reasoning

Mentioned

ChatGPT
GPT-3.5
GPT-4
Llama
Gemini
DALL·E 3
Midjourney
MuZero
EfficientZero
AlphaZero
AlphaGo
Amazon
Anthropic
OpenAI
DeepMind
Bard
Pi
Neil Nander
Andre Karpathy
Leo Gal
Elon Musk
Ral
Paige Bailey
ELO
AGI
PGN
LLM
AI
GPT
ML
TPU