ChatGPT Fails Basic Logic but Now Has Vision, Wins at Chess and Prompts a Masterpiece
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The reversal curse shows that language models often fail when asked to infer “B is A” after learning “A is B,” revealing weak bidirectional generalization.
Briefing
Language models still stumble on basic logical generalization—yet they can perform impressively in tasks that look like reasoning, from chess to generating images—highlighting a gap between “knowing facts” and reliably applying logic in new directions.
A central thread comes from a paper dubbed the “reversal curse,” which finds that models often fail when asked to generalize a simple implication. In one pattern, if a model learns that “A is B,” it may not correctly infer “B is A” in a new prompt. The transcript illustrates this with named-entity reversals: when asked about Tom Cruise’s mother, the system can identify the mother, but when asked about the mother, it can’t reliably name the famous son. A similar example uses Gabriel Ma: one chat correctly identifies Suzanne Pula as Gabriel Ma’s mother, but a follow-up chat asking for the famous son of Suzanne Pula produces a wrong or confused answer (the transcript notes the model’s attempt to connect Suzanne Pula to Elon Musk via family relations), while failing to retrieve Gabriel Ma.
The same asymmetry shows up even when personal-data concerns are removed. The transcript describes tests with a real island in Norway named Hugo. When prompted with “Hugo Norway,” the model can’t provide the island’s description or key facts like length and county. But when the prompt includes a descriptive phrase that matches what the model has in its training associations—then asks for the name—the model can output “Hugo.” That pattern suggests the model’s knowledge is triggered by specific input-to-output mappings rather than by a stable, bidirectional understanding of the underlying entity relationships.
Researchers connect this to how language models learn: training optimizes next-token prediction, which can make forward associations easier than backward inference. Neil Nander (formerly of DeepMind) is cited describing an input-output asymmetry: the model treats the output as having fixed meaning in the learned mapping, so it may predict “son of Mary Lee Fifer” given Tom Cruise, without treating the reverse as an equation-like variable relationship.
At the same time, other results complicate the story. GPT-3.5 is reported to play chess at around 1,800 ELO in a 150-game sample with extremely high move legality, and the transcript argues that such performance doesn’t require memorizing every position. Yet counterfactual testing suggests the “reasoning” can be brittle: when chess problems are reformatted so that the same task depends less on memorized patterns, accuracy can drop toward random. Related work (cited from the Allen Institute for AI and multiple universities) finds transformers handle simpler compositional tasks well but fail as multi-step structure grows, implying that systematic problem-solving doesn’t automatically emerge from maximum-likelihood training.
The transcript ends by placing these limitations alongside rapid capability gains: GPT-4-class systems can still generate “masterpiece” images with GPT Vision (DALL·E 3 is mentioned), and the broader AGI timeline debate remains largely unchanged because researchers expect reasoning to be supported by more than one approach—tool use, retrieval, reinforcement learning systems like MuZero, and efficient search methods.
In short: the systems can look like they reason, but they often generalize in one direction only, and their reliability depends heavily on how prompts line up with learned mappings and training-derived structure—an important constraint as AI moves toward more autonomous, logic-heavy applications.
Cornell Notes
The reversal curse highlights a core weakness in language models: they often fail to generalize simple logical implications in reverse. If a model can answer “A is B,” it may not correctly infer “B is A” when the prompt direction flips, even for non-personal facts like a Norwegian island. This behavior aligns with next-token training and an input-output asymmetry—models learn forward mappings that trigger the right outputs, not stable bidirectional relationships. Yet the same systems can still excel at tasks like chess and image generation, suggesting performance can come from pattern matching, compositional cues in prompts, or auxiliary mechanisms rather than consistent deductive logic. The practical takeaway is that “reasoning-like” performance can mask fragile generalization.
What is the “reversal curse,” and how does it show up in the examples?
Why do the Hugo Norway tests matter if the model can’t answer the description directly?
How does next-token prediction relate to forward-vs-reverse inference?
Why can chess performance look strong while deeper tests suggest fragile reasoning?
What do compositional-reasoning and “faith and fate” style results imply?
How do researchers reconcile these reasoning limits with ongoing AGI optimism?
Review Questions
- In the reversal curse, what specific change in prompting direction causes the failure, and why does that reveal more than simple factual gaps?
- How do counterfactual task formats help distinguish memorization/pattern matching from genuine multi-step reasoning?
- What evidence in the transcript suggests that chain-of-thought prompting alone may not produce systematic reasoning?
Key Points
- 1
The reversal curse shows that language models often fail when asked to infer “B is A” after learning “A is B,” revealing weak bidirectional generalization.
- 2
Entity-to-attribute knowledge can be triggered by descriptive prompts even when direct name-to-description queries fail, as illustrated by Hugo Norway.
- 3
Next-token training and input-output asymmetry help explain why forward associations can work while reverse inference breaks down.
- 4
High performance on tasks like chess can coexist with poor performance on counterfactual or compositional variants, indicating brittle reliance on prompt-aligned patterns.
- 5
Research on compositional tasks suggests transformers may excel at linearized pattern matching but struggle as multi-step structure becomes more complex.
- 6
AGI timelines remain contested because reasoning may come from hybrid systems—retrieval, tool use, decomposition, and reinforcement-learning/search—rather than from one model’s deductive logic alone.