25 crazy software bugs explained

TL;DR

Unsigned integer underflow can wrap values and invert intended behavior, turning “logic” into radically different outcomes.

Briefing Cornell Notes

Briefing

A single line of bad logic can turn everyday software into real-world catastrophe—whether that means freezing a music player, wiping out millions in trading losses, or killing people in flight. Across 25 infamous bugs, the through-line is clear: small mistakes in assumptions, data handling, timing, or unit conversions can cascade into failures that are expensive, dangerous, or both.

The tour begins with “feature” bugs that start as harmless quirks in games and consumer devices. In Sid Meier’s Civilization, Gandhi’s aggression value is treated as an unsigned integer; when diplomacy reduces it, underflow wraps the value around to a maximum, flipping a pacifist into a “diabolical thermonuclear enthusiast.” Players embraced the chaos enough that it effectively became lore. Real systems were less forgiving. Microsoft’s Zoom knockoff froze on New Year’s Eve because leap-year day counting wasn’t handled correctly, trapping the device in a loop until someone removed the battery. On Intel Pentium chips, the infamous fdiv bug produced incorrect floating-point division results due to a flawed SRT division approach with missing lookup-table entries—rare in occurrence but serious enough to trigger major PR fallout.

As the list moves into higher-stakes failures, the consequences scale quickly. A FaceTime group-call exploit on iPhone allowed an attacker to trigger audio behavior and, after dismissing the call, activate the recipient’s camera—discovered by a 14-year-old and eventually patched after it went viral. In finance, a 2024 Chase ATM glitch let people withdraw large sums immediately after depositing fake checks, leading to lawsuits and potential criminal charges. Other failures weren’t about fraud but about brittle systems: an AT&T long-distance crash in 1990 cascaded from one faulty switch rebooting neighboring switches, blocking 50 million calls. At airports, Heathrow Terminal 5’s baggage system suffered a breakdown when multiple software systems failed to coordinate, causing 500+ canceled flights, 42,000 lost bags, and a $16 million fix.

The most dramatic disasters come from unit mistakes, timing errors, and security oversights. NASA’s Mars Climate Orbiter burned up after one team used imperial units while another used metric, breaking the mission’s calculations. An Ariane 5 rocket exploded in 1996 after a conversion error turned a 64-bit floating-point value into a 16-bit integer, sending the vehicle 90° off course. Heartbleed in 2014 exposed servers running vulnerable OpenSSL implementations through a missing bounds check in the TLS heartbeat extension, letting attackers repeatedly request memory contents—leaving roughly two-thirds of internet servers at risk.

The final stretch turns deadly. Toyota’s electronic throttle control and braking logic issues led to recalls, injuries, and deaths. In aviation and defense, software misbehavior contributed to crashes and combat tragedies: a Patriot missile timing overflow killed 28 U.S. soldiers; an Aegis system display and timing lag helped lead to a civilian plane being shot down; and the Boeing 737 MAX disasters tied back to flawed sensor logic in the Maneuvering Characteristics Augmentation System. Even medical devices weren’t safe: the Therac-25 radiation machine delivered lethal doses due to race conditions and the removal of mechanical interlocks.

The common lesson is less about “bugs happen” and more about how assumptions fail under real conditions—leap years, edge-case inputs, overflow, concurrency, inconsistent sensors, and mismatched units. The stakes rise when software controls money, infrastructure, aircraft, or human bodies, making testing, validation, and defensive design non-negotiable.

Cornell Notes

Software failures in the real world often start as small logic errors—like underflow, missing bounds checks, or unit mismatches—but they can cascade into massive financial losses, infrastructure outages, and even deaths. Examples include Gandhi’s unsigned-integer underflow in Civilization, Microsoft Zoom freezing due to leap-year day handling, and the Pentium fdiv bug caused by missing lookup-table entries. Higher-stakes incidents show how timing and conversions can break missions (Mars Climate Orbiter’s imperial vs metric mismatch; Ariane 5’s floating-point to integer conversion error). Security bugs like Heartbleed demonstrate how a single missing bounds check in OpenSSL’s TLS heartbeat can expose sensitive memory across a large portion of the internet.

How can a “harmless” arithmetic mistake become a dramatic behavioral change, even in a game?

In Civilization, Gandhi’s aggression level is stored as an unsigned integer. When diplomacy reduces aggression, the subtraction can underflow (1 - 2 = 255 in unsigned arithmetic). That wraparound effectively maxes out Gandhi’s aggression, turning a pacifist into an aggressive, destructive character—so players treated the bug as a feature.

What made the Pentium fdiv bug so damaging despite being rare?

The Pentium fdiv issue involved floating-point division returning incorrect values. It traced to the SRT division algorithm, which speeds division using lookup tables to estimate quotient digits. The lookup table had 1,66 entries, but five entries were missing, causing certain input combinations to produce wrong results at the hardware level. Even at about 1 in N billion operations, the correctness failure was serious enough to trigger major corporate and PR consequences.

Why did the FaceTime group-call exploit lead to both audio glitches and camera activation?

The exploit involved starting a FaceTime call, swiping up to add another person to create a group call, and then adding the attacker’s own number. FaceTime would glitch into thinking the group call was active, enabling an audio behavior that could be used to eavesdrop. When the recipient dismissed the call using the power button, the system could activate the camera—suggesting the software wasn’t validating call state before switching audio/video behavior.

How do unit and conversion errors translate into mission-ending failures?

Mars Climate Orbiter (1999) failed because one team used imperial units (pound-force) while another used metric (Newton-seconds), breaking the atmospheric entry calculations and costing $125 million. Ariane 5 (1996) exploded after a conversion error in its inertial reference system: a 64-bit floating-point number was incorrectly converted to a 16-bit integer, making the rocket think it was 90° off course and leading to a catastrophic trajectory correction.

What exactly enabled Heartbleed, and why did it scale so broadly?

Heartbleed (2014) was caused by a missing bounds check in OpenSSL’s TLS heartbeat extension. Normally, a heartbeat request includes a payload and the server should echo it back. With improper input validation, attackers could send a malicious heartbeat request that looked legitimate but caused the server to repeatedly return memory contents. Because about two-thirds of internet servers were vulnerable, the bug’s impact was widespread and hard to detect.

Which patterns show up repeatedly in deadly systems failures?

Across military, medical, and automotive examples, the recurring patterns are timing/overflow problems (e.g., Patriot missile timer overflow after long operation), inconsistent or faulty sensor inputs (e.g., Boeing 737 MAX’s angle-of-attack sensor logic), and concurrency/race conditions (e.g., Therac-25 delivering lethal doses due to a race condition). In each case, the system lacked robust validation, redundancy, or safe fail behavior when assumptions broke.

Review Questions

Pick one incident involving numeric representation (unsigned underflow, rounding/truncation, overflow, or unit conversion). Explain the specific representation mistake and the cascade effect it caused.
Compare Heartbleed and the Ariane 5 failure: both involve data handling errors. What kind of data handling failed in each case (bounds checking vs numeric conversion), and what was the resulting impact?
Choose a deadly-systems example (Therac-25, Patriot, 737 MAX, or Toyota). What safety mechanism failed—validation, redundancy, interlocks, or timing—and how did that failure translate into harm?

Key Points

1
Unsigned integer underflow can wrap values and invert intended behavior, turning “logic” into radically different outcomes.
2
Leap-year and calendar handling bugs can freeze systems when loops never reach an exit condition.
3
Rare hardware-level arithmetic errors (like missing lookup-table entries) can still trigger major real-world consequences when they affect correctness.
4
Security vulnerabilities often come from missing bounds checks or state validation, enabling attackers to read memory or trigger unintended device behavior.
5
Cascading failures frequently start with one component misbehaving (a switch reboot, a UI workflow confusion, or a misrouted update) and then propagate through dependencies.
6
Unit mismatches and numeric conversion mistakes can destroy missions because downstream calculations assume consistent measurement systems and data types.
7
In safety-critical software, lack of redundancy, interlocks, or robust handling of edge conditions can turn software defects into physical harm.

Highlights

Gandhi’s “pacifist” status in Civilization flipped because unsigned-integer underflow wrapped aggression from 1 - 2 into 255.

Heartbleed exploited a missing bounds check in OpenSSL’s TLS heartbeat extension, letting attackers repeatedly request server memory contents; roughly two-thirds of internet servers were vulnerable.

Mars Climate Orbiter burned up after imperial-vs-metric unit mismatch broke atmospheric entry calculations—an error worth $125 million.

The Therac-25 radiation machine delivered lethal doses due to race conditions and the removal of mechanical interlocks, relying entirely on software safeguards.

The Patriot missile failure traced to a 24-bit timer overflow after operating for over 100 hours, causing incorrect threat timing and killing 28 American soldiers.

Topics

Unsigned Underflow
Leap Year Logic
Floating-Point Division
Unit Conversion
Security Vulnerabilities
Cascading Failures
Race Conditions

Mentioned

Robert Morris
Gandhi
SRT
TLS
Y2K
C++
RFID
UI
PR
PCS
MIT
UC
NASA
F-35
OBOGS
MCAST
A4