Software Horror Stories

TL;DR

Medical data systems can misroute procedures when insurance claim IDs are mapped into the wrong patient chart context, even if no real procedure occurs.

Briefing Cornell Notes

Briefing

Software in healthcare and entertainment both carries a special kind of risk: when data is wrong or systems fail, the blast radius can be human, financial, or reputational. One developer’s time at Epic—working on Data Link, which turns insurance claims into medical records—produced “dev horror” moments where seemingly small mapping mistakes could have routed procedures to the wrong people. In one patient-safety escalation, a doctor reviewing a new mother’s chart noticed an external procedure labeled “circumcision” appearing about three-quarters down the record. The root cause wasn’t that the procedure occurred—it was that the system associated insurance claim IDs with the wrong patient context, so a baby’s chart could inherit procedure data intended for the mother’s claim. The team traced the issue, involved industry experts, and fixed the mapping logic before any harm occurred. A second scenario—also caught before it could matter—highlighted the same structural danger in transplant data: donor and recipient claim identifiers could be mixed, potentially pairing claims with the wrong individual.

The stakes weren’t theoretical. Epic’s development process was built around preventing exactly these kinds of catastrophic data errors. Changes started in a shared main environment with large-scale test data, then moved through mandatory design documentation, sign-off code reviews, and staged QA environments (QA1 and QA2) that were deliberately detached from development. Reviewers had to be different across code review rounds, and significant changes could force the work back to earlier steps. Even deployment followed a patch-based model: hospitals received “special upgrades,” and if a fix needed to be shipped later across different Epic versions, the entire process could repeat from the appropriate branch—sometimes with auto-application, sometimes not.

Outside healthcare, other horror stories came from the mechanics of scaling and release discipline. One engineer described a government-contracting site where load testing accidentally triggered a cache clear at the same time as stress traffic. The cache-buster logic had been designed to clear only during low-traffic windows, but the test conditions caused a domino effect across dependent systems, taking the production site down for 30 to 45 minutes—long enough to stop logins and prevent payments. The failure wasn’t just capacity; it was cascading overload in a multi-layer architecture.

Another incident involved a database-driven pipeline for mapping soldiers to university class options. Despite extensive testing and war-room monitoring, the mapping logic started sending people to the wrong classes on the main signup day. Errors weren’t obvious via error rates; they surfaced only by inspecting data patterns and dashboards. With stored procedures and downstream dependencies, rollback wasn’t straightforward, so the team had to reverse-engineer the faulty logic while users kept enrolling.

In entertainment, a Netflix developer recounted a smaller but humiliating production lesson: a “static” variable bug in PHP logging behavior persisted longer than expected, embarrassing enough that a non-programmer boss had to point out the obvious. Later, a React/Next.js-era rewrite caused a major performance collapse: server-side rendering plus GraphQL fetching led to containers handling only about 10 requests per second, traffic was shed, and the site became effectively unusable until the team fixed the bottleneck. Across all these stories, the recurring theme was clear—data correctness and release testing aren’t optional, because the system’s failure modes can turn a “small” change into a large outage or a dangerous misrouting of information.

Cornell Notes

Epic’s Data Link work highlighted how insurance-claim identifiers can cause medical procedures to appear on the wrong patient charts—circumcision on a mother’s chart and potential donor/recipient transplant mix-ups—problems caught via patient-safety escalations before harm occurred. Epic’s safeguards relied on staged environments (main → QA1 → QA2), mandatory sign-offs, and different reviewers across code review rounds, plus patch-based “special upgrades” for hospital deployments. Other engineers described production failures from load testing that accidentally cleared caches and triggered cascading overload, and from database logic that mis-mapped users during a high-traffic signup event. In web releases, performance disasters came from server-side rendering and GraphQL fetching patterns that made each request far more expensive than expected, collapsing throughput to roughly 10 requests per second per container.

How can a medical records system accidentally attach procedures to the wrong person even when no procedure actually happens?

The Data Link system ingested insurance claims and mapped them into medical chart context. In one escalation, a doctor saw an “external procedure” labeled “circumcision” on a new mother’s chart shortly after childbirth. The team traced it to how insurance companies charge procedures using claim IDs: the system placed procedure data onto the chart associated with the wrong claim context. The fix prevented procedures from being recorded under the wrong patient identity, avoiding real-world harm.

What makes Epic’s release process unusually strict compared with typical web teams?

Epic’s workflow emphasized preventing life-or-death data errors through multiple gates: development happened in a shared main environment with large-scale test data, then changes required approvals and code reviews with sign-off. QA1 ran in a detached environment with regression tests and scripted validation, and QA2 repeated review in another detached environment. Code review rounds required different developers, and significant QA-requested changes could kick work back to earlier stages. Deployments used patch-like “special upgrades,” and fixes often had to be re-applied across version branches.

Why did a load test take down a production site for 30–45 minutes?

The team load-tested while a cache-buster mechanism cleared cached content. Cache clearing was normally restricted to low-traffic periods because clearing at the wrong time can be catastrophic. During the test, the cache clear coincided with stress traffic, overwhelming downstream systems in a domino effect. The result was loss of login and inability to complete payments, plus knock-on failures for dependent systems.

How did a stored-procedure mapping pipeline fail even after extensive testing?

A database pipeline mapped soldiers to university class options based on profiles like zip code and status. Testing and UAT confidence didn’t prevent a logic path that only manifested at peak signup volume. On the main day, mappings started routing users to the wrong classes. The errors weren’t obvious through error rates; the team had to spot-check data with pre-written queries and then reverse-engineer the faulty logic while users kept enrolling.

What went wrong in a Next.js/GraphQL-style rewrite that caused throughput to collapse?

The rewrite used server-side rendering and GraphQL fetching patterns that effectively made each request far heavier than expected. GraphQL’s nested data fetching meant the system resolved many promises and pulled large portions of the data graph before rendering. In production, each container could handle only about 10 requests per second, so scaling couldn’t keep up, traffic was shed, and the site became unusable until the bottleneck was corrected.

Review Questions

Which identifier-mapping mistake in the healthcare story could have routed procedures to the wrong patient, and how was it detected?
Describe Epic’s staged QA and code review structure and explain why different reviewers matter.
In the load-testing outage, what single operational action triggered cascading failures, and why did it matter when it occurred?

Key Points

1
Medical data systems can misroute procedures when insurance claim IDs are mapped into the wrong patient chart context, even if no real procedure occurs.
2
Epic’s safeguards relied on staged environments (main, QA1, QA2), mandatory approvals, and different code reviewers across review rounds to reduce the chance of dangerous data errors.
3
Patch-based deployments (“special upgrades”) mean fixes may require repeating the full process across version branches, especially when hospital systems aren’t on the same baseline.
4
Load testing can be dangerous when it triggers operational behaviors like cache clearing outside intended low-traffic windows, leading to cascading overload across dependent services.
5
Database-driven pipelines can fail in ways that won’t show up as obvious error rates; correctness may require targeted data spot checks and careful monitoring during peak events.
6
Performance rewrites using server-side rendering and GraphQL can collapse throughput if request-time data fetching becomes much more expensive than expected, especially when scaling assumptions don’t match production data sizes.

Highlights

A doctor’s chart review surfaced a mapping bug where “circumcision” appeared on a new mother’s record due to insurance-claim ID context, not because the procedure happened.

Epic’s multi-stage QA (QA1 and QA2) and mandatory sign-offs were designed to prevent catastrophic data correctness failures from reaching hospitals.

A cache-buster clearing during a load test triggered a domino effect that took a production site down for 30–45 minutes.

A stored-procedure mapping system for class enrollment passed testing but routed users to the wrong classes on signup day, forcing reverse-engineering under war-room pressure.

A Next.js-era rewrite with GraphQL server-side rendering collapsed container throughput to roughly 10 requests per second, making the site effectively unusable until fixed.

Topics

Medical Data Mapping
Release Engineering
Load Testing Outages
Database Stored Procedures
GraphQL Performance
Server-Side Rendering

Mentioned

Epic
Netflix
Next.js
GraphQL
VS Code
Cursor
Windsurf
React
Jeff Wagner
QA1
QA2
UAT
RPS
PII
CPU
ETL

Software Horror Stories | The Standup