Software Horror Stories | The Standup
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Medical data systems can misroute procedures when insurance claim IDs are mapped into the wrong patient chart context, even if no real procedure occurs.
Briefing
Software in healthcare and entertainment both carries a special kind of risk: when data is wrong or systems fail, the blast radius can be human, financial, or reputational. One developer’s time at Epic—working on Data Link, which turns insurance claims into medical records—produced “dev horror” moments where seemingly small mapping mistakes could have routed procedures to the wrong people. In one patient-safety escalation, a doctor reviewing a new mother’s chart noticed an external procedure labeled “circumcision” appearing about three-quarters down the record. The root cause wasn’t that the procedure occurred—it was that the system associated insurance claim IDs with the wrong patient context, so a baby’s chart could inherit procedure data intended for the mother’s claim. The team traced the issue, involved industry experts, and fixed the mapping logic before any harm occurred. A second scenario—also caught before it could matter—highlighted the same structural danger in transplant data: donor and recipient claim identifiers could be mixed, potentially pairing claims with the wrong individual.
The stakes weren’t theoretical. Epic’s development process was built around preventing exactly these kinds of catastrophic data errors. Changes started in a shared main environment with large-scale test data, then moved through mandatory design documentation, sign-off code reviews, and staged QA environments (QA1 and QA2) that were deliberately detached from development. Reviewers had to be different across code review rounds, and significant changes could force the work back to earlier steps. Even deployment followed a patch-based model: hospitals received “special upgrades,” and if a fix needed to be shipped later across different Epic versions, the entire process could repeat from the appropriate branch—sometimes with auto-application, sometimes not.
Outside healthcare, other horror stories came from the mechanics of scaling and release discipline. One engineer described a government-contracting site where load testing accidentally triggered a cache clear at the same time as stress traffic. The cache-buster logic had been designed to clear only during low-traffic windows, but the test conditions caused a domino effect across dependent systems, taking the production site down for 30 to 45 minutes—long enough to stop logins and prevent payments. The failure wasn’t just capacity; it was cascading overload in a multi-layer architecture.
Another incident involved a database-driven pipeline for mapping soldiers to university class options. Despite extensive testing and war-room monitoring, the mapping logic started sending people to the wrong classes on the main signup day. Errors weren’t obvious via error rates; they surfaced only by inspecting data patterns and dashboards. With stored procedures and downstream dependencies, rollback wasn’t straightforward, so the team had to reverse-engineer the faulty logic while users kept enrolling.
In entertainment, a Netflix developer recounted a smaller but humiliating production lesson: a “static” variable bug in PHP logging behavior persisted longer than expected, embarrassing enough that a non-programmer boss had to point out the obvious. Later, a React/Next.js-era rewrite caused a major performance collapse: server-side rendering plus GraphQL fetching led to containers handling only about 10 requests per second, traffic was shed, and the site became effectively unusable until the team fixed the bottleneck. Across all these stories, the recurring theme was clear—data correctness and release testing aren’t optional, because the system’s failure modes can turn a “small” change into a large outage or a dangerous misrouting of information.
Cornell Notes
Epic’s Data Link work highlighted how insurance-claim identifiers can cause medical procedures to appear on the wrong patient charts—circumcision on a mother’s chart and potential donor/recipient transplant mix-ups—problems caught via patient-safety escalations before harm occurred. Epic’s safeguards relied on staged environments (main → QA1 → QA2), mandatory sign-offs, and different reviewers across code review rounds, plus patch-based “special upgrades” for hospital deployments. Other engineers described production failures from load testing that accidentally cleared caches and triggered cascading overload, and from database logic that mis-mapped users during a high-traffic signup event. In web releases, performance disasters came from server-side rendering and GraphQL fetching patterns that made each request far more expensive than expected, collapsing throughput to roughly 10 requests per second per container.
How can a medical records system accidentally attach procedures to the wrong person even when no procedure actually happens?
What makes Epic’s release process unusually strict compared with typical web teams?
Why did a load test take down a production site for 30–45 minutes?
How did a stored-procedure mapping pipeline fail even after extensive testing?
What went wrong in a Next.js/GraphQL-style rewrite that caused throughput to collapse?
Review Questions
- Which identifier-mapping mistake in the healthcare story could have routed procedures to the wrong patient, and how was it detected?
- Describe Epic’s staged QA and code review structure and explain why different reviewers matter.
- In the load-testing outage, what single operational action triggered cascading failures, and why did it matter when it occurred?
Key Points
- 1
Medical data systems can misroute procedures when insurance claim IDs are mapped into the wrong patient chart context, even if no real procedure occurs.
- 2
Epic’s safeguards relied on staged environments (main, QA1, QA2), mandatory approvals, and different code reviewers across review rounds to reduce the chance of dangerous data errors.
- 3
Patch-based deployments (“special upgrades”) mean fixes may require repeating the full process across version branches, especially when hospital systems aren’t on the same baseline.
- 4
Load testing can be dangerous when it triggers operational behaviors like cache clearing outside intended low-traffic windows, leading to cascading overload across dependent services.
- 5
Database-driven pipelines can fail in ways that won’t show up as obvious error rates; correctness may require targeted data spot checks and careful monitoring during peak events.
- 6
Performance rewrites using server-side rendering and GraphQL can collapse throughput if request-time data fetching becomes much more expensive than expected, especially when scaling assumptions don’t match production data sizes.