I Survived A DDOS
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Sudden 10x peak traffic plus API Gateway scaling out—without a corresponding release change—can be a strong early indicator of a DoS.
Briefing
A startup’s app was hit by a sustained denial-of-service (DoS) campaign that first showed up as sudden, unexplained slowness—then escalated into massive request floods, connection exhaustion, and repeated “back again” waves over roughly two weeks. The core lesson wasn’t just that DoS attacks happen; it was that mitigation depends on fast operational muscle memory, tight logging/forensics, and the willingness to iterate rules as attackers adapt.
The trouble began in mid-June 2023 on a Wednesday evening when the team noticed the app becoming unusable shortly after a recent release. Slack activity and load balancer alerts pointed to a traffic spike: the system was receiving around 10x peak traffic, even though no new deployment explained it. Investigation in AWS tooling revealed the API Gateway scaling out, and the team suspected a DoS. They turned on a Web Application Firewall (WAF) rule set that had been under consideration during security certification planning. Once enabled, the attack dropped quickly—initially filtering out about 10% of traffic still left enough volume to hurt, but the team then used WAF sampling and access logs to identify the dominant sources.
Early samples showed most suspicious traffic originating from North Korea, Russia, and China—regions the team didn’t serve. They created a geographic rule and the DoS stopped. Afterward, the team planned follow-ups: export logs for querying in AWS Athena and add alerts tied to WAF allow/block outcomes. For a short window, the incident looked contained.
Then the attack returned. A second wave brought staggering numbers—hundreds of millions of requests within an hour—and attempts to block non-US traffic failed because the attacker redirected traffic and kept probing. The team found a consistent “keep-alive” header pattern across sampled requests that didn’t match their own front-end behavior. That clue led to identifying a Slowloris-style approach: overwhelming the target by maintaining many simultaneous HTTP connections. Even when WAF alerts showed heavy blocking, enough traffic still slipped through to keep the database and endpoints overwhelmed.
Mitigation shifted toward broader containment: blocking non-authenticated traffic, adding rules for known malicious IP ranges (including “onion network”/Tor-class traffic and other bad-reputation sources), and patching a front-end/back-end issue that allowed requests to reach the API Gateway and backend in ways that worsened the flood. After a hot fix removed the problematic path and tightened enforcement, the DoS ended—only to continue for nearly two weeks in a pattern of daily morning surges. Eventually, the campaign stopped as abruptly as it began.
In the aftermath, the team emphasized practical preparedness: use a cloud provider with a WAF, keep load balancer/access logs enabled, learn Athena-based log querying, and rehearse incident response so rules can be applied during an attack. The takeaway was blunt—there’s no guaranteed way to prevent DoS, but there are ways to survive it without burning out or losing the ability to operate when the next wave hits.
Cornell Notes
The incident started as sudden app slowness and quickly turned into a DoS attack that overwhelmed infrastructure with extreme traffic volumes. Turning on a previously debated WAF helped stop the first wave by filtering traffic and using WAF sampling to identify suspicious sources. A second wave adapted: traffic was redirected so geo-blocking didn’t work, and repeated connection-heavy behavior suggested a Slowloris-style technique using many simultaneous HTTP connections. Even with WAF alerts firing, enough requests still reached critical systems to overload the database and endpoints, forcing broader blocking (including non-authenticated traffic) and a hot fix to prevent harmful request paths. After two weeks of daily surges, the attack finally stopped.
How did the team realize the slowness was a DoS rather than a bad release?
What role did the WAF play in stopping the first wave?
Why did blocking non-US traffic fail during the second wave?
What clue pointed to a Slowloris-style connection exhaustion attack?
What finally reduced the impact enough for the team to recover?
What preparedness steps were emphasized after the incident?
Review Questions
- What specific evidence in the AWS metrics and logs supported the shift from “possible bad release” to “active DoS”?
- How did attacker adaptation undermine the initial geo-blocking strategy, and what new signal replaced geography as the key diagnostic?
- Why can WAF alerts still coincide with continued downtime, and what additional controls helped in this case?
Key Points
- 1
Sudden 10x peak traffic plus API Gateway scaling out—without a corresponding release change—can be a strong early indicator of a DoS.
- 2
Enabling a WAF and using its sampling plus access logs can quickly identify dominant attack sources and enable targeted blocking.
- 3
Geo-blocking can fail when attackers redirect traffic; mitigation must evolve based on observed request patterns, not assumptions.
- 4
Consistent connection-related behavior (e.g., keep-alive patterns) can point to Slowloris-style exhaustion, where many simultaneous connections overwhelm resources.
- 5
WAF blocks alone may not restore service if enough traffic still reaches critical paths; broader controls like blocking non-authenticated traffic and patching request flow may be necessary.
- 6
Operational readiness matters: keep load balancer/access logs enabled and practice AWS Athena queries so patterns are actionable during an attack.
- 7
DoS campaigns can persist for weeks with daily surges; survival depends on iterative containment and avoiding burnout.