I Survived A DDOS

TL;DR

Sudden 10x peak traffic plus API Gateway scaling out—without a corresponding release change—can be a strong early indicator of a DoS.

Briefing Cornell Notes

Briefing

A startup’s app was hit by a sustained denial-of-service (DoS) campaign that first showed up as sudden, unexplained slowness—then escalated into massive request floods, connection exhaustion, and repeated “back again” waves over roughly two weeks. The core lesson wasn’t just that DoS attacks happen; it was that mitigation depends on fast operational muscle memory, tight logging/forensics, and the willingness to iterate rules as attackers adapt.

The trouble began in mid-June 2023 on a Wednesday evening when the team noticed the app becoming unusable shortly after a recent release. Slack activity and load balancer alerts pointed to a traffic spike: the system was receiving around 10x peak traffic, even though no new deployment explained it. Investigation in AWS tooling revealed the API Gateway scaling out, and the team suspected a DoS. They turned on a Web Application Firewall (WAF) rule set that had been under consideration during security certification planning. Once enabled, the attack dropped quickly—initially filtering out about 10% of traffic still left enough volume to hurt, but the team then used WAF sampling and access logs to identify the dominant sources.

Early samples showed most suspicious traffic originating from North Korea, Russia, and China—regions the team didn’t serve. They created a geographic rule and the DoS stopped. Afterward, the team planned follow-ups: export logs for querying in AWS Athena and add alerts tied to WAF allow/block outcomes. For a short window, the incident looked contained.

Then the attack returned. A second wave brought staggering numbers—hundreds of millions of requests within an hour—and attempts to block non-US traffic failed because the attacker redirected traffic and kept probing. The team found a consistent “keep-alive” header pattern across sampled requests that didn’t match their own front-end behavior. That clue led to identifying a Slowloris-style approach: overwhelming the target by maintaining many simultaneous HTTP connections. Even when WAF alerts showed heavy blocking, enough traffic still slipped through to keep the database and endpoints overwhelmed.

Mitigation shifted toward broader containment: blocking non-authenticated traffic, adding rules for known malicious IP ranges (including “onion network”/Tor-class traffic and other bad-reputation sources), and patching a front-end/back-end issue that allowed requests to reach the API Gateway and backend in ways that worsened the flood. After a hot fix removed the problematic path and tightened enforcement, the DoS ended—only to continue for nearly two weeks in a pattern of daily morning surges. Eventually, the campaign stopped as abruptly as it began.

In the aftermath, the team emphasized practical preparedness: use a cloud provider with a WAF, keep load balancer/access logs enabled, learn Athena-based log querying, and rehearse incident response so rules can be applied during an attack. The takeaway was blunt—there’s no guaranteed way to prevent DoS, but there are ways to survive it without burning out or losing the ability to operate when the next wave hits.

Cornell Notes

The incident started as sudden app slowness and quickly turned into a DoS attack that overwhelmed infrastructure with extreme traffic volumes. Turning on a previously debated WAF helped stop the first wave by filtering traffic and using WAF sampling to identify suspicious sources. A second wave adapted: traffic was redirected so geo-blocking didn’t work, and repeated connection-heavy behavior suggested a Slowloris-style technique using many simultaneous HTTP connections. Even with WAF alerts firing, enough requests still reached critical systems to overload the database and endpoints, forcing broader blocking (including non-authenticated traffic) and a hot fix to prevent harmful request paths. After two weeks of daily surges, the attack finally stopped.

How did the team realize the slowness was a DoS rather than a bad release?

No new release correlated with the slowdown, but operational signals did: the phone and Slack lit up with “hot fix” expectations, the CTO flagged the app as unusable, and AWS checks showed load balancer alerts plus API Gateway scaling out. Traffic analysis revealed about 10x peak traffic at the time, which didn’t match normal usage patterns for a startup—strong evidence of an external flood rather than a deployment regression.

What role did the WAF play in stopping the first wave?

A Web Application Firewall (WAF) had been prioritized during security certification planning but wasn’t initially enabled. Once turned on, WAF sampling and dashboards showed it filtering a portion of traffic. The team then pulled load balancer/access logs and used WAF rule sampling to identify that most suspicious requests were originating from North Korea, Russia, and China. Adding a rule to block that geographic traffic ended the first DoS quickly.

Why did blocking non-US traffic fail during the second wave?

Attackers adapted by redirecting traffic. After the team blocked non-US traffic, requests still continued hitting the servers. Log inspection showed the traffic was now originating in the US, and the team couldn’t find obvious anomalies like suspicious URLs, headers, or user agents—meaning the attack wasn’t just a simple “block by region” problem.

What clue pointed to a Slowloris-style connection exhaustion attack?

In sampled requests, every request carried a “keep alive” header pattern that the team said wasn’t being set by their front end (and wasn’t expected from their typical CloudFront behavior). That consistency led to identifying Slowloris: an approach that overwhelms a target by opening and maintaining many simultaneous HTTP connections, keeping resources tied up even when individual requests look ordinary.

What finally reduced the impact enough for the team to recover?

Mitigation broadened beyond WAF geo/IP rules. The team blocked non-authenticated traffic to cut off the flood’s reach, added blocks for known malicious IP categories (including Tor/onion-network-class traffic and other bad-reputation sources), and applied a hot fix that patched both front end and back end behavior so requests weren’t passing through the API Gateway to the backend in a way that amplified the overload. After the patch, the DoS stopped.

What preparedness steps were emphasized after the incident?

The team stressed operational readiness: use a cloud provider with a WAF, enable load balancer/access logs, and practice querying and aggregating logs with AWS Athena so patterns can be found quickly. They also recommended having spend limits and choosing protections like AWS Shield or equivalent options, because DoS response often requires fast rule changes and the ability to study traffic patterns under pressure.

Review Questions

What specific evidence in the AWS metrics and logs supported the shift from “possible bad release” to “active DoS”?
How did attacker adaptation undermine the initial geo-blocking strategy, and what new signal replaced geography as the key diagnostic?
Why can WAF alerts still coincide with continued downtime, and what additional controls helped in this case?

Key Points

1
Sudden 10x peak traffic plus API Gateway scaling out—without a corresponding release change—can be a strong early indicator of a DoS.
2
Enabling a WAF and using its sampling plus access logs can quickly identify dominant attack sources and enable targeted blocking.
3
Geo-blocking can fail when attackers redirect traffic; mitigation must evolve based on observed request patterns, not assumptions.
4
Consistent connection-related behavior (e.g., keep-alive patterns) can point to Slowloris-style exhaustion, where many simultaneous connections overwhelm resources.
5
WAF blocks alone may not restore service if enough traffic still reaches critical paths; broader controls like blocking non-authenticated traffic and patching request flow may be necessary.
6
Operational readiness matters: keep load balancer/access logs enabled and practice AWS Athena queries so patterns are actionable during an attack.
7
DoS campaigns can persist for weeks with daily surges; survival depends on iterative containment and avoiding burnout.

Highlights

Turning on a previously planned WAF stopped the first wave quickly, after WAF sampling showed most attack traffic coming from North Korea, Russia, and China.

A second wave bypassed geo-blocking by redirecting traffic into the US, and the team’s investigation pointed toward Slowloris-style connection exhaustion.

Even with heavy WAF blocking, the database and endpoints stayed overloaded until broader blocking and a front-end/back-end hot fix removed the harmful request path.

Topics

Denial of Service
WAF
Slowloris
AWS Athena
Incident Response

Mentioned

AWS
Cloudflare
CloudFront
Microsoft
Vercel
DoS
WAF
WAFf
AWS
API
IPS
CTO
NSA
Tor
HTTP
DB