Rate Limiting
Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Rate limiting enforces fairness by blocking or delaying requests that exceed a configured allowance per key (user, API key, or IP/device identifier).
Briefing
Rate limiting is the practical mechanism for stopping one user from monopolizing a service—by blocking or delaying requests that exceed a defined allowance over time. The core idea is simple: set a maximum number of actions per interval, then enforce it so spam can’t drown out everyone else. Twitch chat’s “slow mode” is a familiar example, effectively limiting how often each account can post so a single spammer can’t dominate the conversation. The same pattern matters far beyond chat: login endpoints need protection against brute-force attempts, expensive API endpoints must prevent resource hogging, and any public interface benefits from guardrails that keep traffic fair and predictable.
A major challenge is that “rate limit” isn’t one algorithm—it’s a design choice with tradeoffs in fairness, burst handling, and implementation complexity. The fixed window limiter is the most straightforward: count requests within a predefined time block and reset the counter at the window boundary. It’s easy to implement and easy to reason about, but it can allow bursts up to about 2× the configured limit when traffic is timed around the reset. GitHub’s API illustrates the risk: a fixed window of 5,000 requests per hour aligned to wall-clock hour boundaries means a client can legally send 5,000 just before midnight and another 5,000 right after, creating a 10,000-request spike across a short span.
Sliding window limiters aim to smooth those spikes. Instead of resetting capacity all at once, they refill gradually, effectively associating each request with a time it remains “active” in the allowance. That reduces burstiness and better matches high-load traffic patterns, but it’s harder for users to predict and can be expensive if it requires tracking timestamps for every request. Many real systems therefore use approximations—often described as “floating windows”—to get sliding-window behavior without storing every event. Upstash and Cloudflare are cited as using approximated sliding windows that blend counts from overlapping fixed windows based on how much each overlaps the current moment.
Token bucket limiters take a different approach: tokens accumulate at a steady refill rate, and each request consumes one token. When the bucket is empty, requests are blocked. This model naturally supports bursts up to the bucket’s capacity while still enforcing a lower long-term average rate. It also avoids the fixed-window “doubling” problem, because the burst is capped by the number of tokens available at once. The tradeoff is communication: telling users exactly when they can try again is less intuitive because tokens refill continuously rather than at a clear boundary.
The transcript also stresses operational details that often decide whether rate limiting works in production. The limiter state must persist across restarts and scale-out instances, so in-memory counters aren’t enough for horizontally scaled or serverless deployments. A key-value store such as Redis is recommended, with guidance to fail open if the store is unavailable to avoid turning a dependency outage into a denial-of-service. Rate limiting should use sensible keys (user ID, API key, or IP/device identifiers for unauthenticated traffic), return 429 responses with rate-limit headers (including “retry after” information), and can be paired with throttling to reduce burst impact. In the end, the choice is pragmatic: fixed windows for simplicity, approximated sliding windows for smoothing at scale, and token buckets when bursts must be allowed without sacrificing a strict average limit.
Cornell Notes
Rate limiting prevents abuse by enforcing a maximum request rate per user or key, blocking requests that exceed the allowance during a time period. Fixed window limiters are simple and predictable but can permit spikes up to roughly 2× the limit when traffic is timed around the reset boundary (GitHub’s 5,000/hour example shows how midnight alignment can create a short burst). Sliding window limiters smooth traffic by refilling gradually, but naive implementations can be resource-heavy; many systems use approximations like floating windows (e.g., Upstash/Cloudflare) to avoid tracking every timestamp. Token bucket limiters refill tokens continuously, allowing controlled bursts up to bucket capacity while enforcing a long-term average rate; they’re flexible but harder to explain to users. Production setups also require persistent shared state (often Redis), sensible rate-limit keys, and 429 responses with headers.
Why does fixed window rate limiting sometimes allow bursts larger than the configured limit?
How do sliding window limiters reduce burstiness compared with fixed windows?
What problem do approximated (floating) sliding windows solve, and how do they work?
What makes token bucket rate limiting good at handling bursts while enforcing a long-term average?
What production concerns determine whether a rate limiter works correctly at scale?
Review Questions
- Compare fixed window, sliding window, and token bucket: which one is most vulnerable to reset-boundary bursts, and why?
- Why might a naive sliding window be too expensive in high-traffic systems, and what approximation is used instead?
- What does “fail open” mean for rate limiting dependencies, and why does it matter?
Key Points
- 1
Rate limiting enforces fairness by blocking or delaying requests that exceed a configured allowance per key (user, API key, or IP/device identifier).
- 2
Fixed window limiters are easy to implement but can allow short spikes up to about 2× the limit when traffic is timed around window resets (e.g., GitHub’s hour-aligned 5,000/hour).
- 3
Sliding window limiters smooth traffic by refilling gradually, but naive versions can require tracking many timestamps; approximated floating windows reduce that cost.
- 4
Token bucket limiters support controlled bursts up to bucket capacity while enforcing a long-term average via continuous token refill; they’re flexible but harder to communicate precisely to users.
- 5
Production rate limiting requires persistent shared state (commonly Redis) for horizontal scaling and serverless, plus a fail-open strategy if the store is unavailable.
- 6
Rate-limited responses should use HTTP 429 and include rate-limit headers (and retry timing) so clients can back off correctly.
- 7
Use sensible rate-limit keys and consider combining throttling with rate limiting to reduce burst impact.