Get AI summaries of any video or article — Sign up free
Rate Limiting thumbnail

Rate Limiting

Theo - t3․gg·
6 min read

Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Rate limiting enforces fairness by blocking or delaying requests that exceed a configured allowance per key (user, API key, or IP/device identifier).

Briefing

Rate limiting is the practical mechanism for stopping one user from monopolizing a service—by blocking or delaying requests that exceed a defined allowance over time. The core idea is simple: set a maximum number of actions per interval, then enforce it so spam can’t drown out everyone else. Twitch chat’s “slow mode” is a familiar example, effectively limiting how often each account can post so a single spammer can’t dominate the conversation. The same pattern matters far beyond chat: login endpoints need protection against brute-force attempts, expensive API endpoints must prevent resource hogging, and any public interface benefits from guardrails that keep traffic fair and predictable.

A major challenge is that “rate limit” isn’t one algorithm—it’s a design choice with tradeoffs in fairness, burst handling, and implementation complexity. The fixed window limiter is the most straightforward: count requests within a predefined time block and reset the counter at the window boundary. It’s easy to implement and easy to reason about, but it can allow bursts up to about 2× the configured limit when traffic is timed around the reset. GitHub’s API illustrates the risk: a fixed window of 5,000 requests per hour aligned to wall-clock hour boundaries means a client can legally send 5,000 just before midnight and another 5,000 right after, creating a 10,000-request spike across a short span.

Sliding window limiters aim to smooth those spikes. Instead of resetting capacity all at once, they refill gradually, effectively associating each request with a time it remains “active” in the allowance. That reduces burstiness and better matches high-load traffic patterns, but it’s harder for users to predict and can be expensive if it requires tracking timestamps for every request. Many real systems therefore use approximations—often described as “floating windows”—to get sliding-window behavior without storing every event. Upstash and Cloudflare are cited as using approximated sliding windows that blend counts from overlapping fixed windows based on how much each overlaps the current moment.

Token bucket limiters take a different approach: tokens accumulate at a steady refill rate, and each request consumes one token. When the bucket is empty, requests are blocked. This model naturally supports bursts up to the bucket’s capacity while still enforcing a lower long-term average rate. It also avoids the fixed-window “doubling” problem, because the burst is capped by the number of tokens available at once. The tradeoff is communication: telling users exactly when they can try again is less intuitive because tokens refill continuously rather than at a clear boundary.

The transcript also stresses operational details that often decide whether rate limiting works in production. The limiter state must persist across restarts and scale-out instances, so in-memory counters aren’t enough for horizontally scaled or serverless deployments. A key-value store such as Redis is recommended, with guidance to fail open if the store is unavailable to avoid turning a dependency outage into a denial-of-service. Rate limiting should use sensible keys (user ID, API key, or IP/device identifiers for unauthenticated traffic), return 429 responses with rate-limit headers (including “retry after” information), and can be paired with throttling to reduce burst impact. In the end, the choice is pragmatic: fixed windows for simplicity, approximated sliding windows for smoothing at scale, and token buckets when bursts must be allowed without sacrificing a strict average limit.

Cornell Notes

Rate limiting prevents abuse by enforcing a maximum request rate per user or key, blocking requests that exceed the allowance during a time period. Fixed window limiters are simple and predictable but can permit spikes up to roughly 2× the limit when traffic is timed around the reset boundary (GitHub’s 5,000/hour example shows how midnight alignment can create a short burst). Sliding window limiters smooth traffic by refilling gradually, but naive implementations can be resource-heavy; many systems use approximations like floating windows (e.g., Upstash/Cloudflare) to avoid tracking every timestamp. Token bucket limiters refill tokens continuously, allowing controlled bursts up to bucket capacity while enforcing a long-term average rate; they’re flexible but harder to explain to users. Production setups also require persistent shared state (often Redis), sensible rate-limit keys, and 429 responses with headers.

Why does fixed window rate limiting sometimes allow bursts larger than the configured limit?

A fixed window counts requests within a predefined interval and resets the counter at the window boundary. If a client sends requests right before the reset and then immediately after, it can consume the full allowance twice in a short real-world period. The transcript’s GitHub example uses 5,000 requests per hour aligned to wall-clock hour starts: sending 5,000 just before midnight and another 5,000 right after means 10,000 requests can land across a two-minute span even though the per-hour limit is never exceeded within each hour.

How do sliding window limiters reduce burstiness compared with fixed windows?

Sliding windows refill capacity continuously rather than all at once. Conceptually, each request occupies a “slot” for the duration of the window, so the limiter allows a new request only when enough earlier requests have aged out. The transcript notes that naive sliding windows can require tracking timestamps for every request, which is why approximated “floating window” methods are common in real systems.

What problem do approximated (floating) sliding windows solve, and how do they work?

Approximated sliding windows aim to keep sliding-window behavior (smoother enforcement) without the cost of storing all request timestamps. The described approach counts allowed requests in the previous fixed window and the current fixed window, then weights those counts by how much each window overlaps the current “floating” window. The weighted sum approximates the true sliding-window limit while remaining more efficient.

What makes token bucket rate limiting good at handling bursts while enforcing a long-term average?

Tokens accumulate at a constant refill rate up to a maximum bucket capacity. Each request consumes one token; when tokens run out, requests are blocked. This means bursts are capped by bucket size (maximum burst capacity), while the refill rate enforces the long-term average. The transcript contrasts this with fixed windows, where timing around resets can create effective 2× spikes; token buckets prevent that kind of doubling because the burst is limited by available tokens.

What production concerns determine whether a rate limiter works correctly at scale?

Rate limiter state must be shared and persistent across multiple servers and serverless instances; in-memory counters break under horizontal scaling or restarts. The transcript recommends storing limiter data in Redis (with expiring keys) and using an ephemeral in-memory cache only as a performance optimization. It also advises failing open if the persisted store fails, returning 429 with rate-limit headers (including retry timing), and choosing correct keys (user ID/API key for authenticated traffic; IP/device fingerprint/installation ID or shared limiter for unauthenticated traffic).

Review Questions

  1. Compare fixed window, sliding window, and token bucket: which one is most vulnerable to reset-boundary bursts, and why?
  2. Why might a naive sliding window be too expensive in high-traffic systems, and what approximation is used instead?
  3. What does “fail open” mean for rate limiting dependencies, and why does it matter?

Key Points

  1. 1

    Rate limiting enforces fairness by blocking or delaying requests that exceed a configured allowance per key (user, API key, or IP/device identifier).

  2. 2

    Fixed window limiters are easy to implement but can allow short spikes up to about 2× the limit when traffic is timed around window resets (e.g., GitHub’s hour-aligned 5,000/hour).

  3. 3

    Sliding window limiters smooth traffic by refilling gradually, but naive versions can require tracking many timestamps; approximated floating windows reduce that cost.

  4. 4

    Token bucket limiters support controlled bursts up to bucket capacity while enforcing a long-term average via continuous token refill; they’re flexible but harder to communicate precisely to users.

  5. 5

    Production rate limiting requires persistent shared state (commonly Redis) for horizontal scaling and serverless, plus a fail-open strategy if the store is unavailable.

  6. 6

    Rate-limited responses should use HTTP 429 and include rate-limit headers (and retry timing) so clients can back off correctly.

  7. 7

    Use sensible rate-limit keys and consider combining throttling with rate limiting to reduce burst impact.

Highlights

Fixed window enforcement can be gamed by timing requests around the reset boundary, enabling effective spikes even when the per-window limit is respected.
Approximated sliding windows (“floating windows”) preserve smoother behavior without storing every request timestamp—by weighting counts from overlapping fixed windows.
Token bucket limiters cap bursts by bucket capacity while still enforcing a steady long-term rate through continuous token refill.
A practical rate limiter needs more than an algorithm: it must persist state across servers, fail open on dependency outages, and return 429 plus actionable rate-limit headers.

Topics

  • Rate Limiting Algorithms
  • Fixed Window
  • Sliding Window
  • Token Bucket
  • Production Rate Limiting

Mentioned