Scaling One Million Checkboxes
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Represent shared checkbox state as a compact bitset so a single user action becomes a single-bit flip in Redis.
Briefing
A one-million-checkbox website launched on June 26 quickly turned into a mainstream, real-time stress test—hitting hundreds of millions of checkbox updates (passing 650 million) before being shut down two weeks later. The core insight wasn’t about checkboxes themselves; it was about how to engineer “shared state at massive scale” when every user interaction must appear instantly to everyone else. The project also became a practical lesson in cost control: bandwidth and update frequency—not just server CPU—ended up driving the hardest scaling decisions.
The site started with a deliberately simple architecture: the checkbox state lived as a compact bitset (1 million bits, one bit per checkbox). Clients rendered only what was visible (using React Window) and relied on updates rather than re-rendering the entire million-item DOM. When a user toggled a checkbox, the server flipped the corresponding bit and broadcast the change to all connected clients, avoiding the need to send a million checkbox objects over the wire.
Early load was dominated by the mismatch between expected traffic and reality. Tens of thousands of users arrived within hours from Hacker News, Mastodon, and Twitter; the site crashed repeatedly until the system stabilized by day two. The initial approach used JSON and base64-encoded snapshots for full state recovery, plus incremental updates for normal operation. That choice created a heavy serialization and payload overhead, and the system eventually ran into Redis connection pressure and WebSocket throughput limits.
Scaling required a series of targeted engineering pivots. The most important were (1) batching updates to reduce per-message overhead, (2) moving away from socket.io toward raw WebSockets to cut framing and inefficiency, and (3) adding connection pooling—though it initially conflicted with the deployment model and still required careful tuning. Redis was treated as the shared state store: it held the bitset and used Pub/Sub to fan out events. To recover from missed WebSocket updates (e.g., backgrounded tabs), the system periodically sent full snapshots, but snapshot frequency and incremental update size had to be reduced to keep bandwidth—and therefore cost—under control.
A key debugging moment came from state synchronization: without ordering metadata, clients could apply stale incremental updates after receiving a newer full snapshot, producing incorrect checkbox views. The fix was to introduce timestamps/counters so clients could drop out-of-date update batches and only apply increments that matched the latest snapshot baseline.
As traffic kept climbing, the project also faced adversarial behavior and operational failures. A bug allowed checkbox counts to jump into the 100 million range, effectively corrupting the intended bitset size and forcing a rollback to a truncated, validated state. Later, bots drove extreme traffic, leading to a denial-of-service situation that was mitigated by putting the site behind Cloudflare and tightening rate limiting. When Redis replicas and connection pooling didn’t behave as expected, the solution became pragmatic: spin up replicas, route traffic to private IPs, and add automated process restarts when Redis connections caused Flask workers to crash.
Finally, the shutdown plan turned the same real-time engineering into a lifecycle feature. Checkboxes were “frozen” if they weren’t toggled within a time window; frozen state was tracked in Redis and distributed like the main bitset. This allowed the system to gradually stop accepting meaningful changes and wind down safely. The project concluded with a Go rewrite that delivered major performance gains over Python, and a broader takeaway: building fast, hackable systems can work—if scaling constraints (especially bandwidth, serialization, and ordering) are treated as first-class design problems from day one.
Cornell Notes
The “1 million checkboxes” site scaled from a small experiment into a high-traffic, real-time shared-state system, driven by a compact bitset model and WebSocket broadcasting. Each checkbox toggle flipped a single bit in Redis and pushed incremental updates to connected clients, while clients rendered only visible items to avoid DOM overload. The biggest scaling bottlenecks emerged from payload overhead (JSON/base64 snapshots), WebSocket throughput, Redis connection limits, and bandwidth cost from frequent full-state snapshots. Correctness also required ordering: clients needed timestamps/counters so they could drop stale incremental updates that arrived after a newer snapshot. The project ultimately stabilized with batching, raw WebSockets, snapshot throttling, ordering metadata, and operational safeguards, then wound down via a “freeze” mechanism implemented with Redis Lua.
Why store checkbox state as a bitset instead of an array of booleans or objects?
What made bandwidth and payload size the real cost driver?
How did update batching and raw WebSockets improve scalability?
What correctness bug appeared when clients received full snapshots and incremental updates?
Why did the system need a “freeze” shutdown plan?
How were operational failures handled during peak traffic?
Review Questions
- If a client receives a full snapshot and then an incremental update that was generated earlier, what mechanism prevents the client from applying the stale change?
- Which scaling lever would you prioritize first for a real-time shared-state app: reducing snapshot frequency, batching incremental updates, or changing the state representation—and why?
- How does freezing inactive checkboxes reduce system load, and what state must be tracked to enforce it consistently across clients?
Key Points
- 1
Represent shared checkbox state as a compact bitset so a single user action becomes a single-bit flip in Redis.
- 2
Avoid sending million-item DOM updates; render only visible checkboxes and keep the full state locally as a bitset.
- 3
Treat bandwidth as a first-class constraint: throttle full-state snapshots and shrink incremental update payloads (especially when using JSON/base64).
- 4
Batch incremental updates and prefer raw WebSockets over higher-level socket frameworks when throughput matters.
- 5
Add ordering metadata (timestamps/counters) so clients can discard stale incremental updates that arrive after a newer snapshot.
- 6
Use Redis Pub/Sub for fan-out, but monitor Redis connection limits and add operational safeguards (process restarts, server rotation).
- 7
Plan for lifecycle and shutdown: a “freeze” mechanism can gradually reduce interaction and safely wind down a real-time system.