Scaling One Million Checkboxes

TL;DR

Represent shared checkbox state as a compact bitset so a single user action becomes a single-bit flip in Redis.

Briefing Cornell Notes

Briefing

A one-million-checkbox website launched on June 26 quickly turned into a mainstream, real-time stress test—hitting hundreds of millions of checkbox updates (passing 650 million) before being shut down two weeks later. The core insight wasn’t about checkboxes themselves; it was about how to engineer “shared state at massive scale” when every user interaction must appear instantly to everyone else. The project also became a practical lesson in cost control: bandwidth and update frequency—not just server CPU—ended up driving the hardest scaling decisions.

The site started with a deliberately simple architecture: the checkbox state lived as a compact bitset (1 million bits, one bit per checkbox). Clients rendered only what was visible (using React Window) and relied on updates rather than re-rendering the entire million-item DOM. When a user toggled a checkbox, the server flipped the corresponding bit and broadcast the change to all connected clients, avoiding the need to send a million checkbox objects over the wire.

Early load was dominated by the mismatch between expected traffic and reality. Tens of thousands of users arrived within hours from Hacker News, Mastodon, and Twitter; the site crashed repeatedly until the system stabilized by day two. The initial approach used JSON and base64-encoded snapshots for full state recovery, plus incremental updates for normal operation. That choice created a heavy serialization and payload overhead, and the system eventually ran into Redis connection pressure and WebSocket throughput limits.

Scaling required a series of targeted engineering pivots. The most important were (1) batching updates to reduce per-message overhead, (2) moving away from socket.io toward raw WebSockets to cut framing and inefficiency, and (3) adding connection pooling—though it initially conflicted with the deployment model and still required careful tuning. Redis was treated as the shared state store: it held the bitset and used Pub/Sub to fan out events. To recover from missed WebSocket updates (e.g., backgrounded tabs), the system periodically sent full snapshots, but snapshot frequency and incremental update size had to be reduced to keep bandwidth—and therefore cost—under control.

A key debugging moment came from state synchronization: without ordering metadata, clients could apply stale incremental updates after receiving a newer full snapshot, producing incorrect checkbox views. The fix was to introduce timestamps/counters so clients could drop out-of-date update batches and only apply increments that matched the latest snapshot baseline.

As traffic kept climbing, the project also faced adversarial behavior and operational failures. A bug allowed checkbox counts to jump into the 100 million range, effectively corrupting the intended bitset size and forcing a rollback to a truncated, validated state. Later, bots drove extreme traffic, leading to a denial-of-service situation that was mitigated by putting the site behind Cloudflare and tightening rate limiting. When Redis replicas and connection pooling didn’t behave as expected, the solution became pragmatic: spin up replicas, route traffic to private IPs, and add automated process restarts when Redis connections caused Flask workers to crash.

Finally, the shutdown plan turned the same real-time engineering into a lifecycle feature. Checkboxes were “frozen” if they weren’t toggled within a time window; frozen state was tracked in Redis and distributed like the main bitset. This allowed the system to gradually stop accepting meaningful changes and wind down safely. The project concluded with a Go rewrite that delivered major performance gains over Python, and a broader takeaway: building fast, hackable systems can work—if scaling constraints (especially bandwidth, serialization, and ordering) are treated as first-class design problems from day one.

Cornell Notes

The “1 million checkboxes” site scaled from a small experiment into a high-traffic, real-time shared-state system, driven by a compact bitset model and WebSocket broadcasting. Each checkbox toggle flipped a single bit in Redis and pushed incremental updates to connected clients, while clients rendered only visible items to avoid DOM overload. The biggest scaling bottlenecks emerged from payload overhead (JSON/base64 snapshots), WebSocket throughput, Redis connection limits, and bandwidth cost from frequent full-state snapshots. Correctness also required ordering: clients needed timestamps/counters so they could drop stale incremental updates that arrived after a newer snapshot. The project ultimately stabilized with batching, raw WebSockets, snapshot throttling, ordering metadata, and operational safeguards, then wound down via a “freeze” mechanism implemented with Redis Lua.

Why store checkbox state as a bitset instead of an array of booleans or objects?

A bitset compresses the entire shared state into 1 million bits (about 125 KB), where each checkbox corresponds to a single bit: 1 means checked, 0 means unchecked. That makes updates cheap (flip one bit) and avoids sending a million checkbox records. Clients keep a local bitset representation and render only what’s in view, while the server broadcasts the minimal information needed to update everyone else.

What made bandwidth and payload size the real cost driver?

Full-state recovery required sending snapshots. When snapshots were encoded as base64 and wrapped in JSON, the payload grew substantially and serialization work increased. Even with a compact 1 million-bit state, sending large snapshots too often (e.g., every 30 seconds) multiplied bandwidth quickly across thousands of clients. The project reduced snapshot frequency and trimmed incremental update sizes to cap cost.

How did update batching and raw WebSockets improve scalability?

Batching grouped multiple checkbox changes into fewer messages, reducing per-message overhead and improving throughput. Switching from socket.io to raw WebSockets removed extra framing and inefficiencies. Together, these changes reduced pressure on Redis and improved the rate at which updates could be broadcast and applied.

What correctness bug appeared when clients received full snapshots and incremental updates?

Clients could receive a newer full snapshot and then apply an older incremental update batch that arrived later, producing an incorrect view. The fix added ordering metadata (timestamps/counters) to full snapshots and incremental batches. Clients then dropped update batches whose timestamp/counter lagged behind the latest snapshot baseline.

Why did the system need a “freeze” shutdown plan?

Instead of abruptly stopping, the site gradually stopped accepting meaningful changes. A checkbox became “frozen” if it wasn’t toggled within a time window (e.g., 10 minutes). Frozen state was tracked in Redis and distributed to clients so they disabled interaction for those boxes. This reduced future activity and allowed a controlled wind-down.

How were operational failures handled during peak traffic?

When Flask workers crashed—often tied to Redis connection issues—the system used pragmatic automation: a script monitored the number of running Flask processes and bounced the systemd unit if too many were down. EngineX was updated to temporarily remove unhealthy servers from rotation. For Redis load, a replica was spun up to spread connections when connection pooling didn’t behave as expected.

Review Questions

If a client receives a full snapshot and then an incremental update that was generated earlier, what mechanism prevents the client from applying the stale change?
Which scaling lever would you prioritize first for a real-time shared-state app: reducing snapshot frequency, batching incremental updates, or changing the state representation—and why?
How does freezing inactive checkboxes reduce system load, and what state must be tracked to enforce it consistently across clients?

Key Points

1
Represent shared checkbox state as a compact bitset so a single user action becomes a single-bit flip in Redis.
2
Avoid sending million-item DOM updates; render only visible checkboxes and keep the full state locally as a bitset.
3
Treat bandwidth as a first-class constraint: throttle full-state snapshots and shrink incremental update payloads (especially when using JSON/base64).
4
Batch incremental updates and prefer raw WebSockets over higher-level socket frameworks when throughput matters.
5
Add ordering metadata (timestamps/counters) so clients can discard stale incremental updates that arrive after a newer snapshot.
6
Use Redis Pub/Sub for fan-out, but monitor Redis connection limits and add operational safeguards (process restarts, server rotation).
7
Plan for lifecycle and shutdown: a “freeze” mechanism can gradually reduce interaction and safely wind down a real-time system.

Highlights

The entire shared state fit into 1 million bits (about 125 KB), enabling single-bit updates and efficient broadcasting.

Scaling broke on payload overhead and bandwidth: JSON/base64 snapshots and frequent full-state refreshes became the cost bottleneck.

A subtle synchronization bug emerged when clients applied stale incremental updates after receiving newer snapshots—fixed by adding timestamps/counters.

Stability required both performance engineering (batching, raw WebSockets) and operations (Redis load management, automated worker restarts).

The shutdown wasn’t a hard stop: checkboxes were “frozen” after inactivity, with frozen state distributed like the main bitset.

Topics

Real-Time WebSockets
Bitset State
Redis Pub/Sub
Bandwidth Optimization
State Synchronization

Mentioned

DOM
CPU
RLE
JSON
UTF-8
UTF-16
TCP
HTTP
API
P2P
DDoS
Lua
Redis
Pub/Sub
WebSockets
CDN