A Million Chess Boards (in a Single Process!)
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The game runs as one continuous global chess-like world with cross-board movement, avoiding sharding but requiring careful rule constraints to prevent immediate king captures.
Briefing
1,000,000 chessboards.com runs as one continuous chess-like world—on a single server process—where pieces can move across board boundaries and updates propagate instantly to all players. In the 10 days after launch, more than 150,000 players made over 15 million moves, generating hundreds of millions of queries, all while the server stayed single-process and unchanged. The scale isn’t just a flex; it forces hard engineering tradeoffs around bandwidth, state distribution, and real-time consistency.
The core design choice is a global game state rather than sharding into many independent games. That avoids the overhead of coordinating turns across processes, but it introduces new constraints: pieces can move between boards, yet captures across boards are restricted to prevent immediate “queen takes king” behavior. The restriction unintentionally creates emergent tactics—players build “indestructible” structures and invent patterns reminiscent of other grid games like Hnefatafl. After early play, relaxing the rule to apply only to unmoved pieces led to problems, so the stricter original rule proved more stable.
To make the system fast enough, the architecture leans on snapshotting and move batching instead of broadcasting everything. The server keeps an 8,000 by 8,000 board as a dense array of 64-bit values representing pieces, with metadata packed into the same structure. Because shipping the full board to every client is impossible, clients receive an initial snapshot (a 95 by 95 region around their view center) and then only incremental “move batches” near their current position. Updates are throttled and spatially filtered: the grid is divided into 50 by 50 tile zones, and a client receives moves from a 3x3 block of zones around where they are.
The snapshot and batching sizes aren’t arbitrary. The system is tuned to the client’s visible area: players can see up to 35 by 35 tiles when zoomed in and up to 70 by 70 when zoomed out, with panning that can shift the view by up to 10 tiles. A 95 by 95 snapshot ensures that moving within that range doesn’t require immediate reloading, which helps avoid “loading spinner” moments. Snapshots are sent when the client’s position drifts more than about 12 tiles from the last snapshot.
For bandwidth and latency, the protocol uses Protocol Buffers encoded as binary wire format and compressed with Zstandard. The project also borrows from the earlier “1 million checkboxes” scaling approach: minimize bandwidth as the unbounded cost, and use batching to reduce per-update overhead. Even global metadata (like player counts) is offloaded via Cloudflare caching with TTL-based refreshes, cutting server fan-out.
Finally, the multiplayer consistency problem is handled with optimistic client-side movement plus server validation, including rollback. Moves carry tokens and sequence numbers; when the server rejects a move, the client reverts. But conflicts can arise from timing—two players may move into the same square before either rejection arrives—so the client tracks dependencies between moves using a conflict graph and unwinds all related moves when necessary. The result is a system that stays responsive under concurrency while still converging to the server’s ground truth.
The project also reflects on what didn’t land: many chess players were surprised by the color assignment and rule deviations, and UI clarity around cross-board behavior lagged expectations. Still, the engineering takeaway is clear: with careful state encoding, spatial interest management, caching, and measured performance decisions, a million-board MMO can run in one process without collapsing under load.
Cornell Notes
A million-board chess MMO can run as a single global game state on one server process by combining spatial interest management with efficient binary updates. Clients don’t receive the whole 8,000×8,000 board; they get an initial 95×95 snapshot around their view and then only move batches from nearby 50×50 zones (a 3×3 zone neighborhood). Protocol Buffers plus Zstandard compression keeps bandwidth manageable, while Cloudflare caching reduces repeated global fan-out requests. To keep interaction snappy, the client applies moves optimistically and rolls back on server rejection, using move tokens, sequence numbers, and a dependency graph to unwind conflicting move sets.
Why is “one global game” harder than sharding, and what rule choices make it workable?
How does the system avoid sending an impossible amount of state to clients?
Why do the snapshot and update thresholds use “magic numbers” like 95×95 and ~12 tiles?
What makes the rollback system necessary, and how does it decide what to undo?
Why use Protocol Buffers and Zstandard instead of JSON?
How does the server keep concurrency manageable while serving snapshots?
Review Questions
- What specific mechanisms limit each client’s data to a local region of the 8,000×8,000 board, and how are they parameterized (snapshot size, zone size, neighborhood size)?
- Describe how optimistic updates and rollback work together to maintain responsiveness without permanently diverging from server ground truth.
- Why does the dependency graph approach for rollback simplify correctness, and what kinds of move relationships does it treat as “related”?
Key Points
- 1
The game runs as one continuous global chess-like world with cross-board movement, avoiding sharding but requiring careful rule constraints to prevent immediate king captures.
- 2
Clients receive a 95×95 snapshot around their view and then only move batches from nearby 50×50 zones (a 3×3 zone neighborhood), preventing full-state broadcast.
- 3
Snapshot and update thresholds are tuned to client visibility and panning limits so new snapshots arrive before players need them, reducing loading interruptions.
- 4
Protocol Buffers plus Zstandard compression reduce bandwidth versus JSON by avoiding repeated self-describing field data.
- 5
Cloudflare caching with TTLs offloads repeated global metadata requests (e.g., player counts) from the server.
- 6
Low-latency play comes from optimistic client-side movement, while server validation triggers rollback using move tokens, sequence numbers, and a dependency graph to unwind conflicting move sets.
- 7
Performance decisions were validated with profiling and measurement rather than relying on napkin CPU timing assumptions.