A Million Chess Boards (in a Single Process!)

TL;DR

The game runs as one continuous global chess-like world with cross-board movement, avoiding sharding but requiring careful rule constraints to prevent immediate king captures.

Briefing Cornell Notes

Briefing

1,000,000 chessboards.com runs as one continuous chess-like world—on a single server process—where pieces can move across board boundaries and updates propagate instantly to all players. In the 10 days after launch, more than 150,000 players made over 15 million moves, generating hundreds of millions of queries, all while the server stayed single-process and unchanged. The scale isn’t just a flex; it forces hard engineering tradeoffs around bandwidth, state distribution, and real-time consistency.

The core design choice is a global game state rather than sharding into many independent games. That avoids the overhead of coordinating turns across processes, but it introduces new constraints: pieces can move between boards, yet captures across boards are restricted to prevent immediate “queen takes king” behavior. The restriction unintentionally creates emergent tactics—players build “indestructible” structures and invent patterns reminiscent of other grid games like Hnefatafl. After early play, relaxing the rule to apply only to unmoved pieces led to problems, so the stricter original rule proved more stable.

To make the system fast enough, the architecture leans on snapshotting and move batching instead of broadcasting everything. The server keeps an 8,000 by 8,000 board as a dense array of 64-bit values representing pieces, with metadata packed into the same structure. Because shipping the full board to every client is impossible, clients receive an initial snapshot (a 95 by 95 region around their view center) and then only incremental “move batches” near their current position. Updates are throttled and spatially filtered: the grid is divided into 50 by 50 tile zones, and a client receives moves from a 3x3 block of zones around where they are.

The snapshot and batching sizes aren’t arbitrary. The system is tuned to the client’s visible area: players can see up to 35 by 35 tiles when zoomed in and up to 70 by 70 when zoomed out, with panning that can shift the view by up to 10 tiles. A 95 by 95 snapshot ensures that moving within that range doesn’t require immediate reloading, which helps avoid “loading spinner” moments. Snapshots are sent when the client’s position drifts more than about 12 tiles from the last snapshot.

For bandwidth and latency, the protocol uses Protocol Buffers encoded as binary wire format and compressed with Zstandard. The project also borrows from the earlier “1 million checkboxes” scaling approach: minimize bandwidth as the unbounded cost, and use batching to reduce per-update overhead. Even global metadata (like player counts) is offloaded via Cloudflare caching with TTL-based refreshes, cutting server fan-out.

Finally, the multiplayer consistency problem is handled with optimistic client-side movement plus server validation, including rollback. Moves carry tokens and sequence numbers; when the server rejects a move, the client reverts. But conflicts can arise from timing—two players may move into the same square before either rejection arrives—so the client tracks dependencies between moves using a conflict graph and unwinds all related moves when necessary. The result is a system that stays responsive under concurrency while still converging to the server’s ground truth.

The project also reflects on what didn’t land: many chess players were surprised by the color assignment and rule deviations, and UI clarity around cross-board behavior lagged expectations. Still, the engineering takeaway is clear: with careful state encoding, spatial interest management, caching, and measured performance decisions, a million-board MMO can run in one process without collapsing under load.

Cornell Notes

A million-board chess MMO can run as a single global game state on one server process by combining spatial interest management with efficient binary updates. Clients don’t receive the whole 8,000×8,000 board; they get an initial 95×95 snapshot around their view and then only move batches from nearby 50×50 zones (a 3×3 zone neighborhood). Protocol Buffers plus Zstandard compression keeps bandwidth manageable, while Cloudflare caching reduces repeated global fan-out requests. To keep interaction snappy, the client applies moves optimistically and rolls back on server rejection, using move tokens, sequence numbers, and a dependency graph to unwind conflicting move sets.

Why is “one global game” harder than sharding, and what rule choices make it workable?

Keeping everything in one process avoids cross-process coordination, but it means every move affects the shared world immediately. That creates rule conflicts: pieces can move between boards, yet captures across board boundaries are restricted to prevent immediate queen-to-king captures across adjacent boards. The restriction later produced emergent tactics—players exploited it to form “indestructible” structures—so relaxing it after launch (to apply only to unmoved pieces) was considered a mistake and the stricter version was favored.

How does the system avoid sending an impossible amount of state to clients?

The full board is far too large to broadcast. Instead, the server sends a snapshot (a 95×95 region around the player’s screen-centered position) and then sends move batches only for moves near that region. The grid is partitioned into 50×50 zones; a client receives moves from the 3×3 block of zones centered on the client’s current zone. Since moves are the only state changes, this keeps bandwidth proportional to what’s relevant to each player.

Why do the snapshot and update thresholds use “magic numbers” like 95×95 and ~12 tiles?

Those sizes are derived from what clients can actually see and how far they can pan between updates. Players can see up to 35×35 tiles when zoomed in and up to 70×70 when zoomed out, with panning shifting the view by up to 10 tiles. A 95×95 snapshot provides enough coverage so that moving within that range shouldn’t require a new snapshot. The server then triggers a new snapshot when the client’s position drifts more than about 12.5 tiles from the last snapshot center.

What makes the rollback system necessary, and how does it decide what to undo?

Optimistic client movement is used to target near-zero perceived latency, but server validation can reject moves due to conflicts or rule violations. Timing issues mean a client might temporarily show an invalid state (e.g., two rooks in the same square) until it learns about other players’ moves. When a rejection or conflict is detected, the client unwinds not just one move but all moves related to the conflict. Moves are considered related if they touch the same squares or pieces, and the client maintains a dependency graph; when conflicts occur, it unwinds all moves in the merged dependency graph.

Why use Protocol Buffers and Zstandard instead of JSON?

Protocol Buffers are a binary wire format with typed fields, so they avoid repeating field names and other self-describing overhead that JSON carries. The system compresses these binary messages with Zstandard to further reduce bandwidth. The tradeoff is that Protocol Buffers aren’t self-contained like JSON—decoders must know the schema—but the bandwidth savings matter at million-board scale.

How does the server keep concurrency manageable while serving snapshots?

The board is stored as a dense array protected by a read-write mutex. Writers validate and apply moves, while snapshot readers take a read lock, copy the relevant region, and release the lock quickly. The design emphasizes lock hold time: readers copy contiguous sequential values (fast on modern CPUs), and post-processing (like filtering empty pieces) happens after releasing the lock to minimize contention.

Review Questions

What specific mechanisms limit each client’s data to a local region of the 8,000×8,000 board, and how are they parameterized (snapshot size, zone size, neighborhood size)?
Describe how optimistic updates and rollback work together to maintain responsiveness without permanently diverging from server ground truth.
Why does the dependency graph approach for rollback simplify correctness, and what kinds of move relationships does it treat as “related”?

Key Points

1
The game runs as one continuous global chess-like world with cross-board movement, avoiding sharding but requiring careful rule constraints to prevent immediate king captures.
2
Clients receive a 95×95 snapshot around their view and then only move batches from nearby 50×50 zones (a 3×3 zone neighborhood), preventing full-state broadcast.
3
Snapshot and update thresholds are tuned to client visibility and panning limits so new snapshots arrive before players need them, reducing loading interruptions.
4
Protocol Buffers plus Zstandard compression reduce bandwidth versus JSON by avoiding repeated self-describing field data.
5
Cloudflare caching with TTLs offloads repeated global metadata requests (e.g., player counts) from the server.
6
Low-latency play comes from optimistic client-side movement, while server validation triggers rollback using move tokens, sequence numbers, and a dependency graph to unwind conflicting move sets.
7
Performance decisions were validated with profiling and measurement rather than relying on napkin CPU timing assumptions.

Highlights

A million-board chess MMO operated as a single-process server, yet still supported 150,000+ players and 15 million+ moves in the first 10 days.

Interest management is spatial: clients get 95×95 snapshots and then only moves from a 3×3 neighborhood of 50×50 zones around their current position.

Optimistic movement plus dependency-graph rollback tackles real-time conflicts where multiple players act before server confirmations arrive.

Protocol Buffers (binary) and Zstandard compression keep the wire protocol small enough for massive query volume.

Rule restrictions on cross-board captures unintentionally generated emergent “indestructible” structures and tactics.

Topics

Single-Process MMO
Spatial Interest Management
Binary Protocols
Optimistic Rollback
Concurrency and Locking

Mentioned

Theo
MMO
JSON
Protobuffs
Zstandard
RLE
FEN
Go
CPU
TTL
VM
mutex
Cloudflare
Nginx
React
WebSocket