My biggest failure to date
Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
T3 Chat’s outage began when Convex websocket connectivity failed, which then stalled chat loading and created severe lag.
Briefing
T3 Chat’s first major outage wasn’t caused by a single bug—it was the result of a high-stakes migration to Convex that repeatedly hit scaling and operational blind spots, followed by a separate failure mode tied to websocket reconnection behavior and search indexing churn. The core takeaway: even when a workload seems “within limits,” real-world traffic patterns (background tabs, long-lived websocket connections, and bursty subscription invalidations) can turn a stable system into a self-amplifying load problem.
The outage’s user-facing symptoms were severe: chats wouldn’t load, creating new chats barely worked, and the app lagged heavily for hours, leaving the site almost unusable. The immediate technical root was the websocket connection layer used by Convex—once that layer failed, queries stalled and the system effectively stopped delivering updates. That failure traced back to how the migration was executed and how client subscriptions behaved while migration writes were happening.
Before the migration, T3 Chat stored data in a MySQL database on PlanetScale. Moving to Convex wasn’t a simple storage swap; it required a near-rewrite of the app’s data flow. With Convex, the client subscribes to queries over websockets, receiving incremental updates as data changes. That model reduces complex edge cases around synchronization, but it also means that write-heavy operations can explode into massive query traffic if subscriptions are active at the wrong time.
The migration plan pulled each user’s MySQL data into Convex in chunks (attachments, threads, messages) and then finalized with a “migration done” flag. The first attempt failed due to identity mismatches in JWTs—an OpenAI library bug produced malformed or inconsistent user IDs, causing migrations to loop. The fix switched to a nested properties ID and introduced an opt-in T3 Chat beta, which initially looked promising.
A second attempt failed when too many migrations ran concurrently, creating an overwhelming mutation throughput. A workpool mechanism with a max parallelism limit (initially reduced further to three) was introduced to queue migrations instead of running them all at once.
The third attempt initially improved things, especially after a key change that prevented thread prefetch subscriptions from running before migration completion—reducing load by 40–70x. But the system still collapsed after go-live when query throughput spiked dramatically. The deeper issue was subtle: while migration writes were streaming, the client’s “snappy navigation” feature kept subscriptions active for the top threads in the sidebar. As threads and messages updated during migration, those subscriptions triggered hundreds of additional queries, effectively turning the migration into a background DDoS against the Convex query layer.
After the migration finally stabilized, a new outage emerged the next morning. Convex’s postmortem attributed the cascade to three interacting problems: search index compaction invalidated caches every ~30 minutes, causing subscription refresh storms; websocket backoff logic failed to apply under certain failure cases, leading to rapid reconnect loops; and emergency operational tooling accidentally placed the deployment back onto default/free-tier hardware resources, making the overload far more damaging. Background tabs kept websocket connections alive for long periods, so the reconnection storm persisted even after fixes were deployed.
Going forward, T3 Chat plans a status platform and in-app outage reporting, real paging and automated alerting, better client refresh/disconnect mechanisms for stale tabs, and a review of all dependencies’ load characteristics. The message is blunt: reliability at this scale requires both technical safeguards (backoff, subscription gating, queueing) and operational discipline (paging, status, and tooling that prevents misconfiguration).
Cornell Notes
T3 Chat’s outage stemmed from a Convex migration that behaved differently under real traffic than in testing, then escalated into a websocket reconnection storm tied to search indexing and operational misconfiguration. Users couldn’t load chats because the Convex websocket layer failed, and query throughput spiked when client subscriptions stayed active during migration writes. Earlier migration failures were traced to identity issues in JWTs (OpenAI library bug) and then to too much concurrent migration throughput, prompting queued migrations via Convex workpools. After partial stabilization, a separate morning incident involved search index compaction invalidating caches every ~30 minutes, reconnection backoff not triggering correctly, and emergency tooling accidentally reverting hardware resources to free-tier. The lesson: background tabs, subscription invalidations, and operational tooling can turn “within limits” workloads into cascading failures.
What made the migration to Convex fundamentally different from a typical database change?
Why did the first migration attempt loop for some users?
How did the team prevent too many migrations from running at once in the second attempt?
What was the “snappy navigation” subscription problem during the third attempt?
What caused the next-morning outage after migration stabilization?
Why couldn’t T3 Chat simply force users to refresh to stop the load?
Review Questions
- Which migration failure was caused by identity mismatches in JWTs, and what specific field change fixed it?
- Explain how active sidebar subscriptions during migration could multiply query traffic even if migration throughput was capped.
- List the three interacting causes Convex cited for the morning outage and describe how background tabs made the problem worse.
Key Points
- 1
T3 Chat’s outage began when Convex websocket connectivity failed, which then stalled chat loading and created severe lag.
- 2
The Convex migration required a near-rewrite of data flow because the client subscribes to updates over websockets rather than relying on request/response syncing.
- 3
Identity bugs in JWT handling (malformed subject IDs due to an OpenAI library issue) caused the first migration to loop for some users.
- 4
Concurrency control mattered: running many migrations at once created mutation and load spikes, leading to queued migrations via Convex workpools.
- 5
A “snappy navigation” optimization (top-thread message prefetch subscriptions) unintentionally amplified query traffic during migration writes, creating a self-reinforcing load pattern.
- 6
The later morning outage combined search index compaction cache invalidations, broken reconnection backoff in a failure mode, and an operational mistake that reverted hardware resources to free-tier capacity.
- 7
Reliability plans now prioritize operational readiness: real paging, in-app status/updates, and mechanisms to disconnect or refresh stale clients—especially background tabs.