My biggest failure to date

TL;DR

T3 Chat’s outage began when Convex websocket connectivity failed, which then stalled chat loading and created severe lag.

Briefing Cornell Notes

Briefing

T3 Chat’s first major outage wasn’t caused by a single bug—it was the result of a high-stakes migration to Convex that repeatedly hit scaling and operational blind spots, followed by a separate failure mode tied to websocket reconnection behavior and search indexing churn. The core takeaway: even when a workload seems “within limits,” real-world traffic patterns (background tabs, long-lived websocket connections, and bursty subscription invalidations) can turn a stable system into a self-amplifying load problem.

The outage’s user-facing symptoms were severe: chats wouldn’t load, creating new chats barely worked, and the app lagged heavily for hours, leaving the site almost unusable. The immediate technical root was the websocket connection layer used by Convex—once that layer failed, queries stalled and the system effectively stopped delivering updates. That failure traced back to how the migration was executed and how client subscriptions behaved while migration writes were happening.

Before the migration, T3 Chat stored data in a MySQL database on PlanetScale. Moving to Convex wasn’t a simple storage swap; it required a near-rewrite of the app’s data flow. With Convex, the client subscribes to queries over websockets, receiving incremental updates as data changes. That model reduces complex edge cases around synchronization, but it also means that write-heavy operations can explode into massive query traffic if subscriptions are active at the wrong time.

The migration plan pulled each user’s MySQL data into Convex in chunks (attachments, threads, messages) and then finalized with a “migration done” flag. The first attempt failed due to identity mismatches in JWTs—an OpenAI library bug produced malformed or inconsistent user IDs, causing migrations to loop. The fix switched to a nested properties ID and introduced an opt-in T3 Chat beta, which initially looked promising.

A second attempt failed when too many migrations ran concurrently, creating an overwhelming mutation throughput. A workpool mechanism with a max parallelism limit (initially reduced further to three) was introduced to queue migrations instead of running them all at once.

The third attempt initially improved things, especially after a key change that prevented thread prefetch subscriptions from running before migration completion—reducing load by 40–70x. But the system still collapsed after go-live when query throughput spiked dramatically. The deeper issue was subtle: while migration writes were streaming, the client’s “snappy navigation” feature kept subscriptions active for the top threads in the sidebar. As threads and messages updated during migration, those subscriptions triggered hundreds of additional queries, effectively turning the migration into a background DDoS against the Convex query layer.

After the migration finally stabilized, a new outage emerged the next morning. Convex’s postmortem attributed the cascade to three interacting problems: search index compaction invalidated caches every ~30 minutes, causing subscription refresh storms; websocket backoff logic failed to apply under certain failure cases, leading to rapid reconnect loops; and emergency operational tooling accidentally placed the deployment back onto default/free-tier hardware resources, making the overload far more damaging. Background tabs kept websocket connections alive for long periods, so the reconnection storm persisted even after fixes were deployed.

Going forward, T3 Chat plans a status platform and in-app outage reporting, real paging and automated alerting, better client refresh/disconnect mechanisms for stale tabs, and a review of all dependencies’ load characteristics. The message is blunt: reliability at this scale requires both technical safeguards (backoff, subscription gating, queueing) and operational discipline (paging, status, and tooling that prevents misconfiguration).

Cornell Notes

T3 Chat’s outage stemmed from a Convex migration that behaved differently under real traffic than in testing, then escalated into a websocket reconnection storm tied to search indexing and operational misconfiguration. Users couldn’t load chats because the Convex websocket layer failed, and query throughput spiked when client subscriptions stayed active during migration writes. Earlier migration failures were traced to identity issues in JWTs (OpenAI library bug) and then to too much concurrent migration throughput, prompting queued migrations via Convex workpools. After partial stabilization, a separate morning incident involved search index compaction invalidating caches every ~30 minutes, reconnection backoff not triggering correctly, and emergency tooling accidentally reverting hardware resources to free-tier. The lesson: background tabs, subscription invalidations, and operational tooling can turn “within limits” workloads into cascading failures.

What made the migration to Convex fundamentally different from a typical database change?

It wasn’t just moving storage from MySQL to Convex. Convex’s sync model expects the client to subscribe to queries over websockets and receive incremental updates. That changes the app’s runtime behavior and required a large codebase shift (described as a near-rewrite: ~25,000 lines total with a PR touching ~10,000). Because the client stays subscribed, write-heavy migration activity can trigger large volumes of query updates if subscriptions aren’t gated.

Why did the first migration attempt loop for some users?

The migration relied on a user identifier from JWTs, using the JWT “subject” field. For users who first signed in on an older T3 Chat version, the JWT contained multiple IDs and the subject sometimes became malformed due to a bug in the OpenAI library used for auth. That caused the migration to repeatedly re-run. Switching to a nested properties ID fixed formatting consistency, and the team used a T3 Chat beta (beta.t3) to limit blast radius.

How did the team prevent too many migrations from running at once in the second attempt?

They discovered that when many users hit migration simultaneously, mutation throughput spiked. Instead of running migrations immediately, they introduced Convex workpools to cap concurrency (max parallelism). Migrations were queued through an internal action that enqueued work into the workpool, ensuring only a limited number (later reduced to three) ran concurrently.

What was the “snappy navigation” subscription problem during the third attempt?

To make thread switching feel instant, the client subscribed to message queries for the top 20 threads in the sidebar. During migration, threads and messages were being written in chunks from oldest to newest. As new thread chunks arrived, the top-20 set changed, firing additional message subscriptions; worse, when message chunks updated threads that were already subscribed, those updates triggered many extra queries. During migration, users didn’t benefit from this prefetching, but the active subscriptions still generated massive query traffic.

What caused the next-morning outage after migration stabilization?

Convex identified three linked issues: (1) search index compaction invalidated caches about every 30 minutes, busting query caches and refreshing subscriptions; (2) websocket reconnection backoff didn’t apply correctly in a specific failure mode, so reconnect attempts rapidly repeated and effectively DDoSed the service; and (3) emergency operational tooling accidentally reverted the deployment to default/free-tier hardware resources, reducing capacity just as load surged. Background tabs kept websocket connections alive, prolonging the reconnection storm.

Why couldn’t T3 Chat simply force users to refresh to stop the load?

Many users were on background tabs and the team couldn’t reliably trigger a forced refresh for already-loaded clients. Even when Convex shipped updated client behavior, older clients remained active until users reopened or refreshed. Without a mechanism to disconnect or refresh stale tabs, the old reconnection behavior continued to hammer the backend.

Review Questions

Which migration failure was caused by identity mismatches in JWTs, and what specific field change fixed it?
Explain how active sidebar subscriptions during migration could multiply query traffic even if migration throughput was capped.
List the three interacting causes Convex cited for the morning outage and describe how background tabs made the problem worse.

Key Points

1
T3 Chat’s outage began when Convex websocket connectivity failed, which then stalled chat loading and created severe lag.
2
The Convex migration required a near-rewrite of data flow because the client subscribes to updates over websockets rather than relying on request/response syncing.
3
Identity bugs in JWT handling (malformed subject IDs due to an OpenAI library issue) caused the first migration to loop for some users.
4
Concurrency control mattered: running many migrations at once created mutation and load spikes, leading to queued migrations via Convex workpools.
5
A “snappy navigation” optimization (top-thread message prefetch subscriptions) unintentionally amplified query traffic during migration writes, creating a self-reinforcing load pattern.
6
The later morning outage combined search index compaction cache invalidations, broken reconnection backoff in a failure mode, and an operational mistake that reverted hardware resources to free-tier capacity.
7
Reliability plans now prioritize operational readiness: real paging, in-app status/updates, and mechanisms to disconnect or refresh stale clients—especially background tabs.

Highlights

Chats wouldn’t load because the Convex websocket layer “fell over,” turning incremental sync into a system-wide stall.

A single subscription strategy—prefetching messages for the top 20 threads—became a load multiplier when migration writes were streaming in the background.

Convex’s postmortem pointed to a three-part cascade: 30-minute search index compaction invalidations, reconnection backoff not triggering correctly, and emergency tooling accidentally reverting to free-tier hardware.

Background tabs kept websocket connections alive for long periods, so reconnection storms persisted even after fixes shipped.

T3 Chat’s most concrete mitigation during migration was gating subscriptions until migration completion, cutting load by 40–70x.

Topics

Convex Migration
Websocket Reliability
Query Throughput
Operational Tooling
Incident Postmortem

Mentioned

Theo - t3․gg
Jamie
James
Tom
Sam Lambert
Keith Reach
Gabriel
Igor
CJ
Claude
Elon Musk
JWT
QPS
UTC
SAML
Octa
API
PR
CPU
PII
DDoS
TRPC
T3
MySQL
SQL