Get AI summaries of any video or article — Sign up free
Why the internet went down for 2.5 hours yesterday thumbnail

Why the internet went down for 2.5 hours yesterday

Theo - t3․gg·
5 min read

Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Cloudflare’s June 12, 2025 outage was driven by a failure in the storage infrastructure underlying Worker KV, lasting 2 hours 28 minutes.

Briefing

A 2 hour 28 minute outage at Cloudflare on June 12, 2025 knocked out a large share of services that depend on Cloudflare’s Worker KV storage layer—affecting identity, security, real-time features, uploads, and parts of the developer platform. The blast radius was unusually broad: Cloudflare reported that 90%+ of Worker KV requests failed, and the failure cascaded into downstream products, leaving many users unable to log in or access protected resources.

The incident’s core cause was a failure in the underlying storage infrastructure used by Worker KV, a “glue” service that many Cloudflare offerings rely on for configuration, authentication, and asset delivery. Cloudflare said the proximate trigger involved a third-party vendor storage failure, but emphasized ultimate responsibility for its dependency choices and architecture. No data was lost in the sense of permanent deletion, yet Cloudflare acknowledged that some transient data—like analytics events expected to persist—likely disappeared during the outage window.

What made the outage especially alarming was the dependency chain between Cloudflare and Google Cloud. Cloudflare’s Worker KV was widely assumed to be independent from Google Cloud infrastructure, but Cloudflare later confirmed that Worker KV’s cold storage layer depends on Google Cloud. That meant a failure in Google-provided storage components could take down a service that sits at the center of Cloudflare’s platform. The outage also hit services built on top of Worker KV even when those services were hosted elsewhere (for example, applications using Cloudflare-managed components for storage-backed features).

Cloudflare’s internal timeline shows how quickly the problem was detected and escalated. Warp saw issues with new device provisioning, then Cloudflare Access alerts triggered as error rates spiked. Within roughly 16 minutes of identifying a shared cause, the incident was upgraded to P1, and later to P0—prompting broad internal response. During mitigation, teams worked in parallel to reduce reliance on the failing storage: Access explored migrating off Worker KV backing stores, Gateway began gracefully degrading identity and device posture checks, and engineering worked on alternative backing stores and temporary workarounds to keep critical services functioning.

The list of impacted products was extensive: Worker KV itself, Cloudflare Access, Gateway, Warp, Turnstile and Challenges, Workers and Workers AI, parts of the dashboard, image uploads (with peak failure rates reported at 100%), stream/real-time features, and various asset and page delivery paths. Some components were less affected—Cloudflare’s CDN was “mostly up”—but many user-facing flows still broke because they depended on KV lookups for authentication, configuration, or bot checks.

Cloudflare’s forward-looking plan focuses on resiliency: reducing single points of failure in storage infrastructure it doesn’t fully own, accelerating migration of the cold storage layer to R2 (Cloudflare’s S3 alternative), and adding product-level blast-radius controls so Worker KV outages don’t cascade into total platform failures. It also includes tooling to progressively re-enable KV namespaces during storage incidents, aiming to avoid overwhelming services when caches fail and cold reads surge. Cloudflare framed the response as customer-impacting and owned the failure, promising a full postmortem and architectural changes to prevent similar cascading outages.

Cornell Notes

Cloudflare’s June 12, 2025 outage lasted 2 hours 28 minutes and stemmed from a storage failure underlying Worker KV, a core “key-value glue” service used for configuration, authentication, and asset delivery. Cloudflare reported that 90%+ of Worker KV requests failed, causing cascading outages across many products—especially identity and security flows like Cloudflare Access, Gateway, Warp, Turnstile/Challenges, and parts of Workers and Workers AI. A key revelation was that Worker KV’s cold storage layer depends on Google Cloud, contradicting earlier assumptions of full independence. Cloudflare mitigated by escalating incident severity quickly and running parallel work to reduce dependency, degrade nonessential checks, and route critical operations toward alternative backing stores. Going forward, it plans to improve storage redundancy, accelerate migration to R2, and add controls to limit blast radius during storage incidents.

Why did Worker KV failure cascade into so many different Cloudflare products?

Worker KV acts as shared infrastructure for many downstream services. Cloudflare described it as the “glue” holding together configuration, authentication, and asset delivery across products. When Worker KV cold reads/writes failed, services that rely on KV for identity checks, bot challenges, and runtime configuration also failed. Cloudflare reported 90%+ Worker KV request failures, which translated into broad outages for Access, Gateway, Warp, Turnstile/Challenges, Workers-related features, and even parts of the dashboard.

What dependency surprised people most, and how did it change the interpretation of the outage?

Cloudflare said Worker KV’s cold storage layer depends on Google Cloud. The assumption had been that Cloudflare’s infrastructure between Google Cloud and Cloudflare was fully independent, but that turned out not to be true for key pieces. That dependency meant a storage failure in the underlying cold storage layer could take down Worker KV and, by extension, many Cloudflare services—even if those services were otherwise “owned” and operated by Cloudflare.

What were the most visible user-facing impacts reported during the incident?

Cloudflare reported severe login and access failures: Access was down because it pulled configs from cold storage; standard login flows relying on Turnstile and Google OIDC failed; and Gateway isolation sessions failed. Uploads were also heavily affected, with image uploads reaching a 100% failure rate at peak. Real-time/streaming and Workers AI were described as down or struggling, while CDN delivery was “mostly up,” indicating that not every path depended equally on KV.

How did Cloudflare respond operationally once the shared cause was identified?

The incident timeline shows rapid escalation: Warp saw provisioning issues, then Access alerts triggered as error rates rose. Cloudflare combined multi-server incidents into a single shared cause and upgraded to P1 within about 16 minutes, then to P0 shortly after. Mitigation included parallel efforts: Access explored removing Worker KV dependency by migrating backing stores; Gateway degraded identity/device posture rules and temporarily dropped rules referencing KV to reduce load; and engineering worked on alternative backing stores and temporary KV approaches to unblock critical services.

What resiliency changes did Cloudflare say it would make after the outage?

Cloudflare’s plan centered on reducing single points of failure in storage infrastructure it doesn’t fully own, accelerating work to improve Worker KV storage redundancy, and migrating the cold storage layer to R2. It also planned blast-radius controls: short-term product remediations so Worker KV failures don’t take everything down, progressive re-enablement of KV namespaces during storage incidents, and safeguards to prevent self-inflicted overload when caches fail and cold reads surge.

Review Questions

  1. What role does Worker KV play in Cloudflare’s platform, and why does its failure disproportionately affect identity and security products?
  2. How did the confirmed Google Cloud dependency of Worker KV’s cold storage alter the likely root-cause narrative?
  3. Which mitigation strategies (degradation, alternative backing stores, namespace controls) were used to reduce blast radius during the incident?

Key Points

  1. 1

    Cloudflare’s June 12, 2025 outage was driven by a failure in the storage infrastructure underlying Worker KV, lasting 2 hours 28 minutes.

  2. 2

    Worker KV is a foundational dependency for many Cloudflare services; Cloudflare reported 90%+ Worker KV request failures, triggering cascading outages.

  3. 3

    Cloudflare confirmed Worker KV’s cold storage layer depends on Google Cloud, contradicting earlier assumptions of full independence.

  4. 4

    Identity and security flows were hit hardest, including Cloudflare Access, Gateway, Warp, and bot/verification systems like Turnstile and Challenges.

  5. 5

    Image uploads failed at peak (reported 100%), while some delivery paths like the CDN were “mostly up,” showing uneven dependency on KV.

  6. 6

    Mitigation relied on fast incident escalation (P1 then P0) and parallel work to reduce dependency, degrade nonessential checks, and use alternative backing approaches.

  7. 7

    Planned fixes include improved storage redundancy, accelerating migration of cold storage to R2, and adding controls to limit blast radius via progressive KV namespace re-enablement.

Highlights

A storage failure behind Worker KV produced a platform-wide cascade: Cloudflare reported 90%+ Worker KV request failures and broad outages across identity, security, and Workers-related services.
Cloudflare’s later confirmation that Worker KV cold storage depends on Google Cloud reframed the incident as a dependency-chain failure, not just an internal service problem.
Cloudflare mitigated by degrading dependent logic (especially identity/device posture checks) and working in parallel on alternative backing-store paths to keep critical services alive.
Forward plans emphasized blast-radius reduction: resiliency upgrades, accelerated cold-storage migration to R2, and tooling to progressively re-enable KV namespaces during storage incidents.

Topics

  • Cloudflare Outage
  • Worker KV
  • Incident Mitigation
  • Google Cloud Dependency
  • Resiliency and R2

Mentioned

  • Cloudflare
  • AWS
  • Google Cloud
  • Azure
  • Cloudflare Workers
  • Cloudflare Access
  • Cloudflare Gateway
  • Cloudflare Warp
  • Cloudflare Turnstile
  • Cloudflare Challenges
  • Cloudflare Dashboard
  • Cloudflare CDN
  • Cloudflare R2
  • Cloudflare Workers AI
  • Cloudflare Magic Transit
  • Cloudflare Magic WAN
  • Post Hog
  • T3 Chat
  • Heroku
  • Spanner
  • R2
  • Dane (CTO of Cloudflare)
  • Devin Smiley
  • Ripley Park
  • KV
  • DOS
  • DDoS
  • JS
  • WASM
  • SSO
  • OIDC
  • P1
  • P0
  • IO
  • AI