Get AI summaries of any video or article — Sign up free
Vercel Finally Caught Up thumbnail

Vercel Finally Caught Up

Theo - t3․gg·
5 min read

Based on Theo - t3․gg's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Active CPU pricing shifts billing from wall-clock duration to CPU time actually used, targeting AI streaming workloads where requests wait on external services.

Briefing

Vercel’s latest release package is a direct attempt to close the cost and reliability gap for long-running, low-CPU workloads—especially AI inference and agent-style tasks—while also adding enterprise-grade security and developer ergonomics. The headline change is “active CPU pricing,” which shifts billing away from wall-clock time toward the moments when code is actually using CPU. That matters because many AI workloads spend most of their lifetime waiting on external services (like OpenAI streaming tokens), meaning traditional “per duration” pricing can punish teams even when their compute utilization is tiny.

The transcript lays out why this pricing mismatch happens. In classic serverless, each request gets an isolated instance that stays alive for the full duration of the request, so billing tracks how long the instance runs—even if the CPU is mostly idle while waiting for I/O. For short web requests (database fetches, rendering), duration and CPU usage correlate well. For AI streaming, they don’t: a request can run for tens of seconds while using only nanoseconds or microbursts of CPU to handle events, verify inputs, and forward streamed output. Under that model, teams can end up paying for long runtimes that don’t translate into meaningful CPU work.

Active CPU pricing is presented as Vercel’s answer to Cloudflare Workers’ “net CPU” approach, where billing tracks CPU time rather than elapsed time. The transcript compares workload shapes using two axes—duration and CPU intensity—and argues that AI endpoints land in a “high duration, low CPU” corner that historically didn’t get much optimization. Vercel’s new model is designed for “I/O-bound backends” that scale instantly but remain idle between operations, such as AI inference agents, MCP servers, and other workflows that don’t fit quick request/response patterns.

Beyond pricing, the release adds sandboxing for untrusted code via “Vercel Sandbox,” positioned as an SDK-powered, ephemeral execution environment for code generated by AI agents or submitted by users. There’s also “Q’s,” a limited-beta queue/message system meant to offload long-running background work so users don’t have to wait for slow operations inside a request. For app security, “Bot ID” introduces invisible bot filtering for critical routes (login, signup, checkouts, and expensive API actions), with a basic mode that’s free across plans and a “deep analysis” mode for stronger detection.

Finally, Vercel ships an “AI gateway” beta: a single endpoint to access multiple model providers (including OpenAI, XAI, Anthropic, and Google) with improved routing, observability, and fallback behavior. The transcript frames this as a way to sidestep provider-specific pain—rate limits, reliability issues, and negotiation friction—by routing requests to whichever backend performs best.

Taken together, the changes aim to make Vercel more competitive for AI-heavy production systems: cheaper for long-running inference, safer for untrusted code and automation, and more resilient when model providers struggle.

Cornell Notes

Vercel’s biggest shift is “active CPU pricing,” moving billing toward CPU time actually used rather than wall-clock duration. That targets a key mismatch in AI workloads: long-running requests (often streaming tokens) can spend most of their lifetime waiting on external services while using very little CPU. The transcript argues this pricing model is especially important for “high duration, low CPU” workloads like AI inference agents and similar I/O-bound backends.

Alongside pricing, Vercel adds sandboxing for untrusted code (via an SDK), a limited-beta queue system (“Q’s”) for background tasks, and “Bot ID” for invisible bot filtering on critical routes. It also introduces an “AI gateway” beta to unify access to multiple model providers with routing, observability, and fallback to improve reliability and reduce rate-limit headaches.

Why does wall-clock billing become a problem for AI streaming workloads?

AI endpoints can run for 20 seconds (or longer) while doing almost no CPU work during most of that time. CPU is used briefly for verification, request setup, and handling streamed tokens, but the request is largely idle while waiting on external APIs. Under duration-based serverless billing, teams pay for the whole lifetime of the instance even when CPU utilization is near zero, creating a cost mismatch for “high duration, low CPU” workloads.

How does active CPU pricing change the cost equation?

Active CPU pricing charges based on CPU time when code is actively executing (measured in active CPU milliseconds), not on how long the function stays alive. The transcript contrasts this with traditional serverless and with Cloudflare’s net CPU billing, arguing that both approaches align cost with real compute work. The practical implication: long-running inference that streams output can become dramatically cheaper because most of the request time doesn’t translate into CPU usage.

What workload shapes benefit most from the new model?

The transcript uses a two-axis framing: duration (wall-clock time) and intensity (CPU usage). It claims AI inference endpoints sit in a rare corner—high duration with extremely low CPU intensity—where older optimizations focused on reducing duration for quick request/response flows. Workloads like database-driven SSR are more “medium duration, low CPU,” while image encoding is “high CPU, medium duration,” and those don’t benefit as much from CPU-time billing.

What is Vercel Sandbox meant to solve, and how is it positioned?

Vercel Sandbox is presented as a secure, ephemeral execution environment for untrusted code—specifically code generated by AI agents or submitted by users. It’s described as an SDK that can run from other environments too, taking code content (e.g., a buffer) and executing commands in isolation with configurable resources (example shown: 4 vCPUs, Node22 runtime, and a 5-minute timeout). The goal is to prevent untrusted code from impacting other parts of the system.

How do Q’s and Bot ID address different production pain points?

Q’s targets reliability and user experience for long-running tasks by offloading work to a queue so users don’t wait inside a request; it supports background processing with retries/failures and durable workflow concepts. Bot ID targets abuse and automation by performing invisible bot filtering on critical routes before expensive backend operations run. Together, they reduce both “timeouts from slow work” and “costly bot-driven abuse.”

What problem does the AI gateway aim to fix across model providers?

The AI gateway is positioned as a single endpoint to access multiple model providers with better uptime, faster responses, and no lock-in. The transcript emphasizes provider-specific issues—rate limits, reliability, and negotiation friction (especially with Anthropic)—and argues that routing and fallback across providers can avoid those bottlenecks. It also highlights improved observability with per-model usage, latency, and error metrics.

Review Questions

  1. Active CPU pricing is designed for which category of workloads, and what two metrics define that category?
  2. How do Q’s and Bot ID differ in what they protect: user wait time vs. abuse prevention?
  3. What kinds of provider-specific failures does the AI gateway try to mitigate, and why does fallback matter?

Key Points

  1. 1

    Active CPU pricing shifts billing from wall-clock duration to CPU time actually used, targeting AI streaming workloads where requests wait on external services.

  2. 2

    The cost mismatch is worst for “high duration, low CPU” endpoints—common in inference and agent-style systems that stream tokens.

  3. 3

    Vercel Sandbox provides an SDK-based, isolated environment for running untrusted or AI-generated code safely and ephemerally.

  4. 4

    Q’s (limited beta) enables background processing via queues so slow operations don’t block user requests and can be retried reliably.

  5. 5

    Bot ID adds invisible bot filtering for critical routes, with a free basic mode and a stronger deep-analysis mode for higher-risk actions.

  6. 6

    The AI gateway beta unifies access to multiple model providers with routing, fallback, and per-model observability to reduce rate-limit and reliability pain.

  7. 7

    The release package collectively aims to make Vercel more competitive for production AI: cheaper inference, safer execution, and more resilient model access.

Highlights

Active CPU pricing is framed as the fix for AI streaming’s “pay for time, not work” problem—requests can run for seconds while using almost no CPU.
Vercel Sandbox is positioned as an SDK for safely executing untrusted code generated by AI agents or submitted by users in isolated, ephemeral environments.
Bot ID focuses on invisible bot filtering for high-cost routes like login, signup, and LLM-powered endpoints, reducing both false positives and downstream abuse.
The AI gateway beta is pitched as a reliability and rate-limit workaround by routing across multiple providers with fallback and detailed observability.

Topics

  • Active CPU Pricing
  • Vercel Sandbox
  • Bot ID
  • Q's Queue
  • AI Gateway
  • AI Inference Costs

Mentioned

  • Theo
  • CPU
  • vCPUs
  • SSR
  • MCP
  • DX
  • AI