Why the Smartest AI Teams Are Panic-Buying Compute: The 36-Month AI Infrastructure Crisis Is Here

TL;DR

Inference compute scarcity is driven by structural hardware constraints (memory/HBM, advanced chip capacity, and Nvidia GPU allocation), not a short-lived software or scheduling issue.

Briefing Cornell Notes

Briefing

AI-driven demand for inference compute is colliding with a physically constrained hardware supply chain, creating a structural “36-month” infrastructure crisis that will push inference costs sharply higher and force enterprises to rethink how they plan, buy, and run AI. The central warning: this isn’t a temporary tech shortage. It’s an economic transformation—demand keeps accelerating while memory, advanced chips, and GPU allocation can’t scale fast enough—so competitive dynamics across industries will shift as budgets tighten and capacity becomes scarce.

The pressure starts with consumption. Heavy enterprise users can burn through roughly a billion tokens per worker per year, while “ceiling” usage for intense workloads can reach 25 billion tokens annually. Token growth isn’t just linear; it accelerates as model capability improves and as AI becomes embedded across everyday software—email, document editors, development environments, and CRM—turning AI from a tool into ambient, continuous consumption. The shift toward agentic systems compounds the effect: agents can run continuously and call other AI in loops, consuming orders of magnitude more tokens than human-in-the-loop workflows. At enterprise scale, the math becomes stark: a 10,000-person company spending about $20 million per year on inference at 1 billion tokens per worker could jump to $200 million at 100 billion tokens per year, and those figures assume stable pricing and available capacity—assumptions the crisis directly breaks.

On the supply side, the bottleneck is memory and advanced semiconductor capacity. DRAM fabrication takes 3–4 years and capacity is already fully allocated; high bandwidth memory (HBM) is effectively sold out and can’t be substituted at scale. DRAM prices are projected to surge—Trend Force forecasts memory costs adding 40–60% to inference infrastructure in the first half of 2026, with effective inference costs potentially doubling or tripling within 18 months. Even if DRAM supply improves, the advanced chip layer is constrained: TSMC’s leading nodes are fully allocated, with capacity expansion timelines stretching to 2028 and beyond. GPU availability adds another choke point. Nvidia’s H100 and newer Blackwell GPUs dominate AI workloads and are sold out, with lead times exceeding six months and multi-year hyperscaler purchase agreements locking up most production.

A key twist is that cloud providers aren’t neutral capacity brokers. AWS, Azure, Google Cloud, and similar players also sell competing AI products (Gemini, Copilot, AWS AI services). When compute is scarce, allocation becomes zero-sum: GPUs sent to enterprises mean fewer GPUs for internal products. Hyperscalers are therefore incentivized to prioritize their own strategic workloads, tightening enterprise rate limits and making allocation commitments harder.

The result is a market that won’t “clear” smoothly. Supply can’t respond quickly, demand can’t be deferred, and information is asymmetric—hyperscalers understand constraints better than most enterprises. Pricing spikes are likely to be abrupt rather than gradual, echoing past volatility in DRAM during the 2016 shortage and GPU surges during crypto mining.

For enterprise leaders, traditional capex planning and depreciation schedules fail because demand and hardware usefulness change faster than procurement cycles. The transcript’s practical playbook centers on four moves: secure inference capacity now via throughput/availability SLAs (not just price-per-token), build a routing layer that can shift workloads across providers and models to preserve optionality, treat hardware as a consumable with accelerated refresh/lease strategies, and invest aggressively in inference efficiency (smaller models where possible, better prompting, caching, retrieval augmentation, quantization). The overarching message is that winners will be those who act within the shrinking window to lock capacity and reduce cost exposure before the crisis peaks.

Cornell Notes

Inference compute is becoming scarce because AI demand keeps rising while memory, advanced chip capacity, and GPU allocation can’t scale quickly enough. Token consumption accelerates further with agentic systems and AI embedded across business software, pushing enterprise usage far beyond earlier planning assumptions. Supply constraints are structural: DRAM/HBM production timelines are long, HBM is effectively allocated, TSMC leading nodes are fully booked, and Nvidia GPUs are sold out under multi-year hyperscaler agreements. Cloud providers also have competing incentives because they route scarce capacity to their own AI products first. Enterprises that rely on traditional capex/depreciation planning risk stranded assets and budget shocks; the recommended response is to secure capacity via SLAs, build workload routing for optionality, refresh hardware on shorter cycles, and reduce token costs through efficiency techniques.

Why does token demand grow so fast in enterprise AI, and why does agentic automation make it worse?

Demand rises as model capability unlocks new use cases and as AI becomes embedded across tools like email, document editors, development environments, and CRM—creating continuous “ambient” consumption. The shift from human-in-the-loop to agentic systems matters because agents can run continuously and call other AI in loops, removing human rate limits (typing speed, breaks, end-of-day). A single agentic workflow can consume more tokens in an hour than a human generates in a month, and fleets of agents can multiply that effect. That changes planning from “per worker usage” to “workers + deployed agents + centrally deployed agents,” which can create a 10–100x consumption footprint versus human-only assumptions.

What specific hardware constraints make this crisis structural rather than cyclical?

The bottleneck is memory and advanced semiconductor capacity. DRAM fabrication takes 3–4 years, and high bandwidth memory (HBM) for large-model inference is specialized and effectively sold out. DRAM price projections are steep (Trend Force forecasts memory costs adding 40–60% to inference infrastructure in early 2026; effective inference costs could double or triple within 18 months). On chips, TSMC’s leading nodes are fully allocated, with capacity expansion timelines stretching to 2028 and beyond, leaving little near-term surge capacity. GPU supply is also constrained: Nvidia’s H100 and Blackwell GPUs are sold out, lead times exceed six months, and hyperscalers lock allocation via multi-year purchase agreements.

How do cloud providers’ incentives worsen enterprise allocation during scarcity?

Cloud infrastructure providers are also AI product companies. When compute is abundant, they can serve internal products and sell excess capacity. When compute is scarce, allocation becomes zero-sum: GPUs used for enterprise workloads aren’t available for internal offerings like Gemini (Google), Copilot (Microsoft), or AWS AI services (Amazon). The transcript argues that this incentive structure tightens enterprise rate limits and makes high-volume allocation commitments less reliable, even if the providers are acting rationally.

Why do traditional enterprise procurement and depreciation models break under these conditions?

Traditional planning assumes predictable demand and available supply, then depreciates infrastructure over 3–5 years. In this environment, demand scales exponentially and hardware usefulness changes quickly as model architectures and GPU generations evolve. That mismatch creates stranded assets: enterprises can buy hardware expecting one workload curve, then find per-worker consumption has jumped 10x and the purchased NPUs can’t sustain agentic workflows. The result is either productivity constraints, write-downs, or expensive alternatives like leasing—each with tradeoffs.

What does “secure capacity now” mean in practice, beyond simply negotiating price?

The recommended shift is from “price per million tokens” to contractual throughput and availability guarantees. CTOs should seek SLAs that specify sustained volume (e.g., x billion tokens per day) with high availability (e.g., 99.9%). If a vendor can’t deliver the contracted volume, pricing becomes secondary. The goal is to lock in usable capacity before the crisis peaks, rather than waiting for market conditions to improve.

What is a routing layer, and why is it framed as a competitive advantage?

A routing layer is an intelligence layer that decides where workloads run and which models to use on the fly. It optimizes for cost and capacity, preserves optionality by abstracting underlying infrastructure, and enables leverage in negotiations by making switching providers less disruptive. Building it requires architecture work, model evaluation logic, observability, and a dedicated team—so it’s treated as “secret sauce” rather than something enterprises can easily outsource.

Review Questions

How do agentic systems change the relationship between per-worker token usage and total enterprise inference demand?
Which supply constraints (memory, TSMC capacity, Nvidia GPU allocation) are described as the main reasons demand growth can’t be met quickly?
What operational changes—capacity SLAs, routing layers, hardware refresh/lease strategy, and inference efficiency—are proposed to reduce cost and availability risk?

Key Points

1
Inference compute scarcity is driven by structural hardware constraints (memory/HBM, advanced chip capacity, and Nvidia GPU allocation), not a short-lived software or scheduling issue.
2
Token consumption can rise nonlinearly as AI capability improves and as AI becomes embedded across business software, turning usage into continuous ambient demand.
3
Agentic systems remove human rate limits and can multiply token consumption by orders of magnitude, forcing planning to include workers plus deployed and centrally deployed agents.
4
Cloud providers are incentivized to prioritize their own AI products when compute is scarce, making enterprise allocation less reliable and rate limits tighter.
5
DRAM/HBM bottlenecks and long fab timelines make supply inelastic, so inference pricing is likely to spike abruptly rather than rise gradually.
6
Traditional capex/depreciation planning fails when demand and hardware usefulness change faster than procurement cycles, increasing the risk of stranded assets.
7
A practical response centers on securing capacity via SLAs, building a workload routing layer for optionality, treating hardware as a consumable with faster refresh/lease cycles, and investing in inference efficiency to reduce token costs.

Highlights

Agentic workflows can consume more tokens in an hour than a human generates in a month, turning “per worker” planning into a major underestimation problem.

HBM is described as effectively unavailable at scale, while DRAM/HBM production timelines (3–4 years) prevent quick supply relief.

Cloud providers aren’t neutral: scarce GPUs allocated to enterprises reduce capacity for Gemini, Copilot, and AWS AI services, creating zero-sum allocation pressure.

The recommended contract shift is from “price per token” to throughput/availability SLAs that guarantee sustained token volume.

The most durable enterprise advantage is framed as a routing layer that can shift workloads across providers and models to preserve optionality and negotiating leverage.

Topics

Inference Compute Shortage
Agentic Token Growth
Memory Bottleneck
GPU Allocation
Enterprise AI Planning

Mentioned

DRAM
HBM
NPU
SLAs
AGI
API