AI Networking is CRAZY!! (but is it fast enough?)
Based on NetworkChuck's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
AI training clusters can spend substantial time waiting on the network, with tail latency from stragglers slowing overall job completion.
Briefing
AI training is bottlenecked less by raw GPU power and more by the network’s ability to move huge data volumes with near-zero tolerance for delay. Training clusters of expensive GPU servers can spend a striking share of time waiting on network transfers and synchronization, a problem often described as “tail latency”—the slowest stragglers that drag down overall job completion. In practice, that means AI workloads behave like a high-speed, tightly coupled distributed system: storage-to-GPU traffic must flow fast, and GPUs must constantly exchange data with each other. When congestion or packet loss creeps in, the entire training run slows.
The core tension is that traditional data center networking was built for CPU-style workloads and typical web traffic patterns, not for AI’s training behavior. Conventional Ethernet networks rely heavily on TCP/IP mechanisms that handle congestion by dropping packets and retransmitting them—effective for many applications, but costly when latency is intolerable and throughput demands are extreme. Even when Ethernet speeds rise (400 GbE today and 800 GbE arriving), the underlying behavior under congestion remains a challenge for AI training.
InfiniBand emerged as a purpose-built alternative for high-performance computing. It targets low latency and high reliability by using its own transport stack rather than TCP/IP, aiming for a lossless fabric and reducing overhead. The centerpiece is RDMA (remote direct memory access), which enables memory-to-memory transfers that bypass much of the CPU and OS networking stack on both ends. Instead of repeatedly copying data through layers of the network stack, RDMA lets a GPU read data directly from remote memory, cutting latency and improving efficiency—especially when paired with InfiniBand’s high-performance fabric.
Still, InfiniBand is not a universal fix. It can be harder to staff and operate because many network engineers cut their teeth on Ethernet. It also adds integration complexity when Ethernet must remain in place for compatibility across a broader environment. Cost is another friction point, with InfiniBand gear often priced above comparable Ethernet switches and routers. And troubleshooting support can be less accessible simply because the Ethernet ecosystem is larger.
Ethernet’s counterattack focuses on bringing AI-friendly features into a familiar framework. The industry leans on lossless and congestion-management mechanisms such as ECN (explicit congestion notification) and PFC (priority flow control). ECN signals congestion before packet drops occur, while PFC can pause specific traffic classes to protect latency-sensitive training flows. With RDMA over Converged Ethernet (RoCE), Ethernet can also support memory-to-memory transfers, though it still depends on a lossless fabric behavior at lower layers to avoid congestion collapse.
By the end, the choice becomes a tradeoff between performance mechanisms and operational reality. InfiniBand’s RDMA + low-latency fabric is compelling, but Ethernet’s ubiquity, engineer familiarity, and broader ecosystem make it attractive for many deployments. The proposed “AI data center” design leans Ethernet: a Clos (fat-tree) style topology with Juniper 800 gig Ethernet spine/leaf switches, plus software-defined networking to tune DCB features like PFC and ECN so the network behaves correctly as training conditions change.
Cornell Notes
AI model training depends on fast, reliable networking as much as on GPU compute. Large GPU clusters can lose significant time waiting on the network, especially due to tail latency caused by stragglers and congestion. InfiniBand targets this with a low-latency, lossless HPC fabric and RDMA, which enables memory-to-memory transfers that bypass much of the CPU/OS networking stack. Ethernet is catching up for AI using RoCE (RDMA over Converged Ethernet) plus congestion-control and lossless mechanisms like ECN and PFC. The practical decision often comes down to performance versus operational complexity, staffing, and ecosystem maturity.
Why does networking become a bottleneck specifically during AI training?
What is tail latency, and why does it matter for training clusters?
How does InfiniBand reduce latency compared with conventional Ethernet over TCP/IP?
What makes RDMA over Ethernet (RoCE) possible, and what condition must Ethernet meet?
How do ECN and PFC work together to support AI traffic on Ethernet?
Why might an organization still prefer Ethernet even if InfiniBand is strong?
Review Questions
- In what ways do congestion and packet loss translate into longer AI training times beyond just reduced throughput?
- Compare RDMA’s role in InfiniBand versus RoCE on Ethernet—what problem does RDMA solve, and what additional requirements remain?
- What operational factors (beyond raw speed) influence whether a data center chooses InfiniBand or Ethernet for AI networking?
Key Points
- 1
AI training clusters can spend substantial time waiting on the network, with tail latency from stragglers slowing overall job completion.
- 2
Traditional Ethernet over TCP/IP can add latency under congestion because it handles problems by dropping packets and retransmitting them.
- 3
InfiniBand targets low latency with a dedicated HPC-oriented stack, a lossless fabric approach, and RDMA for memory-to-memory transfers that bypass much CPU/OS involvement.
- 4
InfiniBand adoption can be harder due to staffing gaps, integration complexity with existing Ethernet, and often higher equipment costs.
- 5
Ethernet can support AI traffic using RoCE (RDMA over Converged Ethernet) but still needs lossless-like behavior to avoid congestion collapse.
- 6
ECN and PFC are key Ethernet mechanisms for AI: ECN signals congestion early, while PFC pauses specific traffic classes to protect latency-sensitive flows.
- 7
The practical network choice often balances performance features against ecosystem maturity, engineer familiarity, and deployment complexity.