CPU, GPU…..DPU?

TL;DR

Virtualization expanded beyond servers into networking and security, increasing CPU load and creating a data-plane bottleneck as link speeds rise.

Briefing Cornell Notes

Briefing

Data centers are running out of CPU headroom as virtualization expands from servers into networking and security—and NVIDIA’s BlueField 3 data processing unit (DPU) is positioned as the fix. The core shift is simple: instead of forcing the CPU to handle high-volume packet forwarding, encryption/decryption, and inspection workloads, a DPU acts like a specialized “server inside a server” that takes over those network functions, keeping the CPU focused on general compute.

Virtualization began by consolidating many physical machines into fewer hosts, with each virtual machine sharing CPU and memory resources. That worked—until the same consolidation logic spread to the rest of the stack. Modern data centers increasingly virtualize switches, routers, firewalls, and security appliances using platforms such as VMware NSX. As more networking and cybersecurity functions move into software, the CPU becomes the bottleneck: it must manage many operating systems and also perform tasks it wasn’t built for, especially as network speeds climb from 1 Gbps to 10, 100, and 200 Gbps. The transcript frames this as a mismatch between general-purpose processing and specialized data-plane work like traffic inspection, encryption, and large-scale data movement.

The first step in relieving that pressure is the smart NIC (network interface card), which offloads some networking and security tasks from the CPU. But smart NICs eventually hit limits as workloads grow and more functions get virtualized. That’s where the DPU enters. NVIDIA’s BlueField 3 is described as an SoC (system on a chip) that can sit on a smart NIC form factor, yet runs its own operating system—specifically, ESXi is installed on the DPU in the VMware environment described. In practical terms, firewall and networking software can run on the DPU so network traffic no longer has to traverse the CPU path.

The operational payoff is twofold: performance and mobility. The lab demonstrates that VMware vSphere 8 with the distributed services engine (previously called Project Monterey) supports Universal Pass Through (UPT) for NVIDIA DPUs. That enables vMotion of a VM that uses the DPU as its network interface without breaking connectivity—something the transcript says fails with PCIe pass-through smart NIC setups.

On raw throughput and latency, the lab’s numbers are stark. With a standard NIC using a 25 GbE cluster, bandwidth is reported at 64 megabits per second, latency around 0.273 ms, and 548,000 operations per second. Switching to the BlueField 3 DPU raises bandwidth to 90 megabits per second, latency to about 0.19 ms, and operations per second to 770,000. Crucially, the CPU remains less stressed in the DPU case, implying not just faster networking but better efficiency.

The broader argument is that DPUs change how virtual machines should be designed and how data centers scale. Rather than adding more servers to spread CPU load—an approach that increases space and power use—the DPU adds specialized processing capacity within the existing server footprint. The transcript also notes that NVIDIA’s DPU programmability stack (DOCA) enables customization, and it points to additional white papers for scaling beyond the single workload and 25 GbE test environment.

Cornell Notes

Virtualization pushed data-center networking and security workloads onto the CPU, and that general-purpose chip increasingly becomes the bottleneck as network speeds rise. Smart NICs offload some tasks, but the next step is a DPU—specifically NVIDIA’s BlueField 3—described as a “server inside a server” that runs its own OS and can take over packet processing and security functions. In a VMware vSphere 8 environment, Universal Pass Through (UPT) with the distributed services engine enables vMotion to keep working when a DPU is assigned to a VM, avoiding the connectivity breakage seen with PCIe pass-through. In a lab using a 25 GbE cluster, the DPU improved reported bandwidth (64 to 90 Mbps), reduced latency (~0.273 ms to ~0.19 ms), and increased operations per second (548k to 770k) while keeping the CPU less stressed. The implication: DPUs can improve both performance and efficiency, shaping future VM and data-center design.

Why does the CPU become a bottleneck in modern virtualized data centers?

As virtualization expanded from servers into networking and security, the CPU had to manage both many virtual machines and additional data-plane work. The transcript highlights that tasks like traffic inspection, encryption/decryption, and high-rate data transfer are specialty workloads that CPUs handle only “okay,” and performance degrades as networks move from 1 GbE up through 10/100/200 GbE. With 30–40+ VMs per physical host, the CPU’s general-purpose design gets stressed by packet forwarding and security processing that it wasn’t built to accelerate.

How do smart NICs differ from DPUs in offloading network work?

A smart NIC (network interface card) can offload networking and security tasks from the CPU by handling specialized packet processing. The transcript frames this as giving the CPU an “extra arm.” However, smart NICs can become insufficient as workloads grow and more functions get virtualized. A DPU is described as more than an extra arm: it’s a dedicated processing endpoint inside the server that can run its own OS and take over the network load more completely.

What makes NVIDIA’s BlueField 3 DPU unusual in the VMware setup described?

BlueField 3 is presented as an SoC that can be integrated into a smart NIC, yet it runs its own operating system. In the VMware lab, ESXi is installed on the BlueField DPU itself (alongside the main host running VMware ESXi with vSphere 8 and the distributed services engine). That means networking and firewall software can run on the DPU, so traffic is handled there rather than being forced through the CPU.

What problem does Universal Pass Through (UPT) solve for vMotion?

The transcript contrasts two approaches. With traditional smart NIC usage via PCIe pass-through, vMotion breaks connectivity when moving a VM between physical hosts. With VMware vSphere 8 plus the distributed services engine and UPT, the DPU can be assigned as a network interface to a VM, and vMotion continues to work—because the DPU remains reachable and functional across host moves.

What performance results were reported when switching from a standard NIC to a BlueField 3 DPU?

In a 25 GbE cluster lab test transferring data between Redisk servers, the standard NIC case reports 64 megabits per second bandwidth, ~0.273 ms latency, and 548,000 operations per second. The DPU case reports 90 megabits per second bandwidth, ~0.19 ms latency, and 770,000 operations per second. The transcript also emphasizes that the CPU is less stressed in the DPU scenario, suggesting improved efficiency rather than just higher throughput.

Why is the DPU presented as more power-efficient than adding more servers?

Adding more servers spreads CPU load but increases data-center footprint and power consumption—repeating the earlier inefficiency problem that virtualization tried to solve. The transcript argues that a DPU uses an ARM processor and specialized data processing to handle network workloads within the existing server, improving efficiency (citing ~30% efficiency increases) without scaling out to many additional hosts.

Review Questions

What specific types of workloads shift from the CPU to the DPU, and why does that matter as network speeds increase?
How does UPT with VMware vSphere 8 change the behavior of vMotion compared with PCIe pass-through smart NICs?
Based on the lab numbers, what tradeoffs improve when using a DPU instead of a standard NIC (throughput, latency, CPU stress, or all of these)?

Key Points

1
Virtualization expanded beyond servers into networking and security, increasing CPU load and creating a data-plane bottleneck as link speeds rise.
2
Smart NICs offload some networking/security tasks, but they can fall short when workloads and virtualized functions scale further.
3
NVIDIA BlueField 3 is positioned as a DPU that can run its own OS (ESXi) and handle network traffic and security workloads inside the server.
4
VMware vSphere 8 with the distributed services engine supports Universal Pass Through (UPT), enabling vMotion to work with DPU-assigned network interfaces.
5
In a 25 GbE lab test, the DPU improved reported bandwidth (64 to 90 Mbps), reduced latency (~0.273 ms to ~0.19 ms), and increased operations per second (548k to 770k).
6
DPUs are framed as more power-efficient than scaling out with additional servers because they add specialized capacity within existing hosts.
7
Programmability via NVIDIA’s DOCA is highlighted as a way to customize DPU behavior for different networking and security workloads.

Highlights

The DPU is described as a “server inside a server,” with ESXi installed on the BlueField 3 so network and firewall software can run there instead of on the CPU.

UPT is presented as the key to keeping vMotion working when a DPU is assigned to a VM—avoiding the connectivity breakage seen with PCIe pass-through smart NICs.

The lab’s reported results show lower latency and higher operations per second when moving from a standard NIC to BlueField 3, while the CPU stays less stressed.

The efficiency argument ties DPU adoption to data-center constraints: less need to add more servers means less space and power pressure. 

Topics

Data Processing Unit
Virtualization
Smart NIC
VMware vSphere
Network Offload

Mentioned

NVIDIA
VMware
VMware ESXi
vSphere
NSX
BlueField 3
DOCA
Redisk
DPU
CPU
GPU
VM
VMware
NSX
ESXi
vSphere
UPT
PCIe
Gbps
ARM
SDK
SoC
IDs
IPS
VXLAN
vMotion
DOCA