Ethernet is DEAD?? Mac Studio is 100x FASTER!!

TL;DR

Apple’s Tahoe 26.2 update enables RDMMA over Thunderbolt, cutting inter-machine latency from ~300 microseconds to ~3 microseconds.

Briefing Cornell Notes

Briefing

Local AI clustering is back on the table—because Apple’s latest software update slashes the latency bottleneck that previously made multi-Mac setups slower than running models on a single machine. In earlier tests with five Mac Studios, adding more computers cut performance by 91%, and the culprit wasn’t GPU horsepower or memory capacity. It was network latency between machines, which forced AI to use slower “pipeline parallelism” and made more advanced parallel strategies unusable.

The new build uses four fully loaded Mac Studios (512 GB unified memory each, 80 GPU cores each, 8 TB storage each), connected as a cluster with Thunderbolt networking plus Ethernet for discovery. The hardware scale is enormous: 2 TB unified memory total, 32 TB storage total, and 320 GPU cores across the cluster. The key change isn’t just moving to Thunderbolt 5; it’s what Apple enabled in a Tahoe 26.2 update. Apple quietly turned on RDMMA (remote direct memory access) over Thunderbolt ports, bypassing the traditional TCP/IP networking stack that adds overhead and delays.

That shift is dramatic in measured terms: latency drops from roughly 300 microseconds to about 3 microseconds. With that improvement, the cluster can finally use tensor parallelism—where multiple machines split the computation of each model layer and work together—without the communication “chitchat” that previously turned tensor parallelism into a slowdown. The earlier math was brutal: with 160 communication events per token and ~300 microseconds per message, waiting time ballooned to tens of milliseconds per token. After RDMMA, those waits shrink enough to make tensor parallelism practical.

Tests show the performance difference clearly. For a Llama 3.37B FP16 model, the older pipeline approach on the cluster lands around 5 tokens per second, while the tensor-parallel approach with RDMA enabled jumps to about 16 tokens per second (and the smaller-model pipeline case also improves when clustered). The pattern holds as model size increases: clustering doesn’t degrade speed the way it did earlier; instead, it often increases throughput while keeping GPU utilization high.

The real stress test comes with very large models. The setup runs DeepSeek’s “Quinn 3 coder 480B” and then scales to Kimmy K2 (a trillion-parameter “thinking” model), using multiple nodes when a single Mac can’t fit the model. The cluster also runs multiple large models concurrently—loading a trillion-parameter model alongside a 671B model and then adding additional Llama variants—while maintaining workable performance. Some instability appears during beta software operations (including failures that require rebooting and reinitializing the cluster), but the overall result is a functional, responsive local system rather than a fragile science project.

Beyond benchmarks, the cluster is used through real applications like Open WebUI, with coding workflows attempted via tools such as Xcode and OpenCode. The system can become overloaded under heavy “thinking” workloads, but it remains usable enough to demonstrate that clustering can support interactive development.

Bottom line: the cluster’s success hinges on network latency—not raw compute—and Apple’s RDMMA-over-Thunderbolt update is what removes the old ceiling. The experiment suggests local multi-Mac AI can now be fast enough to matter, turning clustering from a curiosity into a practical architecture for running the largest models available locally (at least for those willing to pay for the hardware).

Cornell Notes

A multi-Mac AI cluster that previously performed worse than a single Mac is now fast enough to be useful. The breakthrough is Apple’s Tahoe 26.2 software update enabling RDMMA (remote direct memory access) over Thunderbolt, cutting inter-machine latency from ~300 microseconds to ~3 microseconds. That latency reduction makes tensor parallelism viable again, avoiding the heavy per-token communication overhead that previously slowed everything down. With four fully spec’d Mac Studios (512 GB unified memory each), the cluster improves token throughput on Llama 3.37B and scales to very large models like DeepSeek’s Quinn 3 coder 480B and Kimmy K2 (trillion-parameter thinking). The setup also supports real apps (Open WebUI, coding tools), though beta instability can require cluster reboots.

Why did earlier multi-Mac clustering perform 91% worse, even with powerful GPUs?

The slowdown traced to networking latency between machines. Even when bandwidth improved, the per-message delay forced the system into pipeline parallelism, which increases waiting between sequential layer chunks. The result was “capacity without speed”: models could be split across machines, but tokens still had to wait on inter-node communication, turning throughput into a crawl.

What specific Apple change makes tensor parallelism workable on Thunderbolt connections?

Apple’s Tahoe 26.2 update enables RDMMA (remote direct memory access) over Thunderbolt. Instead of using the TCP/IP stack (which adds overhead and CPU involvement before data reaches GPU memory), RDMMA provides direct memory-to-memory access between GPUs. Reported latency drops from ~300 microseconds to ~3 microseconds, reducing the communication penalty that previously killed tensor parallelism.

How do pipeline parallelism and tensor parallelism differ, and why does latency matter more for tensor parallelism?

Pipeline parallelism splits model layers across machines and processes them sequentially per token—Mac 1 runs layers 1–20, then passes results to Mac 2, and so on. Tensor parallelism keeps every machine working on every layer by splitting the math for each layer and combining results. Tensor parallelism requires far more inter-node messages per token (described as 160 communications per token), so high latency multiplies into large waiting time and can make tensor parallelism slower than pipeline parallelism.

What performance changes were observed on Llama 3.37B when switching from the old approach to RDMA-enabled tensor parallelism?

On the cluster, the older pipeline approach produced about 5 tokens per second. With Apple’s fix (RDMA-enabled tensor parallelism), throughput rose to roughly 16 tokens per second for the same Llama 3.37B FP16 model. The improvement is attributed to the latency reduction that makes the extra tensor-parallel communications affordable.

Can the cluster run extremely large models locally, and what happens when multiple large models run at once?

Yes. The cluster loads large models such as DeepSeek’s Quinn 3 coder 480B and scales to Kimmy K2 (trillion-parameter thinking), using multiple nodes when needed. It also attempts concurrent workloads—running a trillion-parameter model alongside a 671B model and additional Llama variants. Memory utilization stays around the mid-range (roughly ~50% in the described monitoring), but beta software can fail during loading, requiring reboots and reinitialization.

Does clustering work with real user interfaces and coding workflows, not just benchmark tools?

Open WebUI is used from a separate server to connect to the cluster and run the Kimmy K2 thinking model. Coding attempts are made through tools like Xcode and OpenCode, with the model generating and analyzing code. Under heavy “thinking” loads the system can become unresponsive, but the cluster remains functional enough to complete tasks, with the caveat that beta instability may require restarting the cluster.

Review Questions

What mechanism causes tensor parallelism to become slower than pipeline parallelism under high inter-node latency, and how does RDMMA change that outcome?
How does the cluster’s hardware configuration (unified memory size, GPU core count, and Thunderbolt 5 connectivity) relate to the observed token throughput improvements?
What signs of beta instability appear during large-model loading, and what operational steps are used to recover?

Key Points

1
Apple’s Tahoe 26.2 update enables RDMMA over Thunderbolt, cutting inter-machine latency from ~300 microseconds to ~3 microseconds.
2
The earlier 91% clustering slowdown was driven by latency, not GPU compute or memory capacity.
3
Lower latency makes tensor parallelism practical again by reducing the per-token communication overhead that previously caused long waits.
4
A four–Mac Studio cluster (512 GB unified memory each) can run and accelerate models like Llama 3.37B FP16 using RDMA-enabled tensor parallelism.
5
Scaling to very large models (480B and trillion-parameter “thinking” models) is feasible locally by distributing across multiple nodes.
6
Concurrent multi-model workloads are possible, but beta software can require cluster reboots and reinitialization after failures.
7
Interactive use through Open WebUI and coding tools demonstrates clustering is more than a benchmark exercise, though overload and beta bugs can still disrupt responsiveness.

Highlights

RDMMA over Thunderbolt is the turning point: latency drops from ~300 microseconds to ~3 microseconds, making multi-Mac AI speed viable.

Tensor parallelism previously failed because it multiplies communication events per token (described as 160), turning latency into tens of milliseconds of waiting.

With RDMA enabled, Llama 3.37B FP16 throughput rises from about 5 tokens per second to roughly 16 tokens per second on the cluster.

The cluster scales from 480B models up to a trillion-parameter “thinking” model and can run multiple large models at once.

Real workflows work: Open WebUI connects to the cluster and coding tasks are attempted via Xcode/OpenCode, not just synthetic tests.

Topics

RDMMA
Tensor Parallelism
Thunderbolt 5
Local AI Clustering
Mac Studio

Mentioned

Apple
UniFi
UniFi Switch
ExoLabs
Mac Studio
Open WebUI
OpenCode
Twinkgate
MLX
Llama
RDMA
RDMMA
TCP/IP
FP16
MLX