I built an AI supercomputer with 5 Mac Studios

TL;DR

XO Labs can auto-discover multiple Mac Studios and form a local inference cluster with a web GUI and an OpenAI-compatible API.

Briefing Cornell Notes

Briefing

Five Mac Studios can be stitched into a local AI cluster with XO Labs—but the limiting factor isn’t just how much model memory the machines have. Networking bandwidth and overhead quickly cap token throughput, even when the cluster is able to run models that normally require cloud-scale hardware.

The build starts with a clear goal: run Meta’s Llama 3.1405B locally. That model is typically served by large GPU fleets, and the transcript lays out why it’s so hard—model size scales with parameter count, and parameter count drives the amount of GPU memory (VRAM) needed for inference. The creator uses Llama 3.2 as a baseline and then walks up through larger variants, noting practical VRAM requirements (for example, multi‑tens of gigabytes for 70B-class models, and far beyond consumer setups for 405B). To make large models fit on smaller hardware, the transcript emphasizes quantization—reducing precision (FP32/FP16 down to int8 and lower) to shrink memory footprints while accepting some accuracy loss.

The hardware plan hinges on Apple’s M-series “unified memory” architecture. Each Mac Studio used in the cluster has 64GB of unified RAM, which can be shared with the GPU. With five units, that becomes a theoretical 320GB pool for inference, and the creator argues this helps because unified memory reduces the usual system-memory-to-VRAM transfer overhead. The cluster is then assembled using XO Labs, a beta tool aimed at AI clustering across heterogeneous hardware. XO auto-discovers nodes and provides a web GUI and an OpenAI-compatible API, letting the cluster split work across machines.

The first performance test uses 10GbE networking via a 10 gigabit switch. A single Mac Studio running a small model (around 1B parameters) reaches roughly 117 tokens per second. But once all five nodes join the cluster, throughput drops sharply—about 29 tokens per second—signaling that network constraints dominate. The transcript also notes an operational pain point: model downloads and distribution can be slow and sometimes buggy, with the model appearing to download on multiple nodes rather than being perfectly shared.

To reduce the bottleneck, the creator swaps to Thunderbolt networking using a bridge/hub setup to reach higher bandwidth (up to ~40Gbps) and more direct PCIe access. Token rates improve modestly (roughly 50 tokens per second for smaller tests with two nodes), but the five-node cluster still lands in the low double digits for tokens per second. The transcript repeatedly returns to the same conclusion: bandwidth and coordination overhead limit how fast the cluster can generate text, even when RAM capacity is sufficient.

The “biggest model” moment comes with the 405B target. Running it on a single Mac Studio fails due to memory pressure, triggering heavy swap usage and risking timeouts. With five nodes, swap stays inactive and the system eventually begins generating output—around 0.5–0.8 tokens per second—confirming the cluster can run the model locally, just not at interactive speeds.

Finally, the transcript compares XO’s performance on Apple’s MLX inference stack versus local inference workflows like Ollama, and demonstrates integration with Fabric (via OpenAI-compatible APIs) for streaming chat and summarization. The overall takeaway is pragmatic: local AI clustering with Mac Studios is feasible for very large models, but networking remains the main obstacle to making it fast—especially for the largest Llama variants.

Cornell Notes

XO Labs can turn five Mac Studios into a local AI cluster that auto-discovers nodes and runs LLM inference across machines. The cluster’s ability to handle very large models depends heavily on Apple’s unified memory (64GB per Mac Studio) and quantization, which reduces precision to fit models into available memory. In practice, token throughput collapses when moving from one node to five, with 10GbE dropping performance from ~117 tokens/sec to ~29 tokens/sec. Thunderbolt improves results somewhat, but five-node generation still stays in the low double digits. The 405B model becomes runnable only when distributed across all five nodes, producing output at roughly 0.5–0.8 tokens/sec—proof of feasibility, not speed.

Why does model size (parameters) translate into a practical hardware requirement for local LLMs?

The transcript links parameter count to learned knowledge: each parameter is a numerical weight in the neural network that helps the model predict responses. More parameters generally mean more patterns and nuance, but they also drive memory needs for inference. That’s why a ~1B-parameter Llama variant can run quickly on a single Mac Studio, while 70B and especially 405B require far more VRAM/unified memory and often quantization to fit.

What role does quantization play in making large models run on consumer GPUs?

Quantization reduces precision so the model occupies less memory. The transcript walks through FP32 (full precision), FP16 (half precision with small precision loss), and then integer quantization like int8 (about 4× smaller with ~1–3% loss, depending on method). It also mentions lower levels like int4, where memory shrinks further but accuracy loss can rise (roughly 10–30% for very low precision), with noticeable degradation on complex tasks like coding or reasoning.

How does Apple unified memory change the feasibility of clustering Mac Studios for LLM inference?

Instead of separate system RAM and GPU VRAM, the M-series Mac Studio uses a unified memory pool. The transcript claims this reduces transfer overhead between CPU memory and GPU memory. With five Mac Studios at 64GB each, the creator expects a large shared pool (theoretical 320GB) that the GPU can use, making it more plausible to load and run models that would otherwise exceed dedicated VRAM limits.

What bottleneck appears when scaling from one Mac Studio to a five-node cluster?

Token throughput drops sharply when all five nodes join. With 10GbE, a small model runs around 117 tokens/sec on one node, but falls to about 29 tokens/sec on the five-node cluster. The transcript attributes this to networking bandwidth and overhead: even if the cluster can split work, the nodes must communicate large amounts of data during inference.

Why does Thunderbolt help, and why doesn’t it fully solve the speed problem?

Thunderbolt provides higher bandwidth (around 40Gbps) and more direct PCIe access, reducing overhead compared with standard TCP/IP Ethernet paths. The transcript reports better performance when using Thunderbolt (e.g., two-node tests around ~50 tokens/sec versus ~10GbE), but five-node generation still remains limited (low double digits). The remaining constraint is still coordination and bandwidth/overhead across multiple nodes.

How does the cluster handle the 405B model, and what does the result imply?

A single Mac Studio can’t run the 405B target without heavy swap usage (the transcript shows swap rising and warns about disk pressure/timeouts). With five nodes, the model eventually starts generating text while swap stays inactive, but speed is very low—about 0.5–0.8 tokens/sec. That demonstrates feasibility on local hardware, but also shows that networking and orchestration prevent interactive performance.

Review Questions

What specific hardware feature in the Mac Studios is presented as enabling large-model inference without dedicated VRAM, and how is it expected to reduce overhead?
How do the transcript’s token-per-second results change when moving from one node to five nodes over 10GbE, and what bottleneck is blamed?
Why does quantization allow larger models to fit, and what trade-offs does the transcript associate with lower-bit quantization (e.g., int4)?

Key Points

1
XO Labs can auto-discover multiple Mac Studios and form a local inference cluster with a web GUI and an OpenAI-compatible API.
2
Model size scales with parameter count, which drives memory requirements for inference; quantization is used to shrink models to fit available memory.
3
Apple’s unified memory architecture is positioned as a key enabler for running large models on Mac Studio GPUs without separate VRAM constraints.
4
10GbE networking becomes a major bottleneck when scaling to five nodes, causing a large drop in tokens per second compared with single-node runs.
5
Thunderbolt networking improves throughput by increasing bandwidth and reducing overhead, but it still doesn’t eliminate the multi-node speed ceiling.
6
The 405B Llama target becomes runnable only when distributed across all five nodes, producing output at roughly 0.5–0.8 tokens/sec—slow but feasible.
7
Model download/distribution overhead can dominate setup time and can behave inconsistently across nodes, affecting practical usability.

Highlights

A five-node Mac Studio cluster can run Llama 3.1405B locally, but generation speed lands around 0.5–0.8 tokens per second.

Scaling from one node to five over 10GbE drops throughput dramatically (about 117 tokens/sec down to ~29 tokens/sec for small-model tests).

Thunderbolt networking boosts performance versus 10GbE, yet five-node clusters still struggle to reach high token rates.

Unified memory is treated as the critical Mac advantage: 64GB per Mac Studio becomes a large shared pool for GPU inference.

XO Labs supports an OpenAI-compatible API, enabling integration with tools like Fabric for chat and summarization workflows.

Topics

AI Clustering
Mac Studio
LLM Quantization
Unified Memory
Thunderbolt Networking

Mentioned

VRAM
GPU
CPU
MLX
API
TCP/IP
PCIe
FP32
FP16
int8
int4
LLM