Get AI summaries of any video or article — Sign up free
INFINITE Inference Power for AI thumbnail

INFINITE Inference Power for AI

sentdex·
6 min read

Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The Camino Grand server is framed as an inference-focused alternative to datacenter GPUs, with a stated price around the low $30,000 range for six NVIDIA 4090s.

Briefing

A single Camino Grand server built around six NVIDIA 4090 GPUs is positioned as a cost-effective “inference powerhouse,” delivering far more usable large-model throughput than buying equivalent compute from top-tier datacenter accelerators—while staying within a roughly low–$30,000 price range. The pitch is straightforward: for workloads dominated by inference (running models to generate answers), this setup can deliver “bang for your buck” that undercuts the cost of a single H100, making it attractive for teams that need lots of model calls rather than heavy training.

The hardware design is central to the claim. Fitting six 4090s into one chassis is made possible by treating the cards as consumer GPUs that are still cooled aggressively enough for continuous operation. Camino ships the system fully water-cooled, and power delivery is handled by four power supplies totaling about 6 kW. The transcript also contrasts this with earlier Camino testing: server GPUs are built to tolerate high temperatures for long durations, but electrical heat still degrades components over time—so the cooling strategy matters. In a sustained stress test running all six 4090s at full utilization for about an hour, temperatures stabilized in the upper 60s to low 70s °C with ambient around 24 °C, while noise rose sharply at maximum performance (about 77 dB from a meter).

Why use 4090s in a server at all? The argument is price-to-performance. The 4090 is described as “S tier” for cost efficiency, with the main shortcomings tied to enterprise training features: no NVLink and no SXM variant, which limits memory pooling across GPUs. That makes the system less suitable for training models that require spanning multiple GPUs, where alternatives like A100, H100, or larger-memory datacenter GPUs (the transcript mentions a 144 GB-per-card G200) are still preferred.

Where the Camino Grand shines is inference. The tester runs the open-source “Qwen 72B,” described as approaching GPT-4-like performance in some respects. In practice, Qwen 72B is characterized as strong at factual understanding, reasoning, and programming, but weaker at strict instruction-following and prompt-structured behavior. Beyond chat, the system is used for an “educational lecturer” style project where the model streams topic content and can be directed asynchronously, with ideas to improve chunked streaming, delivery speed, and voice options.

The machine also supports real-time robotics experiments via image-to-depth inference. A Wi‑Fi-based quadcopter (a small “Rise T” drone) sends camera frames to the Camino server, where an RGB-to-depth model generates depth information fast enough to support real-time control loops; running such a depth model on the drone hardware itself is considered impractical due to size constraints. The transcript frames this as a sign that modern open-source models have become “plug and play” compared with earlier years, when tasks like object detection and RGB-to-depth/segmentation were far harder.

Finally, the server is used to probe language-model behavior through multi-model “conversation” experiments. Six models are loaded across six GPUs and prompted with a question plus conversation history, including a philosophical debate about whether AI should have rights. The results are mixed: models can argue on either side, but they often avoid strong opinions and show difficulty engaging in genuine back-and-forth debate. The takeaway is less about hardware alone and more about what compute enables—rapid experimentation with model behavior, robotics perception, and inference-heavy applications—paired with clear limitations around training-scale features and the system’s loud, server-grade operation.

Cornell Notes

A Camino Grand server using six NVIDIA 4090 GPUs is presented as a high-throughput inference machine that can cost far less than datacenter alternatives while still delivering strong real-world performance. The system is fully water-cooled and draws about 6 kW total power across four power supplies, with stress-test temperatures stabilizing in the upper 60s to low 70s °C at full utilization. The 4090 choice is justified by price-to-performance, but the lack of NVLink and SXM limits multi-GPU memory pooling, making it less suitable for certain large training workloads. Inference experiments include running Qwen 72B for chat, reasoning, and programming, plus using RGB-to-depth models to support real-time drone control by sending images to the server for inference. Multi-model “AI rights” debates highlight both the flexibility of open models and their tendency to avoid deeply opinionated, adversarial argumentation.

Why does the six-4090 Camino Grand setup make sense specifically for inference rather than training?

Inference benefits from high cost-efficiency and throughput, and the 4090 is described as “S tier” for price-to-performance. The system is built to run continuously with heavy water cooling and ~6 kW power delivery, enabling sustained high utilization. Training is where the limitations show up: the 4090 lacks NVLink and the SXM variant, so memory can’t be pooled across GPUs. That makes it harder to train models that span multiple GPUs, where datacenter GPUs like A100/H100 or higher-memory options (the transcript mentions 144 GB per card G200) are better suited.

What cooling and power details support the claim that six 4090s can run reliably in one chassis?

The server is fully water-cooled, and power is handled by four power supplies totaling about 6 kW. In a full-tilt test running all six 4090s at maximum utilization for over an hour, temperatures stabilized in the upper 60s to low 70s °C with ambient around 24 °C. Noise rises at peak performance, reaching roughly 77 dB from about a meter away, reinforcing that the cooling comes with a loud, server-like tradeoff.

How does Qwen 72B behave in practice on this multi-GPU inference setup?

Qwen 72B is described as strong for informational tasks—factual understanding and reasoning—and also “pretty good” at programming. It’s less reliable for strict instruction-following or prompt-structured behavior, tending to work better in a chatbot-style question-and-answer mode. The transcript also notes that it can fit across the six 4090s at half precision, whereas it’s considered too large for a single H100 at half precision.

What does the RGB-to-depth drone experiment show about using a server for real-time perception?

The drone (Rise T) has a forward-facing camera, and the experiment tests whether RGB-to-depth models can support depth-aware control. Because the drone is Wi‑Fi-based and the communication chain is fragile, Wi‑Fi is described as an annoyance for robotics. The key workaround is architectural: images are sent from the drone to the Camino server for inference, and the resulting depth information is returned quickly enough to support real-time operation. The transcript suggests the depth model is too heavy to run on the small drone hardware itself.

What did the multi-model “AI rights” conversation experiment reveal about language-model debate?

Six large language models are loaded one per GPU and prompted with a question plus shared conversation history. The models can argue on either side of the AI-rights question, and they often reflect the tendency of internet-trained models to lean toward granting AI rights. However, they struggle to produce strongly formed opinions and are difficult to get into genuine back-and-forth debate; they often default to providing information rather than committing to argumentative stances. The transcript frames this as a challenge for “thinking through” rather than merely generating text.

Review Questions

  1. What hardware features of the 4090-based server enable sustained inference at high utilization, and what features limit it for multi-GPU training?
  2. How do the transcript’s observations about Qwen 72B’s instruction-following compare with its strengths in reasoning and programming?
  3. Why does the RGB-to-depth drone setup rely on server-side inference, and what role does communication (Wi‑Fi) play in the system’s practicality?

Key Points

  1. 1

    The Camino Grand server is framed as an inference-focused alternative to datacenter GPUs, with a stated price around the low $30,000 range for six NVIDIA 4090s.

  2. 2

    Fully water-cooling and ~6 kW total power delivery (four power supplies) are used to keep temperatures stable during sustained full utilization.

  3. 3

    Stress testing reported upper 60s to low 70s °C temperatures at full load, with peak noise around 77 dB from a meter.

  4. 4

    The 4090’s lack of NVLink and SXM support limits memory pooling, making multi-GPU training less practical than with A100/H100-class hardware.

  5. 5

    Qwen 72B is described as strong at factual reasoning and programming, but weaker at strict instruction-following and prompt-structured outputs.

  6. 6

    Real-time robotics perception is demonstrated by running RGB-to-depth inference on the server and sending results back to a Wi‑Fi-controlled drone.

  7. 7

    Multi-model conversation experiments show that open models can discuss philosophical topics, but they often avoid deeply opinionated, adversarial debate.

Highlights

Six NVIDIA 4090s in one water-cooled chassis are presented as a practical inference platform, not a training rig—especially because NVLink/SXM are missing.
At full utilization, temperatures stabilized in the upper 60s to low 70s °C, but maximum performance noise climbed to about 77 dB from a meter.
Qwen 72B is portrayed as excellent for reasoning and programming while struggling with strict instruction-following.
Depth perception for a drone is achieved by sending camera frames to the server for RGB-to-depth inference, enabling real-time control despite the drone’s limited compute.
Even with multiple models loaded, getting them to debate with strong, committed opinions proved harder than expected.

Topics

  • Inference Hardware
  • Water Cooling
  • Qwen 72B
  • RGB-to-Depth
  • Multi-Model Debate

Mentioned

  • Camino
  • Camino Grand
  • NVIDIA
  • Qwen
  • H100
  • A100
  • G200
  • Supermicro
  • Wall Street Bets
  • Daniel Kukwa
  • AI
  • GPU
  • LLM
  • RGB
  • NVLink
  • SXM
  • dB
  • drones
  • R&D