Computing and GPUs (3) - Infrastructure & Tooling - Full Stack Deep Learning
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat infrastructure as two different problems: fast iterative development and scalable, failure-tolerant training/evaluation.
Briefing
Deep learning progress over the past five years has tracked compute growth closely enough that hardware choices now shape what experiments are even feasible. The practical takeaway is a two-stage workflow—iterative development and longer training/evaluation—and each stage pushes different infrastructure decisions: fast, repeatable loops for coding and debugging, then scalable, failure-tolerant runs for hyperparameter sweeps and multi-day training.
For development, the most common setup is a dedicated machine with at least one GPU, often up to four, or a reserved cloud instance that keeps the environment ready. The goal isn’t maximum throughput; it’s minimizing friction so small training loops can run quickly and results can be inspected via a UI or dashboard. Once a model trains on one GPU, the workflow shifts to launching many experiments—different hyperparameters, multiple architectures, or larger models that need more GPUs or faster turnaround. That’s where compute strategy matters: launching experiments in parallel reduces the “iteration cycle” time, which can be as valuable as raw speed.
Compute isn’t just a slogan. OpenAI’s “AI and compute” analysis plots major results against the amount of compute used (in FLOPs on a log scale), showing that landmark systems like AlphaGo Zero sit about an order of magnitude above earlier versions while still clustering near each other relative to compute. Yet the same discussion includes an appendix highlighting that some influential work achieved strong results with modest compute—such as “Attention Is All You Need,” which replaced expensive recurrent patterns (LSTMs) with transformer-style matrix multiplications that train faster.
That balance—creativity versus brute-force scaling—feeds directly into GPU selection. Nvidia dominates the current ecosystem, with Google’s TPU also available on Google Cloud. Nvidia’s GPU lineup follows a roughly yearly architecture cadence (Kepler, Maxwell, Pascal, Volta, Turing), with server cards first, then enthusiast, then consumer variants. The most important hardware dimensions for deep learning are memory capacity (especially for recurrent models), and whether the GPU supports mixed precision via tensor cores. Tensor cores accelerate deep-learning operations by combining 16-bit multiplications with 32-bit accumulation, effectively delivering speedups and better memory utilization. Straight 16-bit can help, but mixed precision is generally the better default.
Older architectures like Kepler (K80) and early Pascal/Volta-era cards are often slower—sometimes multiple times—than current options, so they’re usually better left to cloud “cheap” tiers. In practice, the “current” choice tends to be Nvidia Turing/Volta-class GPUs such as the 1080 Ti (useful for used/DIY setups), P100 as a mid-range option, and V100 as a top cloud performer. Cloud providers—Amazon Web Services, Google Cloud, and Microsoft Azure—offer largely similar GPU services, but availability and rollout timing can lag; for example, even when V100 exists, getting quota and instance types can take months.
Cost tradeoffs depend on workload shape. Owning a quad-GPU machine can be cheaper over time, but cloud spot instances can dramatically speed experimentation by running many jobs in parallel—at the risk of termination. The break-even point depends on how many experiments are run and how much faster parallelism shortens the feedback loop. For small teams, scaling often hits a DevOps wall: self-managed clusters require orchestration, updates, and failure handling. Cloud shifts that burden.
Finally, GPUs aren’t only for images. As long as libraries can use CUDA acceleration, GPUs can speed up training for text models (including LSTMs/GRUs via unrolled parallel computation) and even tabular workflows after preprocessing. The main “don’t bother” case is when the workload lacks GPU-accelerated implementations; otherwise, GPU use is increasingly the default path.
Cornell Notes
The core message is that compute availability and infrastructure choices strongly influence what deep learning experiments can be run, especially because development is iterative while training/evaluation is longer and more failure-prone. For development, a local workstation with 1–4 GPUs (or a reserved cloud instance) supports quick training loops and rapid debugging. For hyperparameter searches and multi-day training, cloud instances often win because they’re easier to provision and scale, and spot instances can cut costs—if the training pipeline can tolerate interruptions. GPU selection should prioritize enough VRAM for meaningful batch sizes and support for mixed precision via tensor cores, which can deliver major speedups. Nvidia remains the dominant GPU ecosystem, while TPUs are a viable specialized alternative on Google Cloud.
How does the “two-stage” workflow (development vs training/evaluation) change infrastructure needs?
Why do tensor cores and mixed precision matter more than raw FLOPs alone?
What practical criteria should guide GPU choice for deep learning experiments?
How do cloud GPU availability and rollout timing affect the “build vs buy” decision?
When do spot instances make sense, and what must the training pipeline support?
Are GPUs only useful for images, or can they accelerate text and tabular workloads too?
Review Questions
- What specific GPU features (beyond general speed) determine whether mixed precision will accelerate training effectively?
- How does the cost-benefit of cloud spot instances depend on the number of experiments and the ability to tolerate interruptions?
- Why might a self-managed on-prem cluster become a DevOps burden as team size grows?
Key Points
- 1
Treat infrastructure as two different problems: fast iterative development and scalable, failure-tolerant training/evaluation.
- 2
Use VRAM capacity as a first-order constraint; if the model/batch doesn’t fit, speed comparisons are meaningless.
- 3
Prioritize mixed precision with tensor cores for deep learning workloads to gain speed and better memory utilization.
- 4
Nvidia dominates GPU tooling for deep learning; TPUs are a specialized alternative available on Google Cloud.
- 5
Cloud GPU availability can lag due to provider rollout and quota processes, so “newest GPU on demand” isn’t guaranteed.
- 6
Spot instances can cut costs sharply, but only work well when training jobs can resume or tolerate termination.
- 7
Build vs cloud hinges on workload shape: parallel experimentation can justify cloud costs, while steady usage often favors owning hardware.