Resource Management (4) - Infrastructure & Tooling

TL;DR

Resource management aims to let many users launch experiments easily while ensuring dependencies are ready and the right GPUs/CPUs are allocated without contention.

Briefing Cornell Notes

Briefing

Resource management in deep learning is about making shared compute usable: multiple people need to launch experiments quickly, with dependencies handled and the right GPUs and CPUs allocated—without fighting over hardware. The spectrum runs from low-tech coordination (spreadsheets reserving machines) to full systems built for machine learning clusters. A more practical middle step is automation on a single machine: a short script can allocate free GPUs to incoming jobs so experiments start with the correct resources and minimal manual bookkeeping.

For larger teams and multi-user environments, workload managers like SLURM provide the standard approach. Users declare requirements—such as “two GPUs, eight CPUs, and 12 GB of RAM”—then submit a job. The scheduler queues it, waits until the requested resources are available, locks them down to prevent contention, and runs the job. That model removes the need for manual reservation and makes it easy to run repeatable batches of experiments.

Container tooling then addresses the dependency problem. Docker packages an entire software stack into a lightweight unit, avoiding the heavier overhead of full virtual machines. Kubernetes takes this further by orchestrating many Docker containers across a cluster, allocating compute to containers as resources become available. In the course’s compute setup, Kubernetes runs a JupyterHub environment (JupyterHub with JupyterLab) across shared GPUs and CPUs, enabling multiple users to work concurrently on the same underlying hardware.

On top of Kubernetes, machine-learning-specific platforms aim to handle common training patterns. Kubeflow (an open-source project associated with Google, with contributions now largely coming from outside Google) includes modules for launching and managing Jupyter notebooks with specified GPU counts—so a user can request a “two-GPU notebook” without knowing which physical machine will host it. Kubeflow also supports multi-step workflows, where preprocessing and data preparation can be CPU-heavy while later training is GPU-heavy. For example, downloading and cropping a large image dataset may not benefit from GPUs initially, but the subsequent concatenation and training do. The platform’s goal is to allocate different resources at different stages rather than overprovisioning everything for the entire pipeline.

The discussion also notes related workflow tooling such as Polyaxon, described as open source with enterprise features, and suggested to be more actively developed than Kubeflow. Finally, a practical question arises about cost and utilization: in the cloud, teams can request 50 GPUs for a day and stop paying afterward, but on-prem hardware can sit idle. The conversation floats the idea of sharing or contributing idle on-prem GPUs to a peer-to-peer style “cloud,” potentially combining ownership with compensation—though no concrete solution is provided. Overall, Kubernetes is positioned as the dominant container orchestration layer for packing workloads onto shared infrastructure, with higher-level ML platforms like Kubeflow handling the workflow and user experience on top of it.

Cornell Notes

Resource management for deep learning focuses on allocating GPUs and CPUs to many users while keeping dependencies and workflows manageable. SLURM solves multi-user contention by letting users declare resource needs and scheduling jobs when those resources are free. Docker packages dependencies into portable units, and Kubernetes orchestrates many containers across a cluster, enabling shared compute for environments like JupyterHub. Kubeflow builds on Kubernetes to provide GPU-backed notebooks and multi-step workflows where preprocessing may use CPUs while later training uses GPUs. This matters because overprovisioning (e.g., reserving 96 CPUs plus GPUs for every step) wastes hardware and increases cost or delays.

Why do teams move beyond spreadsheets to systems like SLURM for GPU access?

Spreadsheets can reserve machines informally, but they don’t scale well when many users and experiments compete for GPUs. SLURM formalizes the process: users submit jobs with explicit resource requirements (e.g., “2 GPUs, 8 CPUs, 12 GB RAM”). The scheduler queues the job, waits until the requested resources are available, locks them to avoid contention, and then runs the job—so users don’t manually track which hardware is free.

How do Docker and Kubernetes split responsibilities in shared compute?

Docker packages the dependency stack into a lightweight container, avoiding the heavier overhead of full virtual machines. Kubernetes then orchestrates those containers across a cluster: it allocates resources to containers and schedules them on available hardware. In the described setup, Kubernetes runs a JupyterHub environment that serves JupyterLab across shared GPUs and CPUs.

What does Kubeflow add on top of Kubernetes for machine learning users?

Kubeflow provides ML-oriented capabilities, including launching and managing GPU-backed Jupyter notebooks through a dashboard. A user can request a notebook with a specific GPU count (e.g., two GPUs) without needing to know which physical machine will run it. It also supports multi-step workflows where different pipeline stages need different resources.

What is a multi-step workflow, and why does it change resource allocation?

A multi-step workflow includes stages where compute needs differ. For instance, downloading and preprocessing a very large image dataset (like cropping parts of a million images) may be CPU-oriented and not benefit from GPUs immediately. Later steps—such as concatenating processed outputs into training data and running training—do benefit from GPUs. The goal is to allocate CPUs for the first stage and GPUs for the second, rather than reserving everything for the entire pipeline.

What question comes up about on-prem GPUs sitting idle, and what idea is floated?

The discussion contrasts cloud elasticity (request 50 GPUs for the day, stop paying afterward) with on-prem ownership, where unused GPUs can remain idle. A proposed “best of both worlds” idea is a peer-to-peer style sharing system where companies contribute idle on-prem GPUs to others and potentially receive compensation, though no specific implementation is confirmed.

Review Questions

How does SLURM prevent contention compared with manual reservation methods?
Describe how Docker and Kubernetes work together to support multi-user environments like JupyterHub.
Give an example of a multi-step ML pipeline and explain which stages likely need CPUs versus GPUs.

Key Points

1
Resource management aims to let many users launch experiments easily while ensuring dependencies are ready and the right GPUs/CPUs are allocated without contention.
2
Spreadsheets can coordinate GPU access but scale poorly; automation scripts can handle simple single-machine GPU allocation.
3
SLURM enables multi-user scheduling by letting jobs declare resource requirements and running them only when those resources are available and locked.
4
Docker packages dependencies into portable containers, while Kubernetes orchestrates those containers across a cluster to share compute efficiently.
5
Kubeflow adds ML-specific workflow support on top of Kubernetes, including GPU-backed Jupyter notebooks and multi-step pipelines with stage-specific resource needs.
6
Multi-step workflows require different resource allocations per stage (e.g., CPU-heavy preprocessing followed by GPU training), avoiding wasteful overprovisioning.
7
Idle on-prem GPUs raise a utilization challenge; peer-to-peer sharing is suggested as a potential way to combine ownership with external demand.

Highlights

SLURM’s core mechanism is simple: declare exact resources, queue the job, and lock the hardware only when it’s available—eliminating manual GPU reservation.

Kubernetes turns Docker containers into a shared, schedulable compute fabric, enabling environments like JupyterHub to run across many GPUs and CPUs.

Kubeflow’s value isn’t just “more GPUs”; it’s stage-aware workflow allocation, such as using CPUs for large-scale preprocessing and GPUs for training.

The on-prem vs cloud cost gap motivates interest in sharing idle GPUs with others, potentially creating a peer-to-peer compute marketplace.

Topics

Resource Management
SLURM Scheduling
Docker Containers
Kubernetes Orchestration
Kubeflow Workflows

Mentioned

SLURM
GPU
CPU
RAM

Resource Management (4) - Infrastructure & Tooling - Full Stack Deep Learning