A comprehensive guide to easily automate and accelerate deep learning training using a self-hosted Kubernetes GPU cluster.
This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster.
This guide is designed for researchers and hobbyists who want to set up and manage their own Kubernetes GPU cluster on bare metal servers to speed up deep learning training workflows. It simplifies the process of cluster setup and GPU container deployment, enabling efficient local or cloud-like training environments without relying on external cloud providers.
This guide is tailored for Ubuntu 16.04 and may become outdated as Kubernetes evolves; users should verify compatibility with newer OS versions and Kubernetes releases. Disabling the firewall (ufw) is not recommended for production environments. Contributions to keep the guide updated are encouraged.
Prepare multiple Ubuntu 16.04 bare metal servers with SSH access and sudo privileges
Disable ufw firewall on all nodes (not recommended for production)
Ensure internet access and open required ports (6443, 443, 8080, 30000-32767 if needed)
Manually initiate the master node following the provided setup instructions or use the fast setup script
Join GPU worker nodes to the cluster by running the provided join commands
Build your GPU container using the guide's instructions to enable GPU support in your workloads
Use the provided scripts and YAML files to automate setup and deployment steps where possible
kubeadm init
Initializes the Kubernetes master node to start the cluster.
kubeadm join <master-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
Command to join a worker node to the Kubernetes cluster.
kubectl apply -f <gpu-device-plugin.yaml>
Deploys the NVIDIA GPU device plugin to enable GPU support in the cluster.
docker build -t <gpu-container-name> .
Builds a Docker container image configured for GPU-accelerated deep learning.
kubectl get nodes
Lists all nodes in the Kubernetes cluster and their status.