ElasticDL is a Kubernetes-native deep learning framework that enables fault-tolerant and elastically scheduled distributed training for TensorFlow and PyTorch models.
Kubernetes-native Deep Learning Framework
ElasticDL is designed for data scientists and ML engineers who want to run distributed deep learning training jobs on Kubernetes clusters with improved resource utilization and fault tolerance. It is especially useful in environments where cluster resources are shared and preemption occurs, allowing training jobs to continue running despite resource fluctuations.
ElasticDL requires a Kubernetes cluster to leverage its elastic scheduling and fault-tolerance features effectively. It does not rely on Kubernetes extensions like Kubeflow, instead directly interacting with Kubernetes APIs. Users should ensure their cluster supports priority-based preemption for optimal resource utilization. While ElasticDL handles fault tolerance without checkpoint recovery, proper model and data preparation remain essential.
Set up a Kubernetes cluster (local, on-prem, or cloud-based such as Google Kubernetes Engine)
Install ElasticDL client via PyPI: pip install elasticdl-client
Prepare your deep learning model using TensorFlow Estimator, Keras, or PyTorch APIs
Package your model and training data accessible to the Kubernetes cluster
Run ElasticDL training jobs using the elasticdl CLI with appropriate parameters
elasticdl train --image_name=elasticdl:mnist --model_zoo=model_zoo --model_def=mnist.mnist_functional_api.custom_model --training_data=/data/mnist/train --job_name=test-mnist --volume="host_path=/data,mount_path=/data"
Starts a distributed training job on Kubernetes using a Keras model defined in the model zoo with specified training data and volume mounts.