%%capture
! pip install azure-core azure-ai-ml
! curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
In this guide, we’ll see how you can do multi-node/multi-GPU training on AzureML using Hugging Face accelerate
.
More specifically, we’ll fine-tune an image classification model from timm
on the CIFAR10 dataset. We use this dataset as it is small and works well for getting started.
Prerequisites:
- You have already created an AzureML workspace
- You have your workspace’s associated subscription ID, Resource Group name, and AzureML workspace name.
- You have the necessary quota for GPU instances, so you can follow along.
Step 0 - Setup Local Environment
First things first, we’ll need to set up a local environment that has the required dependencies to interface with AzureML.
Next, we login with the Azure CLI. If you’re running on your own machine and not a notebook, you can run this in your terminal.
! az login
Here we define all the imports needed for this notebook
from pathlib import Path
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Environment
# Authenticate!
= DefaultAzureCredential()
credential
# Run this to check auth worked
"https://management.azure.com/.default") credential.get_token(
Now, we should be able to authenticate with AzureML SDKv2 to connect to our workspace.
For that, we’ll need some info from you, which you’ll have to replace in the cell below.
- Subscription ID: The Azure subscription where your resource was created.
- Resource Group Name: The name of the Azure Resource Group your AzureML Resource was created.
- Workspace Name: The name of your AzureML Resource
All of this information can be found in the Azure Portal. Just navigate to the AzureML Resource and find it in the “Overview” seciton. ✅
# Replace these values with yours!
="YOUR AZUREML SUBSCRIPTION ID"
aml_sub="NAME OF AZURE RESOURCE GROUP YOUR INSTANCE WAS CREATED IN"
aml_rsg= "NAME OF YOUR AZUREML RESOURCE"
aml_ws_name
# Get a handle to the workspace
= MLClient(
ml_client =credential,
credential=aml_sub,
subscription_id=aml_rsg,
resource_group_name=aml_ws_name,
workspace_name )
Step 2 - Create Compute Targets
Next, you’ll want to make a couple compute targets. There are many ways to do this, but for this example, we will just use the Web UI.
Navigate to your AzureML Portal, and create two compute clusters:
- One named
cpu-cluster
which is a CPU instance. You can set min nodes to 0 and max nodes to 1.- Set min nodes to 0
- Set max nodes to 1
- Another named
gpu-cluster
which is a GPU cluster. For this example, we usedStandard_NC12
instances.- Set min nodes to 0
- Set max nodes to 2
For more detailed instructions on creating compute clusters, you can refer to the AzureML Docs.
# If your targets aren't named as described above, feel free to update here.
= 'cpu-cluster'
cpu_compute_target = 'gpu-cluster'
gpu_compute_target
# Train on 2 nodes with 2 GPUs each (4 GPUs total).
# If you didn 't use Standard_NC12 instances, or if you desire a different number of nodes
# per training run, you may need to update these values accordingly.
= 2
num_training_nodes = 2 num_gpus_per_node
Step 3 - Upload Data to AzureML
Can’t do much training if we don’t have any data! 😅
So, let’s get some data into AzureML! To do that, we’ll create a data-prep-step
that:
- downloads compressed data from a URL,
- extracts it to a new location in AzureML workspace’s storage
Once we do this, we’ll be able to mount this data to our training run later. 💾
We start off by creating a ./src
directory where all of our code will live. AzureML uploads all the files within this source directory, so we want to keep it clean.
We’ll also define an experiment name, so all the jobs we run here are grouped together.
from pathlib import Path
= 'accelerate-cv-multinode-example'
experiment_name = './src'
src_dir =True, parents=True) Path(src_dir).mkdir(exist_ok
Define Data Upload Script
Here’s the data upload script. It simply takes in a path (to a .tar.gz
file) and extracts it to output_folder
. 📝
%%writefile {src_dir}/read_write_data.py
import argparse
import os
import tarfile
= argparse.ArgumentParser()
parser "--input_data", type=str)
parser.add_argument("--output_folder", type=str)
parser.add_argument(= parser.parse_args()
args
file = tarfile.open(args.input_data)
= os.path.join(args.output_folder)
output_path file.extractall(output_path)
file.close()
Define Data Upload Job
Now that we have some code to run, we can define the job. The below basically defines:
- Inputs: The inputs to our script. In our case it’s a
tar.gz
file stored at a URL. This will be downloaded when the job runs. We provide it to our script we wrote above via the--input_data
flag. - Outputs: The path where we will save the outputs in our workspace’s data store. We pass this to
--output_folder
in our script. - Environment: We use one of AzureML’s curated environments, which will result in the job starting faster. Later, for the training job, we’ll define a custom environment.
- Compute: We tell the job to run on our
cpu-cluster
.
Any inputs/outputs you define can be referenced via ${{inputs.<name>}}
and ${{outputs.<name>}}
in the command
, so the values are passed along to the script.
# Input in this case is a URL that will be downloaded
= {
inputs "pets_zip": Input(
type=AssetTypes.URI_FILE,
="https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz",
path
),
}
# Define output data. The resulting path will be used in run.py
= {
outputs "pets": Output(
type=AssetTypes.URI_FOLDER,
=f"azureml://subscriptions/{aml_sub}/resourcegroups/{aml_rsg}/workspaces/{aml_ws_name}/datastores/workspaceblobstore/paths/PETS",
path
)
}
# Define our job
= command(
job =src_dir,
code="python read_write_data.py --input_data ${{inputs.pets_zip}} --output_folder ${{outputs.pets}}",
command=inputs,
inputs=outputs,
outputs="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
environment=cpu_compute_target,
compute=experiment_name,
experiment_name='data-prep-step'
display_name )
Run Data Upload Job
If everything goes smoothly, the below should launch the data-prep
job, and spit out a link for you to watch it run.
You only really need to run this job once, and then can reference it as many times as you like in the training step we are going to define in the next section.
# submit the command
= ml_client.jobs.create_or_update(job)
returned_job returned_job
Step 4 - Train
Ok, we have some data! 🙏
Let’s see how we can set up multi-node/multi-gpu training with accelerate
.
Define Training Environment
For the training job, we’ll define a custom training environment, as our dependencies aren’t included in the curated environments offered by AzureML. We try to pin most of these to very specific versions so the environment won’t break in the future/if we share it with others.
%%writefile {src_dir}/train_environment.yml
-video-accelerate
name: aml
channels:- conda-forge
dependencies:- python=3.9
- numpy
- pip
- scikit-learn
- scipy
- pandas
- pip:
- pyarrow==9.0.0
- azure-identity>=1.8.0
- transformers==4.24.0
- timm==0.6.12
- git+https://github.com/huggingface/accelerate.git@5315290b55ea9babd95a281a27c51d87b89d7c85
- fire==0.4.0
- torchmetrics==0.10.3
- av==9.2.0
- torch==1.12.1
- torchvision==0.13.1
- tensorboard
- mlflow
- setfit
- azure-keyvault-secrets
- azureml-mlflow
- azure-ai-ml
Now we use the conda environment file we just wrote to specify additional dependencies on top of the curated openmpi3.1.2-ubuntu18.04
docker image from AzureML.
For more information on creating environments in AzureML SDK v2, check out the docs.
# Define environment from conda specification
= Environment(
train_environment ="aml-accelerate",
name="Custom environment for Accelerate + PytorchVideo training",
description=str(Path(src_dir) / "train_environment.yml"),
conda_file="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
image )
Define Training Script
For our training script, we’re going to use the complete_cv_example.py
script from the official accelerate
examples on GitHub.
! wget -O {src_dir}/train.py -nc https://raw.githubusercontent.com/huggingface/accelerate/main/examples/complete_cv_example.py
Define Training Job
The moment of truth! Let’s see if we can train an image classifier using multiple GPUs across multiple nodes on AzureML 🤞
Here, we’ll define a job called train-step
where we define:
- An input,
pets
, which points to the data store path where we stored our processed data earlier. - Our training command, providing the following flags:
--data_dir
: supplying the input reference path--with_tracking
: To make sure we save logs--checkpointing_steps epoch
: To make sure we are saving checkpoints every epoch--output_dir ./outputs
: Save to the./outputs
directory, which is a special directory in AzureML meant for saving any artifacts from training.
- Our
training_environment
we defined above. - The
distribution
asPyTorch
, specifyingprocess_count_per_instance
, which is how many GPUs there are per node. (in our case, 2).
For more information on how Multi-Node GPU training works on AzureML, you can refer to the docs.
# Define inputs, which in our case is the path from upload_cats_and_dogs.py
= dict(
inputs =Input(
petstype=AssetTypes.URI_FOLDER,
=f"azureml://subscriptions/{aml_sub}/resourcegroups/{aml_rsg}/workspaces/{aml_ws_name}/datastores/workspaceblobstore/paths/PETS/images",
path
),
)
# Define the job!
= command(
job =src_dir,
code=inputs,
inputs="python train.py --data_dir ${{inputs.pets}} --with_tracking --checkpointing_steps epoch --output_dir ./outputs",
command=train_environment,
environment=gpu_compute_target,
compute=num_training_nodes, # In this, only 2 node cluster was created.
instance_count={
distribution"type": "PyTorch",
# set process count to the number of gpus per node
# In our case (using Standard_NC12) we have 2 GPUs per node.
"process_count_per_instance": num_gpus_per_node,
},=experiment_name,
experiment_name='train-step'
display_name )
Run Training Job
# Run it! 🚀
= ml_client.jobs.create_or_update(job)
train_job train_job