In this guide, we’ll see how you can do multi-node/multi-GPU training on AzureML using Hugging Face accelerate.

More specifically, we’ll fine-tune an image classification model from timm on the CIFAR10 dataset. We use this dataset as it is small and works well for getting started.

Prerequisites:

You have already created an AzureML workspace
You have your workspace’s associated subscription ID, Resource Group name, and AzureML workspace name.
You have the necessary quota for GPU instances, so you can follow along.

Step 0 - Setup Local Environment

First things first, we’ll need to set up a local environment that has the required dependencies to interface with AzureML.

%%capture
! pip install azure-core azure-ai-ml
! curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

Next, we login with the Azure CLI. If you’re running on your own machine and not a notebook, you can run this in your terminal.

! az login

Here we define all the imports needed for this notebook

from pathlib import Path

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Environment

# Authenticate!
credential = DefaultAzureCredential()

# Run this to check auth worked
credential.get_token("https://management.azure.com/.default")

Now, we should be able to authenticate with AzureML SDKv2 to connect to our workspace.

For that, we’ll need some info from you, which you’ll have to replace in the cell below.

Subscription ID: The Azure subscription where your resource was created.
Resource Group Name: The name of the Azure Resource Group your AzureML Resource was created.
Workspace Name: The name of your AzureML Resource

All of this information can be found in the Azure Portal. Just navigate to the AzureML Resource and find it in the “Overview” seciton. ✅

# Replace these values with yours!
aml_sub="YOUR AZUREML SUBSCRIPTION ID"
aml_rsg="NAME OF AZURE RESOURCE GROUP YOUR INSTANCE WAS CREATED IN"
aml_ws_name = "NAME OF YOUR AZUREML RESOURCE"

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id=aml_sub,
    resource_group_name=aml_rsg,
    workspace_name=aml_ws_name,
)

Step 2 - Create Compute Targets

Next, you’ll want to make a couple compute targets. There are many ways to do this, but for this example, we will just use the Web UI.

Navigate to your AzureML Portal, and create two compute clusters:

One named cpu-cluster which is a CPU instance. You can set min nodes to 0 and max nodes to 1.
- Set min nodes to 0
- Set max nodes to 1
Another named gpu-cluster which is a GPU cluster. For this example, we used Standard_NC12 instances.
- Set min nodes to 0
- Set max nodes to 2

As mentioned in the prerequisites at the start of the notebook, you will need to request a quota increase to make sure you have access to enough compute to follow along. Azure usually responds within 12-24 hours, in my experience.

For more detailed instructions on creating compute clusters, you can refer to the AzureML Docs.

# If your targets aren't named as described above, feel free to update here.
cpu_compute_target = 'cpu-cluster'
gpu_compute_target = 'gpu-cluster'

# Train on 2 nodes with 2 GPUs each (4 GPUs total).
# If you didn 't use Standard_NC12 instances, or if you desire a different number of nodes
# per training run, you may need to update these values accordingly.
num_training_nodes = 2
num_gpus_per_node = 2

Step 3 - Upload Data to AzureML

Can’t do much training if we don’t have any data! 😅

So, let’s get some data into AzureML! To do that, we’ll create a data-prep-step that:

downloads compressed data from a URL,
extracts it to a new location in AzureML workspace’s storage

Once we do this, we’ll be able to mount this data to our training run later. 💾

We start off by creating a ./src directory where all of our code will live. AzureML uploads all the files within this source directory, so we want to keep it clean.

We’ll also define an experiment name, so all the jobs we run here are grouped together.

from pathlib import Path

experiment_name = 'accelerate-cv-multinode-example'
src_dir = './src'
Path(src_dir).mkdir(exist_ok=True, parents=True)

Define Data Upload Script

Here’s the data upload script. It simply takes in a path (to a .tar.gz file) and extracts it to output_folder. 📝

%%writefile {src_dir}/read_write_data.py
import argparse
import os
import tarfile

parser = argparse.ArgumentParser()
parser.add_argument("--input_data", type=str)
parser.add_argument("--output_folder", type=str)
args = parser.parse_args()


file = tarfile.open(args.input_data)
output_path = os.path.join(args.output_folder)
file.extractall(output_path)
file.close()

Define Data Upload Job

Now that we have some code to run, we can define the job. The below basically defines:

Inputs: The inputs to our script. In our case it’s a tar.gz file stored at a URL. This will be downloaded when the job runs. We provide it to our script we wrote above via the --input_data flag.
Outputs: The path where we will save the outputs in our workspace’s data store. We pass this to --output_folder in our script.
Environment: We use one of AzureML’s curated environments, which will result in the job starting faster. Later, for the training job, we’ll define a custom environment.
Compute: We tell the job to run on our cpu-cluster.

Any inputs/outputs you define can be referenced via ${{inputs.<name>}} and ${{outputs.<name>}} in the command, so the values are passed along to the script.

# Input in this case is a URL that will be downloaded
inputs = {
    "pets_zip": Input(
        type=AssetTypes.URI_FILE,
        path="https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz",
    ),
}

# Define output data. The resulting path will be used in run.py
outputs = {
    "pets": Output(
        type=AssetTypes.URI_FOLDER,
        path=f"azureml://subscriptions/{aml_sub}/resourcegroups/{aml_rsg}/workspaces/{aml_ws_name}/datastores/workspaceblobstore/paths/PETS",
    )
}

# Define our job
job = command(
    code=src_dir,
    command="python read_write_data.py --input_data ${{inputs.pets_zip}} --output_folder ${{outputs.pets}}",
    inputs=inputs,
    outputs=outputs,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    compute=cpu_compute_target,
    experiment_name=experiment_name,
    display_name='data-prep-step'
)

Run Data Upload Job

If everything goes smoothly, the below should launch the data-prep job, and spit out a link for you to watch it run.

You only really need to run this job once, and then can reference it as many times as you like in the training step we are going to define in the next section.

# submit the command
returned_job = ml_client.jobs.create_or_update(job)
returned_job

Step 4 - Train

Ok, we have some data! 🙏

Let’s see how we can set up multi-node/multi-gpu training with accelerate.

Define Training Environment

For the training job, we’ll define a custom training environment, as our dependencies aren’t included in the curated environments offered by AzureML. We try to pin most of these to very specific versions so the environment won’t break in the future/if we share it with others.

%%writefile {src_dir}/train_environment.yml
name: aml-video-accelerate
channels:
  - conda-forge
dependencies:
  - python=3.9
  - numpy
  - pip
  - scikit-learn
  - scipy
  - pandas
  - pip:
    - pyarrow==9.0.0
    - azure-identity>=1.8.0
    - transformers==4.24.0
    - timm==0.6.12
    - git+https://github.com/huggingface/accelerate.git@5315290b55ea9babd95a281a27c51d87b89d7c85
    - fire==0.4.0
    - torchmetrics==0.10.3
    - av==9.2.0
    - torch==1.12.1
    - torchvision==0.13.1
    - tensorboard
    - mlflow 
    - setfit
    - azure-keyvault-secrets
    - azureml-mlflow
    - azure-ai-ml

Now we use the conda environment file we just wrote to specify additional dependencies on top of the curated openmpi3.1.2-ubuntu18.04 docker image from AzureML.

For more information on creating environments in AzureML SDK v2, check out the docs.

# Define environment from conda specification
train_environment = Environment(
    name="aml-accelerate",
    description="Custom environment for Accelerate + PytorchVideo training",
    conda_file=str(Path(src_dir) / "train_environment.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)

Define Training Script

For our training script, we’re going to use the complete_cv_example.py script from the official accelerate examples on GitHub.

! wget -O {src_dir}/train.py -nc https://raw.githubusercontent.com/huggingface/accelerate/main/examples/complete_cv_example.py

Define Training Job

The moment of truth! Let’s see if we can train an image classifier using multiple GPUs across multiple nodes on AzureML 🤞

Here, we’ll define a job called train-step where we define:

An input, pets, which points to the data store path where we stored our processed data earlier.
Our training command, providing the following flags:
- --data_dir: supplying the input reference path
- --with_tracking: To make sure we save logs
- --checkpointing_steps epoch: To make sure we are saving checkpoints every epoch
- --output_dir ./outputs: Save to the ./outputs directory, which is a special directory in AzureML meant for saving any artifacts from training.
Our training_environment we defined above.
The distribution as PyTorch, specifying process_count_per_instance, which is how many GPUs there are per node. (in our case, 2).

For more information on how Multi-Node GPU training works on AzureML, you can refer to the docs.

# Define inputs, which in our case is the path from upload_cats_and_dogs.py
inputs = dict(
    pets=Input(
        type=AssetTypes.URI_FOLDER,
        path=f"azureml://subscriptions/{aml_sub}/resourcegroups/{aml_rsg}/workspaces/{aml_ws_name}/datastores/workspaceblobstore/paths/PETS/images",
    ),
)

# Define the job!
job = command(
    code=src_dir,
    inputs=inputs,
    command="python train.py --data_dir ${{inputs.pets}} --with_tracking --checkpointing_steps epoch --output_dir ./outputs",
    environment=train_environment,
    compute=gpu_compute_target,
    instance_count=num_training_nodes,  # In this, only 2 node cluster was created.
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus per node
        # In our case (using Standard_NC12) we have 2 GPUs per node.
        "process_count_per_instance": num_gpus_per_node,
    },
    experiment_name=experiment_name,
    display_name='train-step'
)

Run Training Job

# Run it! 🚀
train_job = ml_client.jobs.create_or_update(job)
train_job