Cloud Integration#

Amazon Web Services#

Amazon Web Services (AWS) offers EC2 instances with NVIDIA GPU support. NVTabular can be used with 1x, 4x, or 8x GPU instances or multiple nodes. We’re using an EC2 instance with 8x NVIDIA A100 GPUs to demonstrate the steps below. Check out the $/h for this instance type and adjust the type.

To run NVTabular on the cloud using AWS, do the following:

Start the AWS EC2 instance with the NVIDIA Deep Learning AMI image using the aws-cli.

# Starts the P4D instance with 8x NVIDIA A100 GPUs (take a look at the $/h for this instance type before using them)
aws ec2 run-instances --image-id ami-04c0416d6bd8e4b1f --count 1 --instance-type p4d.24xlarge --key-name <MyKeyPair> --security-groups <my-sg>

SSH into the machine.

Create a RAID volume by running the following command:

Depending on the EC2 instance, the machine may include local disk storage. We can optimize the performance by creating a RAID volume. Based on our experience, two NVMe volumes yield the best performance.

sudo mdadm --create --verbose /dev/md0 --level=0 --name=MY_RAID --raid-devices=2 /dev/nvme1n1 /dev/nvme2n1

sudo mkfs.ext4 -L MY_RAID /dev/md0
sudo mkdir -p /mnt/raid
sudo mount LABEL=MY_RAID /mnt/raid

sudo chmod -R 777 /mnt/raid

# Copy dataset inside raid directory:
cp -r data/ /mnt/raid/data/

Launch the NVTabular Docker container by running the following command:

docker run --gpus all --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_PTRACE -v /mnt/raid:/raid nvcr.io/nvidia/nvtabular:0.3 /bin/bash

Start the jupyter-lab server by running the following command:

jupyter-lab --allow-root --ip='0.0.0.0' --NotebookApp.token='<password>'

Google Cloud Platform#

The Google Cloud Platform (GCP) offers Compute Engine instances with NVIDIA GPU support. We’re using a VM with 8x NVIDIA A100 GPUs and eight local SSD-NVMe devices configured as RAID 0 to demonstrate the steps below.

To run NVTabular on the cloud using GCP, do the following:

Configure and create the VM as follows:
- GPU: 8xA100 (a2-highgpu-8g)
- Boot Disk: Ubuntu version 18.04
- Storage: Local 8xSSD-NVMe

Install the NVIDIA drivers and CUDA by running the following commands:

curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt -y update
sudo apt -y install cuda
nvidia-smi # Check installation

For more information, refer to Install GPU drivers in the Google Cloud documentation.

Install Docker by running the following commands:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia-merlin.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia-merlin.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get -y update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi # Check Installation

Configure the storage as RAID 0 by running the following commands:

sudo mdadm --create --verbose /dev/md0 --level=0 --name=MY_RAID --raid-devices=2 /dev/nvme0n1 /dev/nvme0n2
sudo mkfs.ext4 -L MY_RAID /dev/md0
sudo mkdir -p /mnt/raid
sudo mount LABEL=MY_RAID /mnt/raid
sudo chmod -R 777 /mnt/raid

# Copy data to RAID
cp -r data/ /mnt/raid/data/

Run the container by running the following command:

docker run --gpus all --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_PTRACE -v /mnt/raid:/raid nvcr.io/nvidia/nvtabular:0.3 /bin/bash

Databricks#

Databricks has developed a web-based platform on top of Apache Spark to provide automated cluster management. Databricks currently supports custom containers

To run NVTabular on Databricks, do the following:

Create a custom NVTabular container using Databricks runtime.

NOTE: If any of the default dependencies that come with the Databricks cluster are changed to a different version, the Databricks cluster won’t be able to detect the Spark driver. As a workaround, the NVIDIA RAPIDS team has created a Docker container so that RAPIDS can run inside a Databricks cluster.
Extend the container and add NVTabular and PyTorch so that they can run inside Databricks.
Select the appropriate version of the NVTabular Conda repo.

NOTE: All versions of the NVTabular conda repo are listed here.
Clone the cloud-ml-example repo by running the following command:
```
git clone https://github.com/rapidsai/cloud-ml-examples.git
```

Add the selected version of the NVTabular Conda repo to the rapids-spec.txt file by running the following command:

cd databricks
echo "https://conda.anaconda.org/nvidia/linux-64/nvtabular-0.6.1-py38_0.tar.bz2" >> docker/rapids-spec.txt

To install PyTorch, add the fastai pip package install to the Dockerfile by running the following command:
```
RUN pip install fastai
```

Build the container and push it to Docker Hub or the AWS Elastic Container Registry by running the following command:

docker build --tag <repo_name>/databricks_nvtabular:latest docker push <repo_name>/databricks_nvtabular:latest

Use the custom container to spin up the Databricks cluster.
Select a GPU node for the Worker and Driver. Once the Databricks cluster is up, NVTabular will be running inside of it.

AWS SageMaker#

AWS SageMaker is a service from AWS to “build, train and deploy machine learning” models. It automates and manages the MLOps workflow. It supports jupyter notebook instances enabling users to work directly in jupyter notebook/jupyter lab without any additional configurations. In this section, we will explain how to run NVIDIA Merlin (NVTabular) on AWS SageMaker notebook instances. We adopted the work from Eugene from his twitter post. We tested the workflow on February, 1st, 2022, but it is not integrated into our CI workflows. Future release of Merlin or Merlin’s dependencies can cause issues.

To run the movielens example on AWS SageMaker, do the following:

Login into your AWS console and select AWS SageMaker.
Select Notebook -> Notebook instances -> Create notebook instance. Give the instance a name and select a notebook instance type with GPUs. For example, we selected ml.p3.2xlarge. Please review the associated costs with each instance type. As a platform identifier, select notebook-al2-v1. The previous platform identifier runs with TensorFlow 2.1.x and we had more issue to update it to TensorFlow 2.6.x. The volume size can be increased in the section Additional configuration.
After the instance is running, connect to jupyter lab.
Start a terminal to have access to the command line.
The image contains many conda environments, which requires ~60GB of disk space. You can remove some of them to free disk space in the folder /home/ec2-user/anaconda3/envs/
Clone the NVTabular repository and install the conda environment.

cd /home/ec2-user/SageMaker/
git clone https://github.com/NVIDIA-Merlin/NVTabular.git
conda env create -f=NVTabular/conda/environments/nvtabular_aws_sagemaker.yml

Activate the conda environment

source /home/ec2-user/anaconda3/etc/profile.d/conda.sh
conda activate nvtabular

Install additional packages, such as TensorFlow or PyTorch

pip install tensorflow-gpu
pip install torch
pip install graphviz

Install Transformer4Rec, torchmetrics and ipykernel

conda install -y -c nvidia -c rapidsai -c numba -c conda-forge transformers4rec
conda install -y torchmetrics ipykernel

Add conda environment as ipykernel

python -m ipykernel install --user --name=nvtabular

You can switch in jupyter lab and run the movielens example.

This workflow enables NVTabular ETL and training with TensorFlow or Pytorch. Deployment with Triton Inference Server will follow soon.

Cloud Integration

Contents

Cloud Integration#

Amazon Web Services#

Google Cloud Platform#

Databricks#

AWS SageMaker#