SOK DLRM Benchmark

This document demonstrates how to prepare the dataset and run SOK DLRM benchmark.

How to Prepare Dataset

We provide two approaches to prepare data: using Criteo Terabyte dataset directly or generate synthetic dataset with HugeCTR data generator below.

How to Prepare Criteo Terabyte Dataset

git clone https://github.com/NVIDIA-Merlin/HugeCTR.git
cd HugeCTR/
cd sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# train_data.bin and test_data.bin is the binary dataset generated by hugectr
# $DATA is the target directory to save the splited dataset
python3 split_bin.py train_data.bin $DATA/train --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"

python3 split_bin.py test_data.bin $DATA/test --slot_size_array="[39884406,39043,17289,7420,20263,3,7120,1543,63,38532951,2953546,403346,10,2208,11938,155,4,976,14,39979771,25641295,39664984,585935,12972,108,36]"

How to Prepare Synthetic Dataset

Step 1, start a container with native HugeCTR

Merlin NGC container with native HugeCTR can be used directly: nvcr.io/nvidia/merlin/merlin-training:22.05

To start the container, you can refer to the related instructions here

# $YourDataDir is the target directory to save the synthetic dataset
docker run --privileged=true --gpus=all -it --rm -v $YourDataDir:/home/workspace nvcr.io/nvidia/merlin/merlin-training:22.05
cd /home/workspace

Step2, run the following script to generate a synthetic dataset, you can modify num_samples and eval_num_samples as you want.

# python
import hugectr
from hugectr.tools import DataGenerator, DataGeneratorParams

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Raw,
  label_dim = 1,
  dense_dim = 13,
  num_slot = 26,
  i64_input_key = False,
  source = "./dlrm_raw/train_data.bin",
  eval_source = "./dlrm_raw/test_data.bin",
  slot_size_array = [203931, 18598, 14092, 7012, 18977, 4, 6385, 1245, 49, 186213, 71328, 67288, 11, 2168, 7338, 61, 4, 932, 15, 204515, 141526, 199433, 60919, 9137, 71, 34],
  nnz_array = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  num_samples = 5242880,
  eval_num_samples = 1310720
)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()

Step 3, split the binary file

cd /home/workspace
git clone https://github.com/NVIDIA-Merlin/HugeCTR.git

# Note: the `--slot_size_array` should be the same as the slot_size_array in step 2.
python3 HugeCTR/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/preprocess/split_bin.py ./dlrm_raw/train_data.bin ./splited_dataset/train/ --slot_size_array="[203931,18598,14092,7012,18977,4,6385,1245,49,186213,71328,67288,11,2168,7338,61,4,932,15,204515,141526,199433,60919,9137,71,34]"

# Note: the `--slot_size_array` should be the same as the slot_size_array in step 2.
python3 HugeCTR/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/preprocess/split_bin.py ./dlrm_raw/test_data.bin ./splited_dataset/test/ --slot_size_array="[203931,18598,14092,7012,18977,4,6385,1245,49,186213,71328,67288,11,2168,7338,61,4,932,15,204515,141526,199433,60919,9137,71,34]"

Environment

# $YourDataDir is the directory where you saved the dataset
docker run --privileged=true --gpus=all -it --rm -v $YourDataDir:/home/workspace nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.05

How to Run Benchmark

git clone https://github.com/NVIDIA-Merlin/HugeCTR.git
cd HugeCTR/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/

# FP32 Result with global batch size = 65536
# Note that --lr=24 is tested on real criteo dataset. This learning rate is too large for a synthetic dataset and it is likely to cause the loss to become nan
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=/home/workspace/splited_dataset/ --global_batch_size=65536 --xla --compress --eval_in_last --epochs=1000 --lr=24

# AMP result with global batch size = 65536
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=/home/workspace/splited_dataset/ --global_batch_size=65536 --xla --amp --eval_in_last --epochs=1000 --lr=24

# FP32 Result with global batch size = 55296
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=/home/workspace/splited_dataset/ --global_batch_size=55296 --xla --compress --epochs=1000 --lr=24

# AMP result with global batch size = 55296
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=/home/workspace/splited_dataset/ --global_batch_size=55296 --xla --amp --epochs=1000 --lr=24

Note: For better performance, you can use a custom interact op provided by here. After installing the custom interact op, you can add --custom_interact to the instructions below (This is optional). Detailed performance can be found on the tables below.

Performance

Performance on 8 x A100

batch size	exit criteria	frequent of evaluation	xla	custom interact	amp	compress	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	8.79	0.10	8.89	8.25	8.16M
65536	1 epoch	at end	yes	yes	yes	no	6.72	0.09	6.81	6.30	10.78M
55296	AUC > 0.8025	every 3793 steps	yes	yes	no	yes	8.04	1.59	9.63	7.48	7.60M
55296	AUC > 0.8025	every 3793 steps	yes	yes	yes	no	6.52	1.94	8.46	6.07	10.45M

Performance on 8 x V100

batch size	exit criteria	frequent of evaluation	xla	custom interact	amp	compress	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	19.25	0.21	19.46	18.04	3.66M
65536	1 epoch	at end	yes	yes	yes	no	12.91	0.19	13.10	12.10	5.53M
55296	AUC > 0.8025	every 3793 steps	yes	yes	no	yes	18.48	4.03	22.51	16.24	3.45M
55296	AUC > 0.8025	every 3793 steps	yes	yes	yes	no	12.11	3.18	15.29	10.65	5.36M

Performance with custom interact op

8 x A100 (82GB embedding table) with custom interact op:

batch size	exit criteria	frequent of evaluation	xla	custom interact	amp	compress	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	5.93	0.09	6.02	5.55	12.08M
65536	1 epoch	at end	yes	yes	yes	no	5.06	0.07	5.13	4.74	14.51M
55296	AUC > 0.8025	every 3793 steps	yes	yes	no	yes	5.23	1.44	6.67	4.87	11.66M
55296	AUC > 0.8025	every 3793 steps	yes	yes	yes	no	4.99	1.26	6.25	4.64	12.50M

8 x V100 (82GB embedding table) with custom interact op:

batch size	exit criteria	frequent of evaluation	xla	custom interact	amp	compress	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	17.52	0.19	17.71	16.42	4.02M
65536	1 epoch	at end	yes	yes	yes	no	10.20	0.15	10.35	9.56	6.99M
55296	AUC > 0.8025	every 3793 steps	yes	yes	no	yes	16.45	3.59	20.04	14.45	3.85M
55296	AUC > 0.8025	every 3793 steps	yes	yes	yes	no	9.69	2.54	12.23	8.52	6.62M

Profile

cd HugeCTR/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
nsys profile --sample=none --backtrace=none --cudabacktrace=none --cpuctxsw=none --trace-fork-before-exec=true horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --early_stop=30