SOK DLRM Benchmark

This is the SOK DLRM benchmark on Criteo Terabyte dataset.

Prepare Dataset

git clone https://gitlab-master.nvidia.com/dl/hugectr/hugectr.git
cd hugectr/
cd sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# train_data.bin and test_data.bin is the binary dataset generated by hugectr
# $DATA is the target directory to save the splited dataset
python3 split_bin.py train_data.bin $DATA/train
python3 split_bin.py test_data.bin $DATA/test

Environment

docker run --privileged=true --gpus=all -it --rm nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.11
git clone https://gitlab-master.nvidia.com/dl/hugectr/hugectr.git
cd hugectr/
cd sparse_operation_kit/
mkdir build
cd build
# use "-DSM=70" in V100
cmake -DSM=80 ..
make -j
make install
cd ..
cp -r sparse_operation_kit /usr/local/lib/

Run Benchmark

cd hugectr/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# FP32 Result
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --eval_in_last

# AMP result
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --amp --custom_interact --eval_in_last

Performance

8 x A100 (82GB embedding table):

batch size	exit criteria	frequent of evaluation	xla	custom interact	amp	compress	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	5.93	0.09	6.02	5.55	12.08M
65536	1 epoch	at end	yes	yes	yes	no	5.06	0.07	5.13	4.74	14.51M
55296	AUC > 0.8025	every 3793 steps	yes	yes	no	yes	5.23	1.44	6.67	4.87	11.66M
55296	AUC > 0.8025	every 3793 steps	yes	yes	yes	no	4.99	1.26	6.25	4.64	12.50M

8 x V100 (82GB embedding table):

batch size	exit criteria	frequent of evaluation	xla	custom interact	amp	compress	training time (minutes)	evaluating time (minutes)	total time (minutes)	average time of iteration (ms)	throughput(samples/second)
65536	1 epoch	at end	yes	yes	no	yes	17.52	0.19	17.71	16.42	4.02M
65536	1 epoch	at end	yes	yes	yes	no	10.20	0.15	10.35	9.56	6.99M
55296	AUC > 0.8025	every 3793 steps	yes	yes	no	yes	16.45	3.59	20.04	14.45	3.85M
55296	AUC > 0.8025	every 3793 steps	yes	yes	yes	no	9.69	2.54	12.23	8.52	6.62M

Note: The custom interact op can be seen in here, be careful to clone the repo from https://github.com/NVIDIA/DeepLearningExamples/, instead of cloning it from https://gitlab-master.nvidia.com/wraveane/tensorflow-dot-based-interact as written in the README.

Profile

cd hugectr/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
nsys profile --sample=none --backtrace=none --cudabacktrace=none --cpuctxsw=none --trace-fork-before-exec=true horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --early_stop=30