SOK DLRM Benchmark

This is the SOK DLRM benchmark on Criteo Terabyte dataset.

Prepare Dataset

git clone https://gitlab-master.nvidia.com/dl/hugectr/hugectr.git
cd hugectr/
cd sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# train_data.bin and test_data.bin is the binary dataset generated by hugectr
# $DATA is the target directory to save the splited dataset
python3 split_bin.py train_data.bin $DATA/train
python3 split_bin.py test_data.bin $DATA/test

Environment

docker run --privileged=true --gpus=all -it --rm nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.11
git clone https://gitlab-master.nvidia.com/dl/hugectr/hugectr.git
cd hugectr/
cd sparse_operation_kit/
mkdir build
cd build
# use "-DSM=70" in V100
cmake -DSM=80 ..
make -j
make install
cd ..
cp -r sparse_operation_kit /usr/local/lib/

Run Benchmark

cd hugectr/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# FP32 Result
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --eval_in_last

# AMP result
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --amp --custom_interact --eval_in_last

Performance

8 x A100 (82GB embedding table):

batch size

exit criteria

frequent of evaluation

xla

custom interact

amp

compress

training time (minutes)

evaluating time (minutes)

total time (minutes)

average time of iteration (ms)

throughput(samples/second)

65536

1 epoch

at end

yes

yes

no

yes

5.93

0.09

6.02

5.55

12.08M

65536

1 epoch

at end

yes

yes

yes

no

5.06

0.07

5.13

4.74

14.51M

55296

AUC > 0.8025

every 3793 steps

yes

yes

no

yes

5.23

1.44

6.67

4.87

11.66M

55296

AUC > 0.8025

every 3793 steps

yes

yes

yes

no

4.99

1.26

6.25

4.64

12.50M

8 x V100 (82GB embedding table):

batch size

exit criteria

frequent of evaluation

xla

custom interact

amp

compress

training time (minutes)

evaluating time (minutes)

total time (minutes)

average time of iteration (ms)

throughput(samples/second)

65536

1 epoch

at end

yes

yes

no

yes

17.52

0.19

17.71

16.42

4.02M

65536

1 epoch

at end

yes

yes

yes

no

10.20

0.15

10.35

9.56

6.99M

55296

AUC > 0.8025

every 3793 steps

yes

yes

no

yes

16.45

3.59

20.04

14.45

3.85M

55296

AUC > 0.8025

every 3793 steps

yes

yes

yes

no

9.69

2.54

12.23

8.52

6.62M

Note: The custom interact op can be seen in here, be careful to clone the repo from https://github.com/NVIDIA/DeepLearningExamples/, instead of cloning it from https://gitlab-master.nvidia.com/wraveane/tensorflow-dot-based-interact as written in the README.

Profile

cd hugectr/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
nsys profile --sample=none --backtrace=none --cudabacktrace=none --cpuctxsw=none --trace-fork-before-exec=true horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --early_stop=30