SOK DLRM Benchmark
This is the SOK DLRM benchmark on Criteo Terabyte dataset.
Prepare Dataset
git clone https://gitlab-master.nvidia.com/dl/hugectr/hugectr.git
cd hugectr/
cd sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# train_data.bin and test_data.bin is the binary dataset generated by hugectr
# $DATA is the target directory to save the splited dataset
python3 split_bin.py train_data.bin $DATA/train
python3 split_bin.py test_data.bin $DATA/test
Environment
docker run --privileged=true --gpus=all -it --rm nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.11
git clone https://gitlab-master.nvidia.com/dl/hugectr/hugectr.git
cd hugectr/
cd sparse_operation_kit/
mkdir build
cd build
# use "-DSM=70" in V100
cmake -DSM=80 ..
make -j
make install
cd ..
cp -r sparse_operation_kit /usr/local/lib/
Run Benchmark
cd hugectr/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
# FP32 Result
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --eval_in_last
# AMP result
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --amp --custom_interact --eval_in_last
Performance
8 x A100 (82GB embedding table):
batch size |
exit criteria |
frequent of evaluation |
xla |
custom interact |
amp |
compress |
training time (minutes) |
evaluating time (minutes) |
total time (minutes) |
average time of iteration (ms) |
throughput(samples/second) |
---|---|---|---|---|---|---|---|---|---|---|---|
65536 |
1 epoch |
at end |
yes |
yes |
no |
yes |
5.93 |
0.09 |
6.02 |
5.55 |
12.08M |
65536 |
1 epoch |
at end |
yes |
yes |
yes |
no |
5.06 |
0.07 |
5.13 |
4.74 |
14.51M |
55296 |
AUC > 0.8025 |
every 3793 steps |
yes |
yes |
no |
yes |
5.23 |
1.44 |
6.67 |
4.87 |
11.66M |
55296 |
AUC > 0.8025 |
every 3793 steps |
yes |
yes |
yes |
no |
4.99 |
1.26 |
6.25 |
4.64 |
12.50M |
8 x V100 (82GB embedding table):
batch size |
exit criteria |
frequent of evaluation |
xla |
custom interact |
amp |
compress |
training time (minutes) |
evaluating time (minutes) |
total time (minutes) |
average time of iteration (ms) |
throughput(samples/second) |
---|---|---|---|---|---|---|---|---|---|---|---|
65536 |
1 epoch |
at end |
yes |
yes |
no |
yes |
17.52 |
0.19 |
17.71 |
16.42 |
4.02M |
65536 |
1 epoch |
at end |
yes |
yes |
yes |
no |
10.20 |
0.15 |
10.35 |
9.56 |
6.99M |
55296 |
AUC > 0.8025 |
every 3793 steps |
yes |
yes |
no |
yes |
16.45 |
3.59 |
20.04 |
14.45 |
3.85M |
55296 |
AUC > 0.8025 |
every 3793 steps |
yes |
yes |
yes |
no |
9.69 |
2.54 |
12.23 |
8.52 |
6.62M |
Note: The custom interact op can be seen in here, be careful to clone the repo from https://github.com/NVIDIA/DeepLearningExamples/
, instead of cloning it from https://gitlab-master.nvidia.com/wraveane/tensorflow-dot-based-interact
as written in the README.
Profile
cd hugectr/sparse_operation_kit/documents/tutorials/DLRM_Benchmark/
nsys profile --sample=none --backtrace=none --cudabacktrace=none --cpuctxsw=none --trace-fork-before-exec=true horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=($DATA) --global_batch_size=65536 --xla --compress --custom_interact --early_stop=30