Performance of demo model using Dense Embedding Layer

The performance of demo model introduced in Examples/DenseDemo.

Profiling commands

Add --trace-fork-before-exec=true if MPI or multiple CPU processes is used to collect the timelines for all GPUs.

nsys profile --trace=nvtx,cuda --sample=none --backtrace=none --cudabacktrace=none --cpuctxsw=none -f true -o profiling_filename \
python3 script.py --arguments

Infrastructure

TensorFlow 2.5
embedding_vec_size: 4
slot_num: 100
nnz_per_slot: 10
batchsize for single GPU: 8192
batchsize for 8 GPUs: 65536
DGX A100
NsightSystems-linux-cli-public-2021.2.1.58

Performance Numbers

end2end elapsed time (miliseconds)

1 GPU

8 GPUs

Original TF

179.85

——

SOK

25.90

45.36

Query per seconds

1 GPU

8 GPUs

Original TF

45548

——

SOK

316269

1444925