Performance of demo model using Dense Embedding Layer
The performance of demo model introduced in Examples/DenseDemo.
Profiling commands
Add --trace-fork-before-exec=true
if MPI or multiple CPU processes is used to collect the timelines for all GPUs.
nsys profile --trace=nvtx,cuda --sample=none --backtrace=none --cudabacktrace=none --cpuctxsw=none -f true -o profiling_filename \
python3 script.py --arguments
Infrastructure
TensorFlow 2.5
embedding_vec_size: 4
slot_num: 100
nnz_per_slot: 10
batchsize for single GPU: 8192
batchsize for 8 GPUs: 65536
DGX A100
NsightSystems-linux-cli-public-2021.2.1.58
Performance Numbers
end2end elapsed time (miliseconds)
1 GPU |
8 GPUs |
|
---|---|---|
Original TF |
179.85 |
—— |
SOK |
25.90 |
45.36 |
Query per seconds
1 GPU |
8 GPUs |
|
---|---|---|
Original TF |
45548 |
—— |
SOK |
316269 |
1444925 |