Demo model using Dense Embedding Layer

This file demonstrates how to build a DNN model with dense embedding layer, where no reduction will be conducted intra each slot (feature-filed), with TensorFlow and SparseOperationKit.

You can find the source codes in sparse_operation_kit/documents/tutorials/DenseDemo/.


python modules: cupy, mpi4py, nvtx

model structure

This demo model is constructed with a dense embedding layer and 7 fully connected layers, where the former 6 fully connected layers have 1024 output units, and the last one has 1 output unit. avatar


Generate datasets

This commands will generate a dataset randomly. By default, its filename is data.file, you can specify the output filename by adding --filename=XXX when running this command.

$ python3 \
    --global_batch_size=65536 \
    --slot_num=100 \
    --nnz_per_slot=10 \

Split the whole dataset into multiple shards

When MPI is used, we’d like to let each CPU process have its own datareader, and each datareader reads from different data source. Therefore the whole dataset is splited.

The splited files will be saved with name: save_prefix[split_id].file, for example, data_0.file, data_1.file. And the samples in each shard are linearly arranged. For instance, the whole samples is [s0, s1, s2, s3, s4, s5, s6, s7], when they are splited into 4 shards, each shard owns 2 samples, which is [s0, s1], [s2, s3], [s4, s5], [s6, s7], respectively.

$ python3 \
    --filename="./data.file" \
    --split_num=8 \

Run this demo writen with TensorFlow

It will only utilize 1 GPU.

$ python3 \
    --data_filename="./data.file" \
    --global_batch_size=8192 \
    --vocabulary_size=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \

Run this demo writen with SOK + MirroredStrategy

$ python3 \
    --data_filename="./data.file" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \

Run this demo writen with SOK + MultiWorkerMirroredStrategy + MPI

Add --oversubscribe to mpiexec if there is not enough slots.

$ mpiexec -n 8 --allow-run-as-root \
    python3 \
    --data_filename="./data_" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --data_splited=1 \