Demo model using Dense Embedding Layer

This file demonstrates how to build a DNN model with dense embedding layer, where no reduction will be conducted intra each slot (feature-filed), with TensorFlow and SparseOperationKit.

You can find the source codes in sparse_operation_kit/documents/tutorials/DenseDemo/.

requirements

python modules: cupy, mpi4py, nvtx

model structure

This demo model is constructed with a dense embedding layer and 7 fully connected layers, where the former 6 fully connected layers have 1024 output units, and the last one has 1 output unit. avatar

steps

Generate datasets

This commands will generate a dataset randomly. By default, its filename is data.file, you can specify the output filename by adding --filename=XXX when running this command.

$ python3 gen_data.py \
    --global_batch_size=65536 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --iter_num=30 

Split the whole dataset into multiple shards

When MPI is used, we’d like to let each CPU process have its own datareader, and each datareader reads from different data source. Therefore the whole dataset is splited.

The splited files will be saved with name: save_prefix[split_id].file, for example, data_0.file, data_1.file. And the samples in each shard are linearly arranged. For instance, the whole samples is [s0, s1, s2, s3, s4, s5, s6, s7], when they are splited into 4 shards, each shard owns 2 samples, which is [s0, s1], [s2, s3], [s4, s5], [s6, s7], respectively.

$ python3 split_data.py \
    --filename="./data.file" \
    --split_num=8 \
    --save_prefix="./data_"

Run this demo writen with TensorFlow

This is a model parallelism demo implemented by tf methods.

$ mpiexec -n 8 --allow-run-as-root \
    python3 run_tf.py \
    --data_filename="./data_" \
    --global_batch_size=65536 \
    --vocabulary_size=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --optimizer="adam" \
    --data_splited=1

Run this demo writen with SOK + MirroredStrategy

$ python3 run_sok_MirroredStrategy.py \
    --data_filename="./data.file" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --optimizer="adam" 

Run this demo writen with SOK + MultiWorkerMirroredStrategy + MPI

Add --oversubscribe to mpiexec if there is not enough slots.

$ mpiexec -n 8 --allow-run-as-root \
    python3 run_sok_MultiWorker_mpi.py \
    --data_filename="./data_" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=8192 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --data_splited=1 \
    --optimizer="adam"

Run this demo writen with SOK + Horovod

$ horovodrun -np 8 -H localhost:8 \
    python3 run_sok_horovod.py \
    --data_filename_prefix="./data_" \
    --global_batch_size=65536 \
    --max_vocabulary_size_per_gpu=1024 \
    --slot_num=100 \
    --nnz_per_slot=10 \
    --num_dense_layers=6 \
    --embedding_vec_size=4 \
    --optimizer="adam"