# Features in SparseOperationKit #

## Model-Parallelism GPU Embedding Layer ##
SOK provides GPU embedding layers that take advantage of model-parallelism. No further data transformation from model-parallelism to data-parallism is required.

SOK implements several different GPU embedding layers. The layers use different algorithms to provide maximum performance in different application scenarios. SOK supports single machine and multi-machine cluster deployment.
![avatar](../images/workflow_of_embeddinglayer.png)

## Sparse Embedding Layer ##
The sparse embedding layer is equivalent to `tf.nn.embedding_lookup_sparse`, except the sparse embedding layers in SOK operates in a MP manner. The sparse embedding layer supports the `Mean` and `Sum` combiners.

### Distributed Sparse Embedding ###
The distributed sparse embedding scatters keys across GPUSs by computing `gpu_id = key % number_of_gpus`. For example, if there are 8 GPUs, then `key=1000` will be deployed to GPU-0, `key=1001` will be deployed to GPU-1. The following picture depicts its forward propagation process.
```{image} ../images/distributed_sparse_embedding.png
:class: bg_primary
:width: 50%
:align: center
```
To reduce the overhead when looking up multiple embedding tables with identical embedding vector sizes, the distributed sparse embedding combines them as one huge embedding table. Each sub-embedding-table is called slot, which is also known as feature-field. To avoid ambiguity, the input keys for across embedding tables should be represented using a unified encoding.

When conducting reduction of embedding vectors intra slots (feature-fields), SOK will use the collective operation `Reduce-Scatter`. `All-Gather` is used for the accumulation of gradient during backward propagation.

## Dense Embedding Layer ##
SOK's dense embedding layer is equivalent to `tf.nn.embedding_lookup`, except that it works in a MP manner.

### All2All Dense Embedding ###
The all-2-all dense embedding will distribute each key based on `gpu_id = key % gpu_num`. For example, if there are 8 GPUs, then `key=1000` will be deployed to GPU-0, `key=1001` will be deployed to GPU-1. The following picture illustrates the forward propagation process.
```{image} ../images/all2all_dense_embedding.png
:class: bg_primary
:width: 50%
:align: center
```
To reduce the overhead when looking up multiple embedding tables with identical embedding vector sizes, the all-2-all dense embedding combines them as one huge embedding table. Each sub-embedding-table is called slot, which is also known as feature-field. To avoid ambiguity, the input keys for across embedding tables should be represented using a unified encoding.


During forward propagation, an `All2All` communication primitive is first used to exchange keys among all GPUs. Then, another `All2All` is used to exchange embedding vectors among all GPUs. During backward propagation, `All2All` is used to exchange top gradients among all GPUs.