Benchmark the HPS Plugin for TensorFlow

Benchmark Setup
Results and Analysis
Resources

Benchmark Setup

The inference SavedModel that leverages HPS can be deployed with the Triton TensorFlow Backend, which is demonstrated in hps_tensorflow_triton_deployment_demo.ipynb. The inference performance of this deployment method needs to be investigated using Triton Performance Analyzer to verify the effectiveness of the HPS integration into TensorFlow.

We train a DLRM model following JoC TensorFlow2 DLRM on the Criteo 1TB dataset filtered with the frequency threshold of 15. The trained model is in the SavedModel format and the size is about 16GB, which is almost the size of embedding weights because the size of dense layer weights is small. We compare three deployment methods on the Triton Infrence Server:

Deploy the trained SavedModel directly with the TensorFlow backend.
Deploy the inference SavedModel that leverages HPS with the TensorFlow backend. The workflow of deriving the inference SavedModel is illustrated in HPS TensorFlow WorkFlow.
Deploy the ensemble model that combines the Triton HPS backend and the Triton TensorFlow backend. The demo for this deployment method can be found at the HPS Triton Ensemble notebook.

The benchmark is conducted on DGX A100, and only one GPU is utilized with one Triton model instance on it. The best-case performance for HPS is studied by sending the same batch of inference data repeatedly. In this case, the embedding lookup is served by the GPU embedding cache of HPS. As for the method of deploying the trained SavedModel, the whole embedding weights (~16GB) can fit into the 80GB GPU memory of A100, which can be regarded as the baseline of the benchmark.

Results and Analysis

The per-batch latency, in milliseconds, measured at the server side is shown in the following table and Fig. 1. The Y-axis is logarithmic.

Batch size	Native TF Model	HPS Plugin TF Model	Ensemble Model
32	0.636	0.612	0.854
64	0.754	0.651	1.008
256	0.756	0.669	1.802
1024	0.922	0.870	4.420
2048	1.176	1.100	8.087
4096	1.447	1.552	23.383
8192	2.443	2.524	42.303
16384	3.901	4.402	82.310
32768	7.681	8.320	291.335
65536	13.665	15.577	617.968
131072	29.636	29.952	1190.025

Fig. 1: JoC DLRM inference benchmark - latency

The throughput, in K samples per second, measured at the server side is shown in the following table and Fig. 2. The Y-axis is logarithmic.

Batch size	Native TF Model	HPS Plugin TF Model	Ensemble Model
32	50.314	52.287	37.470
64	84.880	98.310	63.492
256	338.624	382.660	142.064
1024	1110.629	1177.011	231.674
2048	1741.496	1861.818	253.245
4096	2830.684	2639.175	175.169
8192	3353.254	3245.641	193.650
16384	4199.948	3721.944	199.052
32768	4266.111	3938.461	112.475
65536	4795.901	4207.228	106.050
131072	4422.729	4376.068	110.142

Fig. 2: JoC DLRM inference benchmark - throughput

As the benchmark results indicate, the performance of inference SavedModel is comparable to that of trained SavedModel. The results show that the best-case performance of HPS embedding lookup, when serviced by the GPU embedding cache, is on par with that of native TensorFlow GPU embedding lookup. Often, large embedding weights cannot fit in GPU memory. The native TensorFlow GPU embedding lookup does not support this condition. However, the HPS plugin for TensorFlow can handle embedding tables that exceed GPU memory with a hierarchical memory storage and provide a low-latency embedding lookup service with an efficient GPU caching mechanism.

Among the three deployment methods, the performance of the Triton ensemble model is much worse than the other two, which can be attributed to the overhead of data transfer between two backends. Specifically, the embedding vectors output by the HPS backend will be fed to the TensorFlow backend, and the size of embedding vectors increase linearly with the batch size (attaining 131072*128*4 bytes for the batch size 131072). Therefore, the latency is orders of magnitude higher than the other two methods.

Benchmark the HPS Plugin for TensorFlow

Benchmark Setup

Results and Analysis

Resources