Profiling HPS
A critical part of optimizing the inference performance of HPS is being able to measure changes in performance as you experiment with different optimization strategies and data distribution. There are two ways to profile HPS:
The HPS profiler. The hps_profiler application performs benchmark tasks for the Hierarchical Parameter Server. The hps_profiler will be compiled and installed from the following instructions in Build and install the HPS Profiler.
The Triton Perf Analyzer. For detailed documentation of Triton Perf Analyzer, please refer to here. For how to use Trion Perf Analyzer to profile HPS, take a look at the procedure.
HPS profiler
The hps_profiler application generates inference requests to HPS and measures the throughput and latency of different components, such as embedding cache, Database Backend and Lookup session. To
get representative results, hps_profiler measures the throughput and
latency over the configurable iteration, and then repeats the measurements until it reaches a specified number of iterations.
For example, if --embedding_cache
is used the results will be show below:
$ hps_profiler --iterations 1000 --num_key 2000 --powerlaw --alpha 1.2 --config /hugectr/model/ps.json --table_size 630000 --warmup_iterations 100 --embedding_cache
...
*** Measurement Results ***
The Benchmark of: Apply for workspace from the memory pool for Embedding Cache Lookup
Latencies [900 iterations] min = 0.000285ms, mean = 0.000384853ms, median = 0.000365ms, 95% = 0.000428ms, 99% = 0.000465ms, max = 0.009736ms, throughput = 2.73973e+06/s
The Benchmark of: Copy the input to workspace of Embedding Cache
Latencies [900 iterations] min = 0.010842ms, mean = 0.0117076ms, median = 0.011596ms, 95% = 0.012219ms, 99% = 0.016642ms, max = 0.027379ms, throughput = 86236.6/s
The Benchmark of: Deduplicate the input embedding key for Embedding Cache
Latencies [900 iterations] min = 0.019159ms, mean = 0.0272492ms, median = 0.027262ms, 95% = 0.028104ms, 99% = 0.029548ms, max = 0.052309ms, throughput = 36681.1/s
The Benchmark of: Lookup the embedding keys from Embedding Cache
Latencies [900 iterations] min = 0.178875ms, mean = 0.231377ms, median = 0.227815ms, 95% = 0.267493ms, 99% = 0.284738ms, max = 0.47672ms, throughput = 4389.53/s
The Benchmark of: Merge output from Embedding Cache
Latencies [900 iterations] min = 0.007656ms, mean = 0.00850756ms, median = 0.008434ms, 95% = 0.009117ms, 99% = 0.011863ms, max = 0.018697ms, throughput = 118568/s
The Benchmark of: Missing key synchronization insert into Embedding Cache
Latencies [900 iterations] min = 0.105163ms, mean = 0.15741ms, median = 0.153763ms, 95% = 0.192302ms, 99% = 0.208846ms, max = 0.402043ms, throughput = 6503.52/s
The Benchmark of: Native Embedding Cache Query API
Latencies [900 iterations] min = 0.021729ms, mean = 0.0227739ms, median = 0.02253ms, 95% = 0.023695ms, 99% = 0.025035ms, max = 0.043024ms, throughput = 44385.3/s
The Benchmark of: decompress/deunique output from Embedding Cache
Latencies [900 iterations] min = 0.011247ms, mean = 0.0121274ms, median = 0.011953ms, 95% = 0.013055ms, 99% = 0.014706ms, max = 0.022186ms, throughput = 83661/s
The Benchmark of: The hit rate of Embedding Cache
Occupancy [900 iterations] min = 0.719323, mean = 0.843972, median = 0.854749, 95% = 0.894188, 99% = 0.90276, max = 0.918169
Build and install the HPS Profiler
To build HPS profiler from source, do the following: 2. Download the HugeCTR repository and the third-party modules that it relies on by running the following commands:
$ git clone https://github.com/NVIDIA/HugeCTR.git
$ cd HugeCTR
$ git submodule update --init --recursive
Pull the NGC Docker and run it
Pull the container using the following command:
docker pull nvcr.io/nvidia/merlin/merlin-hugectr:23.09
Launch the container in interactive mode (mount the HugeCTR root directory into the container for your convenience) by running this command:
docker run --gpus all --rm -it --cap-add SYS_NICE --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -u root -v $(pwd):/HugeCTR -w /HugeCTR -p 8888:8888 nvcr.io/nvidia/merlin/merlin-hugectr:23.09
Here is an example of how you can build HPS Profiler using the build options:
$ mkdir -p build && cd build $ cmake -DCMAKE_BUILD_TYPE=Release -DSM="70;80" -DENABLE_INFERENCE=ON -DENABLE_PROFILER=ON .. # Target is NVIDIA V100 / A100 with Inference mode ON. $ make -j && make install
You will get
hps_profiler
under bin folder.
Create a synthetic embedding table
The embedding generator is used to generate synthetic HugeCTR sparse model files that can be loaded into HugeCTR HPS for inference. To generate a HugeCTR embedding file, please refer to the Model generator
Use the HPS Profiler to get the measurement results
Generate HPS json configuration file based on synthetic model file. For configuration information about HPS, you can refer to here. Here is an example:
{
"supportlonglong": true,
"models": [{
"model": "model_name",
"sparse_files": ["The path of synthetic embedding files"],
"dense_file": "",
"network_file": "",
"num_of_worker_buffer_in_pool": 2,
"num_of_refresher_buffer_in_pool":1,
"deployed_device_list":[0],
"max_batch_size":1024,
"default_value_for_each_table":[0.0],
"cache_refresh_percentage_per_iteration":0.1,
"hit_rate_threshold":1.0,
"gpucacheper":0.9,
"gpucache":true,
"maxnum_des_feature_per_sample": 0,
"maxnum_catfeature_query_per_table_per_sample" : [26],
"embedding_vecsize_per_table" : [16]
}
]
}
NOTE
: The product of the max_batch_size
size and the maxnum_catfeature_query_per_table_per_sample
needs to be greater than or equal to the --num_key
option in the hps_profiler.
Add arguments to hps_profiler for benchmark
$ hps_profiler
--config: required.
Usage: HPS_Profiler [options]
Optional arguments:
-h --help shows help message and exits [default: false]
-v --version prints version information and exits [default: false]
--config The path of the HPS json configuration file [required]
--powerlaw Generate the queried key that in each iteration based on the power distribution [default: false]
--table_size The number of keys in the embedded table [default: 100000]
--alpha Alpha of power distribution [default: 1.2]
--hot_key_percentage Percentage of hot keys in embedding tables [default: 0.2]
--hot_key_coverage The probability of the hot key in each iteration [default: 0.8]
--num_key The number of keys to query for each iteration [default: 1000]
--iterations The number of iterations of the test [default: 1000]
--warmup_iterations Performance results in warmup stage will be discarded [default: 0]
--embedding_cache Enable embedding cache profiler, including the performance of lookup, insert, etc. [default: false]
--database_backend Enable database backend profiler, which is to get the lookup performance of VDB/PDB [default: false]
--refresh_embeddingcache Enable refreshing embedding cache. If the embedding cache tool is also enabled, the refresh will be performed asynchronously [default: false]
--lookup_session Enable lookup_session profiler, which is E2E profiler, including embedding cache and data backend query delay [default: false]
Measurement example of the HPS Lookup Session
$hps_profiler --iterations 1000 --num_key 2000 --powerlaw --alpha 1.2 --config /hugectr/Model_Samples/wdl/wdl_infer/model/ps.json --table_size 630000 --warmup_iterations 100 --lookup_session
...
*** Measurement Results ***
The Benchmark of: End-to-end lookup embedding keys for Lookup session
Latencies [900 iterations] min = 0.190813ms, mean = 0.243117ms, median = 0.238085ms, 95% = 0.283761ms, 99% = 0.346377ms, max = 0.511712ms, throughput= 4200.18/s
Measurement example of the HPS Data Backend
$hps_profiler --iterations 1000 --num_key 2000 --powerlaw --alpha 1.2 --config /hugectr/Model_Samples/wdl/wdl_infer/model/ps.json --table_size 630000 --warmup_iterations 100 --database_backend
...
*** Measurement Results ***
The Benchmark of: Lookup the embedding key from default HPS database Backend
Latencies [900 iterations] min = 0.075086ms, mean = 0.127312ms, median = 0.121235ms, 95% = 0.166826ms, 99% = 0.219295ms, max = 0.285409ms, throughput = 8248.44/s
NOTE
:
If the user add the
--powerlaw
option, the queried embedding key will be generated with the specified argument--alpha = **
.If the user add the
--hot_key_percentage=**
and--hot_key_coverage=xx
options, the queried embedding key will generate the number of--table_size
*--hot_key_percentage
keys with this probability of--hot_key_percentage=**
. For example--hot_key_percentage=0.01
,--hot_key_coverage=0.9
and--table_size=1000
, then the first 1000*0.01=10 keys will appear in the request with a probability of 90%.It is recommended that users make mutually exclusive selections of three components(
--embedding_cache
,--database_backend
and--lookup_session
) to ensure the most accurate performance. Because the measurement results of the lookup session will include the performance results of the database backend and embedding cache.If enable the static embedding table in HPS json file, the hps_profiler does not support the refresh operation.
Profile HPS with Triton Perf Analyzer:
To profile HPS with Triton Perf Analyzer, make sure you know how to deploy your model using the hugectr backend in Triton. If you don’t, please refer to here.
To profile HPS, follow the procedure below:
Prepare your embedding table. This can be either a real model trained by HugeCTR or a synthetic model generated using the model generator.
Prepare your HPS configuration file
ps.config
, demo showed above.Prepare the Triton required JSON like request. The request can be generated using the request generator
After everything is prepared, start Triton. For example:
tritonserver --model-repository=/dir/to/model/ --load-model=your_model_name --model-control-mode=explicit --backend-directory=/usr/local/hugectr/backends --backend-config=hugectr,ps=/dir/to/your/ps.json
Run the Triton Perf Analyzer. For example:
perf_analyzer -m your_model_name --collect-metrics -f perf_output.csv --verbose-csv --input-data your_generated_request.json
HPS Profiler vs. Triton Perf Analyzer:
Functionalities |
HPS profiler |
Triton Perf Analyzer |
---|---|---|
Profile client side E2E Pipeline |
NO |
YES |
Profile sever side key lookup session |
YES |
YES |
Pofile the embedding cache component |
YES |
NO |
Profile the database backend component |
YES |
NO |
Support different key distributions |
YES |
YES |
Concurrency Support |
NO |
YES |
GPU/Memory Utilization |
NO |
YES |