http://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_hugectr_multi-gpu-offline-inference/nvidia_logo.png

Multi-GPU Offline Inference

Overview

In HugeCTR version 3.4.1, we provide Python APIs to perform multi-GPU offline inference. This work leverages the HugeCTR Hierarchical Parameter Server and enables concurrent execution on multiple devices. The Norm or Parquet dataset format is currently supported by multi-GPU offline inference.

This notebook explains how to perform multi-GPU offline inference with the HugeCTR Python APIs. For more details about the API, see the HugeCTR Python Interface documentation.

Setup

To setup the environment, refer to HugeCTR Example Notebooks and follow the instructions there before running the following.

Data Generation

HugeCTR provides a tool to generate synthetic datasets. The Data Generator class is capable of generating datasets in different formats and with different distributions. We will generate multi-hot Parquet datasets with a power-law distribution for this notebook:

import hugectr
from hugectr.tools import DataGeneratorParams, DataGenerator

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Parquet,
  label_dim = 2,
  dense_dim = 2,
  num_slot = 3,
  i64_input_key = True,
  nnz_array = [2, 1, 3],
  source = "./multi_hot_parquet/file_list.txt",
  eval_source = "./multi_hot_parquet/file_list_test.txt",
  slot_size_array = [10000, 10000, 10000],
  check_type = hugectr.Check_t.Non,
  dist_type = hugectr.Distribution_t.PowerLaw,
  power_law_type = hugectr.PowerLaw_t.Short,
  num_files = 32,
  eval_num_files = 8)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()
[HCTR][08:59:54.134][INFO][RK0][main]: Generate Parquet dataset
[HCTR][08:59:54.134][INFO][RK0][main]: train data folder: ./multi_hot_parquet, eval data folder: ./multi_hot_parquet, slot_size_array: 10000, 10000, 10000, nnz array: 2, 1, 3, #files for train: 32, #files for eval: 8, #samples per file: 40960, Use power law distribution: 1, alpha of power law: 1.3
[HCTR][08:59:54.136][INFO][RK0][main]: ./multi_hot_parquet exist
[HCTR][08:59:54.140][INFO][RK0][main]: ./multi_hot_parquet/train/gen_0.parquet
[HCTR][08:59:55.615][INFO][RK0][main]: ./multi_hot_parquet/train/gen_1.parquet
[HCTR][08:59:55.850][INFO][RK0][main]: ./multi_hot_parquet/train/gen_2.parquet
[HCTR][08:59:56.078][INFO][RK0][main]: ./multi_hot_parquet/train/gen_3.parquet
[HCTR][08:59:56.311][INFO][RK0][main]: ./multi_hot_parquet/train/gen_4.parquet
[HCTR][08:59:56.534][INFO][RK0][main]: ./multi_hot_parquet/train/gen_5.parquet
[HCTR][08:59:56.770][INFO][RK0][main]: ./multi_hot_parquet/train/gen_6.parquet
[HCTR][08:59:56.959][INFO][RK0][main]: ./multi_hot_parquet/train/gen_7.parquet
[HCTR][08:59:57.152][INFO][RK0][main]: ./multi_hot_parquet/train/gen_8.parquet
[HCTR][08:59:57.309][INFO][RK0][main]: ./multi_hot_parquet/train/gen_9.parquet
[HCTR][08:59:57.496][INFO][RK0][main]: ./multi_hot_parquet/train/gen_10.parquet
[HCTR][08:59:57.671][INFO][RK0][main]: ./multi_hot_parquet/train/gen_11.parquet
[HCTR][08:59:57.879][INFO][RK0][main]: ./multi_hot_parquet/train/gen_12.parquet
[HCTR][08:59:58.069][INFO][RK0][main]: ./multi_hot_parquet/train/gen_13.parquet
[HCTR][08:59:58.240][INFO][RK0][main]: ./multi_hot_parquet/train/gen_14.parquet
[HCTR][08:59:58.423][INFO][RK0][main]: ./multi_hot_parquet/train/gen_15.parquet
[HCTR][08:59:58.619][INFO][RK0][main]: ./multi_hot_parquet/train/gen_16.parquet
[HCTR][08:59:58.833][INFO][RK0][main]: ./multi_hot_parquet/train/gen_17.parquet
[HCTR][08:59:59.017][INFO][RK0][main]: ./multi_hot_parquet/train/gen_18.parquet
[HCTR][08:59:59.176][INFO][RK0][main]: ./multi_hot_parquet/train/gen_19.parquet
[HCTR][08:59:59.358][INFO][RK0][main]: ./multi_hot_parquet/train/gen_20.parquet
[HCTR][08:59:59.527][INFO][RK0][main]: ./multi_hot_parquet/train/gen_21.parquet
[HCTR][08:59:59.722][INFO][RK0][main]: ./multi_hot_parquet/train/gen_22.parquet
[HCTR][08:59:59.939][INFO][RK0][main]: ./multi_hot_parquet/train/gen_23.parquet
[HCTR][09:00:00.107][INFO][RK0][main]: ./multi_hot_parquet/train/gen_24.parquet
[HCTR][09:00:00.294][INFO][RK0][main]: ./multi_hot_parquet/train/gen_25.parquet
[HCTR][09:00:00.509][INFO][RK0][main]: ./multi_hot_parquet/train/gen_26.parquet
[HCTR][09:00:00.695][INFO][RK0][main]: ./multi_hot_parquet/train/gen_27.parquet
[HCTR][09:00:00.955][INFO][RK0][main]: ./multi_hot_parquet/train/gen_28.parquet
[HCTR][09:00:01.190][INFO][RK0][main]: ./multi_hot_parquet/train/gen_29.parquet
[HCTR][09:00:01.365][INFO][RK0][main]: ./multi_hot_parquet/train/gen_30.parquet
[HCTR][09:00:01.509][INFO][RK0][main]: ./multi_hot_parquet/train/gen_31.parquet
[HCTR][09:00:01.698][INFO][RK0][main]: ./multi_hot_parquet/file_list.txt done!
[HCTR][09:00:01.708][INFO][RK0][main]: ./multi_hot_parquet/val/gen_0.parquet
[HCTR][09:00:01.895][INFO][RK0][main]: ./multi_hot_parquet/val/gen_1.parquet
[HCTR][09:00:02.062][INFO][RK0][main]: ./multi_hot_parquet/val/gen_2.parquet
[HCTR][09:00:02.255][INFO][RK0][main]: ./multi_hot_parquet/val/gen_3.parquet
[HCTR][09:00:02.472][INFO][RK0][main]: ./multi_hot_parquet/val/gen_4.parquet
[HCTR][09:00:02.665][INFO][RK0][main]: ./multi_hot_parquet/val/gen_5.parquet
[HCTR][09:00:02.888][INFO][RK0][main]: ./multi_hot_parquet/val/gen_6.parquet
[HCTR][09:00:03.110][INFO][RK0][main]: ./multi_hot_parquet/val/gen_7.parquet
[HCTR][09:00:03.303][INFO][RK0][main]: ./multi_hot_parquet/file_list_test.txt done!

Train from Scratch

We can train fom scratch by performing the following steps with Python APIs:

  1. Create the solver, reader and optimizer, then initialize the model.

  2. Construct the model graph by adding input, sparse embedding and dense layers in order.

  3. Compile the model and have an overview of the model graph.

  4. Dump the model graph to a JSON file.

  5. Fit the model, save the model weights and optimizer states implicitly.

  6. Dump one batch of evaluation results to files.

%%writefile multi_hot_train.py
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(model_name = "multi_hot",
                              max_eval_batches = 1,
                              batchsize_eval = 131072,
                              batchsize = 16384,
                              lr = 0.001,
                              vvgpu = [[0]],
                              i64_input_key = True,
                              repeat_dataset = True,
                              use_cuda_graph = True)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["./multi_hot_parquet/file_list.txt"],
                                  eval_source = "./multi_hot_parquet/file_list_test.txt",
                                  check_type = hugectr.Check_t.Non,
                                  slot_size_array = [10000, 10000, 10000])
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 2, label_name = "label",
                        dense_dim = 2, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        [hugectr.DataReaderSparseParam("data1", [2, 1], False, 2),
                        hugectr.DataReaderSparseParam("data2", 3, False, 1),]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 4,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "data1",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 2,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "data2",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=32))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=16))                            
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "reshape2", "dense"], top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["relu1"],
                            top_names = ["fc2"],
                            num_output=2))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.MultiCrossEntropyLoss,
                            bottom_names = ["fc2", "label"],
                            top_names = ["loss"],
                            target_weight_vec = [0.5, 0.5]))
model.compile()
model.summary()
model.graph_to_json("multi_hot.json")
model.fit(max_iter = 1100, display = 200, eval_interval = 1000, snapshot = 1000, snapshot_prefix = "multi_hot")
model.export_predictions("multi_hot_pred_" + str(1000), "multi_hot_label_" + str(1000))
Overwriting multi_hot_train.py
!python3 multi_hot_train.py
HugeCTR Version: 3.7
====================================================Model Init=====================================================
[HCTR][09:00:10.032][INFO][RK0][main]: Initialize model: multi_hot
[HCTR][09:00:10.032][INFO][RK0][main]: Global seed is 69819197
[HCTR][09:00:10.135][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][09:00:11.978][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][09:00:11.978][INFO][RK0][main]: Start all2all warmup
[HCTR][09:00:11.978][INFO][RK0][main]: End all2all warmup
[HCTR][09:00:11.979][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][09:00:11.980][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][09:00:11.985][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][09:00:11.985][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][09:00:12.176][INFO][RK0][main]: Vocabulary size: 30000
[HCTR][09:00:12.177][INFO][RK0][main]: max_vocabulary_size_per_gpu_=21845
[HCTR][09:00:12.179][INFO][RK0][main]: max_vocabulary_size_per_gpu_=10922
[HCTR][09:00:12.181][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][09:00:43.965][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][09:00:43.965][INFO][RK0][main]: gpu0 init embedding done
[HCTR][09:00:43.965][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][09:00:43.965][INFO][RK0][main]: gpu0 init embedding done
[HCTR][09:00:43.969][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][09:00:43.972][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][09:00:43.972][INFO][RK0][main]: label                                   Dense                         Sparse                        
label                                   dense                          data1,data2                   
(None, 2)                               (None, 2)                               
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type                              Input Name                    Output Name                   Output Shape                  
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash      data1                         sparse_embedding1             (None, 2, 16)                 
------------------------------------------------------------------------------------------------------------------
DistributedSlotSparseEmbeddingHash      data2                         sparse_embedding2             (None, 1, 16)                 
------------------------------------------------------------------------------------------------------------------
Reshape                                 sparse_embedding1             reshape1                      (None, 32)                    
------------------------------------------------------------------------------------------------------------------
Reshape                                 sparse_embedding2             reshape2                      (None, 16)                    
------------------------------------------------------------------------------------------------------------------
Concat                                  reshape1                      concat1                       (None, 50)                    
                                        reshape2                                                                                  
                                        dense                                                                                     
------------------------------------------------------------------------------------------------------------------
InnerProduct                            concat1                       fc1                           (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
ReLU                                    fc1                           relu1                         (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
InnerProduct                            relu1                         fc2                           (None, 2)                     
------------------------------------------------------------------------------------------------------------------
MultiCrossEntropyLoss                   fc2                           loss                                                        
                                        label                                                                                     
------------------------------------------------------------------------------------------------------------------
[HCTR][09:00:43.977][INFO][RK0][main]: Save the model graph to multi_hot.json successfully
=====================================================Model Fit=====================================================
[HCTR][09:00:43.977][INFO][RK0][main]: Use non-epoch mode with number of iterations: 1100
[HCTR][09:00:43.977][INFO][RK0][main]: Training batchsize: 16384, evaluation batchsize: 131072
[HCTR][09:00:43.977][INFO][RK0][main]: Evaluation interval: 1000, snapshot interval: 1000
[HCTR][09:00:43.977][INFO][RK0][main]: Dense network trainable: True
[HCTR][09:00:43.977][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][09:00:43.977][INFO][RK0][main]: Sparse embedding sparse_embedding2 trainable: True
[HCTR][09:00:43.977][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][09:00:43.977][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][09:00:43.977][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][09:00:43.977][INFO][RK0][main]: Training source file: ./multi_hot_parquet/file_list.txt
[HCTR][09:00:43.977][INFO][RK0][main]: Evaluation source file: ./multi_hot_parquet/file_list_test.txt
[HCTR][09:00:46.346][INFO][RK0][main]: Iter: 200 Time(200 iters): 2.36888s Loss: 0.346413 lr:0.001
[HCTR][09:00:48.421][INFO][RK0][main]: Iter: 400 Time(200 iters): 2.07362s Loss: 0.345891 lr:0.001
[HCTR][09:00:50.519][INFO][RK0][main]: Iter: 600 Time(200 iters): 2.09809s Loss: 0.345239 lr:0.001
[HCTR][09:00:52.586][INFO][RK0][main]: Iter: 800 Time(200 iters): 2.06616s Loss: 0.344346 lr:0.001
[HCTR][09:00:54.656][INFO][RK0][main]: Iter: 1000 Time(200 iters): 2.0697s Loss: 0.343731 lr:0.001
[HCTR][09:00:54.686][INFO][RK0][main]: Evaluation, AUC: 0.499013
[HCTR][09:00:54.686][INFO][RK0][main]: Eval Time for 1 iters: 0.006811s
[HCTR][09:00:54.692][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][09:00:54.830][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][09:00:54.848][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][09:00:54.851][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][09:00:54.852][INFO][RK0][main]: Done
[HCTR][09:00:54.852][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][09:00:54.853][INFO][RK0][main]: Done
[HCTR][09:00:54.886][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][09:00:54.887][INFO][RK0][main]: Done
[HCTR][09:00:54.887][INFO][RK0][main]: Rank0: Write optimzer state to file
[HCTR][09:00:54.888][INFO][RK0][main]: Done
[HCTR][09:00:54.904][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][09:00:54.906][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][09:00:54.909][INFO][RK0][main]: Dumping dense optimizer states to file, successful
[HCTR][09:00:55.915][INFO][RK0][main]: Finish 1100 iterations with batchsize: 16384 in 11.94s.

Multi-GPU Offline Inference

We can demonstrate multi-GPU offline inference by performing the following steps with Python APIs:

  1. Configure the inference hyperparameters.

  2. Initialize the inference model. The model is a collection of inference sessions deployed on multiple devices.

  3. Make an inference from the evaluation dataset.

  4. Check the correctness of the inference by comparing it with the dumped evaluation results.

Note: The max_batchsize configured within InferenceParams is the global batch size. The value for max_batchsize should be divisible by the number of deployed devices. The numpy array returned by InferenceModel.predict is of the shape (max_batchsize * num_batches, label_dim).

import hugectr
from hugectr.inference import InferenceModel, InferenceParams
import numpy as np
from mpi4py import MPI

model_config = "multi_hot.json"
inference_params = InferenceParams(
    model_name = "multi_hot",
    max_batchsize = 16384,
    hit_rate_threshold = 1.0,
    dense_model_file = "multi_hot_dense_1000.model",
    sparse_model_files = ["multi_hot0_sparse_1000.model", "multi_hot1_sparse_1000.model"],
    deployed_devices = [0, 1, 2, 3, 4, 5, 6, 7],
    use_gpu_embedding_cache = True,
    cache_size_percentage = 0.5,
    i64_input_key = True
)
inference_model = InferenceModel(model_config, inference_params)
pred = inference_model.predict(
    8,
    "./multi_hot_parquet/file_list_test.txt",
    hugectr.DataReaderType_t.Parquet,
    hugectr.Check_t.Non,
    [10000, 10000, 10000]
)
grount_truth = np.loadtxt("multi_hot_pred_1000")
print("pred: ", pred)
print("grount_truth: ", grount_truth)
diff = pred.flatten()-grount_truth
mse = np.mean(diff*diff)
print("mse: ", mse)
[HCTR][09:01:06.069][WARNING][RK0][main]: default_value_for_each_table.size() is not equal to the number of embedding tables
[HCTR][09:01:06.072][INFO][RK0][main]: Global seed is 3072588155
[HCTR][09:01:06.222][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
  GPU 1 ->  node 0
  GPU 2 ->  node 0
  GPU 3 ->  node 0
  GPU 4 ->  node 1
  GPU 5 ->  node 1
  GPU 6 ->  node 1
  GPU 7 ->  node 1
[HCTR][09:01:23.761][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][09:01:23.763][INFO][RK0][main]: Start all2all warmup
[HCTR][09:01:23.996][INFO][RK0][main]: End all2all warmup
[HCTR][09:01:24.013][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][09:01:24.013][INFO][RK0][main]: default_emb_vec_value is not specified using default: 0
[HCTR][09:01:24.013][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][09:01:24.013][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][09:01:24.013][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][09:01:24.347][INFO][RK0][main]: Table: hps_et.multi_hot.sparse_embedding1; cached 19849 / 19849 embeddings in volatile database (PreallocatedHashMapBackend); load: 19849 / 18446744073709551615 (0.00%).
[HCTR][09:01:24.622][INFO][RK0][main]: Table: hps_et.multi_hot.sparse_embedding2; cached 9996 / 9996 embeddings in volatile database (PreallocatedHashMapBackend); load: 9996 / 18446744073709551615 (0.00%).
[HCTR][09:01:24.622][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][09:01:24.622][INFO][RK0][main]: Create embedding cache in device 0.
[HCTR][09:01:24.628][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.628][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.628][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.628][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.628][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.641][INFO][RK0][main]: Create embedding cache in device 1.
[HCTR][09:01:24.646][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.646][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.646][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.646][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.646][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.647][INFO][RK0][main]: Create embedding cache in device 2.
[HCTR][09:01:24.652][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.652][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.652][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.652][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.652][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.654][INFO][RK0][main]: Create embedding cache in device 3.
[HCTR][09:01:24.659][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.659][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.659][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.659][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.659][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.662][INFO][RK0][main]: Create embedding cache in device 4.
[HCTR][09:01:24.667][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.667][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.667][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.667][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.667][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.669][INFO][RK0][main]: Create embedding cache in device 5.
[HCTR][09:01:24.675][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.675][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.675][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.675][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.675][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.679][INFO][RK0][main]: Create embedding cache in device 6.
[HCTR][09:01:24.683][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.683][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.683][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.683][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.683][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.685][INFO][RK0][main]: Create embedding cache in device 7.
[HCTR][09:01:24.688][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 0.500000
[HCTR][09:01:24.688][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][09:01:24.688][INFO][RK0][main]: The size of thread pool: 80
[HCTR][09:01:24.688][INFO][RK0][main]: The size of worker memory pool: 2
[HCTR][09:01:24.688][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][09:01:24.768][INFO][RK0][main]: Create inference session on device: 0
[HCTR][09:01:24.768][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:24.768][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:24.768][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:24.768][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:24.768][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:24.768][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:24.768][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:24.768][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:24.768][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:24.768][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:25.520][INFO][RK0][main]: Create inference session on device: 1
[HCTR][09:01:25.520][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:25.520][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:25.520][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:25.520][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:25.520][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:25.520][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:25.520][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:25.520][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:25.520][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:25.520][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:26.275][INFO][RK0][main]: Create inference session on device: 2
[HCTR][09:01:26.275][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:26.275][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:26.275][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:26.275][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:26.275][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:26.275][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:26.275][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:26.275][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:26.275][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:26.275][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:27.035][INFO][RK0][main]: Create inference session on device: 3
[HCTR][09:01:27.035][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:27.035][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:27.035][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:27.035][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:27.035][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:27.035][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:27.035][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:27.035][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:27.035][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:27.035][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:27.781][INFO][RK0][main]: Create inference session on device: 4
[HCTR][09:01:27.781][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:27.781][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:27.781][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:27.781][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:27.781][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:27.781][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:27.781][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:27.781][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:27.781][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:27.781][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:28.534][INFO][RK0][main]: Create inference session on device: 5
[HCTR][09:01:28.534][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:28.534][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:28.534][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:28.534][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:28.534][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:28.534][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:28.534][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:28.534][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:28.534][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:28.534][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:29.291][INFO][RK0][main]: Create inference session on device: 6
[HCTR][09:01:29.291][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:29.291][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:29.291][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:29.291][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:29.291][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:29.291][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:29.291][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:29.291][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:29.291][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:29.291][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:30.037][INFO][RK0][main]: Create inference session on device: 7
[HCTR][09:01:30.037][INFO][RK0][main]: Model name: multi_hot
[HCTR][09:01:30.037][INFO][RK0][main]: Use mixed precision: False
[HCTR][09:01:30.037][INFO][RK0][main]: Use cuda graph: True
[HCTR][09:01:30.037][INFO][RK0][main]: Max batchsize: 2048
[HCTR][09:01:30.037][INFO][RK0][main]: Use I64 input key: True
[HCTR][09:01:30.038][INFO][RK0][main]: start create embedding for inference
[HCTR][09:01:30.038][INFO][RK0][main]: sparse_input name data1
[HCTR][09:01:30.038][INFO][RK0][main]: sparse_input name data2
[HCTR][09:01:30.038][INFO][RK0][main]: create embedding for inference success
[HCTR][09:01:30.038][INFO][RK0][main]: Inference stage skip MultiCrossEntropyLoss layer, replaced by Sigmoid layer
[HCTR][09:01:30.807][INFO][RK0][main]: Create inference data reader on 8 GPU(s)
[HCTR][09:01:30.807][INFO][RK0][main]: num of DataReader workers: 8
[HCTR][09:01:30.915][INFO][RK0][main]: Vocabulary size: 30000

[INFO] Inference time for 8 batches: 0.182527
pred:  [[0.51329887 0.4888402 ]
 [0.55268604 0.62567735]
 [0.48302165 0.5015869 ]
 ...
 [0.52275413 0.46319592]
 [0.46984023 0.5436093 ]
 [0.48216432 0.48920953]]
grount_truth:  [0.513299 0.48884  0.552686 ... 0.543609 0.482164 0.48921 ]
mse:  8.482603947165404e-14