# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

Deploying Ranking Models with Merlin Systems#

NVIDIA Merlin is an open source framework that accelerates and scales end-to-end recommender system pipelines. The Merlin framework is broken up into several sub components, these include: Merlin-Core, Merlin-Models, NVTabular and Merlin-Systems. Merlin Systems is the focus of this example.

The purpose of the Merlin Systems library is to make it easy for Merlin users to quickly deploy their recommender systems from development to Triton Inference Server, which is an open-source inference serving software, standardizes AI model deployment and execution and delivers fast and scalable AI in production.

Please ensure you have followed the Quick-start for ranking, and ran the preprocesssing.py and ranking.py scripts and saved the NVTabular preproc workflow and the trained ranking model in an accessible location. You also need to follow the instructions at inference README.

Merlin Systems takes the data preprocessing workflow defined in NVTabular and loads that into Triton Inference Server as a model. Subsequently it does the same for the trained model.

Learning Objectives#

This Jupyter notebook example demonstrates

  • deploying an NVTabular model and a ranking model to Triton Inference Server as an ensemble

  • sending a request to Triton

  • generating prediction results for a given query (a batch)

Starting Triton Inference Server#

After we export the ensemble, we are ready to start the Triton Inference Server. The server is installed in all the Merlin inference containers. If you are not using one of our containers, then ensure it is installed in your environment. For more information, see the Triton Inference Server documentation.

You can start the server by running the following command:

tritonserver --model-repository=<path to the saved ensemble folder>

For the --model-repository argument, specify the same path of the ensemble_export_path that you specified previously when executing the inference.py script.

After you run the tritonserver command, wait until your terminal shows messages like the following example:

I0414 18:29:50.741833 4067 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001
I0414 18:29:50.742197 4067 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000
I0414 18:29:50.783470 4067 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002 ,br>

Import libraries.

import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"

import cudf
import json
import numpy as np
import pandas as pd
from nvtabular.workflow import Workflow
import tritonclient.grpc as grpcclient
2023-05-16 23:44:57.737454: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/usr/local/lib/python3.8/dist-packages/merlin/dtypes/mappings/torch.py:43: UserWarning: PyTorch dtype mappings did not load successfully due to an error: No module named 'torch'
  warn(f"PyTorch dtype mappings did not load successfully due to an error: {exc.msg}")

Load the saved NVTabular workflow. We will use workflow’s input schema as an input below when sending request to Triton.

input_data_path = os.environ.get("INPUT_FOLDER", "/outputs/dataset/")
workflow_stored_path = os.path.join(input_data_path, "workflow")
workflow = Workflow.load(workflow_stored_path)
workflow.input_schema
name tags dtype is_list is_ragged
0 user_id () DType(name='int32', element_type=<ElementType.... False False
1 item_id () DType(name='int32', element_type=<ElementType.... False False
2 video_category () DType(name='int8', element_type=<ElementType.I... False False
3 gender () DType(name='int8', element_type=<ElementType.I... False False
4 age () DType(name='int8', element_type=<ElementType.I... False False

Load the saved output names as a list.

output_targets_path ='outputs.json'
with open(output_targets_path, "r") as outfile:
    outputs = json.loads(outfile.read())
    
print(outputs)
['click/binary_output', 'like/binary_output']

We prepare a batch request to send a recommendation request to Triton whose response will be probability scores for each target column. Since we are serving a pipeline ensemble containing our NVTabular workflow and ranking model, we can send a request with raw data (not preprocessed) and the served NVTabular model will transform data the same way done during the preprocessing of training data.

One thing to note that in this example, we are not creating the raw data from raw .csv file since, we did some data preparations and removed some user and items from the dataset based on the min frequencies we set during preprocessing file. So we use the raw validation data that were generated after train and eval set split step to send a request.

batch = cudf.read_parquet(os.path.join(input_data_path, "_cache/02/eval/", "part.0.parquet"), columns=workflow.input_schema.column_names).reset_index(drop=True)
batch = batch.iloc[:10, :]
print(batch)
   user_id  item_id  video_category  gender  age
0    16794   221049               0       2    2
1    23542    61962               0       0    0
2    85886   281786               0       0    0
3     6016    26929               0       4    1
4    66043    30710               0       0    0
5    39752   222908               0       2    1
6     8365   273888               0       0    0
7    73739   280425               0       0    0
8    27552    28110               0       2    1
9    17866    69910               0       2    2

Deploy models on Triton Inference Server#

First we need to ensure that we have a client connected to the server that we started. To do this, we use the Triton HTTP client library.

import tritonclient.http as client

# Create a triton client
try:
    triton_client = client.InferenceServerClient(url="localhost:8000", verbose=True)
    print("client created.")
except Exception as e:
    print("channel creation failed: " + str(e))
client created.
# ensure triton is in a good state
triton_client.is_server_live()
triton_client.get_model_repository_index()
GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
POST /v2/repository/index, headers None

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '191'}>
bytearray(b'[{"name":"0_transformworkflowtriton","version":"1","state":"READY"},{"name":"1_predicttensorflowtriton","version":"1","state":"READY"},{"name":"executor_model","version":"1","state":"READY"}]')
[{'name': '0_transformworkflowtriton', 'version': '1', 'state': 'READY'},
 {'name': '1_predicttensorflowtriton', 'version': '1', 'state': 'READY'},
 {'name': 'executor_model', 'version': '1', 'state': 'READY'}]

Now that our server is running, we can send requests to it. In the code below we create a request to send to triton and send it.

from merlin.systems.triton.utils import send_triton_request
response = send_triton_request(workflow.input_schema, batch, outputs)

Print out the response.

print(response)
{'click/binary_output': array([[0.50231797],
       [0.50405663],
       [0.50262684],
       [0.5003805 ],
       [0.50613105],
       [0.4995402 ],
       [0.5027875 ],
       [0.5036676 ],
       [0.4998571 ],
       [0.5052081 ]], dtype=float32), 'like/binary_output': array([[0.49693626],
       [0.49303743],
       [0.49347958],
       [0.49609515],
       [0.4981295 ],
       [0.49890146],
       [0.49202597],
       [0.49149314],
       [0.5004128 ],
       [0.49684843]], dtype=float32)}

The response consists of probability values for each row in the batch request for each target, i.e., click and like.

Summary#

Congratulations on completing this quick start guide example series!

In this quick start example series, you have preprocessed and transformed the data with NVTabular, trained a single-task or multi-task model with Merlin Models, and then finally deployed these models on Triton Inference Server.