# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
Triton for Recommender Systems
NVIDIA Triton Inference Server (TIS) simplifies the deployment of AI models at scale in production. The Triton Inference Server allows us to deploy and serve our model for inference. It supports a number of different machine learning frameworks such as TensorFlow and PyTorch.
The last step of machine learning (ML)/deep learning (DL) pipeline is to deploy the ETL workflow and saved model to production. In the production setting, we want to transform the input data as done during training (ETL). We need to apply the same mean/std for continuous features and use the same categorical mapping to convert the categories to continuous integer before we use the DL model for a prediction. Therefore, we deploy the NVTabular workflow with the PyTorch model as an ensemble model to Triton Inference. The ensemble model guarantees that the same transformation is applied to the raw inputs.
Objectives:
Learn how to deploy a model to Triton
Deploy saved NVTabular and PyTorch models to Triton Inference Server
Sent requests for predictions
Pull and start Inference docker container
At this point, before connecing to the Triton Server, we launch the inference docker container and then load the exported ensemble t4r_pytorch
to the inference server. This is done with the scripts below:
Launch the docker container:
docker run -it --gpus device=0 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <path_to_saved_models>:/workspace/models/ nvcr.io/nvidia/merlin/merlin-inference:21.09
This script will mount your local model-repository folder that includes your saved models from the previous cell to /workspace/models
directory in the merlin-inference docker container.
Start triton server
After you started the merlin-inference container, you can start triton server with the command below. You need to provide correct path of the models folder.
tritonserver --model-repository=<path_to_models> --model-control-mode=explicit
Note: The model-repository path for our example is /workspace/models
. The models haven’t been loaded, yet. Below, we will request the Triton server to load the saved ensemble model.
1. Deploy PyTorch and NVTabular Model to Triton Inference Server
Our Triton server has already been launched and is ready to make requests. Remember we already exported the saved PyTorch model in the previous notebook, and generated the config files for Triton Inference Server.
# Import dependencies
import os
from time import time
import argparse
import numpy as np
import pandas as pd
import sys
import cudf
1.2 Review exported files
Triton expects a specific directory structure for our models as the following format:
<model-name>/
[config.pbtxt]
<version-name>/
[model.savedmodel]/
<pytorch_saved_model_files>/
...
Let’s check out our model repository layout. You can install tree library with apt-get install tree
, and then run !tree /workspace/models/
to print out the model repository layout as below:
├── t4r_pytorch
│ ├── 1
│ └── config.pbtxt
├── t4r_pytorch_nvt
│ ├── 1
│ │ ├── model.py
│ │ ├── __pycache__
│ │ │ └── model.cpython-38.pyc
│ │ └── workflow
│ │ ├── categories
│ │ │ ├── cat_stats.category_id.parquet
│ │ │ ├── unique.brand.parquet
│ │ │ ├── unique.category_code.parquet
│ │ │ ├── unique.category_id.parquet
│ │ │ ├── unique.event_type.parquet
│ │ │ ├── unique.product_id.parquet
│ │ │ ├── unique.user_id.parquet
│ │ │ └── unique.user_session.parquet
│ │ ├── metadata.json
│ │ └── workflow.pkl
│ └── config.pbtxt
└── t4r_pytorch_pt
├── 1
│ ├── model_info.json
│ ├── model.pkl
│ ├── model.pth
│ ├── model.py
│ └── __pycache__
│ └── model.cpython-38.pyc
└── config.pbtxt
Triton needs a config file to understand how to interpret the model. Let’s look at the generated config file. It defines the input columns with datatype and dimensions and the output layer. Manually creating this config file can be complicated and NVTabular generates it with the export_pytorch_ensemble()
function, which we used in the previous notebook.
The config file needs the following information:
name
: The name of our model. Must be the same name as the parent folder.platform
: The type of framework serving the model.input
: The input our model expects.name
: Should correspond with the model input name.data_type
: Should correspond to the input’s data type.dims
: The dimensions of the request for the input. For models that support input and output tensors with variable-size dimensions, those dimensions can be listed as -1 in the input and output configuration.
output
: The output parameters of our model.name
: Should correspond with the model output name.data_type
: Should correspond to the output’s data type.dims
: The dimensions of the output.
1.3. Loading Model
Next, let’s build a client to connect to our server. The InferenceServerClient
object is what we’ll be using to talk to Triton.
import tritonhttpclient
try:
triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
print("client created.")
except Exception as e:
print("channel creation failed: " + str(e))
triton_client.is_server_live()
client created.
GET /v2/health/live, headers None
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
True
triton_client.get_model_repository_index()
POST /v2/repository/index, headers None
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '201'}>
bytearray(b'[{"name":"t4r_pytorch","version":"1","state":"UNAVAILABLE","reason":"unloaded"},{"name":"t4r_pytorch_nvt","version":"1","state":"UNLOADING"},{"name":"t4r_pytorch_pt","version":"1","state":"UNLOADING"}]')
[{'name': 't4r_pytorch',
'version': '1',
'state': 'UNAVAILABLE',
'reason': 'unloaded'},
{'name': 't4r_pytorch_nvt', 'version': '1', 'state': 'UNLOADING'},
{'name': 't4r_pytorch_pt', 'version': '1', 'state': 'UNLOADING'}]
We load the ensemble model
model_name = "t4r_pytorch"
triton_client.load_model(model_name=model_name)
POST /v2/repository/models/t4r_pytorch/load, headers None
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 't4r_pytorch'
If all models are loaded successfully, you should be seeing successfully loaded status next to each model name on your terminal.
2. Sent Requests for Predictions
Load raw data for inference: We select the first 50 interactions and filter out sessions with less than 2 interactions. For this tutorial, just as an example we use the Oct-2019
dataset that we used for model training.
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", "/workspace/data/")
df= cudf.read_parquet(os.path.join(INPUT_DATA_DIR, 'Oct-2019.parquet'))
df=df.sort_values('event_time_ts')
batch = df.iloc[:50,:]
sessions_to_use = batch.user_session.value_counts()
filtered_batch = batch[batch.user_session.isin(sessions_to_use[sessions_to_use.values>1].index.values)]
filtered_batch.head()
user_session | event_type | product_id | category_id | category_code | brand | price | user_id | event_time_ts | prod_first_event_time_ts | |
---|---|---|---|---|---|---|---|---|---|---|
3562914 | 1637332 | view | 1307067 | 2053013558920217191 | computers.notebook | lenovo | 251.74 | 550050854 | 1569888001 | 1569888001 |
5173328 | 4202155 | view | 1004237 | 2053013555631882655 | electronics.smartphone | apple | 1081.98 | 535871217 | 1569888004 | 1569888004 |
3741261 | 1808164 | view | 1480613 | 2053013561092866779 | computers.desktop | pulser | 908.62 | 512742880 | 1569888005 | 1569888005 |
4996937 | 3794756 | view | 31500053 | 2053013558031024687 | <NA> | luminarc | 41.16 | 550978835 | 1569888008 | 1569888008 |
5589259 | 5470852 | view | 28719074 | 2053013565480109009 | apparel.shoes.keds | baden | 102.71 | 520571932 | 1569888010 | 1569888010 |
import warnings
warnings.filterwarnings("ignore")
import nvtabular.inference.triton as nvt_triton
import tritonclient.grpc as grpcclient
inputs = nvt_triton.convert_df_to_triton_input(filtered_batch.columns, filtered_batch, grpcclient.InferInput)
output_names = ["output"]
outputs = []
for col in output_names:
outputs.append(grpcclient.InferRequestedOutput(col))
MODEL_NAME_NVT = "t4r_pytorch"
with grpcclient.InferenceServerClient("localhost:8001") as client:
response = client.infer(MODEL_NAME_NVT, inputs)
print(col, ':\n', response.as_numpy(col))
output :
[[-12.86381 -13.449438 -9.572359 ... -12.689846 -13.033402
-13.294905 ]
[-24.320768 -26.130745 -4.3342614 ... -24.07727 -25.470228
-26.27378 ]
[-22.867298 -24.897617 -6.6269407 ... -23.640343 -23.620872
-24.977371 ]
[-21.455946 -22.92965 -4.8912797 ... -21.020473 -22.514032
-22.958193 ]
[-24.569319 -26.149971 -4.223791 ... -24.316437 -25.649946
-26.920403 ]
[-14.218529 -14.833358 -8.438756 ... -14.013732 -14.700138
-14.71361 ]]
Visualise top-k predictions
from transformers4rec.torch.utils.examples_utils import visualize_response
visualize_response(filtered_batch, response, top_k=5, session_col='user_session')
- Top-5 predictions for session `1167651`: 1045 || 229 || 233 || 1085 || 10
- Top-5 predictions for session `1637332`: 11 || 7 || 4 || 2 || 3
- Top-5 predictions for session `1808164`: 162 || 142 || 226 || 80 || 200
- Top-5 predictions for session `3794756`: 3 || 2 || 26 || 364 || 10
- Top-5 predictions for session `4202155`: 2 || 57 || 36 || 38 || 10
- Top-5 predictions for session `5470852`: 1710 || 233 || 805 || 555 || 10
As you see we first got prediction results (logits) from the trained model head, and then by using a handy util function visualize_response
we extracted top-k encoded item-ids from logits. Basically, we generated recommended items for a given session.
This is the end of the tutorial. You successfully …
performed feature engineering with NVTabular
trained transformer architecture based session-based recommendation models with Transformers4Rec
deployed a trained model to Triton Inference Server, sent request and got responses from the server.
Unload models and shut down the kernel
triton_client.unload_model(model_name="t4r_pytorch")
triton_client.unload_model(model_name="t4r_pytorch_nvt")
triton_client.unload_model(model_name="t4r_pytorch_pt")
POST /v2/repository/models/t4r_pytorch/unload, headers None
{"parameters":{"unload_dependents":false}}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 't4r_pytorch'
POST /v2/repository/models/t4r_pytorch_nvt/unload, headers None
{"parameters":{"unload_dependents":false}}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 't4r_pytorch_nvt'
POST /v2/repository/models/t4r_pytorch_pt/unload, headers None
{"parameters":{"unload_dependents":false}}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 't4r_pytorch_pt'
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
{'status': 'ok', 'restart': True}