# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.
https://developer.download.nvidia.com/notebooks/dlsw-notebooks/merlin_merlin_scaling-criteo-04-triton-inference-with-merlin-models-tensorflow/nvidia_logo.png

Scaling Criteo: Triton Inference with Merlin Models TensorFlow#

This notebook is created using the latest stable merlin-tensorflow container.

Overview#

In the previous notebook, we processed the criteo dataset with NVTabular and trained a DLRM model with Merlin Model Tensorflow. Finally, we want to deploy our pipeline to Triton Inference Server (TIS), which can serve our model in a production environment.

We can send raw data to the API endpoint. TIS will execute the same NVTabular workflow for feature engineering and predict the processed data with Merlin Models TensorFlow. We deploy the pipeline as an ensemble and receive the predict scores. This notebook is based on the Example, Serving Ranking Models With Merlin Systems, in Merlin systems. If you are interested in more details, we recommend to go through the example, first.

Learning objectives#

  • Deploy an ensemble pipeline of NVTabular and Merlin Models TensorFlow to Triton Inference Server

  • Get prediction from Triton Inference Server

Saved NVTabular workflow and Merlin Models#

We load the required libraries.

import os
import glob
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"

import tensorflow as tf

import tritonclient.grpc as grpcclient
from nvtabular.workflow import Workflow
# The following import is required even though it is not referenced in the program.
# It loads artifacts that affect the schema and how the model is saved on disk.
import merlin.models.tf as mm  # noqa: F401
from merlin.schema.tags import Tags
from merlin.systems.dag.ops.workflow import TransformWorkflow
from merlin.systems.dag.ops.tensorflow import PredictTensorflow
from merlin.systems.dag.ensemble import Ensemble
from merlin.systems.triton import convert_df_to_triton_input
from merlin.core.dispatch import get_lib

We define the path for the saved workflow and model.

BASE_DIR = os.environ.get("BASE_DIR", '/raid/data/criteo/test_dask/output/')
original_data_path = os.environ.get("INPUT_FOLDER", "/raid/data/criteo/converted/criteo")
input_path = BASE_DIR

We load the NVTabular workflow.

workflow = Workflow.load(os.path.join(input_path, "workflow"))

We need to remove the target columns from the workflow. The target columns are required to train our model. However, we do not know the targets during inference in the production environment.

label_columns = workflow.output_schema.select_by_tag(Tags.TARGET).column_names
workflow.remove_inputs(label_columns)

We load the saved Merlin Models TensorFlow model.

model = tf.keras.models.load_model(os.path.join(input_path, "dlrm"))

Deploying Ensemble to Triton Inference Server#

We create our prediction pipeline:

  • the NVTabular workflow is executed via TransformWorkflow()

  • the TensorFlow model predict the output of the NVTabular workflow

serving_operators = (
    workflow.input_schema.column_names >>
    TransformWorkflow(workflow) >>
    PredictTensorflow(model)
)

We create the Ensemble graph.

ensemble = Ensemble(serving_operators, workflow.input_schema)

We generate the Triton Inference Server artifacts and export them in the export_path directory.

export_path = os.path.join(input_path, "ensemble")
ens_conf, node_confs = ensemble.export(export_path)

After we export the ensemble, we are ready to start the Triton Inference Server. The server is installed in the merlin-tensorflow-container. If you are not using one of our containers, then ensure it is installed in your environment. For more information, see the Triton Inference Server documentation.

You can start the server by running the following command:

tritonserver --model-repository=/raid/data/criteo/test_dask/output/ensemble --backend-config=tensorflow,version=2

For the --model-repository argument, specify the same value as the export_path that you specified previously in the ensemble.export method.

Get prediction from Triton Inference Server#

After we started Triton Inference Server and it loaded all models, we will send raw data as a request and receive the predictions.

We read 3 example rows from the last parquet file from the raw data.

df_lib = get_lib()
input_cols = workflow.input_schema.column_names
# read in data for request
batch = df_lib.read_parquet(
    os.path.join(sorted(glob.glob(original_data_path + "/*.parquet"))[-1]),
    num_rows=3,
    columns=input_cols
)
batch

We generate a Triton Inference Server request object.

Currently, NA/None values are not supported for int32 columns. As a workaround, we will NA values with 0. This will be updated in the future.

# create inputs and outputs
inputs = convert_df_to_triton_input(workflow.input_schema, batch.fillna(0), grpcclient.InferInput)
output_cols = ensemble.output_schema.column_names
outputs = [
    grpcclient.InferRequestedOutput(col)
    for col in output_cols
]

We send the request to Triton Inference Server.

# send request to tritonserver
with grpcclient.InferenceServerClient("localhost:8001") as client:
    response = client.infer("executor_model", inputs, request_id="1", outputs=outputs)

We print out the predictions.

for col in ensemble.output_schema.column_names:
    print(col, response.as_numpy(col), response.as_numpy(col).shape)

Summary#

In this example, we deployed a recommender system pipeline as an ensemble. First, NVTabular created features and afterwards, Merlin Models TensorFlow predicted the processed data. The DLRM architecture was used as a model. This process ensures that the training and production environments use the same feature engineering.

Next steps#

If you are interested in more details of the pipeline, we recommend to try out the Merlin System example.

In our Merlin repository, we provide another end-to-end example using a candidate retrieval and ranking model. In addition, we use approximate nearest neighbor and a feature store.