# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ================================

Taking the Next Step with Merlin Models: Define Your Own Architecture

This notebook is created using the latest stable merlin-tensorflow container.

In Iterating over Deep Learning Models using Merlin Models, we conducted a benchmark of standard and deep learning-based ranking models provided by the high-level Merlin Models API. The library also includes the standard components of deep learning that let recsys practitioners and researchers to define custom models, train and export them for inference.

In this example, we combine pre-existing blocks and demonstrate how to create the DLRM architecture.

Learning objectives

Understand the building blocks of Merlin Models
Define a model architecture from scratch

Introduction to Merlin-models core building blocks

The Block is the core abstraction in Merlin Models and is the class from which all blocks inherit. The class extends the tf.keras.layers.Layer base class and implements a number of properties that simplify the creation of custom blocks and models. These properties include the Schema object for determining the embedding dimensions, input shapes, and output shapes. Additionally, the Block has a ModelContext instance to store and retrieve public variables and share them with other blocks in the same model as additional meta-data.

Before deep-diving into the definition of the DLRM architecture, let’s start by listing the core components you need to know to define a model from scratch:

Features Blocks

They include input blocks to process various inputs based on their types and shapes. Merlin Models supports three main blocks:

EmbeddingFeatures: Input block for embedding-lookups for categorical features.
SequenceEmbeddingFeatures: Input block for embedding-lookups for sequential categorical features (3D tensors).
ContinuousFeatures: Input block for continuous features.

Transformations Blocks

They include various operators commonly used to transform tensors in various parts of the model, such as:

AsDenseFeatures: It takes a dictionary of raw input tensors and transforms the sparse tensors into dense tensors.
L2Norm: It takes a single or a dictionary of hidden tensors and applies an L2-normalization along a given axis.
LogitsTemperatureScaler: It scales the output tensor of predicted logits to lower the model’s confidence.

Aggregations Blocks

They include common aggregation operations to combine multiple tensors, such as:

ConcatFeatures: Concatenate dictionary of tensors along a given dimension.
StackFeatures: Stack dictionary of tensors along a given dimension.
CosineSimilarity: Calculate the cosine similarity between two tensors.

Connects Methods

The base class Block implements different connects methods that control how to link a given block to other blocks:

connect: Connect the block to other blocks sequentially. The output is a tensor returned by the last block.
connect_branch: Link the block to other blocks in parallel. The output is a dictionary containing the output tensor of each block.
connect_with_shortcut: Connect the block to other blocks sequentially and apply a skip connection with the block’s output.
connect_with_residual: Connect the block to other blocks sequentially and apply a residual sum with the block’s output.

Prediction Tasks

Merlin Models introduces the PredictionTask layer that defines the necessary blocks and transformation operations to compute the final prediction scores. It also provides the default loss and metrics related to the given prediction task.
Merlin Models supports the core tasks: BinaryClassificationTask, MultiClassClassificationTask, andRegressionTask. In addition to the preceding tasks, Merlin Models provides tasks that are specific to recommender systems: NextItemPredictionTask, and ItemRetrievalTask.

Implement the DLRM model with MovieLens-1M data

Now that we have introduced the core blocks of Merlin Models, let’s take a look at how we can combine them to define the DLRM architecture:

import tensorflow as tf
import merlin.models.tf as mm

from merlin.datasets.entertainment import get_movielens
from merlin.schema.tags import Tags

2022-04-12 17:57:26.504390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 24570 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:65:00.0, compute capability: 8.6

We use the get_movielens function to download, extract, and preprocess the MovieLens 1M dataset:

train, valid = get_movielens(variant="ml-1m")

WARNING:tensorflow:From /models/merlin/models/utils/nvt_utils.py:14: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.

WARNING:tensorflow:From /models/merlin/models/utils/nvt_utils.py:14: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2022-04-12 17:57:27.081082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /device:GPU:0 with 24570 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:65:00.0, compute capability: 8.6
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(

We display the first five rows of the validation data and use them to check the outputs of each building block:

valid.head()

	userId	movieId	title	genres	gender	age	occupation	zipcode	TE_age_rating	TE_gender_rating	TE_occupation_rating	TE_zipcode_rating	TE_movieId_rating	TE_userId_rating	rating_binary	rating
0	178	60	60	[3, 7, 14]	2	1	8	178	-0.520464	1.792874	-0.076353	-0.251986	-0.320740	-0.461858	1	4.0
1	81	1408	1409	[1]	1	5	11	240	1.955704	-0.537666	1.541820	1.849453	0.161224	1.619103	1	4.0
2	183	349	352	[1, 9]	1	1	4	58	-0.561167	-0.602045	-0.140828	0.369887	-0.701068	-0.095035	0	3.0
3	153	1310	1311	[2, 6]	1	1	10	338	-0.535551	-0.506479	0.173980	0.671975	-0.082473	0.599116	1	4.0
4	297	1491	1496	[5, 4]	2	1	11	408	-0.523482	1.630173	1.541820	-0.721210	-3.000164	-0.781899	0	1.0

We convert the first five rows of the valid dataset to a batch of input tensors:

batch = mm.sample_batch(valid, batch_size=5, shuffle=False, include_targets=False)
batch["userId"]

<tf.Tensor: shape=(5, 1), dtype=int32, numpy=
array([[178],
       [ 81],
       [183],
       [153],
       [297]], dtype=int32)>

Define the inputs block

For the sake of simplicity, let’s create a schema with a subset of the following continuous and categorical features:

sub_schema = train.schema.select_by_name(
    [
        "userId",
        "movieId",
        "title",
        "gender",
        "TE_zipcode_rating",
        "TE_movieId_rating",
        "rating_binary",
    ]
)

We define the continuous layer based on the schema:

continuous_block = mm.ContinuousFeatures.from_schema(sub_schema, tags=Tags.CONTINUOUS)

We display the output tensor of the continuous block by using the data from the first batch. We can see the raw tensors of the continuous features:

continuous_block(batch)

{'TE_zipcode_rating': <tf.Tensor: shape=(5, 1), dtype=float32, numpy=
 array([[-0.25198567],
        [ 1.8494534 ],
        [ 0.36988667],
        [ 0.67197526],
        [-0.7212096 ]], dtype=float32)>,
 'TE_movieId_rating': <tf.Tensor: shape=(5, 1), dtype=float32, numpy=
 array([[-0.3207402 ],
        [ 0.16122401],
        [-0.70106816],
        [-0.08247337],
        [-3.0001638 ]], dtype=float32)>}

We connect the continuous block to a MLPBlock instance to project them into the same dimensionality as the embedding width of categorical features:

deep_continuous_block = continuous_block.connect(mm.MLPBlock([64]))
deep_continuous_block(batch).shape

2022-04-12 17:57:48.227735: I tensorflow/stream_executor/cuda/cuda_blas.cc:1792] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

TensorShape([5, 64])

We define the categorical embedding block based on the schema:

embedding_block = mm.EmbeddingFeatures.from_schema(sub_schema)

We display the output tensor of the categorical embedding block using the data from the first batch. We can see the embeddings tensors of categorical features with a default dimension of 64:

embeddings = embedding_block(batch)
embeddings.keys(), embeddings["userId"].shape

(dict_keys(['userId', 'movieId', 'title', 'gender']), TensorShape([5, 64]))

Let’s store the continuous and categorical representations in a single dictionary using a ParallelBlock instance:

dlrm_input_block = mm.ParallelBlock(
    {"embeddings": embedding_block, "deep_continuous": deep_continuous_block}
)
print("Output shapes of DLRM input block:")
for key, val in dlrm_input_block(batch).items():
    print("\t%s : %s" % (key, val.shape))

Output shapes of DLRM input block:
	userId : (5, 64)
	movieId : (5, 64)
	title : (5, 64)
	gender : (5, 64)
	deep_continuous : (5, 64)

By looking at the output, we can see that the ParallelBlock class applies embedding and continuous blocks, in parallel, to the same input batch. Additionally, it merges the resulting tensors into one dictionary.

Define the interaction block

Now that we have a vector representation of each input feature, we will create the DLRM interaction block. It consists of three operations:

Apply a dot product between all continuous and categorical features to learn pairwise interactions.
Concat the resulting pairwise interaction with the deep representation of conitnuous features (skip-connection).
Apply an MLPBlock with a series of dense layers to the concatenated tensor.

First, we use the connect_with_shortcut method to create first two operations of the DLRM interaction block:

from merlin.models.tf.blocks.dlrm import DotProductInteractionBlock

dlrm_interaction = dlrm_input_block.connect_with_shortcut(
    DotProductInteractionBlock(), shortcut_filter=mm.Filter("deep_continuous"), aggregation="concat"
)

The Filter operation allows us to select the deep_continuous tensor from the dlrm_input_block outputs.

The following diagram provides a visualization of the operations that we constructed in the dlrm_interaction object.

dlrm_interaction(batch)

<tf.Tensor: shape=(5, 2080), dtype=float32, numpy=
array([[ 0.03531839,  0.        ,  0.02178912, ...,  0.00348584,
         0.01123738,  0.05082896],
       [ 0.        ,  0.06999855,  0.38183114, ...,  0.02661334,
         0.00329179, -0.0324194 ],
       [ 0.03445464,  0.        ,  0.25753298, ..., -0.0443273 ,
         0.08484615, -0.04135836],
       [ 0.        ,  0.        ,  0.17358088, ..., -0.0163713 ,
         0.02033711, -0.03035038],
       [ 0.25441766,  0.        ,  0.5767709 , ...,  0.01078878,
        -0.02322949,  0.04039076]], dtype=float32)>

Then, we project the learned interaction using a series of dense layers:

deep_dlrm_interaction = dlrm_interaction.connect(mm.MLPBlock([64, 128, 512]))
deep_dlrm_interaction(batch)

<tf.Tensor: shape=(5, 512), dtype=float32, numpy=
array([[0.00196931, 0.        , 0.01411253, ..., 0.00167978, 0.00330653,
        0.        ],
       [0.00134648, 0.02019053, 0.04212135, ..., 0.00931738, 0.        ,
        0.        ],
       [0.00061063, 0.00702857, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00954365, 0.        , 0.00239623, ..., 0.        , 0.        ,
        0.        ],
       [0.01073698, 0.04097259, 0.        , ..., 0.00655706, 0.01244057,
        0.        ]], dtype=float32)>

Define the Prediction block

At this stage, we have created the DLRM block that accepts a dictionary of categorical and continuous tensors as input. The output of this block is the interaction representation vector of shape 512. The next step is to use this hidden representation to conduct a given prediction task. In our case, we use the label rating_binary and the objective is: to predict if a user A will give a high rating to a movie B or not.

We use the BinaryClassificationTask class and evaluate the performances using the AUC metric. We also use the LogitsTemperatureScaler block as a pre-transformation operation that scales the logits returned by the task before computing the loss and metrics:

from merlin.models.tf.core.transformations import LogitsTemperatureScaler

binary_task = mm.BinaryClassificationTask(
    sub_schema,
    pre=LogitsTemperatureScaler(temperature=2),
)

Define, train, and evaluate the final DLRM Model

We connect the deep DLRM interaction to the binary task and the method automatically generates the Model class for us. We note that the Model class inherits from tf.keras.Model class:

model = mm.Model(deep_dlrm_interaction, binary_task)
type(model)

We train the model using the built-in tf.keras fit method:

model.compile(optimizer="adam", metrics=[tf.keras.metrics.AUC()])
model.fit(train, batch_size=1024, epochs=1)

Let’s check out the model evaluation scores:

metrics = model.evaluate(valid, batch_size=1024, return_dict=True)
metrics

2022-04-12 17:58:03.691971: W tensorflow/core/grappler/optimizers/loop_optimizer.cc:907] Skipping loop optimization for Merge node with control input: cond/then/_0/cond/cond/branch_executed/_128

196/196 [==============================] - 3s 8ms/step - rating_binary/binary_classification_task/auc: 0.7464 - loss: 2.2079 - regularization_loss: 0.0000e+00 - total_loss: 2.2079

{'rating_binary/binary_classification_task/auc': 0.746394157409668,
 'loss': 2.0516271591186523,
 'regularization_loss': 0.0,
 'total_loss': 2.0516271591186523}

Note that the evaluate() progress bar shows the loss score for every batch, whereas the final loss stored in the dictionary represents the total loss across all batches.

Save the model so we can use it for serving predictions in production or for resuming training with new observations:

model.save("custom_dlrm")

WARNING:absl:Function `_wrapped_model` contains input name(s) TE_age_rating, TE_gender_rating, TE_movieId_rating, TE_occupation_rating, TE_userId_rating, TE_zipcode_rating, movieId, userId with unsupported characters which will be renamed to te_age_rating, te_gender_rating, te_movieid_rating, te_occupation_rating, te_userid_rating, te_zipcode_rating, movieid, userid in the SavedModel.
WARNING:absl:Found untraced functions such as sequential_block_9_layer_call_fn, sequential_block_9_layer_call_and_return_conditional_losses, binary_classification_task_layer_call_fn, binary_classification_task_layer_call_and_return_conditional_losses, sequential_block_9_layer_call_fn while saving (showing 5 of 155). These functions will not be directly callable after loading.

INFO:tensorflow:Assets written to: custom_dlrm/assets

INFO:tensorflow:Assets written to: custom_dlrm/assets

Conclusion

Merlin Models provides common and state-of-the-art RecSys architectures in a high-level API as well as all the required low-level building blocks for you to create your own architecture (input blocks, MLP layers, prediction tasks, loss functions, etc.). In this example, we explored a subset of these pre-existing blocks to create the DLRM model, but you can view our documentation to discover more. You can also contribute to the library by submitting new RecSys architectures and custom building Blocks.

Next steps

To learn more about how to deploy the trained DLRM model, please visit Merlin Systems library and execute the Serving-Ranking-Models-With-Merlin-Systems.ipynb notebook that deploys an ensemble of a NVTabular Workflow and a trained model from Merlin Models to Triton Inference Server.