# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
Getting Started with NVTabular: Process Tabular Data On GPU#
This notebook is created using the latest stable merlin-tensorflow container.
Overview#
Merlin NVTabular is a library for processing tabular data. It lets Data Scientists and ML Engineers easily process data leveraging custom operators specifically designed for machine learning workflows. The processing is carried out on the GPU with best practices baked into the library. Running on the GPU translates to faster iteration cycles and, thanks to leveraging dask
, enables working on arbitrarily large datasets. NVTabular is a part of the Merlin open source framework which allows for seamless transitioning to working with your preprocessed data using the numerous other libraries, including ones for model construction and serving.
Training a machine learning model ofen requires preprocessing data and engineering features. In this example, we want to train a neural network with embedding layers based on two categorical features: userId
and movieId
. Embedding layers require that the categorical features are continuous integers. In this example, we will show how to use the Categorify
operator to transform the categorical features for training a model
Core features of Merlin NVTabular:
Many different operators (
Categorify
,FillMissing
,TargetEncoding
,Groupby
, etc) tailored for processing tabular data at scaleFlexible APIs targeted to both production and research
Deep integration with NVIDIA Merlin platform, including Merlin Models for constructing and training Deep Learning models and Merlin Systems for model serving
Learning objectives#
Processing the Movielens dataset.
Understanding Merlin NVTabular high-level concepts (Dataset, Workflow)
A first look at operators and defining the preprocessing workflow
Downloading the dataset#
MovieLens25M#
The MovieLens25M is a popular dataset for recommender systems and is widely used in academic publications. The dataset contains 25M movie ratings for 62,000 movies given by 162,000 users. Many projects use only the user/item/rating information of MovieLens, but the original dataset provides metadata for the movies, as well. For example, which genres a movie has.
To streamline obtaining data, we will use a function from Merlin Models.
import os
from merlin.datasets.entertainment import get_movielens
input_path = os.environ.get("INPUT_DATA_DIR", os.path.expanduser("~/merlin-framework/movielens/"))
get_movielens(variant="ml-1m", path=input_path);
2022-08-31 04:04:15.362393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:15.362838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:15.362976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
The original dataset has been preprocessed to make it easier to work with. Instead of having to deal with dat
files, we can read files directly into a DataFrame
using the parquet format.
The data has already been split for us into train and validation sets.
ls {input_path}/ml-1m #noqa
README ratings.dat users.dat
movies.dat train.parquet users_converted.parquet
movies_converted.parquet transformed/ valid.parquet
from merlin.core.dispatch import get_lib
train = get_lib().read_parquet(f'{input_path}ml-1m/train.parquet')
valid = get_lib().read_parquet(f'{input_path}ml-1m/valid.parquet')
From the provided train
and validation
sets we will extract userId
, movieId
and rating
.
train.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
259658 | 1587 | 356 | 3 | 974740825 |
834974 | 5018 | 1299 | 4 | 962583606 |
153802 | 988 | 1721 | 4 | 976397962 |
363802 | 2119 | 151 | 4 | 974997851 |
834543 | 5015 | 1393 | 4 | 962594210 |
Processing the dataset with NVTabular#
Defining the workflow#
Before we can leverage NVTabular
, we need to convert our data to a Merlin Dataset
.
We achieve this by passing the DataFrame
to the Dataset
constructor.
import nvtabular as nvt
from merlin.schema.tags import Tags
train_ds = nvt.Dataset(train)
valid_ds = nvt.Dataset(valid)
train_ds, valid_ds
(<merlin.io.dataset.Dataset at 0x7f39bce8b880>,
<merlin.io.dataset.Dataset at 0x7f39bce8b820>)
Now that we have read in our data, let’s define a workflow.
A workflow consists of one or more preprocessing steps that will be applied to our data.
We begin by converting userId
and movieId
columns to categories. In our dataset, they are already represented as integers, but many models require them to be continuous integers, which is not something we can guarantee about our input data if we don’t preprocess it. Further to that, in order to train models on our data, we need to ensure we handle categories not seen in the train dataset.
We accomplish both of these with the Categorify
operator.
output = ['userId', 'movieId'] >> nvt.ops.Categorify()
Above, we are instructing NVTabular
to select the userId
and movieId
columns and to apply the Categorify
operator to them. We store the results as output
.
When we run the cell, the actual operation is not performed. Only a graph representation of the operation is created.
output.graph
Let us also add our target to the set of returned columns.
Additionally, we tag the rating
column with appropriate tags. This will allow other components of the Merlin Framework to use this information and minimize the code we will have to write to perform complex operations such as training or serving a Deep Learning model.
If you would like to learn more about using Tags
, take a look at the NVTabular and Merlin Models integrated example notebook in the Merlin Models repository.
output += ['rating'] >> nvt.ops.AddMetadata(tags=[Tags.REGRESSION, Tags.TARGET])
We are now ready to construct a Workflow
that will run the operations we defined above.
workflow = nvt.Workflow(output)
Applying the workflow to the train and validation sets#
NVTabular
follows the familiar sklearn
API. We can fit the workflow to our train set and subsequently use it to transform our validation dataset.
workflow.fit_transform(train_ds).to_parquet('train')
workflow.transform(valid_ds).to_parquet('valid')
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
We have fit
our workflow to the train set. During this operation, the workflow computed and stored a mapping from userId
and movieId
values in the dataset to their encoded representation as continuous integers.
Subsequently, we have transformed the train set and encoded the userId
and movieId
columns (both operations were performed when we called fit_transform
).
Last but not list, we transform our validation dataset using values computed on the train set.
We output both datasets to disk.
ls train
_file_list.txt _metadata _metadata.json part_0.parquet schema.pbtxt
Let us now load our transformed data and see whether everything looks as expected.
train_transformed = nvt.Dataset('train', engine='parquet')
valid_transformed = nvt.Dataset('valid', engine='parquet')
train_transformed.head()
userId | movieId | rating | |
---|---|---|---|
0 | 320 | 28 | 3 |
1 | 1278 | 354 | 4 |
2 | 3408 | 63 | 4 |
3 | 1747 | 569 | 4 |
4 | 204 | 99 | 4 |
Let’s finish off this notebook with training a DLRM (a Deep Learning Recommendation Model introduced in Deep Learning Recommendation Model for Personalization and Recommendation Systems) on our preprocessed data.
To learn more about the integration between NVTabular and Merlin Models, please see the NVTabular and Merlin Models integrated example in the Merlin Models repository.
Training a DLRM model#
We define the DLRM model, whose prediction task is a binary classification. From the schema
, the categorical features are identified (and embedded) and the target column is also automatically inferred, because of the schema tags. We talk more about the schema in the next example notebook, Advanced NVTabular Workflow.
import tensorflow
import merlin.models.tf as mm
model = mm.DLRMModel(
train_transformed.schema,
embedding_dim=64,
bottom_block=mm.MLPBlock([128, 64]),
top_block=mm.MLPBlock([128, 64, 32]),
prediction_tasks=mm.RegressionTask('rating')
)
opt = tensorflow.optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer=opt)
model.fit(train_transformed, validation_data=valid_transformed, batch_size=1024, epochs=5)
model.optimizer.learning_rate = 1e-4
metrics = model.fit(train_transformed, validation_data=valid_transformed, batch_size=1024, epochs=3)
2022-08-31 04:04:17.240242: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-31 04:04:17.241058: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:17.241240: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:17.241375: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:17.241645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:17.241789: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:17.241929: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-31 04:04:17.242051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 24576 MB memory: -> device: 0, name: Quadro RTX 8000, pci bus id: 0000:08:00.0, compute capability: 7.5
Epoch 1/5
782/782 [==============================] - 6s 5ms/step - loss: 1.2843 - root_mean_squared_error: 1.1333 - regularization_loss: 0.0000e+00 - val_loss: 0.8349 - val_root_mean_squared_error: 0.9137 - val_regularization_loss: 0.0000e+00
Epoch 2/5
782/782 [==============================] - 4s 5ms/step - loss: 0.8268 - root_mean_squared_error: 0.9093 - regularization_loss: 0.0000e+00 - val_loss: 0.8105 - val_root_mean_squared_error: 0.9003 - val_regularization_loss: 0.0000e+00
Epoch 3/5
782/782 [==============================] - 3s 4ms/step - loss: 0.8017 - root_mean_squared_error: 0.8954 - regularization_loss: 0.0000e+00 - val_loss: 0.7988 - val_root_mean_squared_error: 0.8938 - val_regularization_loss: 0.0000e+00
Epoch 4/5
782/782 [==============================] - 3s 4ms/step - loss: 0.7863 - root_mean_squared_error: 0.8868 - regularization_loss: 0.0000e+00 - val_loss: 0.7915 - val_root_mean_squared_error: 0.8897 - val_regularization_loss: 0.0000e+00
Epoch 5/5
782/782 [==============================] - 3s 4ms/step - loss: 0.7734 - root_mean_squared_error: 0.8794 - regularization_loss: 0.0000e+00 - val_loss: 0.7855 - val_root_mean_squared_error: 0.8863 - val_regularization_loss: 0.0000e+00
Epoch 1/3
782/782 [==============================] - 4s 4ms/step - loss: 0.7360 - root_mean_squared_error: 0.8579 - regularization_loss: 0.0000e+00 - val_loss: 0.7744 - val_root_mean_squared_error: 0.8800 - val_regularization_loss: 0.0000e+00
Epoch 2/3
782/782 [==============================] - 4s 5ms/step - loss: 0.7267 - root_mean_squared_error: 0.8525 - regularization_loss: 0.0000e+00 - val_loss: 0.7722 - val_root_mean_squared_error: 0.8788 - val_regularization_loss: 0.0000e+00
Epoch 3/3
782/782 [==============================] - 3s 4ms/step - loss: 0.7221 - root_mean_squared_error: 0.8497 - regularization_loss: 0.0000e+00 - val_loss: 0.7727 - val_root_mean_squared_error: 0.8791 - val_regularization_loss: 0.0000e+00
Conclusion#
NVTabular exposes operators tailored for processing tabular data at scale with machine learning best practices baked into the library. It tightly integrates with the rest of the Merlin Framework to streamline model construction, training and serving.
Next steps#
In subsequent notebooks, we will define more advanaced workflows and custom operators. We will also take a closer look at exporting NVTabular datasets and workflows at running in different environments (CPU, GPU and multi-GPU).