Ranking script#
The ranking.py
is a template script that leverages the Merlin models library (Tensorflow API) to build, train, evaluate ranking models. In the end you can either save the model for interence or persist model predictions to file.
Merlin models library provides building blocks on top of Tensorflow (Keras) that make it easy to build and train advanced neural ranking models. There are blocks for representing input, configuring model outputs/heads, perform feature interactions, losses, metrics, negative sampling, among others.
Ranking in multi-stage RecSys#
Large online services like social media, streaming, e-commerce, and news provide a very broad catalog of items and leverage recommender systems to help users find relevant items. Those companies typically deploy recommender systems pipelines with multiple stages, in particular the retrieval and ranking. The retrieval stage selects a few hundreds or thousands of items from a large catalog. It can be a heuristic approach (like most recent items) or a scalable model like Matrix Factorization, Two-Tower architecture or YouTubeDNN. Then, the ranking stage scores the relevance of the candidate items provided by the previous stage for a given user and context.
Multi-task learning for ranking models#
It is common to find scenarios where you need to score the likelihood of different user-item events, e.g., clicking, liking, sharing, commenting, following the author, etc. Multi-Task Learning (MTL) techniques have been popular in deep learning to train a single model that is able to predict multiple targets at the same time.
By using MTL, it is typically possible to improve the tasks accuracy for somewhat correlated tasks, in particular for sparser targets, for which less training data is available. And instead of spending computational resources to train and deploy different models for each task, you can train and deploy a single MTL model that is able to predict multiple targets.
You can find more details in this post on the multi-task learning building blocks provided by models library.
The ranking.py
script makes it easy to use multi-task learning backed by models library. It is automatically enabled when you provide more than one target column to --tasks
argument.
Supported models#
The ranking.py
script makes it easy to use baseline and advanced deep ranking models available in models library.
The script can be also used as an advanced example that demonstrate how to set specific hyperparameters using models API.
Baseline ranking architectures#
MLP (
--model=mlp
) - Simple multi-layer perceptron architecture. More info in MLPBlock.Wide and Deep - Aims to leverage the ability of neural networks to generalize and capacity of linear models to memorize relevant feature interactions. The deep part is an MLP model, with categorical features represented as embeddings, which are concatenated with continuous features and fed through multiple MLP layers. The wide part is a linear model takes a sparse representation of categorical features (i.e. one-hot or multi-hot representation). More info in WideAndDeepModel and its paper.
DeepFM (
--model=deepfm
) - DeepFM architecture is a combination of a Factorization Machine and a Deep Neural Network. More info in DeepFMModel and its paper.DRLM (
--model=dlrm
) - Continuous features are concatenated and combined by the bottom MLP to produce an embedding like categorical embeddings. The factorization machines layer perform 2nd level feature interaction of those embeddings, which need to have the same dim. Then those outputs are concatenated and processed through the top MLP layer to output the predictions. More info in DLRMModel and its paper.DCN-v2 (
--model=dcn
) - The Improved Deep & Cross Network combines a MLP network with cross-network for powerful and bounded feature interaction. More info in DCNModel and its paper.
Multi-task learning architectures#
MMOE (
--model=mmoe
) - The Multi-gate Mixture-of-Experts (MMoE) is one of the most popular models for multi-task learning on tabular data. It allows parameters to be automatically allocated to capture either shared task information or task-specific information. The core components of MMoE are experts and gates. Instead of using a shared-bottom for all tasks, it has multiple expert sub-networks processing input features independently from each other. Each task has an independent gate, which dynamically selects based on the inputs the level with which the task wants to leverage the output of each expert. More info on MMOEBlock and its paper.CGC (
--model=cgc
) - Instead of having tasks sharing all the experts like in MMOE, it allows for splitting task-specific experts and shared experts, in an architecture named Customized Gate Control (CGC) Model. More info on CGCBlock and its paper.PLE (
--model=ple
) - In the same paper introducing CGC, authors proposed stacking multiple CGC models on top of each other to form a multi-level MTL model, so that the model can progressively combine shared and task-specific experts. They name this approach as Progressive Layered Extraction (PLE). Their paper experiments showed accuracy improvements by using PLE compared to CGC. More info on PLEBlock and its paper.
Best practices#
Modeling inputs features#
Neural networks operate on top of dense / continuous float inputs. Continuous features fit nicely into that format, but categorical features needs to be represented accordingly. It is assumed that in the preprocessing the categorical features were encoded as contiguous ids. Then, they are typically be represented by the model using:
One-hot encoding Sparse representation where each categorical value is represented by a binary feature with 1 only for the actual value. If the categorical feature contains a list of values, it can be encoded with multi-hot encoding, with 1s for all values in the list. This encoding is useful to represent low-cardinality categorical features or to provide input to linear models.
Embedding - This encoding is very popular for deep neural networks. Each categorical value is mapped to a 1D continuous vector, that can be trainable or pre-trained. The embeddings are stored in embedding layers or tables, whose first dim in the cardinality of the categorical feature and 2nd dim is the embedding size.
Dealing with high-cardinality categorical features#
We explain in the Quick-start preprocessing documentation that large services might have categorical features with very high cardinality (e.g. order of hundreds of millions or higher), like user id or item id. They typically require a high memory to be stored (e.g. with embedding tables) or processed (e.g. with one-hot encoding). In addition, most of the categorical values are very infrequent, for which it is not possible to learn good embeddings.
The preprocessing documentation describes some options to deal with the high-cardinality features: Frequency capping, Filtering out rows with infrequent values and Hashing.
You might also decide to keep the original high-cardinality of the categorical features for better personalization level and accuracy.
The embedding tables are typically responsible for most of the parameters of Recommender System models. For large scale systems, where the number of users and items is in the order of hundreds of millions, it is typically needed to use a distributed embeddings solution, so that embedding embedding tables can be sharded in multiple compute devices (e.g. GPU, CPU).
Defining the embedding size#
It is common sense that higher the cardinality of categorical feature the higher should be the embedding dimension, as its vector space gets more complex.
Models library uses by default a heuristic method that sets embedding sizes based on cardinality (implementation here), which can be scaled by --embedding_sizes_multiplier
. Models library API also allows setting specific embedding dims for each / all categorical features.
Some models supported by this script (DLRM and DeepFM) require the embedding sizes of categorical features to be the same (
--embedding_dim
) because of their feature interaction approach based on Factorization Machines.
Regularization#
Neural networks typically require regularization in order to avoid overfitting, in particular if trained on small data or for many epochs that can make it memorize train set. This script provide typical regularization techniques like Dropout (--dropout
) and L2 regularization of model parameters (--l2_reg
) and embeddings (--embeddings_l2_reg
).
Classes weights#
Typically positive user interactions are just a small fraction of the items that were exposed to the users. That leads to class unbalanced targets.
A common technique to deal with this problem in machine learning is to assign a higher weight to the loss for examples with infrequent targets - positive classes in this case.
You might set the positive class weight for single-task learning models with --stl_positive_class_weight
and for multi-task learning you can set the class weight for each target separately by using --mtl_pos_class_weight_*
, where *
must be replaced by the target name. In this case, the negative class weight is always 1.0.
Negative sampling#
If you have only positive interactions in your training data, you can use negative sampling to include synthetic negative examples in the training batch. The negative samples are generated by adding for each positive example N negative examples, keeping user features and replacing features of the target item by other item interacted by another users in the batch. You can easily set the number of negative examples for train (--in_batch_negatives_train
) and evaluate (--in_batch_negatives_eval
).
This functionality require that user and item features are tagged accordingly, as explained in the Quick-start preprocessing documentation.
Multi-task learning#
Losses weights#
You can balance the learning of the multiple tasks by setting losses weights when using multi-task learning. You can set them by providing --mtl_loss_weight_*
for each task, where *
must be replaced by the target name.
Setting tasks sample space#
Some targets might depend on other target columns for some datasets. For example, the preprocessed TenRec dataset have positive (=1) like
, follow
, and share
events only if the user has also clicked on the item (click=1
).
You might want to model dependent tasks explicitly by setting the sample space, i.e., computing the loss of the dependent tasks only for examples where the dominant target is 1. That would make the dependent targets less sparser, as their value is always 0 when dominant target is 0.
The scripts allows setting the tasks sample space by using --tasks_sample_space
, which accepts comma-separated values. The order of sample spaces should match the order of the --tasks
. Empty value means the task will be trained in the entire space, i.e., loss computed for all examples in the dataset.
For TenRec dataset, you could use --tasks click,like,share,follow
and --tasks_sample_space=,click,click,click
, meaning that the click task will be trained using the entire space and the other tasks will be trained only in click space.
We have observed empirically that if you want a model to predict all tasks at the same time (e.g. the likelihood of a user to click-like-share a post), it is better to train all tasks using entire space. On the other hand, if you want to train a MTL model that predicts rarer events (e.g. add-to-cart, purchase) given prior events (e.g. click), then you typically get better accuracy on the dependent tasks training them in the dominant task space, while training the dominant task on entire space.
Command line arguments#
In this section we describe the command line arguments of the ranking.py
script.
You can check how to setup the Docker image to run
ranking.py
script with Docker.
This is an example command line for running the training for the TenRec dataset in our Docker image, which is explained here.
The parameters and their values can be separated by either space or by =
.
cd /Merlin/examples/
OUT_DATASET_PATH=/outputs/dataset
CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python -m quick_start.scripts.ranking.ranking --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 3 --model dlrm --embeddings_dim 64 --l2_reg 1e-4 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32 --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --save_model_path ./saved_model
Inputs#
--train_data_path
Path of the train set. It expects a folder with parquet files.
If not provided, the model will not be trained (in case you want to use
--load_model_path to load a pre-trained model)
--eval_data_path
Path of the eval set. It expects a folder with parquet files.
If not provided, the model will not be evaluated
--predict_data_path
Path of a dataset for prediction. It expects a folder with parquet files
If provided, it will compute the predictions for this dataset and
save those predictions to --predict_output_path
--load_model_path
If provided, loads a model saved by --save_model_path
instead of initializing the parameters randomly
--keep_columns
Comma-separated list of columns from the schema that
should be kept by the dataloader.
--ignore_columns
Comma-separated list of columns from the schema that
should be ignored by the dataloader.
Tasks#
--tasks Columns (comma-sep) with the target columns to
be predicted. A regression/binary classification
head is created for each of the target columns.
If more than one column is provided, then multi-
task learning is used to combine the tasks
losses. If 'all' is provided, all columns tagged
as target in the schema are used as tasks. By
default 'all'
--tasks_sample_space
Columns (comma-sep) to be used as sample space
for each task. This list of columns should match
the order of columns in --tasks. Typically this
is used to explicitly model that the task event
(e.g. purchase) can only occur when another
binary event has already happened (e.g. click).
Then by setting for example
--tasks=click,purchase
--tasks_sample_space,click, you configure the
training to compute the purchase loss only for
examples with click=1, making the purchase
target less sparser.
Model#
--model {mmoe,cgc,ple,dcn,dlrm,mlp,wide_n_deep,deepfm}
Types of ranking model architectures that are
supported. Any of these models can be used with
multi-task learning (MTL). But these three are
specific to MTL: 'mmoe', 'cgc' and 'ple'. By default
'mlp'
--activation
Activation function supported by Keras, like:
'relu', 'selu', 'elu', 'tanh', 'sigmoid'. By
default 'relu'
--mlp_init Keras initializer for MLP layers.
By default 'glorot_uniform'.
--l2_reg L2 regularization factor. By default 1e-5.
--embeddings_l2_reg
L2 regularization factor for embedding tables.
It operates only on the embeddings in the
current batch, not on the whole embedding table.
By default 0.0
--embedding_sizes_multiplier
When --embedding_dim is not provided it infers
automatically the embedding dimensions from the
categorical features cardinality. This factor
allows to increase/decrease the embedding dim
based on the cardinality. Typical values range
between 2 and 10. By default 2.0
--dropout Dropout rate. By default 0.0
--mlp_layers
Comma-separated dims of MLP layers.
It is used by MLP model and also for dense blocks
of DLRM, DeepFM, DCN and Wide&Deep.
By default '128,64,32'
--stl_positive_class_weight
Positive class weight for single-task models. By
default 1.0. The negative class weight is fixed
to 1.0
DCN-v2#
--dcn_interacted_layer_num
Number of interaction layers for DCN-v2
architecture. By default 1.
DLRM and DeepFM#
--embeddings_dim
Sets the embedding dim for all embedding columns
to be the same. This is only used for --model
'dlrm' and 'deepfm'
Wide&Deep#
--wnd_hashed_cross_num_bins
Used with Wide&Deep model. Sets the number of
bins for hashing feature interactions. By
default 10000.
--wnd_wide_l2_reg
Used with Wide&Deep model. Sets the L2 reg of
the wide/linear sub-network. By default 1e-5.
--wnd_ignore_combinations
Feature interactions to ignore. Separate feature
combinations with ',' and columns with ':'. For
example: --wnd_ignore_combinations='item_id:item
_category,user_id:user_gender'
Wide&Deep and DeepFM#
--multihot_max_seq_length
DeepFM and Wide&Deep support multi-hot
categorical features for the wide/linear sub-
network. But they require setting the maximum
list length, i.e., number of different multi-hot
values that can exist in a given example. By
default 5.
MMOE#
--mmoe_num_mlp_experts
Number of experts for MMOE. All of them are
shared by all the tasks. By default 4.
CGC and PLE#
--cgc_num_task_experts
Number of task-specific experts for CGC and PLE.
By default 1.
--cgc_num_shared_experts
Number of shared experts for CGC and PLE. By
default 2.
--ple_num_layers
Number of CGC modules to stack for PLE
architecture. By default 1.
Expert-based MTL models#
--expert_mlp_layers
For MTL models (MMOE, CGC, PLE) sets the MLP
layers of experts.
It expects a comma-separated list of layer dims.
By default '64'
--gate_dim Dimension of the gate dim MLP layer. By default
64
--mtl_gates_softmax_temperature
Sets the softmax temperature for the gates
output layer, that provides weights for the
weighted average of experts outputs. By default
1.0
Multi-task learning models#
--use_task_towers
Creates task-specific MLP tower before its head.
By default True.
--tower_layers
MLP architecture of task-specific towers.
It expects a comma-separated list of layer dims.
By default '64'
Negative sampling#
--in_batch_negatives_train
If greater than 0, enables in-batch sampling,
providing this number of negative samples per
positive. This requires that your data contains
only positive examples, and that item features
are tagged accordingly in the schema, for
example, by setting --item_features in the
preprocessing script.
--in_batch_negatives_eval
Same as --in_batch_negatives_train for
evaluation.
Training and evaluation#
--lr LR Learning rate
--lr_decay_rate
Learning rate decay factor. By default 0.99
--lr_decay_steps
Learning rate decay steps. It decreases the LR
at this frequency, by default each 100 steps
--train_batch_size
Train batch size. By default 1024. Larger batch
sizes are recommended for better performance.
--eval_batch_size
Eval batch size. By default 1024. Larger batch
sizes are recommended for better performance.
--epochs EPOCHS Number of epochs. By default 1.
--optimizer {adagrad,adam}
Optimizer. By default 'adam'
--train_metrics_steps
How often should train metrics be computed
during training. You might increase this number
to reduce the frequency and increase a bit the
training throughput. By default 10.
--validation_steps
If not predicting, logs the validation metrics
for this number of steps at the end of each
training epoch. By default 10.
--random_seed
Random seed for some reproducibility. By default
42.
--train_steps_per_epoch
Number of train steps per epoch. Set this for
quick debugging.
--shuffled_train
Shuffles data during training.
Logging#
--metrics_log_frequency
--How often metrics should be logged to
Tensorboard or Weights&Biases. By default each
50 steps.
--log_to_tensorboard
Enables logging to Tensorboard.
--log_to_wandb
Enables logging to Weights&Biases. This requires
sign-up for a free Weights&Biases account at https://wandb.ai/
and providing an API key in the console you can get at
https://wandb.ai/authorize
--wandb_project
Name of the Weights&Biases project to log
--wandb_entity
Name of the Weights&Biases team/org to log
--wandb_exp_group
Not used by the script. Just used to allow for
logging some info to organize experiments in
Weights&Biases
This requires sign-up for a free Weights&Biases account at https://wandb.ai/home “ “and providing an API key in the console.
Outputs#
--output_path
Folder to save training and logging assets.
--save_model_path
If provided, model is saved to this path after
training. It can be loaded later with --load_model_path
--predict_output_path
If provided the prediction scores will be saved
to this path, according to --predict_output_format
and --predict_output_keep_cols
--predict_output_keep_cols
Comma-separated list of columns to keep in the
output prediction file. If no columns is
provided, all columns are kept together with the
prediction scores.
--predict_output_format {parquet,csv,tsv}
Format of the output prediction files. By
default 'parquet', which is the most performant
format.