Preprocessing script#
The preprocessing.py
is a template script that provides basic preprocessing and feature engineering operations for tabular data, so that they are better represented for neural models. It uses the NVTabular and dask-cudf libraries for GPU accelerated preprocessing.
In this document we describe the provided preprocessing and feature engineering options and the corresponding command line arguments.
Best practices#
In this section we list some best practices on preprocessing and feature engineering for preparing data for neural models.
Dataset#
The typical data to train recommender systems is the log of user interactions on items from a platform like e-commerce, news portal, social network, ad network, streaming media platform, among others. In addition to users interactions. The logged users interaction might contain explicit feedback from users, e.g. like, dislike, rating, or implicit feedback events, e.g. click, comment, add-to-cart, purchase, which might be positive or negative, e.g. items shown to the user and ignored.
Defining the task#
You need to prepare the dataset according to the desired task.
Retrieval - The model objective is to return for a given user the top-k recommended items. In this case, the data can contain only positive interactions as retrieval models are typically trained using negative sampling from other users interactions, not requiring implicit negatives.
Ranking - The model objective is to score the relevance of a target item for a given user. In this case, you have at least one target column that express implicit or explicit your feedback you want to predict. Typically each target will be used by either a binary classification (e.g. predicting binary events like click, --binary_classif_targets
) or regression task (e.g. estimating rating, --regression_targets
). You can see below an example of the TenRec dataset that is suitable for ranking.
Preprocessing features#
When preparing the data, you need to include features that are relevant for predicting a user interaction, which might include user features that are static (e.g. user id, age, gender), dynamic contextual features (e.g. location, device) and item features (e.g. item id, category, price).
For neural networks there is an important distinction between categorical and continuous features.
Continuous features
Continuous features (--continuous_features
) are naturally fed into neural networks, they typically just need need to be normalized to avoid numerical scaling issues. Typical approaches for normalizing continuous features are standardization (Z-scaling) and min-max scaling.
It is important to have a strategy for imputation of missing values (e.g. with a constant float, or some statistic like mean or median), as null (NaN) values are not acceptable as input by neural networks.
Categorical features
Categorical features (--categorical_features
) are nominal data, which typically strings or id numbers that don’t have any meaningful order or scaling properties. They are typically categorified / represented as continuous ids, so that when fed to a model they can be represented either as one-hot representation for linear models or embedded for neural networks.
Dealing with high-cardinality data#
Large services might have categorical features with very high cardinality (e.g. order of hundreds of millions or higher), like user id or item id. They typically require a high memory to be stored (e.g. with embedding tables) or processed (e.g. with one-hot encoding). In addition, most of the categorical values are very infrequent, for which it is not possible to learn good embeddings. Thus, your make some modeling choices in order to preprocess those categorical features accordingly. Here are some options:
Keep the original high-cardinality - If you are going use a model with distributed embedding solution, that will support sharding the embedding table across multiple devices(typically GPUs) to avoid going out-of-memory, then you can categorify the features just as you do for low-cardinality ones.
Frequency capping (
--categ_min_freq_capping
)- Infrequent values are mapped to 0, forming a cluster of infrequent / cold-start users/items that can be useful for training the model to deal with them.Filtering out infrequent values (
--min_user_freq
,--min_item_freq
) - You might filter out interactions from infrequent or fresh users or items, which are typically the majority of systems interactions as they typically follow the long-tail distribution.Hashing - An additional option is to hash the categorical values into a number of buckets much lower than the feature cardinality. That way, you introduce collisions as a trade-off for lower final cardinality and memory requirements in the modeling side. This can be in the preprocessing or in the modeling.
Feature Engineering#
Feature engineering allows designing new features from raw data that are can provide useful information to the model with respect to the prediction task.
In this section we list common feature engineering techniques. Most of them are implemented as ops in NVTabular. User defined functions (UDF) can be implemented with Lambda op, which are very useful for example for dealing with temporal and geographic feature engineering.
TIP: This preprocessing script provides just basic feature engineering. For more using those more advanced techniques you can either copy preprocessing.py
and change it, or you can create a class inheriting from the PreprocessingRunner
class (in preprocessing.py
) and override the generate_nvt_features()
method to customize the preprocessing workflow with different NVTabular ops.
Continuous features
Smoothing long-tailed distributions of continuous features with Log, so that the range of large numbers is compressed and the range of small numbers is expanded.
Continuous features can be represented as categorical features by either binarization (converting to binary) or binning (converting to multiple categorical or ordinal values). That might be useful to group together values that are similar, e.g., periods of the day, age ranges of users, etc.
Categorical features
Besides contiguous ids, categorical features can be also represented by global statistics of their values, or by statistics conditioned in other columns. Some popular techniques are:
Count encoding - represents the count of a given categorical value across the whole dataset (e.g. count of user past interactions)
Target encoding - represents one statistic of a categorical column conditioned on a target column. One example would be computing the average of click binary target segmented by the item id categorical values, which represents its Click-Through Rate (CTR) or likelihood to be clicked by a random user. Target encoding is a very powerful feature engineering technique, and has been a key for many of our winning solutions for RecSys competitions. You can create target encoded features with this script, by setting the
--target_encoding_features
and--target_encoding_targets
arguments to define which categorical columns and targets should be used for generating the target encoded features.
Temporal features
Extracting temporal features from timestamps, like day of week, day, month, year, quarter, hour, period of the day, among others.
Compute the “age” of the item or how long the user is active in the system, e.g. by subtracting the interaction timestamp by the timestamp when the user/item were seen for the first time.
Trending features might also be useful: for example, including continuous features that accumulates the user engagement in a specific category of product the last month, quarter, semester.
Geographic features
You can treat Zip codes, cities, states, countries as categorical features
If latitude/longitude are available, you can also compute distances, e.g. the distance between a hotel (item) location and the user location / airport / touristic landmark.
You can also enrich adding features based on external geolocation data (e.g. from census or government).,
Data set splitting#
There are many approaches for splitting (--dataset_split_strategy
) train and evaluation data:
random - Examples are randomly assigned to train and eval sets (according to a percentage,
--random_split_eval_perc
).random by user - It is like random but stratified by user.Ensures that users have examples in both train and eval sets. This approach doesn’t provide cold-start users on eval set.
temporal - Uses a reference timestamp (
--dataset_split_temporal_timestamp
) to split train and eval sets. Typically this is the most realistic approach, as when deployed models will not have access to future information when performing predictions.
Command line arguments#
In this section we describe the command line arguments of the preprocessing script.
The input and format can be CSV, TSV or Parquet, but the latter is recommended for being a columnar format which is faster to preprocess. Output preprocessing format is parquet format.
You can check how to setup the Docker container to run
preprocessing.py
script with Docker.
Here is an example command line for running preprocessing for the TenRec dataset in our Docker image, which is explained here.
The parameters and their values can be separated by either space or by =
.
cd /Merlin/examples/
OUT_DATASET_PATH=/outputs/dataset
python -m quick_start.scripts.preproc.preprocessing --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
Inputs#
--data_path
Path to the data
--eval_data_path
Path to eval data, if data was already splitMust have
the same schema as train data (in --data_path).
--predict_data_path
Path to data to be preprocessed for prediction.
This data is expected to have the same input features as
train data but not targets, as this data is used for prediction.
--input_data_format {csv,tsv,parquet}
Input data format
--csv_sep Character separator for CSV files.Default is ','. You
can use 'tab' for tabular separated data, or
--input_data_format tsv
--csv_na_values
String in the original data that should be replaced by
NULL
Outputs#
--output_path
Output path where the preprocessed files will be
savedDefault is ./results/
--output_num_partitions
Number of partitions that result in this number of
output filesDefault is 10.
--persist_intermediate_files
Whether to persist/cache the intermediate
preprocessing files. Enabling this might be necessary
for larger datasets.
Features and targets definition#
--control_features
Columns (comma-separated) that should be kept as is in
the output files. For example,
--control_features=session_id,timestamp
--categorical_features
Columns (comma-sep) with categorical/discrete features
that will encoded/categorified to contiguous ids in
the preprocessing. These tags are tagged as
'categorical' in the schema, so that Merlin Models can
automatically create embedding tables for them.
--continuous_features
Columns (comma-sep) with continuous features that will
be standardized and tagged in the schema as
'continuous', so that the Merlin Models can represent
and combine them with embedding properly.
--continuous_features_fillna
Replaces NULL values with this float. You can also set
it with 'median' for filling nulls with the median
value.
--user_features
Columns (comma-sep) that should be tagged in the
schema as user features. This information might be
useful for modeling later.
--item_features
Columns (comma-sep) that should be tagged in the
schema as item features. This information might be
useful for modeling later, for example, for in-batch
sampling if your data contains only positive examples.
--user_id_feature
Column that contains the user id feature, for tagging
in the schema. This information is used in the
preprocessing if you set --min_user_freq or
--max_user_freq
--item_id_feature
Column that contains the item id feature, for tagging
in the schema. This information is used in the
preprocessing if you set --min_item_freq or
--max_item_freq
--timestamp_feature
Column containing a timestamp or date feature. The
basic preprocessing doesn't extracts date and time
features for it. It is just tagged as 'timestamp' in
the schema and used for splitting train / eval data if
--dataset_split_strategy=temporal is used.
--session_id_feature SESSION_ID_FEATURE
This is just for tagging this feature.
--binary_classif_targets
Columns (comma-sep) that should be tagged in the
schema as binary target. Merlin Models will create a
binary classification head for each of these targets.
--regression_targets
Columns (comma-sep) that should be tagged in the
schema as binary target. Merlin Models will create a
regression head for each of these targets.
Target encoding features#
--target_encoding_features
Columns (comma-sep) with categorical/discrete
features for which target encoding features will be
generated, with the average of the target columns
for each categorical value. The target columns are
defined in --target_encoding_targets. If
--target_encoding_features is not provided but
--target_encoding_targets is, all categorical
features will be used.
--target_encoding_targets
Columns (comma-sep) with target columns that will be
used to compute target encoding features with the
average of the target columns for categorical
features value. The categorical features are defined
in --target_encoding_features. If
--target_encoding_targets is not provided but
--target_encoding_features is, all target columns
will be used.
--target_encoding_kfold
Number of folds for target encoding, in order to
avoid that the current example is considered in the
target encoding feature computation, which could
cause overfitting for infrequent categorical values.
Default is 5
--target_encoding_smoothing
Smoothing factor that is used in the target encoding
computation, as statistics for infrequent
categorical values might be noisy. It makes target
encoding formula = `sum_target_per_categ_value +
(global_target_avg * smooth) / categ_value_count +
smooth`. Default is 10
Data casting and filtering#
--to_int32 Cast these columns (comma-sep) to int32.
--to_int16 Cast these columns (comma-sep) to int16, to save some
memory.
--to_int8 Cast these columns (comma-sep) to int32, to save some
memory.
--to_float32
Cast these columns (comma-sep) to float32
Filtering and frequency capping#
--categ_min_freq_capping
Value used for min frequency capping. If greater than 0, all categorical values which are less frequent than this threshold will be mapped to the null value encoded id.
--min_user_freq
Users with frequency lower than this value are removed
from the dataset (before data splitting).
--max_user_freq
Users with frequency higher than this value are
removed from the dataset (before data splitting).
--min_item_freq
Items with frequency lower than this value are removed
from the dataset (before data splitting).
--max_item_freq
Items with frequency higher than this value are
removed from the dataset (before data splitting).
--num_max_rounds_filtering
Max number of rounds interleaving users and items
frequency filtering. If a small number of rounds is
chosen, some low-frequent users or items might be kept
in the dataset. Default is 5
--filter_query
A filter query condition compatible with dask-cudf
`DataFrame.query()`
Dataset splitting (train and eval sets)#
--dataset_split_strategy {random,random_by_user,temporal}
If None, no data split is performed. If 'random',
samples are assigned randomly to eval set according to
--random_split_eval_perc. If 'random_by_user', users
will have examples in both train and eval set,
according to the proportion specified in
--random_split_eval_perc. If 'temporal', the
--timestamp_feature with
--dataset_split_temporal_timestamp to split eval set
based on time.
--random_split_eval_perc
Percentage of examples to be assigned to eval set. It
is used with --dataset_split_strategy 'random' and
'random_by_user'
--dataset_split_temporal_timestamp
Used when --dataset_split_strategy 'temporal'. It
assigns for eval set all examples where the
--timestamp_feature >= value
CUDA cluster options#
--enable_dask_cuda_cluster
Initializes a LocalCUDACluster for multi-GPU preprocessing.
This is recommended for larger dataset to avoid out-of-memory
errors, when multiple GPUs are available. By default is False.
--dask_cuda_visible_gpu_devices
Ids of GPU devices that should be used for
preprocessing, if any. For example:
--visible_gpu_devices=0,1. Default is None, for using all GPUs
--dask_cuda_gpu_device_spill_frac
Percentage of GPU memory used at which
LocalCUDACluster should spill memory to CPU, before
raising out-of-memory errors. Default is 0.7