Categorify

class nvtabular.ops.Categorify(freq_threshold=0, out_path=None, tree_width=None, na_sentinel=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, start_index=0, single_table=False, cardinality_memory_limit=None)[source]

Bases: nvtabular.ops.stat_operator.StatOperator

Most of the data set will contain categorical features, and these variables are typically stored as text values. Machine Learning algorithms don’t support these text values. Categorify operation can be added to the workflow to transform categorical features into unique integer values.

Example usage:

# Define pipeline
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(freq_threshold=10)

# Initialize the workflow and execute it
proc = nvt.Workflow(cat_features)
proc.fit(dataset)
proc.transform(dataset).to_parquet('./test/')

Example for frequency hashing:

import cudf
import nvtabular as nvt

# Create toy dataset
df = cudf.DataFrame({
    'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'],
    'productID': [100, 101, 102, 101, 102, 103, 103],
    'label': [0, 0, 1, 1, 1, 0, 0]
})
dataset = nvt.Dataset(df)

# Define pipeline
CATEGORICAL_COLUMNS = ['author', 'productID']
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(
    freq_threshold={"author": 3, "productID": 2},
    num_buckets={"author": 10, "productID": 20})


# Initialize the workflow and execute it
proc = nvt.Workflow(cat_features)
proc.fit(dataset)
ddf = proc.transform(dataset).to_ddf()

# Print results
print(ddf.compute())

Example with multi-hot:

import cudf
import nvtabular as nvt

# Create toy dataset
df = cudf.DataFrame({
    'userID': [10001, 10002, 10003],
    'productID': [30003, 30005, 40005],
    'categories': [['Cat A', 'Cat B'], ['Cat C'], ['Cat A', 'Cat C', 'Cat D']],
    'label': [0,0,1]
})
dataset = nvt.Dataset(df)

# Define pipeline
CATEGORICAL_COLUMNS = ['userID', 'productID', 'categories']
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify()

# Initialize the workflow and execute it
proc = nvt.Workflow(cat_features)
proc.fit(dataset)
ddf = proc.transform(dataset).to_ddf()

# Print results
print(ddf.compute())

Parameters

freq_threshold (int or dictionary:{column: freq_limit_value}, default 0) – Categories with a count/frequency below this threshold will be omitted from the encoding and corresponding data will be mapped to the “null” category. Can be represented as both an integer or a dictionary with column names as keys and frequency limit as value. If dictionary is used, all columns targeted must be included in the dictionary.
encode_type ({"joint", "combo"}, default "joint") – If “joint”, the columns within any multi-column group will be jointly encoded. If “combo”, the combination of values will be encoded as a new column. Note that replacement is not allowed for “combo”, because the same column name can be included in multiple groups.
tree_width (dict or int, optional) – Tree width of the hash-based groupby reduction for each categorical column. High-cardinality columns may require a large tree_width, while low-cardinality columns can likely use tree_width=1. If passing a dict, each key and value should correspond to the column name and width, respectively. The default value is 8 for all columns.
out_path (str, optional) – Root directory where groupby statistics will be written out in parquet format.
on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).
na_sentinel (default 0) – Label to use for null-category mapping
cat_cache ({"device", "host", "disk"} or dict) – Location to cache the list of unique categories for each categorical column. If passing a dict, each key and value should correspond to the column name and location, respectively. Default is “host” for all columns.
dtype – If specified, categorical labels will be cast to this dtype after encoding is performed.
name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.
search_sorted (bool, default False.) – Set it True to apply searchsorted algorithm in encoding.
num_buckets (int, or dictionary:{column: num_hash_buckets}) – Column-wise modulo to apply after hash function. Note that this means that the corresponding value will be the categorical cardinality of the transformed categorical feature. If given as an int, that value will be used as the number of “hash buckets” for every feature. If a dictionary is passed, it will be used to specify explicit mappings from a column name to a number of buckets. In this case, only the columns specified in the keys of num_buckets will be transformed.
max_size (int or dictionary:{column: max_size_value}, default 0) – This parameter allows you to set the maximum size for an embedding table for each column. For example, if max_size is set to 1000 only the first 999 most frequent values for each column will be be encoded, and the rest will be mapped to a single value (0). To map the rest to a number of buckets, you can set the num_buckets parameter > 1. In that case, topK value will be max_size - num_buckets -1. Setting the max_size param means that freq_threshold should not be given. If the num_buckets parameter is set, it must be smaller than the max_size value.
start_index (int, default 0) – The start index where Categorify will begin to translate dataframe entries into integer values, including an initial out-of-vocabulary encoding value. For instance, if our original translated dataframe entries appear as [[1], [1, 4], [3, 2], [2]], with an out-of-vocabulary value of 0, then with a start_index of 16, Categorify will reserve 16 as the out-of-vocabulary encoding value, and our new translated dataframe entry will now be [[17], [17, 20], [19, 18], [18]]. This parameter is useful to reserve an initial segment of non-negative translated integers for special user-defined values.
cardinality_memory_limit (int or str, default None) – Upper limit on the “allowed” memory usage of the internal DataFrame and Table objects used to store unique categories. By default, this limit is 12.5% of the total memory. Note that this argument is meant as a guide for internal optimizations and UserWarnings within NVTabular, and does not guarantee that the memory limit will be satisfied.

fit(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)[source]: Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(categories)[source]: Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

clear()[source]: Clear the internal state of the operator’s stats.

process_vocabs(vocabs)[source]: Process vocabs passed in by the user.

set_storage_path(new_path, copy=False)[source]

transform(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Transform the dataframe by applying this operator to the set of input columns

Parameters

columns (list of str or list of list of str) – The columns to apply this operator to
df (Dataframe) – A pandas or cudf dataframe that this operator will work on

Returns

Returns a transformed dataframe for this operator

Return type

DataFrame

column_mapping(col_selector)[source]

property output_tags

property output_dtype

compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector) → merlin.dag.selector.ColumnSelector[source]

get_embedding_sizes(columns)[source]

inference_initialize(columns, inference_config)[source]