nvtabular.ops.Categorify

class nvtabular.ops.Categorify(freq_threshold=0, out_path=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, single_table=False, cardinality_memory_limit=None, tree_width=None, split_out=1, split_every=8, **kwargs)[source]

Bases: merlin.dag.ops.stat_operator.StatOperator

Most of the data set will contain categorical features, and these variables are typically stored as text values. Machine Learning algorithms don’t support these text values. Categorify operation can be added to the workflow to transform categorical features into unique integer values.

Encoding Convention:

- `0`: Not used by `Categorify` (reserved for padding).
- `1`: Null and NaN values.
- `[2, 2 + num_buckets)`: OOV values (including hash buckets).
- `[2 + num_buckets, max_size)`: Unique vocabulary.

Example usage:

# Define pipeline
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(freq_threshold=10)

# Initialize the workflow and execute it
proc = nvt.Workflow(cat_features)
proc.fit(dataset)
proc.transform(dataset).to_parquet('./test/')

Example for frequency hashing:

import cudf
import nvtabular as nvt

# Create toy dataset
df = cudf.DataFrame({
    'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'],
    'productID': [100, 101, 102, 101, 102, 103, 103],
    'label': [0, 0, 1, 1, 1, 0, 0]
})
dataset = nvt.Dataset(df)

# Define pipeline
CATEGORICAL_COLUMNS = ['author', 'productID']
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(
    freq_threshold={"author": 3, "productID": 2},
    num_buckets={"author": 10, "productID": 20})


# Initialize the workflow and execute it
proc = nvt.Workflow(cat_features)
proc.fit(dataset)
ddf = proc.transform(dataset).to_ddf()

# Print results
print(ddf.compute())

Example with multi-hot:

import cudf
import nvtabular as nvt

# Create toy dataset
df = cudf.DataFrame({
    'userID': [10001, 10002, 10003],
    'productID': [30003, 30005, 40005],
    'categories': [['Cat A', 'Cat B'], ['Cat C'], ['Cat A', 'Cat C', 'Cat D']],
    'label': [0,0,1]
})
dataset = nvt.Dataset(df)

# Define pipeline
CATEGORICAL_COLUMNS = ['userID', 'productID', 'categories']
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify()

# Initialize the workflow and execute it
proc = nvt.Workflow(cat_features)
proc.fit(dataset)
ddf = proc.transform(dataset).to_ddf()

# Print results
print(ddf.compute())
Parameters
  • freq_threshold (int or dictionary:{column: freq_limit_value}, default 0) – Categories with a count/frequency below this threshold will be omitted from the encoding and corresponding data will be mapped to the OOV indices. Can be represented as both an integer or a dictionary with column names as keys and frequency limit as value. If dictionary is used, all columns targeted must be included in the dictionary.

  • encode_type ({"joint", "combo"}, default "joint") – If “joint”, the columns within any multi-column group will be jointly encoded. If “combo”, the combination of values will be encoded as a new column. Note that replacement is not allowed for “combo”, because the same column name can be included in multiple groups.

  • split_out (dict or int, optional) – Number of files needed to store the unique values of each categorical column. High-cardinality columns may require split_out>1, while low-cardinality columns should be fine with the split_out=1 default. If passing a dict, each key and value should correspond to the column name and value, respectively. The default value is 1 for all columns.

  • split_every (dict or int, optional) – Number of adjacent partitions to aggregate in each tree-reduction node. The default value is 8 for all columns.

  • out_path (str, optional) – Root directory where groupby statistics will be written out in parquet format.

  • on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).

  • cat_cache ({"device", "host", "disk"} or dict) – Location to cache the list of unique categories for each categorical column. If passing a dict, each key and value should correspond to the column name and location, respectively. Default is “host” for all columns.

  • dtype – If specified, categorical labels will be cast to this dtype after encoding is performed.

  • name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.

  • search_sorted (bool, default False.) – Set it True to apply searchsorted algorithm in encoding.

  • num_buckets (int, or dictionary:{column: num_oov_indices}, optional) – Number of indices to reserve for out-of-vocabulary (OOV) encoding at transformation time. By default, all OOV values will be mapped to the same index (2). If num_buckets is set to an integer greater than one, a column-wise hash and modulo will be used to map each OOV value to an index in the range [2, 2 + num_buckets). A dictionary may be used if the desired num_buckets behavior varies by column.

  • max_size (int or dictionary:{column: max_size_value}, optional) – Set the maximum size of the expected embedding table for each column. For example, if max_size is set to 1000, only the first 997 most- frequent values will be included in the unique-value vocabulary, and all remaining non-null values will be mapped to the OOV indices (indices 0 and 1 will still be reserved for padding and nulls). To use multiple OOV indices for infrequent values, set the num_buckets parameter accordingly. Note that max_size cannot be combined with freq_threshold, and it cannot be less than num_buckets + 2. By default, the total number of encoding indices will be unconstrained.

  • cardinality_memory_limit (int or str, optional) – Upper limit on the “allowed” memory usage of the internal DataFrame and Table objects used to store unique categories. By default, this limit is 12.5% of the total memory. Note that this argument is meant as a guide for internal optimizations and UserWarnings within NVTabular, and does not guarantee that the memory limit will be satisfied.

__init__(freq_threshold=0, out_path=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, single_table=False, cardinality_memory_limit=None, tree_width=None, split_out=1, split_every=8, **kwargs)[source]

Methods

__init__([freq_threshold, out_path, …])

clear()

Clear the internal state of the operator’s stats.

column_mapping(col_selector)

compute_column_schema(col_name, input_schema)

compute_input_schema(root_schema, …)

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

compute_output_schema(input_schema, col_selector)

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

compute_selector(input_schema, selector, …)

create_node(selector)

export(path, input_schema, output_schema, …)

Export the class object as a config and all related files to the user defined path.

fit(col_selector, ddf)

Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(categories)

Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

get_embedding_sizes(columns)

inference_initialize(columns, inference_config)

load_artifacts([artifact_path])

Load artifacts from disk required for operator function.

output_column_names(col_selector)

Given a set of columns names returns the names of the transformed columns this operator will produce

process_vocabs(vocabs)

Process vocabs passed in by the user.

save_artifacts([artifact_path])

Save artifacts required to be reload operator state from disk

set_storage_path(new_path[, copy])

transform(col_selector, df)

Transform the dataframe by applying this operator to the set of input columns

validate_schemas(parents_schema, …[, …])

Provides a hook method that sub-classes can override to implement schema validation logic.

Attributes

dependencies

Defines an optional list of column dependencies for this operator.

dynamic_dtypes

export_name

Provides a clear common english identifier for this operator.

fitted

is_subgraph

label

output_dtype

output_properties

output_tags

supported_formats

supports

Returns what kind of data representation this operator supports

fit(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)[source]

Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(categories)[source]

Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

clear()[source]

Clear the internal state of the operator’s stats.

process_vocabs(vocabs)[source]

Process vocabs passed in by the user.

set_storage_path(new_path, copy=False)[source]
transform(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Transform the dataframe by applying this operator to the set of input columns

Parameters
  • col_selector (ColumnSelector) – The columns to apply this operator to

  • transformable (Transformable) – A pandas or cudf dataframe that this operator will work on

Returns

Returns a transformed dataframe or dictarray for this operator

Return type

Transformable

column_mapping(col_selector)[source]
property output_tags
property output_dtype
compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector)merlin.dag.selector.ColumnSelector[source]
get_embedding_sizes(columns)[source]
inference_initialize(columns, inference_config)[source]