nvtabular.ops.Categorify#
- class nvtabular.ops.Categorify(freq_threshold=0, out_path=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, single_table=False, cardinality_memory_limit=None, tree_width=None, split_out=1, split_every=8, **kwargs)[source]#
Bases:
StatOperator
Most of the data set will contain categorical features, and these variables are typically stored as text values. Machine Learning algorithms don’t support these text values. Categorify operation can be added to the workflow to transform categorical features into unique integer values.
Encoding Convention:
- `0`: Not used by `Categorify` (reserved for padding). - `1`: Null and NaN values. - `[2, 2 + num_buckets)`: OOV values (including hash buckets). - `[2 + num_buckets, max_size)`: Unique vocabulary.
Example usage:
# Define pipeline cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(freq_threshold=10) # Initialize the workflow and execute it proc = nvt.Workflow(cat_features) proc.fit(dataset) proc.transform(dataset).to_parquet('./test/')
Example for frequency hashing:
import cudf import nvtabular as nvt # Create toy dataset df = cudf.DataFrame({ 'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'], 'productID': [100, 101, 102, 101, 102, 103, 103], 'label': [0, 0, 1, 1, 1, 0, 0] }) dataset = nvt.Dataset(df) # Define pipeline CATEGORICAL_COLUMNS = ['author', 'productID'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify( freq_threshold={"author": 3, "productID": 2}, num_buckets={"author": 10, "productID": 20}) # Initialize the workflow and execute it proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf() # Print results print(ddf.compute())
Example with multi-hot:
import cudf import nvtabular as nvt # Create toy dataset df = cudf.DataFrame({ 'userID': [10001, 10002, 10003], 'productID': [30003, 30005, 40005], 'categories': [['Cat A', 'Cat B'], ['Cat C'], ['Cat A', 'Cat C', 'Cat D']], 'label': [0,0,1] }) dataset = nvt.Dataset(df) # Define pipeline CATEGORICAL_COLUMNS = ['userID', 'productID', 'categories'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify() # Initialize the workflow and execute it proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf() # Print results print(ddf.compute())
- Parameters:
freq_threshold (int or dictionary:{column: freq_limit_value}, default 0) – Categories with a count/frequency below this threshold will be omitted from the encoding and corresponding data will be mapped to the OOV indices. Can be represented as both an integer or a dictionary with column names as keys and frequency limit as value. If dictionary is used, all columns targeted must be included in the dictionary.
encode_type ({"joint", "combo"}, default "joint") – If “joint”, the columns within any multi-column group will be jointly encoded. If “combo”, the combination of values will be encoded as a new column. Note that replacement is not allowed for “combo”, because the same column name can be included in multiple groups.
split_out (dict or int, optional) – Number of files needed to store the unique values of each categorical column. High-cardinality columns may require split_out>1, while low-cardinality columns should be fine with the split_out=1 default. If passing a dict, each key and value should correspond to the column name and value, respectively. The default value is 1 for all columns.
split_every (dict or int, optional) – Number of adjacent partitions to aggregate in each tree-reduction node. The default value is 8 for all columns.
out_path (str, optional) – Root directory where groupby statistics will be written out in parquet format.
on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).
cat_cache ({"device", "host", "disk"} or dict) – Location to cache the list of unique categories for each categorical column. If passing a dict, each key and value should correspond to the column name and location, respectively. Default is “host” for all columns.
dtype – If specified, categorical labels will be cast to this dtype after encoding is performed.
name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.
search_sorted (bool, default False.) – Set it True to apply searchsorted algorithm in encoding.
num_buckets (int, or dictionary:{column: num_oov_indices}, optional) – Number of indices to reserve for out-of-vocabulary (OOV) encoding at transformation time. By default, all OOV values will be mapped to the same index (2). If num_buckets is set to an integer greater than one, a column-wise hash and modulo will be used to map each OOV value to an index in the range [2, 2 + num_buckets). A dictionary may be used if the desired num_buckets behavior varies by column.
max_size (int or dictionary:{column: max_size_value}, optional) – Set the maximum size of the expected embedding table for each column. For example, if max_size is set to 1000, only the first 997 most- frequent values will be included in the unique-value vocabulary, and all remaining non-null values will be mapped to the OOV indices (indices 0 and 1 will still be reserved for padding and nulls). To use multiple OOV indices for infrequent values, set the num_buckets parameter accordingly. Note that max_size cannot be combined with freq_threshold, and it cannot be less than num_buckets + 2. By default, the total number of encoding indices will be unconstrained.
cardinality_memory_limit (int or str, optional) – Upper limit on the “allowed” memory usage of the internal DataFrame and Table objects used to store unique categories. By default, this limit is 12.5% of the total memory. Note that this argument is meant as a guide for internal optimizations and UserWarnings within NVTabular, and does not guarantee that the memory limit will be satisfied.
- __init__(freq_threshold=0, out_path=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, single_table=False, cardinality_memory_limit=None, tree_width=None, split_out=1, split_every=8, **kwargs)[source]#
Methods
__init__
([freq_threshold, out_path, ...])clear
()Clear the internal state of the operator's stats.
column_mapping
(col_selector)compute_column_schema
(col_name, input_schema)compute_input_schema
(root_schema, ...)Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use
compute_output_schema
(input_schema, col_selector)Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce
compute_selector
(input_schema, selector, ...)create_node
(selector)export
(path, input_schema, output_schema, ...)Export the class object as a config and all related files to the user defined path.
fit
(col_selector, ddf)Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.
fit_finalize
(categories)Finalize statistics calculation - the workflow calls this function with the computed statistics from the 'fit' object'
get_embedding_sizes
(columns)inference_initialize
(columns, inference_config)load_artifacts
([artifact_path])Load artifacts from disk required for operator function.
output_column_names
(col_selector)Given a set of columns names returns the names of the transformed columns this operator will produce
process_vocabs
(vocabs)Process vocabs passed in by the user.
save_artifacts
([artifact_path])Save artifacts required to be reload operator state from disk
set_storage_path
(new_path[, copy])transform
(col_selector, df)Transform the dataframe by applying this operator to the set of input columns
validate_schemas
(parents_schema, ...[, ...])Provides a hook method that sub-classes can override to implement schema validation logic.
Attributes
dependencies
Defines an optional list of column dependencies for this operator.
dynamic_dtypes
export_name
Provides a clear common english identifier for this operator.
fitted
is_subgraph
label
output_properties
supported_formats
supports
Returns what kind of data representation this operator supports
- fit(col_selector: ColumnSelector, ddf: DataFrame)[source]#
Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.
- fit_finalize(categories)[source]#
Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’
- transform(col_selector: ColumnSelector, df: DataFrame) DataFrame [source]#
Transform the dataframe by applying this operator to the set of input columns
- Parameters:
col_selector (ColumnSelector) – The columns to apply this operator to
transformable (Transformable) – A pandas or cudf dataframe that this operator will work on
- Returns:
Returns a transformed dataframe or dictarray for this operator
- Return type:
Transformable
- property output_tags#
- property output_dtype#
- compute_selector(input_schema: Schema, selector: ColumnSelector, parents_selector: ColumnSelector, dependencies_selector: ColumnSelector) ColumnSelector [source]#