nvtabular.ops.Categorify
-
class
nvtabular.ops.
Categorify
(freq_threshold=0, out_path=None, tree_width=None, na_sentinel=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, start_index=0, single_table=False, cardinality_memory_limit=None)[source] Bases:
nvtabular.ops.stat_operator.StatOperator
Most of the data set will contain categorical features, and these variables are typically stored as text values. Machine Learning algorithms don’t support these text values. Categorify operation can be added to the workflow to transform categorical features into unique integer values.
Example usage:
# Define pipeline cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(freq_threshold=10) # Initialize the workflow and execute it proc = nvt.Workflow(cat_features) proc.fit(dataset) proc.transform(dataset).to_parquet('./test/')
Example for frequency hashing:
import cudf import nvtabular as nvt # Create toy dataset df = cudf.DataFrame({ 'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'], 'productID': [100, 101, 102, 101, 102, 103, 103], 'label': [0, 0, 1, 1, 1, 0, 0] }) dataset = nvt.Dataset(df) # Define pipeline CATEGORICAL_COLUMNS = ['author', 'productID'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify( freq_threshold={"author": 3, "productID": 2}, num_buckets={"author": 10, "productID": 20}) # Initialize the workflow and execute it proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf() # Print results print(ddf.compute())
Example with multi-hot:
import cudf import nvtabular as nvt # Create toy dataset df = cudf.DataFrame({ 'userID': [10001, 10002, 10003], 'productID': [30003, 30005, 40005], 'categories': [['Cat A', 'Cat B'], ['Cat C'], ['Cat A', 'Cat C', 'Cat D']], 'label': [0,0,1] }) dataset = nvt.Dataset(df) # Define pipeline CATEGORICAL_COLUMNS = ['userID', 'productID', 'categories'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify() # Initialize the workflow and execute it proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf() # Print results print(ddf.compute())
- Parameters
freq_threshold (int or dictionary:{column: freq_limit_value}, default 0) – Categories with a count/frequency below this threshold will be omitted from the encoding and corresponding data will be mapped to the “null” category. Can be represented as both an integer or a dictionary with column names as keys and frequency limit as value. If dictionary is used, all columns targeted must be included in the dictionary.
encode_type ({"joint", "combo"}, default "joint") – If “joint”, the columns within any multi-column group will be jointly encoded. If “combo”, the combination of values will be encoded as a new column. Note that replacement is not allowed for “combo”, because the same column name can be included in multiple groups.
tree_width (dict or int, optional) – Tree width of the hash-based groupby reduction for each categorical column. High-cardinality columns may require a large tree_width, while low-cardinality columns can likely use tree_width=1. If passing a dict, each key and value should correspond to the column name and width, respectively. The default value is 8 for all columns.
out_path (str, optional) – Root directory where groupby statistics will be written out in parquet format.
on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).
na_sentinel (default 0) – Label to use for null-category mapping
cat_cache ({"device", "host", "disk"} or dict) – Location to cache the list of unique categories for each categorical column. If passing a dict, each key and value should correspond to the column name and location, respectively. Default is “host” for all columns.
dtype – If specified, categorical labels will be cast to this dtype after encoding is performed.
name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.
search_sorted (bool, default False.) – Set it True to apply searchsorted algorithm in encoding.
num_buckets (int, or dictionary:{column: num_hash_buckets}) – Column-wise modulo to apply after hash function. Note that this means that the corresponding value will be the categorical cardinality of the transformed categorical feature. If given as an int, that value will be used as the number of “hash buckets” for every feature. If a dictionary is passed, it will be used to specify explicit mappings from a column name to a number of buckets. In this case, only the columns specified in the keys of num_buckets will be transformed.
max_size (int or dictionary:{column: max_size_value}, default 0) – This parameter allows you to set the maximum size for an embedding table for each column. For example, if max_size is set to 1000 only the first 999 most frequent values for each column will be be encoded, and the rest will be mapped to a single value (0). To map the rest to a number of buckets, you can set the num_buckets parameter > 1. In that case, topK value will be max_size - num_buckets -1. Setting the max_size param means that freq_threshold should not be given. If the num_buckets parameter is set, it must be smaller than the max_size value.
start_index (int, default 0) – The start index where Categorify will begin to translate dataframe entries into integer values, including an initial out-of-vocabulary encoding value. For instance, if our original translated dataframe entries appear as [[1], [1, 4], [3, 2], [2]], with an out-of-vocabulary value of 0, then with a start_index of 16, Categorify will reserve 16 as the out-of-vocabulary encoding value, and our new translated dataframe entry will now be [[17], [17, 20], [19, 18], [18]]. This parameter is useful to reserve an initial segment of non-negative translated integers for special user-defined values.
cardinality_memory_limit (int or str, default None) – Upper limit on the “allowed” memory usage of the internal DataFrame and Table objects used to store unique categories. By default, this limit is 12.5% of the total memory. Note that this argument is meant as a guide for internal optimizations and UserWarnings within NVTabular, and does not guarantee that the memory limit will be satisfied.
-
__init__
(freq_threshold=0, out_path=None, tree_width=None, na_sentinel=None, cat_cache='host', dtype=None, on_host=True, encode_type='joint', name_sep='_', search_sorted=False, num_buckets=None, vocabs=None, max_size=0, start_index=0, single_table=False, cardinality_memory_limit=None)[source]
Methods
__init__
([freq_threshold, out_path, …])clear
()Clear the internal state of the operator’s stats.
column_mapping
(col_selector)compute_column_schema
(col_name, input_schema)compute_input_schema
(root_schema, …)Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use
compute_output_schema
(input_schema, col_selector)Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce
compute_selector
(input_schema, selector, …)create_node
(selector)fit
(col_selector, ddf)Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.
fit_finalize
(categories)Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’
get_embedding_sizes
(columns)inference_initialize
(columns, inference_config)load_artifacts
([artifact_path])Load artifacts from disk required for operator function.
output_column_names
(col_selector)Given a set of columns names returns the names of the transformed columns this operator will produce
process_vocabs
(vocabs)Process vocabs passed in by the user.
save_artifacts
([artifact_path])Save artifacts required to be reload operator state from disk
set_storage_path
(new_path[, copy])transform
(col_selector, df)Transform the dataframe by applying this operator to the set of input columns
validate_schemas
(parents_schema, …[, …])Provides a hook method that sub-classes can override to implement schema validation logic.
Attributes
dependencies
Defines an optional list of column dependencies for this operator.
dynamic_dtypes
is_subgraph
label
output_properties
supported_formats
supports
Returns what kind of data representation this operator supports
-
fit
(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)[source] Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.
-
fit_finalize
(categories)[source] Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’
-
transform
(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source] Transform the dataframe by applying this operator to the set of input columns
- Parameters
columns (list of str or list of list of str) – The columns to apply this operator to
df (Dataframe) – A pandas or cudf dataframe that this operator will work on
- Returns
Returns a transformed dataframe for this operator
- Return type
DataFrame
-
property
output_dtype
-
compute_selector
(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector) → merlin.dag.selector.ColumnSelector[source]