TargetEncoding
- 
class nvtabular.ops.TargetEncoding(target, target_mean=None, kfold=None, fold_seed=42, p_smooth=20, out_col=None, out_dtype=None, tree_width=None, cat_cache='host', out_path=None, on_host=True, name_sep='_', drop_folds=True)[source]
- Bases: - nvtabular.ops.stat_operator.StatOperator- Target encoding is a common feature-engineering technique for categorical columns in tabular datasets. For each categorical group, the mean of a continuous target column is calculated, and the group-specific mean of each row is used to create a new feature (column). To prevent overfitting, the following additional logic is applied: - 1. Cross Validation: To prevent overfitting in training data, a cross-validation strategy is used - The data is split into k random “folds”, and the mean values within the i-th fold are calculated with data from all other folds. The cross-validation strategy is only employed when the dataset is used to update recorded statistics. For transformation-only workflow execution, global-mean statistics are used instead. - 2. Smoothing: To prevent overfitting for low cardinality categories, the means are smoothed with the overall mean of the target variable. - Target Encoding Function: - TE = ((mean_cat*count_cat)+(mean_global*p_smooth)) / (count_cat+p_smooth) count_cat := count of the categorical value mean_cat := mean target value of the categorical value mean_global := mean target value of the whole dataset p_smooth := smoothing factor - Example usage: - # First, we can transform the label columns to binary targets LABEL_COLUMNS = ['label1', 'label2'] labels = ColumnSelector(LABEL_COLUMNS) >> (lambda col: (col>0).astype('int8')) # We target encode cat1, cat2 and the cross columns cat1 x cat2 target_encode = ( ['cat1', 'cat2', ['cat2','cat3']] >> nvt.ops.TargetEncoding( labels, kfold=5, p_smooth=20, out_dtype="float32", ) ) processor = nvt.Workflow(target_encode) - Parameters
- target (str) – Continuous target column to use for the encoding of cat_groups. The same continuous target will be used for all cat_groups. 
- target_mean (float) – Global mean of the target column to use for encoding. Supplying this value up-front will improve performance. 
- kfold (int, default 3) – Number of cross-validation folds to use while gathering statistics. 
- fold_seed (int, default 42) – Random seed to use for numpy-based fold assignment. 
- p_smooth (int, default 20) – Smoothing factor. 
- out_col (str or list of str, default is problem-specific) – Name of output target-encoding column. If cat_groups includes multiple elements, this should be a list of the same length (and elements must be unique). 
- out_dtype (str, default is problem-specific) – dtype of output target-encoding columns. 
- tree_width (dict or int, optional) – Tree width of the hash-based groupby reduction for each categorical column. High-cardinality columns may require a large tree_width, while low-cardinality columns can likely use tree_width=1. If passing a dict, each key and value should correspond to the column name and width, respectively. The default value is 8 for all columns. 
- cat_cache ({"device", "host", "disk"} or dict) – Location to cache the list of unique categories for each categorical column. If passing a dict, each key and value should correspond to the column name and location, respectively. Default is “host” for all columns. 
- out_path (str, optional) – Root directory where category statistics will be written out in parquet format. 
- on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure). 
- name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups. 
- drop_folds (bool, default True) – Whether to drop the “__fold__” column created. This is really only useful for unittests. 
 
 - 
fit(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)[source]
- Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow. 
 - 
fit_finalize(dask_stats)[source]
- Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’ 
 - 
property dependencies
 - 
compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector) → merlin.dag.selector.ColumnSelector[source]
 - 
property output_dtype
 - 
property target_columns
 - 
transform(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]
- Transform the dataframe by applying this operator to the set of input columns - Parameters
- columns (list of str or list of list of str) – The columns to apply this operator to 
- df (Dataframe) – A pandas or cudf dataframe that this operator will work on 
 
- Returns
- Returns a transformed dataframe for this operator 
- Return type
- DataFrame