nvtabular.ops.TargetEncoding

class nvtabular.ops.TargetEncoding(target, target_mean=None, kfold=None, fold_seed=42, p_smooth=20, out_col=None, out_dtype=None, split_out=None, split_every=None, cat_cache='host', out_path=None, on_host=True, name_sep='_', drop_folds=True, tree_width=None)[source]

Bases: merlin.dag.ops.stat_operator.StatOperator

Target encoding is a common feature-engineering technique for categorical columns in tabular datasets. For each categorical group, the mean of a continuous target column is calculated, and the group-specific mean of each row is used to create a new feature (column). To prevent overfitting, the following additional logic is applied:

1. Cross Validation: To prevent overfitting in training data, a cross-validation strategy is used - The data is split into k random “folds”, and the mean values within the i-th fold are calculated with data from all other folds. The cross-validation strategy is only employed when the dataset is used to update recorded statistics. For transformation-only workflow execution, global-mean statistics are used instead.

2. Smoothing: To prevent overfitting for low cardinality categories, the means are smoothed with the overall mean of the target variable.

Target Encoding Function:

TE = ((mean_cat*count_cat)+(mean_global*p_smooth)) / (count_cat+p_smooth)

count_cat := count of the categorical value
mean_cat := mean target value of the categorical value
mean_global := mean target value of the whole dataset
p_smooth := smoothing factor

Example usage:

# First, we can transform the label columns to binary targets
LABEL_COLUMNS = ['label1', 'label2']
labels = ColumnSelector(LABEL_COLUMNS) >> (lambda col: (col>0).astype('int8'))
# We target encode cat1, cat2 and the cross columns cat1 x cat2
target_encode = (
    ['cat1', 'cat2', ['cat2','cat3']] >>
    nvt.ops.TargetEncoding(
        labels,
        kfold=5,
        p_smooth=20,
        out_dtype="float32",
        )
)
processor = nvt.Workflow(target_encode)
Parameters
  • target (str) – Continuous target column to use for the encoding of cat_groups. The same continuous target will be used for all cat_groups.

  • target_mean (float) – Global mean of the target column to use for encoding. Supplying this value up-front will improve performance.

  • kfold (int, default 3) – Number of cross-validation folds to use while gathering statistics.

  • fold_seed (int, default 42) – Random seed to use for numpy-based fold assignment.

  • p_smooth (int, default 20) – Smoothing factor.

  • out_col (str or list of str, default is problem-specific) – Name of output target-encoding column. If cat_groups includes multiple elements, this should be a list of the same length (and elements must be unique).

  • out_dtype (str, default is problem-specific) – dtype of output target-encoding columns.

  • split_out (dict or int, optional) – Number of files needed to store the final result of each groupby reduction. High-cardinality groups may require a large split_out, while low-cardinality columns can likely use split_out=1 (default). If passing a dict, each key and value should correspond to the column name and value, respectively. The default value is 1 for all columns.

  • split_every (dict or int, optional) – Number of adjacent partitions to aggregate in each tree-reduction node. The default value is 8 for all columns.

  • cat_cache ({"device", "host", "disk"} or dict) – Location to cache the list of unique categories for each categorical column. If passing a dict, each key and value should correspond to the column name and location, respectively. Default is “host” for all columns.

  • out_path (str, optional) – Root directory where category statistics will be written out in parquet format.

  • on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).

  • name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.

  • drop_folds (bool, default True) – Whether to drop the “__fold__” column created. This is really only useful for unittests.

__init__(target, target_mean=None, kfold=None, fold_seed=42, p_smooth=20, out_col=None, out_dtype=None, split_out=None, split_every=None, cat_cache='host', out_path=None, on_host=True, name_sep='_', drop_folds=True, tree_width=None)[source]

Methods

__init__(target[, target_mean, kfold, …])

clear()

column_mapping(col_selector)

compute_column_schema(col_name, input_schema)

compute_input_schema(root_schema, …)

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

compute_output_schema(input_schema, col_selector)

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

compute_selector(input_schema, selector, …)

create_node(selector)

export(path, input_schema, output_schema, …)

Export the class object as a config and all related files to the user defined path.

fit(col_selector, ddf)

Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(dask_stats)

Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

load_artifacts([artifact_path])

Load artifacts from disk required for operator function.

output_column_names(col_selector)

Given a set of columns names returns the names of the transformed columns this operator will produce

save_artifacts([artifact_path])

Save artifacts required to be reload operator state from disk

set_storage_path(new_path[, copy])

transform(col_selector, df)

Transform the dataframe by applying this operator to the set of input columns

validate_schemas(parents_schema, …[, …])

Provides a hook method that sub-classes can override to implement schema validation logic.

Attributes

dependencies

dynamic_dtypes

export_name

Provides a clear common english identifier for this operator.

fitted

is_subgraph

label

output_dtype

output_properties

output_tags

supported_formats

supports

Returns what kind of data representation this operator supports

target_columns

fit(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)[source]

Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(dask_stats)[source]

Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

property dependencies
compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector)merlin.dag.selector.ColumnSelector[source]
column_mapping(col_selector)[source]
property output_dtype
property output_tags
property target_columns
set_storage_path(new_path, copy=False)[source]
clear()[source]
transform(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Transform the dataframe by applying this operator to the set of input columns

Parameters
  • col_selector (ColumnSelector) – The columns to apply this operator to

  • transformable (Transformable) – A pandas or cudf dataframe that this operator will work on

Returns

Returns a transformed dataframe or dictarray for this operator

Return type

Transformable