JoinGroupby(cont_cols=None, stats=('count'), tree_width=None, cat_cache='host', out_path=None, on_host=True, name_sep='_')
One of the ways to create new features is to calculate the basic statistics of the data that is grouped by categorical features. This operator groups the data by the given categorical feature(s) and calculates the desired statistics of requested continuous features (along with the count of rows in each group). The aggregated statistics are merged with the data (by joining on the desired categorical columns).
# Use JoinGroupby to define a NVTabular workflow groupby_features = ['cat1', 'cat2', 'cat3'] >> ops.JoinGroupby( out_path=str(tmpdir), stats=['sum','count'], cont_cols=['num1'] ) processor = nvtabular.Workflow(groupby_features)
cont_cols (list of str or WorkflowNode) – The continuous columns to calculate statistics for (for each unique group in each column in columns).
stats (list of str, default ) – List of statistics to calculate for each unique group. Note that “count” corresponds to the group itself, while all other statistics correspond to a specific continuous column. Supported statistics include [“count”, “sum”, “mean”, “std”, “var”].
tree_width (dict or int, optional) – Tree width of the hash-based groupby reduction for each categorical column. High-cardinality columns may require a large tree_width, while low-cardinality columns can likely use tree_width=1. If passing a dict, each key and value should correspond to the column name and width, respectively. The default value is 8 for all columns.
cat_cache (ToDo Describe) – TEXT
out_path (str, optional) – Root directory where groupby statistics will be written out in parquet format.
on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).
name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.
fit(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)
Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.
Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’
transform(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame
Transform the dataframe by applying this operator to the set of input columns
columns (list of str or list of list of str) – The columns to apply this operator to
df (Dataframe) – A pandas or cudf dataframe that this operator will work on
Returns a transformed dataframe for this operator
- Return type
compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector) → merlin.dag.selector.ColumnSelector