nvtabular.ops.JoinGroupby

class nvtabular.ops.JoinGroupby(cont_cols=None, stats=('count'), split_out=None, split_every=None, cat_cache='host', out_path=None, on_host=True, name_sep='_', tree_width=None)[source]

Bases: merlin.dag.ops.stat_operator.StatOperator

One of the ways to create new features is to calculate the basic statistics of the data that is grouped by categorical features. This operator groups the data by the given categorical feature(s) and calculates the desired statistics of requested continuous features (along with the count of rows in each group). The aggregated statistics are merged with the data (by joining on the desired categorical columns).

Example usage:

# Use JoinGroupby to define a NVTabular workflow
groupby_features = ['cat1', 'cat2', 'cat3'] >> ops.JoinGroupby(
    out_path=str(tmpdir), stats=['sum','count'], cont_cols=['num1']
)
processor = nvtabular.Workflow(groupby_features)
Parameters
  • cont_cols (list of str or WorkflowNode) – The continuous columns to calculate statistics for (for each unique group in each column in columns).

  • stats (list of str, default []) – List of statistics to calculate for each unique group. Note that “count” corresponds to the group itself, while all other statistics correspond to a specific continuous column. Supported statistics include [“count”, “sum”, “mean”, “std”, “var”].

  • split_out (dict or int, optional) – Number of files needed to store the final result of each groupby reduction. High-cardinality groups may require a large split_out, while low-cardinality columns can likely use split_out=1 (default). If passing a dict, each key and value should correspond to the column name and value, respectively. The default value is 1 for all columns.

  • split_every (dict or int, optional) – Number of adjacent partitions to aggregate in each tree-reduction node. The default value is 8 for all columns.

  • cat_cache (ToDo Describe) – TEXT

  • out_path (str, optional) – Root directory where groupby statistics will be written out in parquet format.

  • on_host (bool, default True) – Whether to convert cudf data to pandas between tasks in the hash-based groupby reduction. The extra host <-> device data movement can reduce performance. However, using on_host=True typically improves stability (by avoiding device-level memory pressure).

  • name_sep (str, default "_") – String separator to use between concatenated column names for multi-column groups.

__init__(cont_cols=None, stats=('count'), split_out=None, split_every=None, cat_cache='host', out_path=None, on_host=True, name_sep='_', tree_width=None)[source]

Methods

__init__([cont_cols, stats, split_out, …])

clear()

column_mapping(col_selector)

compute_column_schema(col_name, input_schema)

compute_input_schema(root_schema, …)

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

compute_output_schema(input_schema, col_selector)

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

compute_selector(input_schema, selector, …)

create_node(selector)

export(path, input_schema, output_schema, …)

Export the class object as a config and all related files to the user defined path.

fit(col_selector, ddf)

Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(dask_stats)

Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

load_artifacts([artifact_path])

Load artifacts from disk required for operator function.

output_column_names(col_selector)

Given a set of columns names returns the names of the transformed columns this operator will produce

save_artifacts([artifact_path])

Save artifacts required to be reload operator state from disk

set_storage_path(new_path[, copy])

transform(col_selector, df)

Transform the dataframe by applying this operator to the set of input columns

validate_schemas(parents_schema, …[, …])

Provides a hook method that sub-classes can override to implement schema validation logic.

Attributes

cont_names

dependencies

dynamic_dtypes

export_name

Provides a clear common english identifier for this operator.

fitted

is_subgraph

label

output_dtype

output_properties

output_tags

supported_formats

supports

Returns what kind of data representation this operator supports

property cont_names
fit(col_selector: merlin.dag.selector.ColumnSelector, ddf: dask.dataframe.core.DataFrame)[source]

Calculate statistics for this operator, and return a dask future to these statistics, which will be computed by the workflow.

fit_finalize(dask_stats)[source]

Finalize statistics calculation - the workflow calls this function with the computed statistics from the ‘fit’ object’

transform(col_selector: merlin.dag.selector.ColumnSelector, df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Transform the dataframe by applying this operator to the set of input columns

Parameters
  • col_selector (ColumnSelector) – The columns to apply this operator to

  • transformable (Transformable) – A pandas or cudf dataframe that this operator will work on

Returns

Returns a transformed dataframe or dictarray for this operator

Return type

Transformable

property dependencies
compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector)merlin.dag.selector.ColumnSelector[source]
column_mapping(col_selector)[source]
set_storage_path(new_path, copy=False)[source]
clear()[source]