nvtabular.workflow.workflow.Workflow#

class nvtabular.workflow.workflow.Workflow(output_node: Node, client: Optional[distributed.Client] = None)[source]#

Bases: object

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(merlin.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
Parameters:

output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply

__init__(output_node: Node, client: Optional[distributed.Client] = None)[source]#

Methods

__init__(output_node[, client])

clear_stats()

Removes calculated statistics from each node in the workflow graph

fit(dataset)

Calculates statistics for this workflow on the input dataset

fit_schema(input_schema)

Computes input and output schemas for each node in the Workflow graph

fit_transform(dataset)

Convenience method to both fit the workflow and transform the dataset in a single call.

get_subworkflow(subgraph_name)

load(path[, client])

Load up a saved workflow object from disk

remove_inputs(input_cols)

Removes input columns from the workflow.

save(path[, modules_byvalue])

Save this workflow to disk

transform(-> ~merlin.io.dataset.Dataset)

Transforms the data by applying the graph of operators to it.

Attributes

input_dtypes

input_schema

output_dtypes

output_node

output_schema

subworkflows

transform(data)[source]#
transform(dataset: Dataset) Dataset
transform(dataframe: DataFrame) DataFrame

Transforms the data by applying the graph of operators to it.

Requires the fit method to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using the load method).

This method returns data of the same type.

In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute().

Parameters:

data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform

Returns:

Transformed Dataset or DataFrame with the workflow graph applied to it

Return type:

Dataset or DataFrame

Raises:

NotImplementedError – If passed an unsupoprted data type to transform.

fit_schema(input_schema: Schema)[source]#

Computes input and output schemas for each node in the Workflow graph

Parameters:

input_schema (Schema) – The input schema to use

Returns:

This workflow where each node in the graph has a fitted schema

Return type:

Workflow

property subworkflows#
property input_dtypes#
property input_schema#
property output_schema#
property output_dtypes#
property output_node#
get_subworkflow(subgraph_name)[source]#
remove_inputs(input_cols) Workflow[source]#

Removes input columns from the workflow.

This is useful for the case of inference where you might need to remove label columns from the processed set.

Parameters:

input_cols (list of str) – List of column names to

Returns:

This workflow with the input columns removed from it

Return type:

Workflow

fit(dataset: Dataset) Workflow[source]#

Calculates statistics for this workflow on the input dataset

Parameters:

dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.

Returns:

This Workflow with statistics calculated on it

Return type:

Workflow

fit_transform(dataset: Dataset) Dataset[source]#

Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling workflow.fit(dataset) followed by workflow.transform(dataset)

Parameters:

dataset (Dataset) – Input dataset to calculate statistics on, and transform results

Returns:

Transformed Dataset with the workflow graph applied to it

Return type:

Dataset

See also

fit, transform

save(path: Union[str, PathLike], modules_byvalue=None)[source]#

Save this workflow to disk

Parameters:
  • path (Union[str, os.PathLike]) – The path to save the workflow to

  • modules_byvalue

    A list of modules that should be serialized by value. This should include any modules that will not be available on the host where this workflow is ultimately deserialized.

    In lieu of an explicit list, pass None to serialize all modules by reference or pass “auto” to use a heuristic to infer which modules to serialize by value.

classmethod load(path: Union[str, PathLike], client=None) Workflow[source]#

Load up a saved workflow object from disk

Parameters:
  • path (Union[str, os.PathLike]) – The path to load the workflow from

  • client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing

Returns:

The Workflow loaded from disk

Return type:

Workflow

clear_stats()[source]#

Removes calculated statistics from each node in the workflow graph

See also

nvtabular.ops.stat_operator.StatOperator.clear