nvtabular.workflow.workflow.Workflow

class nvtabular.workflow.workflow.Workflow(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

Bases: object

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(merlin.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
Parameters

output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply

__init__(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

Methods

__init__(output_node[, client])

clear_stats()

Removes calculated statistics from each node in the workflow graph

fit(dataset)

Calculates statistics for this workflow on the input dataset

fit_schema(input_schema)

Computes input and output schemas for each node in the Workflow graph

fit_transform(dataset)

Convenience method to both fit the workflow and transform the dataset in a single call.

load(path[, client])

Load up a saved workflow object from disk

remove_inputs(input_cols)

Removes input columns from the workflow.

save(path[, modules_byvalue])

Save this workflow to disk

transform(-> merlin.io.dataset.Dataset)

Transforms the data by applying the graph of operators to it.

Attributes

input_dtypes

input_schema

output_dtypes

output_node

output_schema

transform(data)[source]
transform(dataset: merlin.io.dataset.Dataset)merlin.io.dataset.Dataset
transform(dataframe: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Transforms the data by applying the graph of operators to it.

Requires the fit method to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using the load method).

This method returns data of the same type.

In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute().

Parameters

data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform

Returns

Transformed Dataset or DataFrame with the workflow graph applied to it

Return type

Dataset or DataFrame

Raises

NotImplementedError – If passed an unsupoprted data type to transform.

fit_schema(input_schema: merlin.schema.schema.Schema)[source]

Computes input and output schemas for each node in the Workflow graph

Parameters

input_schema (Schema) – The input schema to use

Returns

This workflow where each node in the graph has a fitted schema

Return type

Workflow

property input_dtypes
property input_schema
property output_schema
property output_dtypes
property output_node
remove_inputs(input_cols)nvtabular.workflow.workflow.Workflow[source]

Removes input columns from the workflow.

This is useful for the case of inference where you might need to remove label columns from the processed set.

Parameters

input_cols (list of str) – List of column names to

Returns

This workflow with the input columns removed from it

Return type

Workflow

fit(dataset: merlin.io.dataset.Dataset)nvtabular.workflow.workflow.Workflow[source]

Calculates statistics for this workflow on the input dataset

Parameters

dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.

Returns

This Workflow with statistics calculated on it

Return type

Workflow

fit_transform(dataset: merlin.io.dataset.Dataset)merlin.io.dataset.Dataset[source]

Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling workflow.fit(dataset) followed by workflow.transform(dataset)

Parameters

dataset (Dataset) – Input dataset to calculate statistics on, and transform results

Returns

Transformed Dataset with the workflow graph applied to it

Return type

Dataset

See also

fit, transform

save(path, modules_byvalue=None)[source]

Save this workflow to disk

Parameters
  • path (str) – The path to save the workflow to

  • modules_byvalue

    A list of modules that should be serialized by value. This should include any modules that will not be available on the host where this workflow is ultimately deserialized.

    In lieu of an explicit list, pass None to serialize all modules by reference or pass “auto” to use a heuristic to infer which modules to serialize by value.

classmethod load(path, client=None)nvtabular.workflow.workflow.Workflow[source]

Load up a saved workflow object from disk

Parameters
  • path (str) – The path to load the workflow from

  • client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing

Returns

The Workflow loaded from disk

Return type

Workflow

clear_stats()[source]

Removes calculated statistics from each node in the workflow graph

See also

nvtabular.ops.stat_operator.StatOperator.clear