Workflow

class nvtabular.workflow.workflow.Workflow(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

Bases: object

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(merlin.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
Parameters

output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply

transform(dataset: merlin.io.dataset.Dataset)merlin.io.dataset.Dataset[source]

Transforms the dataset by applying the graph of operators to it. Requires the fit method to have already been called, or calculated statistics to be loaded from disk

This method returns a Dataset object, with the transformations lazily loaded. None of the actual computation will happen until the produced Dataset is consumed, or written out to disk.

Parameters

dataset (Dataset) – Input dataset to transform

Returns

Transformed Dataset with the workflow graph applied to it

Return type

Dataset

fit_schema(input_schema: merlin.schema.schema.Schema)[source]

Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph

Parameters

input_schema (Schema) – The input schema to use

Returns

This workflow where each node in the graph has a fitted schema

Return type

Workflow

property input_dtypes
property input_schema
property output_schema
property output_dtypes
property output_node
remove_inputs(input_cols)nvtabular.workflow.workflow.Workflow[source]

Removes input columns from the workflow.

This is useful for the case of inference where you might need to remove label columns from the processed set.

Parameters

input_cols (list of str) – List of column names to

Returns

This workflow with the input columns removed from it

Return type

Workflow

fit(dataset: merlin.io.dataset.Dataset)nvtabular.workflow.workflow.Workflow[source]

Calculates statistics for this workflow on the input dataset

Parameters

dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.

Returns

This Workflow with statistics calculated on it

Return type

Workflow

fit_transform(dataset: merlin.io.dataset.Dataset)merlin.io.dataset.Dataset[source]

Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling workflow.fit(dataset) followed by workflow.transform(dataset)

Parameters

dataset (Dataset) – Input dataset to calculate statistics on, and transform results

Returns

Transformed Dataset with the workflow graph applied to it

Return type

Dataset

See also

fit, transform

save(path)[source]

Save this workflow to disk

Parameters

path (str) – The path to save the workflow to

classmethod load(path, client=None)nvtabular.workflow.workflow.Workflow[source]

Load up a saved workflow object from disk

Parameters
  • path (str) – The path to load the workflow from

  • client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing

Returns

The Workflow loaded from disk

Return type

Workflow

clear_stats()[source]

Removes calculated statistics from each node in the workflow graph