Workflow

class nvtabular.workflow.workflow.Workflow(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(nvtabular.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(nvtabular.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(nvtabular.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)

Parameters

output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing

transform(dataset: nvtabular.io.dataset.Dataset) → nvtabular.io.dataset.Dataset [source]

Transforms the dataset by applying the graph of operators to it. Requires the fit method to have already been called, or calculated statistics to be loaded from disk

This method returns a Dataset object, with the transformations lazily loaded. None of the actual computation will happen until the produced Dataset is consumed, or written out to disk.

Parameters: dataset (Dataset) –
Returns
Return type: Dataset

fit_schema(input_schema: nvtabular.columns.schema.Schema) → nvtabular.workflow.workflow.Workflow [source]

fit(dataset: nvtabular.io.dataset.Dataset) → nvtabular.workflow.workflow.Workflow [source]

Calculates statistics for this workflow on the input dataset

Parameters: dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.

fit_transform(dataset: nvtabular.io.dataset.Dataset) → nvtabular.io.dataset.Dataset [source]

Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling workflow.fit(dataset) followed by workflow.transform(dataset)

Parameters: dataset (Dataset) –
Returns
Return type: Dataset

save(path)[source]

Save this workflow to disk

Parameters: path (str) – The path to save the workflow to

classmethod load(path, client=None)[source]

Load up a saved workflow object from disk

Parameters

path (str) – The path to load the workflow from
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing

Returns

Return type

Workflow

clear_stats()[source]