nvtabular.workflow.workflow.Workflow

class nvtabular.workflow.workflow.Workflow(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

Bases: object

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(merlin.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)

Parameters: output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply

__init__(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

Methods

`__init__`(output_node[, client])
`clear_stats`()	Removes calculated statistics from each node in the workflow graph
`fit`(dataset)	Calculates statistics for this workflow on the input dataset
`fit_schema`(input_schema)	Computes input and output schemas for each node in the Workflow graph
`fit_transform`(dataset)	Convenience method to both fit the workflow and transform the dataset in a single call.
`load`(path[, client])	Load up a saved workflow object from disk
`remove_inputs`(input_cols)	Removes input columns from the workflow.
`save`(path[, modules_byvalue])	Save this workflow to disk
`transform`(-> merlin.io.dataset.Dataset)	Transforms the data by applying the graph of operators to it.

Attributes

`input_dtypes`
`input_schema`
`output_dtypes`
`output_node`
`output_schema`

transform(data)[source]

transform(dataset: merlin.io.dataset.Dataset) → merlin.io.dataset.Dataset

transform(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Transforms the data by applying the graph of operators to it.

Requires the fit method to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using the load method).

This method returns data of the same type.

In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute().

Parameters: data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform
Returns: Transformed Dataset or DataFrame with the workflow graph applied to it
Return type: Dataset or DataFrame
Raises: NotImplementedError – If passed an unsupoprted data type to transform.

fit_schema(input_schema: merlin.schema.schema.Schema)[source]

Computes input and output schemas for each node in the Workflow graph

Parameters: input_schema (Schema) – The input schema to use
Returns: This workflow where each node in the graph has a fitted schema
Return type: Workflow

property input_dtypes

property input_schema

property output_schema

property output_dtypes

property output_node

remove_inputs(input_cols) → nvtabular.workflow.workflow.Workflow [source]

Removes input columns from the workflow.

This is useful for the case of inference where you might need to remove label columns from the processed set.

Parameters: input_cols (list of str) – List of column names to
Returns: This workflow with the input columns removed from it
Return type: Workflow