nvtabular.workflow.workflow.Workflow#

class nvtabular.workflow.workflow.Workflow(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#

Bases: object

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(merlin.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)

Parameters:: output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply

__init__(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#

Methods

`__init__`(output_node[, client])
`clear_stats`()	Removes calculated statistics from each node in the workflow graph
`fit`(dataset)	Calculates statistics for this workflow on the input dataset
`fit_schema`(input_schema)	Computes input and output schemas for each node in the Workflow graph
`fit_transform`(dataset)	Convenience method to both fit the workflow and transform the dataset in a single call.
`load`(path[, client])	Load up a saved workflow object from disk
`remove_inputs`(input_cols)	Removes input columns from the workflow.
`save`(path[, modules_byvalue])	Save this workflow to disk
`transform`(-> ~merlin.io.dataset.Dataset)	Transforms the data by applying the graph of operators to it.

Attributes

`input_dtypes`
`input_schema`
`output_dtypes`
`output_node`
`output_schema`

transform(data)[source]#

transform(dataset: Dataset) → Dataset

transform(dataframe: DataFrame) → DataFrame

Transforms the data by applying the graph of operators to it.

Requires the fit method to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using the load method).

This method returns data of the same type.

In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute().

Parameters:: data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform
Returns:: Transformed Dataset or DataFrame with the workflow graph applied to it
Return type:: Dataset or DataFrame
Raises:: NotImplementedError – If passed an unsupoprted data type to transform.

fit_schema(input_schema: Schema)[source]#

Computes input and output schemas for each node in the Workflow graph

Parameters:: input_schema (Schema) – The input schema to use
Returns:: This workflow where each node in the graph has a fitted schema
Return type:: Workflow

property input_dtypes#

property input_schema#

property output_schema#

property output_dtypes#

property output_node#

remove_inputs(input_cols) → Workflow[source]#

Removes input columns from the workflow.

This is useful for the case of inference where you might need to remove label columns from the processed set.

Parameters:: input_cols (list of str) – List of column names to
Returns:: This workflow with the input columns removed from it
Return type:: Workflow

nvtabular.workflow.workflow.Workflow

Contents

nvtabular.workflow.workflow.Workflow#