Workflow

class nvtabular.workflow.workflow.Workflow(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]

Bases: object

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first fit the workflow by calculating statistics on the dataset, and then once fit we can transform datasets by applying these statistics.

Example usage:

# define a graph of operations
cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify()
cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize()
workflow = nvtabular.Workflow(cat_features + cont_features + "label")

# calculate statistics on the training dataset
workflow.fit(merlin.io.Dataset(TRAIN_PATH))

# transform the training and validation datasets and write out as parquet
workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH)
workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)

Parameters: output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply

transform(dataset: merlin.io.dataset.Dataset) → merlin.io.dataset.Dataset[source]

Transforms the dataset by applying the graph of operators to it. Requires the fit method to have already been called, or calculated statistics to be loaded from disk

This method returns a Dataset object, with the transformations lazily loaded. None of the actual computation will happen until the produced Dataset is consumed, or written out to disk.

Parameters: dataset (Dataset) – Input dataset to transform
Returns: Transformed Dataset with the workflow graph applied to it
Return type: Dataset

fit_schema(input_schema: merlin.schema.schema.Schema)[source]

Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph

Parameters: input_schema (Schema) – The input schema to use
Returns: This workflow where each node in the graph has a fitted schema
Return type: Workflow

property input_dtypes

property input_schema

property output_schema

property output_dtypes

property output_node

remove_inputs(input_cols) → nvtabular.workflow.workflow.Workflow [source]

Removes input columns from the workflow.

This is useful for the case of inference where you might need to remove label columns from the processed set.

Parameters: input_cols (list of str) – List of column names to
Returns: This workflow with the input columns removed from it
Return type: Workflow