nvtabular.workflow.workflow.Workflow#
- class nvtabular.workflow.workflow.Workflow(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#
Bases:
object
The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first
fit
the workflow by calculating statistics on the dataset, and then once fit we cantransform
datasets by applying these statistics.Example usage:
# define a graph of operations cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify() cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize() workflow = nvtabular.Workflow(cat_features + cont_features + "label") # calculate statistics on the training dataset workflow.fit(merlin.io.Dataset(TRAIN_PATH)) # transform the training and validation datasets and write out as parquet workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH) workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
- Parameters:
output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply
- __init__(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#
Methods
__init__
(output_node[, client])Removes calculated statistics from each node in the workflow graph
fit
(dataset)Calculates statistics for this workflow on the input dataset
fit_schema
(input_schema)Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph
fit_transform
(dataset)Convenience method to both fit the workflow and transform the dataset in a single call.
load
(path[, client])Load up a saved workflow object from disk
remove_inputs
(input_cols)Removes input columns from the workflow.
save
(path)Save this workflow to disk
transform
(dataset)Transforms the dataset by applying the graph of operators to it.
Attributes
- transform(dataset: Dataset) Dataset [source]#
Transforms the dataset by applying the graph of operators to it. Requires the
fit
method to have already been called, or calculated statistics to be loaded from diskThis method returns a Dataset object, with the transformations lazily loaded. None of the actual computation will happen until the produced Dataset is consumed, or written out to disk.
- Parameters:
dataset (Dataset) – Input dataset to transform
- Returns:
Transformed Dataset with the workflow graph applied to it
- Return type:
Dataset
- fit_schema(input_schema: Schema)[source]#
Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph
- Parameters:
input_schema (Schema) – The input schema to use
- Returns:
This workflow where each node in the graph has a fitted schema
- Return type:
- property input_dtypes#
- property input_schema#
- property output_schema#
- property output_dtypes#
- property output_node#
- remove_inputs(input_cols) Workflow [source]#
Removes input columns from the workflow.
This is useful for the case of inference where you might need to remove label columns from the processed set.
- Parameters:
- Returns:
This workflow with the input columns removed from it
- Return type:
See also
- fit(dataset: Dataset) Workflow [source]#
Calculates statistics for this workflow on the input dataset
- Parameters:
dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.
- Returns:
This Workflow with statistics calculated on it
- Return type:
- fit_transform(dataset: Dataset) Dataset [source]#
Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling
workflow.fit(dataset)
followed byworkflow.transform(dataset)
- Parameters:
dataset (Dataset) – Input dataset to calculate statistics on, and transform results
- Returns:
Transformed Dataset with the workflow graph applied to it
- Return type:
Dataset
- save(path)[source]#
Save this workflow to disk
- Parameters:
path (str) – The path to save the workflow to
- classmethod load(path, client=None) Workflow [source]#
Load up a saved workflow object from disk
- Parameters:
path (str) – The path to load the workflow from
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing
- Returns:
The Workflow loaded from disk
- Return type: