nvtabular.workflow.workflow.Workflow#
- class nvtabular.workflow.workflow.Workflow(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#
Bases:
object
The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first
fit
the workflow by calculating statistics on the dataset, and then once fit we cantransform
datasets by applying these statistics.Example usage:
# define a graph of operations cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify() cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize() workflow = nvtabular.Workflow(cat_features + cont_features + "label") # calculate statistics on the training dataset workflow.fit(merlin.io.Dataset(TRAIN_PATH)) # transform the training and validation datasets and write out as parquet workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH) workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
- Parameters:
output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply
- __init__(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#
Methods
__init__
(output_node[, client])Removes calculated statistics from each node in the workflow graph
fit
(dataset)Calculates statistics for this workflow on the input dataset
fit_schema
(input_schema)Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph
fit_transform
(dataset)Convenience method to both fit the workflow and transform the dataset in a single call.
load
(path[, client])Load up a saved workflow object from disk
remove_inputs
(input_cols)Removes input columns from the workflow.
save
(path[, modules_byvalue])Save this workflow to disk
transform
(dataset)Transforms the dataset by applying the graph of operators to it.
Attributes
- transform(dataset: Dataset) Dataset [source]#
Transforms the dataset by applying the graph of operators to it. Requires the
fit
method to have already been called, or calculated statistics to be loaded from diskThis method returns a Dataset object, with the transformations lazily loaded. None of the actual computation will happen until the produced Dataset is consumed, or written out to disk.
- Parameters:
dataset (Dataset) – Input dataset to transform
- Returns:
Transformed Dataset with the workflow graph applied to it
- Return type:
Dataset
- fit_schema(input_schema: Schema)[source]#
Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph
- Parameters:
input_schema (Schema) – The input schema to use
- Returns:
This workflow where each node in the graph has a fitted schema
- Return type:
- property input_dtypes#
- property input_schema#
- property output_schema#
- property output_dtypes#
- property output_node#
- remove_inputs(input_cols) Workflow [source]#
Removes input columns from the workflow.
This is useful for the case of inference where you might need to remove label columns from the processed set.
- Parameters:
- Returns:
This workflow with the input columns removed from it
- Return type:
See also
- fit(dataset: Dataset) Workflow [source]#
Calculates statistics for this workflow on the input dataset
- Parameters:
dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.
- Returns:
This Workflow with statistics calculated on it
- Return type:
- fit_transform(dataset: Dataset) Dataset [source]#
Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling
workflow.fit(dataset)
followed byworkflow.transform(dataset)
- Parameters:
dataset (Dataset) – Input dataset to calculate statistics on, and transform results
- Returns:
Transformed Dataset with the workflow graph applied to it
- Return type:
Dataset
- save(path, modules_byvalue=None)[source]#
Save this workflow to disk
- Parameters:
path (str) – The path to save the workflow to
modules_byvalue –
A list of modules that should be serialized by value. This should include any modules that will not be available on the host where this workflow is ultimately deserialized.
In lieu of an explicit list, pass None to serialize all modules by reference or pass “auto” to use a heuristic to infer which modules to serialize by value.
- classmethod load(path, client=None) Workflow [source]#
Load up a saved workflow object from disk
- Parameters:
path (str) – The path to load the workflow from
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing
- Returns:
The Workflow loaded from disk
- Return type: