nvtabular.workflow.workflow.Workflow#
- class nvtabular.workflow.workflow.Workflow(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#
Bases:
object
The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first
fit
the workflow by calculating statistics on the dataset, and then once fit we cantransform
datasets by applying these statistics.Example usage:
# define a graph of operations cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify() cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize() workflow = nvtabular.Workflow(cat_features + cont_features + "label") # calculate statistics on the training dataset workflow.fit(merlin.io.Dataset(TRAIN_PATH)) # transform the training and validation datasets and write out as parquet workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH) workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
- Parameters:
output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply
- __init__(output_node: WorkflowNode, client: Optional[distributed.Client] = None)[source]#
Methods
__init__
(output_node[, client])Removes calculated statistics from each node in the workflow graph
fit
(dataset)Calculates statistics for this workflow on the input dataset
fit_schema
(input_schema)Computes input and output schemas for each node in the Workflow graph
fit_transform
(dataset)Convenience method to both fit the workflow and transform the dataset in a single call.
load
(path[, client])Load up a saved workflow object from disk
remove_inputs
(input_cols)Removes input columns from the workflow.
save
(path[, modules_byvalue])Save this workflow to disk
transform
(-> ~merlin.io.dataset.Dataset)Transforms the data by applying the graph of operators to it.
Attributes
- transform(data)[source]#
- transform(dataset: Dataset) Dataset
- transform(dataframe: DataFrame) DataFrame
Transforms the data by applying the graph of operators to it.
Requires the
fit
method to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using theload
method).This method returns data of the same type.
In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute().
- Parameters:
data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform
- Returns:
Transformed Dataset or DataFrame with the workflow graph applied to it
- Return type:
Dataset or DataFrame
- Raises:
NotImplementedError – If passed an unsupoprted data type to transform.
- fit_schema(input_schema: Schema)[source]#
Computes input and output schemas for each node in the Workflow graph
- Parameters:
input_schema (Schema) – The input schema to use
- Returns:
This workflow where each node in the graph has a fitted schema
- Return type:
- property input_dtypes#
- property input_schema#
- property output_schema#
- property output_dtypes#
- property output_node#
- remove_inputs(input_cols) Workflow [source]#
Removes input columns from the workflow.
This is useful for the case of inference where you might need to remove label columns from the processed set.
- Parameters:
- Returns:
This workflow with the input columns removed from it
- Return type:
See also
- fit(dataset: Dataset) Workflow [source]#
Calculates statistics for this workflow on the input dataset
- Parameters:
dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.
- Returns:
This Workflow with statistics calculated on it
- Return type:
- fit_transform(dataset: Dataset) Dataset [source]#
Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling
workflow.fit(dataset)
followed byworkflow.transform(dataset)
- Parameters:
dataset (Dataset) – Input dataset to calculate statistics on, and transform results
- Returns:
Transformed Dataset with the workflow graph applied to it
- Return type:
Dataset
- save(path, modules_byvalue=None)[source]#
Save this workflow to disk
- Parameters:
path (str) – The path to save the workflow to
modules_byvalue –
A list of modules that should be serialized by value. This should include any modules that will not be available on the host where this workflow is ultimately deserialized.
In lieu of an explicit list, pass None to serialize all modules by reference or pass “auto” to use a heuristic to infer which modules to serialize by value.
- classmethod load(path, client=None) Workflow [source]#
Load up a saved workflow object from disk
- Parameters:
path (str) – The path to load the workflow from
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing
- Returns:
The Workflow loaded from disk
- Return type: