nvtabular.workflow.workflow.Workflow
-
class
nvtabular.workflow.workflow.
Workflow
(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source] Bases:
object
The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first
fit
the workflow by calculating statistics on the dataset, and then once fit we cantransform
datasets by applying these statistics.Example usage:
# define a graph of operations cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify() cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize() workflow = nvtabular.Workflow(cat_features + cont_features + "label") # calculate statistics on the training dataset workflow.fit(merlin.io.Dataset(TRAIN_PATH)) # transform the training and validation datasets and write out as parquet workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH) workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
- Parameters
output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply
-
__init__
(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source]
Methods
__init__
(output_node[, client])Removes calculated statistics from each node in the workflow graph
fit
(dataset)Calculates statistics for this workflow on the input dataset
fit_schema
(input_schema)Computes input and output schemas for each node in the Workflow graph
fit_transform
(dataset)Convenience method to both fit the workflow and transform the dataset in a single call.
load
(path[, client])Load up a saved workflow object from disk
remove_inputs
(input_cols)Removes input columns from the workflow.
save
(path[, modules_byvalue])Save this workflow to disk
transform
(-> merlin.io.dataset.Dataset)Transforms the data by applying the graph of operators to it.
Attributes
-
transform
(data)[source] -
transform
(dataset: merlin.io.dataset.Dataset) → merlin.io.dataset.Dataset -
transform
(dataframe: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame Transforms the data by applying the graph of operators to it.
Requires the
fit
method to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using theload
method).This method returns data of the same type.
In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute().
- Parameters
data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform
- Returns
Transformed Dataset or DataFrame with the workflow graph applied to it
- Return type
Dataset or DataFrame
- Raises
NotImplementedError – If passed an unsupoprted data type to transform.
-
fit_schema
(input_schema: merlin.schema.schema.Schema)[source] Computes input and output schemas for each node in the Workflow graph
- Parameters
input_schema (Schema) – The input schema to use
- Returns
This workflow where each node in the graph has a fitted schema
- Return type
-
property
input_dtypes
-
property
input_schema
-
property
output_schema
-
property
output_dtypes
-
property
output_node
-
remove_inputs
(input_cols) → nvtabular.workflow.workflow.Workflow[source] Removes input columns from the workflow.
This is useful for the case of inference where you might need to remove label columns from the processed set.
- Parameters
input_cols (list of str) – List of column names to
- Returns
This workflow with the input columns removed from it
- Return type
See also
-
fit
(dataset: merlin.io.dataset.Dataset) → nvtabular.workflow.workflow.Workflow[source] Calculates statistics for this workflow on the input dataset
- Parameters
dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.
- Returns
This Workflow with statistics calculated on it
- Return type
-
fit_transform
(dataset: merlin.io.dataset.Dataset) → merlin.io.dataset.Dataset[source] Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling
workflow.fit(dataset)
followed byworkflow.transform(dataset)
- Parameters
dataset (Dataset) – Input dataset to calculate statistics on, and transform results
- Returns
Transformed Dataset with the workflow graph applied to it
- Return type
Dataset
-
save
(path, modules_byvalue=None)[source] Save this workflow to disk
- Parameters
path (str) – The path to save the workflow to
modules_byvalue –
A list of modules that should be serialized by value. This should include any modules that will not be available on the host where this workflow is ultimately deserialized.
In lieu of an explicit list, pass None to serialize all modules by reference or pass “auto” to use a heuristic to infer which modules to serialize by value.
-
classmethod
load
(path, client=None) → nvtabular.workflow.workflow.Workflow[source] Load up a saved workflow object from disk
- Parameters
path (str) – The path to load the workflow from
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing
- Returns
The Workflow loaded from disk
- Return type