Workflow
-
class
nvtabular.workflow.workflow.
Workflow
(output_node: nvtabular.workflow.node.WorkflowNode, client: Optional[distributed.Client] = None)[source] The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first
fit
the workflow by calculating statistics on the dataset, and then once fit we cantransform
datasets by applying these statistics.Example usage:
# define a graph of operations cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify() cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize() workflow = nvtabular.Workflow(cat_features + cont_features + "label") # calculate statistics on the training dataset workflow.fit(merlin.io.Dataset(TRAIN_PATH)) # transform the training and validation datasets and write out as parquet workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH) workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH)
- Parameters
output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply
-
transform
(dataset: merlin.io.dataset.Dataset) → merlin.io.dataset.Dataset[source] Transforms the dataset by applying the graph of operators to it. Requires the
fit
method to have already been called, or calculated statistics to be loaded from diskThis method returns a Dataset object, with the transformations lazily loaded. None of the actual computation will happen until the produced Dataset is consumed, or written out to disk.
- Parameters
dataset (Dataset) – Input dataset to transform
- Returns
Transformed Dataset with the workflow graph applied to it
- Return type
Dataset
-
fit_schema
(input_schema: merlin.schema.schema.Schema)[source] Fits the schema onto the workflow, computing the Schema for each node in the Workflow Graph
- Parameters
input_schema (Schema) – The input schema to use
- Returns
This workflow where each node in the graph has a fitted schema
- Return type
-
property
input_dtypes
-
property
input_schema
-
property
output_schema
-
property
output_dtypes
-
property
output_node
-
remove_inputs
(input_cols) → nvtabular.workflow.workflow.Workflow[source] Removes input columns from the workflow.
This is useful for the case of inference where you might need to remove label columns from the processed set.
- Parameters
input_cols (list of str) – List of column names to
- Returns
This workflow with the input columns removed from it
- Return type
See also
-
fit
(dataset: merlin.io.dataset.Dataset) → nvtabular.workflow.workflow.Workflow[source] Calculates statistics for this workflow on the input dataset
- Parameters
dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only.
- Returns
This Workflow with statistics calculated on it
- Return type
-
fit_transform
(dataset: merlin.io.dataset.Dataset) → merlin.io.dataset.Dataset[source] Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling
workflow.fit(dataset)
followed byworkflow.transform(dataset)
- Parameters
dataset (Dataset) – Input dataset to calculate statistics on, and transform results
- Returns
Transformed Dataset with the workflow graph applied to it
- Return type
Dataset
-
save
(path)[source] Save this workflow to disk
- Parameters
path (str) – The path to save the workflow to
-
classmethod
load
(path, client=None) → nvtabular.workflow.workflow.Workflow[source] Load up a saved workflow object from disk
- Parameters
path (str) – The path to load the workflow from
client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing
- Returns
The Workflow loaded from disk
- Return type