nvtabular.workflow.workflow.Workflow#
- class nvtabular.workflow.workflow.Workflow(output_node: Node, client: Optional[distributed.Client] = None)[source]#
- Bases: - object- The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations. This class follows an API similar to Transformers in sklearn: we first - fitthe workflow by calculating statistics on the dataset, and then once fit we can- transformdatasets by applying these statistics.- Example usage: - # define a graph of operations cat_features = CAT_COLUMNS >> nvtabular.ops.Categorify() cont_features = CONT_COLUMNS >> nvtabular.ops.FillMissing() >> nvtabular.ops.Normalize() workflow = nvtabular.Workflow(cat_features + cont_features + "label") # calculate statistics on the training dataset workflow.fit(merlin.io.Dataset(TRAIN_PATH)) # transform the training and validation datasets and write out as parquet workflow.transform(merlin.io.Dataset(TRAIN_PATH)).to_parquet(output_path=TRAIN_OUT_PATH) workflow.transform(merlin.io.Dataset(VALID_PATH)).to_parquet(output_path=VALID_OUT_PATH) - Parameters:
- output_node (WorkflowNode) – The last node in the graph of operators this workflow should apply 
 - __init__(output_node: Node, client: Optional[distributed.Client] = None)[source]#
 - Methods - __init__(output_node[, client])- Removes calculated statistics from each node in the workflow graph - fit(dataset)- Calculates statistics for this workflow on the input dataset - fit_schema(input_schema)- Computes input and output schemas for each node in the Workflow graph - fit_transform(dataset)- Convenience method to both fit the workflow and transform the dataset in a single call. - get_subworkflow(subgraph_name)- load(path[, client])- Load up a saved workflow object from disk - remove_inputs(input_cols)- Removes input columns from the workflow. - save(path[, modules_byvalue])- Save this workflow to disk - transform(-> ~merlin.io.dataset.Dataset)- Transforms the data by applying the graph of operators to it. - Attributes - transform(data)[source]#
- transform(dataset: Dataset) Dataset
- transform(dataframe: DataFrame) DataFrame
- Transforms the data by applying the graph of operators to it. - Requires the - fitmethod to have already been called, or using a Workflow that has already beeen fit and re-loaded from disk (using the- loadmethod).- This method returns data of the same type. - In the case of a Dataset. The computation is lazy. It won’t happen until the produced Dataset is consumed, or written out to disk. e.g. with a dataset.compute(). - Parameters:
- data (Union[Dataset, DataFrameType]) – Input Dataset or DataFrame to transform 
- Returns:
- Transformed Dataset or DataFrame with the workflow graph applied to it 
- Return type:
- Dataset or DataFrame 
- Raises:
- NotImplementedError – If passed an unsupoprted data type to transform. 
 
 - fit_schema(input_schema: Schema)[source]#
- Computes input and output schemas for each node in the Workflow graph - Parameters:
- input_schema (Schema) – The input schema to use 
- Returns:
- This workflow where each node in the graph has a fitted schema 
- Return type:
 
 - property subworkflows#
 - property input_dtypes#
 - property input_schema#
 - property output_schema#
 - property output_dtypes#
 - property output_node#
 - remove_inputs(input_cols) Workflow[source]#
- Removes input columns from the workflow. - This is useful for the case of inference where you might need to remove label columns from the processed set. - Parameters:
- Returns:
- This workflow with the input columns removed from it 
- Return type:
 - See also 
 - fit(dataset: Dataset) Workflow[source]#
- Calculates statistics for this workflow on the input dataset - Parameters:
- dataset (Dataset) – The input dataset to calculate statistics for. If there is a train/test split this data should be the training dataset only. 
- Returns:
- This Workflow with statistics calculated on it 
- Return type:
 
 - fit_transform(dataset: Dataset) Dataset[source]#
- Convenience method to both fit the workflow and transform the dataset in a single call. Equivalent to calling - workflow.fit(dataset)followed by- workflow.transform(dataset)- Parameters:
- dataset (Dataset) – Input dataset to calculate statistics on, and transform results 
- Returns:
- Transformed Dataset with the workflow graph applied to it 
- Return type:
- Dataset 
 
 - save(path: Union[str, PathLike], modules_byvalue=None)[source]#
- Save this workflow to disk - Parameters:
- path (Union[str, os.PathLike]) – The path to save the workflow to 
- modules_byvalue – - A list of modules that should be serialized by value. This should include any modules that will not be available on the host where this workflow is ultimately deserialized. - In lieu of an explicit list, pass None to serialize all modules by reference or pass “auto” to use a heuristic to infer which modules to serialize by value. 
 
 
 - classmethod load(path: Union[str, PathLike], client=None) Workflow[source]#
- Load up a saved workflow object from disk - Parameters:
- path (Union[str, os.PathLike]) – The path to load the workflow from 
- client (distributed.Client, optional) – The Dask distributed client to use for multi-gpu processing and multi-node processing 
 
- Returns:
- The Workflow loaded from disk 
- Return type: