merlin.dag package#
- class merlin.dag.BaseOperator[source]#
Bases:
object
Base class for all operator classes.
- compute_selector(input_schema: Schema, selector: ColumnSelector, parents_selector: ColumnSelector | None = None, dependencies_selector: ColumnSelector | None = None) ColumnSelector [source]#
Provides a hook method for sub-classes to override to implement custom column selection logic.
- Parameters:
input_schema (Schema) – Schemas of the columns to apply this operator to
selector (ColumnSelector) – Column selector to apply to the input schema
parents_selector (ColumnSelector) – Combined selectors of the upstream parents feeding into this operator
dependencies_selector (ColumnSelector) – Combined selectors of the upstream dependencies feeding into this operator
- Returns:
Revised column selector to apply to the input schema
- Return type:
- compute_input_schema(root_schema: Schema, parents_schema: Schema, deps_schema: Schema, selector: ColumnSelector) Schema [source]#
Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use
- Parameters:
root_schema (Schema) – Base schema of the dataset before running any operators.
parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator
deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator
col_selector (ColumnSelector) – The column selector to apply to the input schema
- Returns:
The schemas of the columns used by this operator
- Return type:
- compute_output_schema(input_schema: Schema, col_selector: ColumnSelector, prev_output_schema: Schema | None = None) Schema [source]#
Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce
- Parameters:
input_schema (Schema) – The schemas of the columns to apply this operator to
col_selector (ColumnSelector) – The column selector to apply to the input schema
- Returns:
The schemas of the columns produced by this operator
- Return type:
- validate_schemas(parents_schema: Schema, deps_schema: Schema, input_schema: Schema, output_schema: Schema, strict_dtypes: bool = False)[source]#
Provides a hook method that sub-classes can override to implement schema validation logic.
Sub-class implementations should raise an exception if the schemas are not valid for the operations they implement.
- Parameters:
parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator
deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator
input_schema (Schema) – The schemas of the columns to apply this operator to
output_schema (Schema) – The schemas of the columns produced by this operator
strict_dtypes (Boolean, optional) – Enables strict checking for column dtype matching if True, by default False
- transform(col_selector: ColumnSelector, transformable: Transformable) Transformable [source]#
Transform the dataframe by applying this operator to the set of input columns
- Parameters:
col_selector (ColumnSelector) – The columns to apply this operator to
transformable (Transformable) – A pandas or cudf dataframe that this operator will work on
- Returns:
Returns a transformed dataframe or dictarray for this operator
- Return type:
Transformable
- column_mapping(col_selector)[source]#
Compute which output columns depend on which input columns
- Parameters:
col_selector (ColumnSelector) – A selector containing a list of column names
- Returns:
Mapping from output column names to list of the input columns they rely on
- Return type:
- property dynamic_dtypes#
- output_column_names(col_selector: ColumnSelector) ColumnSelector [source]#
Given a set of columns names returns the names of the transformed columns this operator will produce
- property dependencies: List[str | Any]#
Defines an optional list of column dependencies for this operator. This lets you consume columns that aren’t part of the main transformation workflow.
- Returns:
Extra dependencies of this operator. Defaults to None
- Return type:
str, list of str or ColumnSelector, optional
- property output_dtype#
- property output_tags#
- property output_properties#
- property supports: Supports#
Returns what kind of data representation this operator supports
- class merlin.dag.Graph(output_node: Node, subgraphs: Dict[str, Node] | None = None)[source]#
Bases:
object
Represents an DAG composed of Nodes, each of which contains an operator that transforms dataframes or dataframe-like data
- property input_dtypes#
- property output_dtypes#
- property column_mapping#
- construct_schema(root_schema: Schema, preserve_dtypes=False) Graph [source]#
Given the schema of a dataset to transform, determine the output schema of the graph
- property input_schema#
- property leaf_nodes#
- property output_schema#
- class merlin.dag.Node(selector=None)[source]#
Bases:
object
A Node is a group of columns that you want to apply the same transformations to. Node’s can be transformed by shifting operators on to them, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your workflow
- Parameters:
selector (ColumnSelector) – Defines which columns to select from the input Dataset using column names and tags.
- property selector#
- add_dependency(dep: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#
Adding a dependency node to this node
- Parameters:
dep (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Dependency to be added
- add_parent(parent: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#
Adding a parent node to this node
- Parameters:
parent (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Parent to be added
- add_child(child: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#
Adding a child node to this node
- Parameters:
child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be added
- remove_child(child: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#
Removing a child node from this node
- Parameters:
child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be removed
- compute_schemas(root_schema: Schema, preserve_dtypes: bool = False)[source]#
Defines the input and output schema
- validate_schemas(root_schema: Schema, strict_dtypes: bool = False)[source]#
Check if this Node’s input schema matches the output schemas of parents and dependencies
- Parameters:
- Raises:
ValueError – If parents and dependencies don’t provide an expected column based on the input schema
ValueError – If the dtype of a column from parents and dependencies doesn’t match the expected dtype based on the input schema
- remove_inputs(input_cols: List[str]) List[str] [source]#
Remove input columns and all output columns that depend on them.
- property parents_with_dependencies#
- property grouped_parents_with_dependencies#
- property input_columns#
- property output_columns#
- property column_mapping#
- property dependency_columns#
- property label#
- property graph#
- Nodable#
alias of
Union
[Node
,str
,List
[str
],ColumnSelector
,List
[Union
[str
,List
[str
],Node
,ColumnSelector
]]]
- classmethod construct_from(nodable: Node | str | List[str] | ColumnSelector | List[str | List[str] | Node | ColumnSelector])[source]#
Convert Node-like objects to a Node or list of Nodes.
- Parameters:
nodable (Nodable) – Node-like objects to convert to a Node or list of Nodes.
- Returns:
New Node(s) corresponding to the Node-like input objects
- Return type:
Union[“Node”, List[“Node”]]
- Raises:
TypeError – If supplied input cannot be converted to a Node or list of Nodes
- class merlin.dag.ColumnSelector(names: str | List[str] | None = None, subgroups: List[ColumnSelector] | None = None, tags: List[str | Tags] | None = None)[source]#
Bases:
object
A ColumnSelector describes a group of columns to be transformed by Operators in a Graph. Operators can be applied to the selected columns by shifting (>>) operators on to the ColumnSelector, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your Graph.
- Parameters:
names (list of (str or tuple of str)) – The columns to select from the input Dataset. The elements of this list are strings indicating the column names in most cases, but can also be tuples of strings for feature crosses.
subgroups (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)
optional (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)
tags (list of Tags) – The columns to select from the input dataset based on Tags. Any column with at-least-one of the tags provided will be considered.
- property tags#
- property names#
- property grouped_names#
- resolve(schema)[source]#
Takes a schema and produces a new selector with selected column names how selection occurs (tags, name) does not matter.
- filter_columns(other_selector: ColumnSelector)[source]#
Narrow the content of this selector to the columns that would be selected by another
- Parameters:
other_selector (ColumnSelector) – Other selector to apply as the filter
- Returns:
This selector filtered by the other selector
- Return type: