merlin.dag package#

class merlin.dag.BaseOperator[source]#

Bases: object

Base class for all operator classes.

compute_selector(input_schema: Schema, selector: ColumnSelector, parents_selector: ColumnSelector | None = None, dependencies_selector: ColumnSelector | None = None) ColumnSelector[source]#

Provides a hook method for sub-classes to override to implement custom column selection logic.

Parameters:
  • input_schema (Schema) – Schemas of the columns to apply this operator to

  • selector (ColumnSelector) – Column selector to apply to the input schema

  • parents_selector (ColumnSelector) – Combined selectors of the upstream parents feeding into this operator

  • dependencies_selector (ColumnSelector) – Combined selectors of the upstream dependencies feeding into this operator

Returns:

Revised column selector to apply to the input schema

Return type:

ColumnSelector

compute_input_schema(root_schema: Schema, parents_schema: Schema, deps_schema: Schema, selector: ColumnSelector) Schema[source]#

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

Parameters:
  • root_schema (Schema) – Base schema of the dataset before running any operators.

  • parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator

  • deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator

  • col_selector (ColumnSelector) – The column selector to apply to the input schema

Returns:

The schemas of the columns used by this operator

Return type:

Schema

compute_output_schema(input_schema: Schema, col_selector: ColumnSelector, prev_output_schema: Schema | None = None) Schema[source]#

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

Parameters:
  • input_schema (Schema) – The schemas of the columns to apply this operator to

  • col_selector (ColumnSelector) – The column selector to apply to the input schema

Returns:

The schemas of the columns produced by this operator

Return type:

Schema

validate_schemas(parents_schema: Schema, deps_schema: Schema, input_schema: Schema, output_schema: Schema, strict_dtypes: bool = False)[source]#

Provides a hook method that sub-classes can override to implement schema validation logic.

Sub-class implementations should raise an exception if the schemas are not valid for the operations they implement.

Parameters:
  • parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator

  • deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator

  • input_schema (Schema) – The schemas of the columns to apply this operator to

  • output_schema (Schema) – The schemas of the columns produced by this operator

  • strict_dtypes (Boolean, optional) – Enables strict checking for column dtype matching if True, by default False

transform(col_selector: ColumnSelector, transformable: Transformable) Transformable[source]#

Transform the dataframe by applying this operator to the set of input columns

Parameters:
  • col_selector (ColumnSelector) – The columns to apply this operator to

  • transformable (Transformable) – A pandas or cudf dataframe that this operator will work on

Returns:

Returns a transformed dataframe or dictarray for this operator

Return type:

Transformable

column_mapping(col_selector)[source]#

Compute which output columns depend on which input columns

Parameters:

col_selector (ColumnSelector) – A selector containing a list of column names

Returns:

Mapping from output column names to list of the input columns they rely on

Return type:

Dict[str, List[str]]

compute_column_schema(col_name, input_schema)[source]#
property dynamic_dtypes#
output_column_names(col_selector: ColumnSelector) ColumnSelector[source]#

Given a set of columns names returns the names of the transformed columns this operator will produce

Parameters:

columns (list of str, or list of list of str) – The columns to apply this operator to

Returns:

The names of columns produced by this operator

Return type:

list of str, or list of list of str

property dependencies: List[str | Any]#

Defines an optional list of column dependencies for this operator. This lets you consume columns that aren’t part of the main transformation workflow.

Returns:

Extra dependencies of this operator. Defaults to None

Return type:

str, list of str or ColumnSelector, optional

property output_dtype#
property output_tags#
property output_properties#
property label: str#
create_node(selector)[source]#
property supports: Supports#

Returns what kind of data representation this operator supports

class merlin.dag.Graph(output_node: Node, subgraphs: Dict[str, Node] | None = None)[source]#

Bases: object

Represents an DAG composed of Nodes, each of which contains an operator that transforms dataframes or dataframe-like data

subgraph(name: str) Graph[source]#
property input_dtypes#
property output_dtypes#
property column_mapping#
construct_schema(root_schema: Schema, preserve_dtypes=False) Graph[source]#

Given the schema of a dataset to transform, determine the output schema of the graph

Parameters:
  • root_schema (Schema) – The schema of a dataset to be transformed with this DAG

  • preserve_dtypes (bool, optional) – Whether to keep any dtypes that may already be present in the schemas, by default False

Returns:

This DAG after the schemas have been filled in

Return type:

Graph

property input_schema#
property leaf_nodes#
property output_schema#
remove_inputs(to_remove)[source]#

Removes columns from a Graph

Starting at the leaf nodes, trickle down looking for columns to remove, when found remove but then must propagate the removal of any other output columns derived from that column.

Parameters:
  • graph (Graph) – The graph to remove columns from

  • to_remove (array_like) – A list of input column names to remove from the graph

Returns:

The same graph with columns removed

Return type:

Graph

classmethod get_nodes_by_op_type(nodes, op_type)[source]#
class merlin.dag.Node(selector=None)[source]#

Bases: object

A Node is a group of columns that you want to apply the same transformations to. Node’s can be transformed by shifting operators on to them, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your workflow

Parameters:

selector (ColumnSelector) – Defines which columns to select from the input Dataset using column names and tags.

property selector#
add_dependency(dep: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#

Adding a dependency node to this node

Parameters:

dep (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Dependency to be added

add_parent(parent: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#

Adding a parent node to this node

Parameters:

parent (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Parent to be added

add_child(child: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#

Adding a child node to this node

Parameters:

child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be added

remove_child(child: str | List[str] | ColumnSelector | Node | List[str | List[str] | Node | ColumnSelector])[source]#

Removing a child node from this node

Parameters:

child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be removed

compute_schemas(root_schema: Schema, preserve_dtypes: bool = False)[source]#

Defines the input and output schema

Parameters:
  • root_schema (Schema) – Schema of the input dataset

  • preserve_dtypes (bool, optional) – True if we don’t want to override dtypes in the current schema, by default False

validate_schemas(root_schema: Schema, strict_dtypes: bool = False)[source]#

Check if this Node’s input schema matches the output schemas of parents and dependencies

Parameters:
  • root_schema (Schema) – Schema of the input dataset

  • strict_dtypes (bool, optional) – If an error should be raised when column dtypes don’t match, by default False

Raises:
  • ValueError – If parents and dependencies don’t provide an expected column based on the input schema

  • ValueError – If the dtype of a column from parents and dependencies doesn’t match the expected dtype based on the input schema

remove_inputs(input_cols: List[str]) List[str][source]#

Remove input columns and all output columns that depend on them.

Parameters:

input_cols (List[str]) – The input columns to remove

Returns:

The output columns that were removed

Return type:

List[str]

exportable(backend: str | None = None)[source]#
property parents_with_dependencies#
property grouped_parents_with_dependencies#
property input_columns#
property output_columns#
property column_mapping#
property dependency_columns#
property label#
property graph#
Nodable#

alias of Union[Node, str, List[str], ColumnSelector, List[Union[str, List[str], Node, ColumnSelector]]]

classmethod construct_from(nodable: Node | str | List[str] | ColumnSelector | List[str | List[str] | Node | ColumnSelector])[source]#

Convert Node-like objects to a Node or list of Nodes.

Parameters:

nodable (Nodable) – Node-like objects to convert to a Node or list of Nodes.

Returns:

New Node(s) corresponding to the Node-like input objects

Return type:

Union[“Node”, List[“Node”]]

Raises:

TypeError – If supplied input cannot be converted to a Node or list of Nodes

class merlin.dag.ColumnSelector(names: str | List[str] | None = None, subgroups: List[ColumnSelector] | None = None, tags: List[str | Tags] | None = None)[source]#

Bases: object

A ColumnSelector describes a group of columns to be transformed by Operators in a Graph. Operators can be applied to the selected columns by shifting (>>) operators on to the ColumnSelector, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your Graph.

Parameters:
  • names (list of (str or tuple of str)) – The columns to select from the input Dataset. The elements of this list are strings indicating the column names in most cases, but can also be tuples of strings for feature crosses.

  • subgroups (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)

  • optional (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)

  • tags (list of Tags) – The columns to select from the input dataset based on Tags. Any column with at-least-one of the tags provided will be considered.

property tags#
property names#
property grouped_names#
resolve(schema)[source]#

Takes a schema and produces a new selector with selected column names how selection occurs (tags, name) does not matter.

filter_columns(other_selector: ColumnSelector)[source]#

Narrow the content of this selector to the columns that would be selected by another

Parameters:

other_selector (ColumnSelector) – Other selector to apply as the filter

Returns:

This selector filtered by the other selector

Return type:

ColumnSelector