merlin.dag package#

class merlin.dag.BaseOperator[source]#

Bases: object

Base class for all operator classes.

compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: Optional[merlin.dag.selector.ColumnSelector] = None, dependencies_selector: Optional[merlin.dag.selector.ColumnSelector] = None) merlin.dag.selector.ColumnSelector[source]#

Provides a hook method for sub-classes to override to implement custom column selection logic.

Parameters
  • input_schema (Schema) – Schemas of the columns to apply this operator to

  • selector (ColumnSelector) – Column selector to apply to the input schema

  • parents_selector (ColumnSelector) – Combined selectors of the upstream parents feeding into this operator

  • dependencies_selector (ColumnSelector) – Combined selectors of the upstream dependencies feeding into this operator

Returns

Revised column selector to apply to the input schema

Return type

ColumnSelector

compute_input_schema(root_schema: merlin.schema.schema.Schema, parents_schema: merlin.schema.schema.Schema, deps_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector) merlin.schema.schema.Schema[source]#

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

Parameters
  • root_schema (Schema) – Base schema of the dataset before running any operators.

  • parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator

  • deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator

  • col_selector (ColumnSelector) – The column selector to apply to the input schema

Returns

The schemas of the columns used by this operator

Return type

Schema

compute_output_schema(input_schema: merlin.schema.schema.Schema, col_selector: merlin.dag.selector.ColumnSelector, prev_output_schema: Optional[merlin.schema.schema.Schema] = None) merlin.schema.schema.Schema[source]#

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

Parameters
  • input_schema (Schema) – The schemas of the columns to apply this operator to

  • col_selector (ColumnSelector) – The column selector to apply to the input schema

Returns

The schemas of the columns produced by this operator

Return type

Schema

validate_schemas(parents_schema: merlin.schema.schema.Schema, deps_schema: merlin.schema.schema.Schema, input_schema: merlin.schema.schema.Schema, output_schema: merlin.schema.schema.Schema, strict_dtypes: bool = False)[source]#

Provides a hook method that sub-classes can override to implement schema validation logic.

Sub-class implementations should raise an exception if the schemas are not valid for the operations they implement.

Parameters
  • parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator

  • deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator

  • input_schema (Schema) – The schemas of the columns to apply this operator to

  • output_schema (Schema) – The schemas of the columns produced by this operator

  • strict_dtypes (Boolean, optional) – Enables strict checking for column dtype matching if True, by default False

transform(col_selector: merlin.dag.selector.ColumnSelector, transformable: merlin.core.protocols.Transformable) merlin.core.protocols.Transformable[source]#

Transform the dataframe by applying this operator to the set of input columns

Parameters
  • col_selector (ColumnSelector) – The columns to apply this operator to

  • transformable (Transformable) – A pandas or cudf dataframe that this operator will work on

Returns

Returns a transformed dataframe or dictarray for this operator

Return type

Transformable

column_mapping(col_selector)[source]#

Compute which output columns depend on which input columns

Parameters

col_selector (ColumnSelector) – A selector containing a list of column names

Returns

Mapping from output column names to list of the input columns they rely on

Return type

Dict[str, List[str]]

compute_column_schema(col_name, input_schema)[source]#
property dynamic_dtypes#
output_column_names(col_selector: merlin.dag.selector.ColumnSelector) merlin.dag.selector.ColumnSelector[source]#

Given a set of columns names returns the names of the transformed columns this operator will produce

Parameters

columns (list of str, or list of list of str) – The columns to apply this operator to

Returns

The names of columns produced by this operator

Return type

list of str, or list of list of str

property dependencies: List[Union[str, Any]]#

Defines an optional list of column dependencies for this operator. This lets you consume columns that aren’t part of the main transformation workflow.

Returns

Extra dependencies of this operator. Defaults to None

Return type

str, list of str or ColumnSelector, optional

property output_dtype#
property output_tags#
property output_properties#
property label: str#
create_node(selector)[source]#
property supports: merlin.dag.base_operator.Supports#

Returns what kind of data representation this operator supports

class merlin.dag.Graph(output_node: merlin.dag.node.Node, subgraphs: Optional[Dict[str, merlin.dag.node.Node]] = None)[source]#

Bases: object

subgraph(name: str) merlin.dag.graph.Graph[source]#
property input_dtypes#
property output_dtypes#
property column_mapping#
construct_schema(root_schema: merlin.schema.schema.Schema, preserve_dtypes=False) merlin.dag.graph.Graph[source]#
property input_schema#
property leaf_nodes#
property output_schema#
remove_inputs(to_remove)[source]#

Removes columns from a Graph

Starting at the leaf nodes, trickle down looking for columns to remove, when found remove but then must propagate the removal of any other output columns derived from that column.

Parameters
  • graph (Graph) – The graph to remove columns from

  • to_remove (array_like) – A list of input column names to remove from the graph

Returns

The same graph with columns removed

Return type

Graph

classmethod get_nodes_by_op_type(nodes, op_type)[source]#
class merlin.dag.Node(selector=None)[source]#

Bases: object

A Node is a group of columns that you want to apply the same transformations to. Node’s can be transformed by shifting operators on to them, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your workflow

Parameters

selector (ColumnSelector) – Defines which columns to select from the input Dataset using column names and tags.

property selector#
add_dependency(dep: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]#

Adding a dependency node to this node

Parameters

dep (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Dependency to be added

add_parent(parent: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]#

Adding a parent node to this node

Parameters

parent (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Parent to be added

add_child(child: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]#

Adding a child node to this node

Parameters

child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be added

remove_child(child: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]#

Removing a child node from this node

Parameters

child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be removed

compute_schemas(root_schema: merlin.schema.schema.Schema, preserve_dtypes: bool = False)[source]#

Defines the input and output schema

Parameters
  • root_schema (Schema) – Schema of the input dataset

  • preserve_dtypes (bool, optional) – True if we don’t want to override dtypes in the current schema, by default False

validate_schemas(root_schema: merlin.schema.schema.Schema, strict_dtypes: bool = False)[source]#

Check if this Node’s input schema matches the output schemas of parents and dependencies

Parameters
  • root_schema (Schema) – Schema of the input dataset

  • strict_dtypes (bool, optional) – If an error should be raised when column dtypes don’t match, by default False

Raises
  • ValueError – If parents and dependencies don’t provide an expected column based on the input schema

  • ValueError – If the dtype of a column from parents and dependencies doesn’t match the expected dtype based on the input schema

remove_inputs(input_cols: List[str]) List[str][source]#

Remove input columns and all output columns that depend on them.

Parameters

input_cols (List[str]) – The input columns to remove

Returns

The output columns that were removed

Return type

List[str]

exportable(backend: Optional[str] = None)[source]#
property parents_with_dependencies#
property grouped_parents_with_dependencies#
property input_columns#
property output_columns#
property column_mapping#
property dependency_columns#
property label#
property graph#
Nodable#

alias of Union[Node, str, List[str], merlin.dag.selector.ColumnSelector, List[Union[Node, str, List[str], merlin.dag.selector.ColumnSelector]]]

classmethod construct_from(nodable: Union[merlin.dag.node.Node, str, List[str], merlin.dag.selector.ColumnSelector, List[Union[merlin.dag.node.Node, str, List[str], merlin.dag.selector.ColumnSelector]]])[source]#

Convert Node-like objects to a Node or list of Nodes.

Parameters

nodable (Nodable) – Node-like objects to convert to a Node or list of Nodes.

Returns

New Node(s) corresponding to the Node-like input objects

Return type

Union[“Node”, List[“Node”]]

Raises

TypeError – If supplied input cannot be converted to a Node or list of Nodes

class merlin.dag.ColumnSelector(names: Optional[List[str]] = None, subgroups: Optional[List[merlin.dag.selector.ColumnSelector]] = None, tags: Optional[List[Union[str, merlin.schema.tags.Tags]]] = None)[source]#

Bases: object

A ColumnSelector describes a group of columns to be transformed by Operators in a Graph. Operators can be applied to the selected columns by shifting (>>) operators on to the ColumnSelector, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your Graph.

Parameters
  • names (list of (str or tuple of str)) – The columns to select from the input Dataset. The elements of this list are strings indicating the column names in most cases, but can also be tuples of strings for feature crosses.

  • subgroups (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)

  • optional (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)

property tags#
property names#
property grouped_names#
resolve(schema)[source]#

Takes a schema and produces a new selector with selected column names how selection occurs (tags, name) does not matter.

filter_columns(other_selector: merlin.dag.selector.ColumnSelector)[source]#

Narrow the content of this selector to the columns that would be selected by another

Parameters

other_selector (ColumnSelector) – Other selector to apply as the filter

Returns

This selector filtered by the other selector

Return type

ColumnSelector