merlin.dag package

class merlin.dag.BaseOperator[source]

Bases: object

Base class for all operator classes.

compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: merlin.dag.selector.ColumnSelector, dependencies_selector: merlin.dag.selector.ColumnSelector)merlin.dag.selector.ColumnSelector[source]
compute_input_schema(root_schema: merlin.schema.schema.Schema, parents_schema: merlin.schema.schema.Schema, deps_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector)merlin.schema.schema.Schema[source]

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use :param root_schema: Base schema of the dataset before running any operators. :type root_schema: Schema :param parents_schema: The combined schemas of the upstream parents feeding into this operator :type parents_schema: Schema :param deps_schema: The combined schemas of the upstream dependencies feeding into this operator :type deps_schema: Schema :param col_selector: The column selector to apply to the input schema :type col_selector: ColumnSelector

Returns

The schemas of the columns used by this operator

Return type

Schema

compute_output_schema(input_schema: merlin.schema.schema.Schema, col_selector: merlin.dag.selector.ColumnSelector, prev_output_schema: Optional[merlin.schema.schema.Schema] = None)merlin.schema.schema.Schema[source]

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce :param input_schema: The schemas of the columns to apply this operator to :type input_schema: Schema :param col_selector: The column selector to apply to the input schema :type col_selector: ColumnSelector

Returns

The schemas of the columns produced by this operator

Return type

Schema

column_mapping(col_selector)[source]
compute_column_schema(col_name, input_schema)[source]
property dynamic_dtypes
output_column_names(col_selector: merlin.dag.selector.ColumnSelector)merlin.dag.selector.ColumnSelector[source]

Given a set of columns names returns the names of the transformed columns this operator will produce :param columns: The columns to apply this operator to :type columns: list of str, or list of list of str

Returns

The names of columns produced by this operator

Return type

list of str, or list of list of str

property dependencies

Defines an optional list of column dependencies for this operator. This lets you consume columns that aren’t part of the main transformation workflow. :returns: Extra dependencies of this operator. Defaults to None :rtype: str, list of str or ColumnSelector, optional

property output_dtype
property output_tags
property output_properties
property label
create_node(selector)[source]
property supports

Returns what kind of data representation this operator supports

class merlin.dag.Graph(output_node: merlin.dag.node.Node)[source]

Bases: object

property input_dtypes
property output_dtypes
property column_mapping
construct_schema(root_schema: merlin.schema.schema.Schema, preserve_dtypes=False)merlin.dag.graph.Graph[source]
property input_schema
property leaf_nodes
property output_schema
remove_inputs(to_remove)[source]

Removes columns from a Graph

Starting at the leaf nodes, trickle down looking for columns to remove, when found remove but then must propagate the removal of any other output columns derived from that column.

Parameters
  • graph (Graph) – The graph to remove columns from

  • to_remove (array_like) – A list of input column names to remove from the graph

Returns

The same graph with columns removed

Return type

Graph

classmethod get_nodes_by_op_type(nodes, op_type)[source]
class merlin.dag.Node(selector=None)[source]

Bases: object

A Node is a group of columns that you want to apply the same transformations to. Node’s can be transformed by shifting operators on to them, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your workflow

Parameters

selector (ColumnSelector) – Defines which columns to select from the input Dataset using column names and tags.

property selector
add_dependency(dep: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]

Adding a dependency node to this node

Parameters

dep (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Dependency to be added

add_parent(parent: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]

Adding a parent node to this node

Parameters

parent (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Parent to be added

add_child(child: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]

Adding a child node to this node

Parameters

child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be added

remove_child(child: Union[str, merlin.dag.selector.ColumnSelector, merlin.dag.node.Node, List[Union[str, merlin.dag.node.Node, merlin.dag.selector.ColumnSelector]]])[source]

Removing a child node from this node

Parameters

child (Union[str, ColumnSelector, Node, List[Union[str, Node, ColumnSelector]]]) – Child to be removed

compute_schemas(root_schema: merlin.schema.schema.Schema, preserve_dtypes: bool = False)[source]

Defines the input and output schema

Parameters
  • root_schema (Schema) – Schema of the input dataset

  • preserve_dtypes (bool, optional) – True if we don’t want to override dtypes in the current schema, by default False

validate_schemas(root_schema: merlin.schema.schema.Schema, strict_dtypes: bool = False)[source]

Check if this Node’s input schema matches the output schemas of parents and dependencies

Parameters
  • root_schema (Schema) – Schema of the input dataset

  • strict_dtypes (bool, optional) – If an error should be raised when column dtypes don’t match, by default False

Raises
  • ValueError – If parents and dependencies don’t provide an expected column based on the input schema

  • ValueError – If the dtype of a column from parents and dependencies doesn’t match the expected dtype based on the input schema

remove_inputs(input_cols)[source]
property exportable
property parents_with_dependencies
property grouped_parents_with_dependencies
property input_columns
property output_columns
property column_mapping
property dependency_columns
property label
property graph
classmethod construct_from(nodable)[source]
class merlin.dag.ColumnSelector(names: Optional[List[str]] = None, subgroups: Optional[List[merlin.dag.selector.ColumnSelector]] = None, tags: Optional[List[Union[str, merlin.schema.tags.Tags]]] = None)[source]

Bases: object

A ColumnSelector describes a group of columns to be transformed by Operators in a Graph. Operators can be applied to the selected columns by shifting (>>) operators on to the ColumnSelector, which returns a new Node with the transformations applied. This lets you define a graph of operations that makes up your Graph.

Parameters
  • names (list of (str or tuple of str)) – The columns to select from the input Dataset. The elements of this list are strings indicating the column names in most cases, but can also be tuples of strings for feature crosses.

  • subgroups (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)

  • optional (list of ColumnSelector objects) – This provides an alternate syntax for grouping column names together (instead of nesting tuples inside the list of names)

property tags
property names
property grouped_names
resolve(schema)[source]

Takes a schema and produces a new selector with selected column names how selection occurs (tags, name) does not matter.

filter_columns(other_selector)[source]