merlin.dag.Operator#

class merlin.dag.Operator[source]#

Bases: object

Base class for all operator classes.

__init__()#

Methods

__init__()

column_mapping(col_selector)

Compute which output columns depend on which input columns

compute_column_schema(col_name, input_schema)

compute_input_schema(root_schema, ...)

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

compute_output_schema(input_schema, col_selector)

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

compute_selector(input_schema, selector[, ...])

Provides a hook method for sub-classes to override to implement custom column selection logic.

create_node(selector)

export(path, input_schema, output_schema, ...)

Export the class object as a config and all related files to the user defined path.

load_artifacts([artifact_path])

Load artifacts from disk required for operator function.

output_column_names(col_selector)

Given a set of columns names returns the names of the transformed columns this operator will produce

save_artifacts([artifact_path])

Save artifacts required to be reload operator state from disk

transform(col_selector, transformable)

Transform the dataframe by applying this operator to the set of input columns

validate_schemas(parents_schema, ...[, ...])

Provides a hook method that sub-classes can override to implement schema validation logic.

Attributes

dependencies

Defines an optional list of column dependencies for this operator.

dynamic_dtypes

export_name

Provides a clear common english identifier for this operator.

is_subgraph

label

output_dtype

output_properties

output_tags

supported_formats

supports

Returns what kind of data representation this operator supports

compute_selector(input_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector, parents_selector: Optional[merlin.dag.selector.ColumnSelector] = None, dependencies_selector: Optional[merlin.dag.selector.ColumnSelector] = None) merlin.dag.selector.ColumnSelector[source]#

Provides a hook method for sub-classes to override to implement custom column selection logic.

Parameters
  • input_schema (Schema) – Schemas of the columns to apply this operator to

  • selector (ColumnSelector) – Column selector to apply to the input schema

  • parents_selector (ColumnSelector) – Combined selectors of the upstream parents feeding into this operator

  • dependencies_selector (ColumnSelector) – Combined selectors of the upstream dependencies feeding into this operator

Returns

Revised column selector to apply to the input schema

Return type

ColumnSelector

compute_input_schema(root_schema: merlin.schema.schema.Schema, parents_schema: merlin.schema.schema.Schema, deps_schema: merlin.schema.schema.Schema, selector: merlin.dag.selector.ColumnSelector) merlin.schema.schema.Schema[source]#

Given the schemas coming from upstream sources and a column selector for the input columns, returns a set of schemas for the input columns this operator will use

Parameters
  • root_schema (Schema) – Base schema of the dataset before running any operators.

  • parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator

  • deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator

  • col_selector (ColumnSelector) – The column selector to apply to the input schema

Returns

The schemas of the columns used by this operator

Return type

Schema

compute_output_schema(input_schema: merlin.schema.schema.Schema, col_selector: merlin.dag.selector.ColumnSelector, prev_output_schema: Optional[merlin.schema.schema.Schema] = None) merlin.schema.schema.Schema[source]#

Given a set of schemas and a column selector for the input columns, returns a set of schemas for the transformed columns this operator will produce

Parameters
  • input_schema (Schema) – The schemas of the columns to apply this operator to

  • col_selector (ColumnSelector) – The column selector to apply to the input schema

Returns

The schemas of the columns produced by this operator

Return type

Schema

validate_schemas(parents_schema: merlin.schema.schema.Schema, deps_schema: merlin.schema.schema.Schema, input_schema: merlin.schema.schema.Schema, output_schema: merlin.schema.schema.Schema, strict_dtypes: bool = False)[source]#

Provides a hook method that sub-classes can override to implement schema validation logic.

Sub-class implementations should raise an exception if the schemas are not valid for the operations they implement.

Parameters
  • parents_schema (Schema) – The combined schemas of the upstream parents feeding into this operator

  • deps_schema (Schema) – The combined schemas of the upstream dependencies feeding into this operator

  • input_schema (Schema) – The schemas of the columns to apply this operator to

  • output_schema (Schema) – The schemas of the columns produced by this operator

  • strict_dtypes (Boolean, optional) – Enables strict checking for column dtype matching if True, by default False

transform(col_selector: merlin.dag.selector.ColumnSelector, transformable: merlin.core.protocols.Transformable) merlin.core.protocols.Transformable[source]#

Transform the dataframe by applying this operator to the set of input columns

Parameters
  • col_selector (ColumnSelector) – The columns to apply this operator to

  • transformable (Transformable) – A pandas or cudf dataframe that this operator will work on

Returns

Returns a transformed dataframe or dictarray for this operator

Return type

Transformable

column_mapping(col_selector)[source]#

Compute which output columns depend on which input columns

Parameters

col_selector (ColumnSelector) – A selector containing a list of column names

Returns

Mapping from output column names to list of the input columns they rely on

Return type

Dict[str, List[str]]

load_artifacts(artifact_path: Optional[os.PathLike] = None)[source]#

Load artifacts from disk required for operator function.

Parameters

artifact_path (str) – The path where artifacts are loaded from

save_artifacts(artifact_path: Optional[os.PathLike] = None) None[source]#

Save artifacts required to be reload operator state from disk

Parameters

artifact_path (str) – The path where artifacts are to be saved

compute_column_schema(col_name, input_schema)[source]#
property dynamic_dtypes#
property is_subgraph#
output_column_names(col_selector: merlin.dag.selector.ColumnSelector) merlin.dag.selector.ColumnSelector[source]#

Given a set of columns names returns the names of the transformed columns this operator will produce

Parameters

columns (list of str, or list of list of str) – The columns to apply this operator to

Returns

The names of columns produced by this operator

Return type

list of str, or list of list of str

property dependencies: List[Union[str, Any]]#

Defines an optional list of column dependencies for this operator. This lets you consume columns that aren’t part of the main transformation workflow.

Returns

Extra dependencies of this operator. Defaults to None

Return type

str, list of str or ColumnSelector, optional

property output_dtype#
property output_tags#
property output_properties#
property label: str#
create_node(selector)[source]#
property supports: merlin.dag.operator.Supports#

Returns what kind of data representation this operator supports

property supported_formats: merlin.dag.operator.DataFormats#
property export_name#

Provides a clear common english identifier for this operator.

Returns

Name of the current class as spelled in module.

Return type

String

export(path: str, input_schema: merlin.schema.schema.Schema, output_schema: merlin.schema.schema.Schema, **kwargs)[source]#

Export the class object as a config and all related files to the user defined path.

Parameters
  • path (str) – Artifact export path

  • input_schema (Schema) – A schema with information about the inputs to this operator.

  • output_schema (Schema) – A schema with information about the outputs of this operator.

  • params (dict, optional) – Parameters dictionary of key, value pairs stored in exported config, by default None.

  • node_id (int, optional) – The placement of the node in the graph (starts at 1), by default None.

  • version (int, optional) – The version of the operator, by default 1.

Returns

model_config – The config for the exported operator.

Return type

dict