API Documentation

Workflow Constructors

Workflow(output_node, client)

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations.

WorkflowNode

alias of merlin.dag.node.Node

Categorical Operators

Bucketize(boundaries)

This operation transforms continuous features into categorical features with bins based on the provided bin boundaries.

Categorify([freq_threshold, out_path, …])

Most of the data set will contain categorical features, and these variables are typically stored as text values.

DropLowCardinality([min_cardinality])

DropLowCardinality drops low cardinality categorical columns.

HashBucket(num_buckets, Dict[str, int]])

This op maps categorical columns to a contiguous integer range by first hashing the column, then reducing modulo the number of buckets.

HashedCross(num_buckets, Dict[str, int]])

This ops creates hashed cross columns by first combining categorical features and hashing the combined feature, then reducing modulo the number of buckets.

TargetEncoding(target[, target_mean, kfold, …])

Target encoding is a common feature-engineering technique for categorical columns in tabular datasets.

Continuous Operators

Clip([min_value, max_value])

This operation clips continuous values so that they are within a min/max bound. For instance by setting the min value to 0, you can replace all negative values with 0. This is helpful in cases where you want to log normalize values::.

LogOp()

This operator calculates the log of continuous columns.

Normalize([out_dtype])

Standardizing the features around 0 with a standard deviation of 1 is a common technique to compare measurements that have different units.

NormalizeMinMax([out_dtype])

This operator standardizes continuous features such that they are between 0 and 1.

Missing Value Operators

Dropna()

This operation detects and filters out rows with missing values.

FillMissing([fill_val, add_binary_cols])

This operation replaces missing values with a constant pre-defined value

FillMedian([add_binary_cols])

This operation replaces missing values with the median value for the column.

Row Manipulation Operators

DifferenceLag(partition_cols[, shift])

Calculates the difference between two consecutive rows of the dataset.

Filter(f, Union[pandas.core.frame.DataFrame, …)

Filters rows from the dataset.

Groupby([groupby_cols, sort_cols, aggs, …])

Groupby Transformation

JoinExternal(df_ext, on[, how, on_ext, …])

Join each dataset partition to an external table.

JoinGroupby([cont_cols, stats, split_out, …])

One of the ways to create new features is to calculate the basic statistics of the data that is grouped by categorical features.

Schema Operators

AddMetadata([tags, properties])

This operator will add user defined tags and properties to a Schema.

AddProperties([properties])

AddTags([tags])

Rename([f, postfix, name])

This operation renames columns by one of several methods:

ReduceDtypeSize([float_dtype])

ReduceDtypeSize changes the dtypes of numeric columns.

TagAsItemFeatures([tags])

TagAsItemID([tags])

TagAsUserFeatures([tags])

TagAsUserID([tags])

List Operators

ListSlice(start[, end, pad, pad_value])

Slices a list column

ValueCount()

The operator calculates the min and max lengths of multihot columns.

Vector Operators

ColumnSimilarity(left_features[, …])

Calculates the similarity between two columns using tf-idf, cosine or inner product as the distance metric.

User-Defined Function Operators

LambdaOp

alias of merlin.dag.ops.udf.UDF

Operator Base Classes

Operator()

Base class for all operator classes.

StatOperator()

Base class for statistical operator classes.