API Documentation#

Workflow Constructors#

Workflow(output_node[, client])

The Workflow class applies a graph of operations onto a dataset, letting you transform datasets to do feature engineering and preprocessing operations.

WorkflowNode

alias of Node

Categorical Operators#

Bucketize(boundaries)

This operation transforms continuous features into categorical features with bins based on the provided bin boundaries.

Categorify([freq_threshold, out_path, ...])

Most of the data set will contain categorical features, and these variables are typically stored as text values.

DropLowCardinality([min_cardinality])

DropLowCardinality drops low cardinality categorical columns.

HashBucket(num_buckets)

This op maps categorical columns to a contiguous integer range by first hashing the column, then reducing modulo the number of buckets.

HashedCross(num_buckets)

This ops creates hashed cross columns by first combining categorical features and hashing the combined feature, then reducing modulo the number of buckets.

TargetEncoding(target[, target_mean, kfold, ...])

Target encoding is a common feature-engineering technique for categorical columns in tabular datasets.

Continuous Operators#

Clip([min_value, max_value])

This operation clips continuous values so that they are within a min/max bound. For instance by setting the min value to 0, you can replace all negative values with 0. This is helpful in cases where you want to log normalize values::.

LogOp()

This operator calculates the log of continuous columns.

Normalize([out_dtype])

Standardizing the features around 0 with a standard deviation of 1 is a common technique to compare measurements that have different units.

NormalizeMinMax([out_dtype])

This operator standardizes continuous features such that they are between 0 and 1.

Missing Value Operators#

Dropna()

This operation detects and filters out rows with missing values.

FillMissing([fill_val, add_binary_cols])

This operation replaces missing values with a constant pre-defined value

FillMedian([add_binary_cols])

This operation replaces missing values with the median value for the column.

Row Manipulation Operators#

DifferenceLag(partition_cols[, shift])

Calculates the difference between two consecutive rows of the dataset.

Filter(f)

Filters rows from the dataset.

Groupby([groupby_cols, sort_cols, aggs, ...])

Groupby Transformation

JoinExternal(df_ext, on[, how, on_ext, ...])

Join each dataset partition to an external table.

JoinGroupby([cont_cols, stats, split_out, ...])

One of the ways to create new features is to calculate the basic statistics of the data that is grouped by categorical features.

Schema Operators#

AddMetadata([tags, properties])

This operator will add user defined tags and properties to a Schema.

AddProperties([properties])

AddTags([tags])

Rename([f, postfix, name])

This operation renames columns by one of several methods:

ReduceDtypeSize([float_dtype])

ReduceDtypeSize changes the dtypes of numeric columns.

TagAsItemFeatures([tags])

TagAsItemID([tags])

TagAsUserFeatures([tags])

TagAsUserID([tags])

List Operators#

ListSlice(start[, end, pad, pad_value])

Slices a list column

ValueCount()

The operator calculates the min and max lengths of multihot columns.

Vector Operators#

ColumnSimilarity(left_features[, ...])

Calculates the similarity between two columns using tf-idf, cosine or inner product as the distance metric.

User-Defined Function Operators#

LambdaOp

alias of UDF

Operator Base Classes#

Operator()

StatOperator()

Base class for statistical operator classes.