merlin.schema package#
- class merlin.schema.Schema(column_schemas=None)[source]#
Bases:
object
A collection of column schemas for a dataset.
- property column_names#
- select(selector) Schema [source]#
Select matching columns from this Schema object using a ColumnSelector
- Parameters:
selector (ColumnSelector) – Selector that describes which columns match
- Returns:
New object containing only the ColumnSchemas of selected columns
- Return type:
- excluding(selector) Schema [source]#
Select non-matching columns from this Schema object using a ColumnSelector
- Parameters:
selector (ColumnSelector) – Selector that describes which columns match
- Returns:
New object containing only the ColumnSchemas of selected columns
- Return type:
- select_by_tag(tags: str | Tags | List[str | Tags]) Schema [source]#
Select matching columns from this Schema object using a list of tags
- select_by_name(names: List[str]) Schema [source]#
Select matching columns from this Schema object using a list of column names
- get(col_name: str, default: ColumnSchema | None = None) ColumnSchema [source]#
Get a ColumnSchema by name
- Parameters:
col_name (str) – Name of the column to get
default (ColumnSchema :) –
- Default value to return if column is not found.
(Default value = None)
- Returns:
Retrieved column schema (or default value, if not found)
- Return type:
- property first: ColumnSchema#
Returns the first ColumnSchema in the Schema. Useful for cases where you select down to a single column via select_by_name or select_by_tag, and just want the value
- Returns:
The first column schema present in this Schema object
- Return type:
- Raises:
ValueError – If this Schema object contains no column schemas
- class merlin.schema.ColumnSchema(name: str, tags: ~merlin.schema.tags.TagSet | ~typing.List[str | ~merlin.schema.tags.Tags] | None = <factory>, properties: ~typing.Dict | None = <factory>, dtype: ~merlin.dtypes.base.DType | None = None, is_list: bool | None = None, is_ragged: bool | None = None, dims: dataclasses.InitVar[typing.Union[typing.Tuple, merlin.dtypes.shape.Shape]] = None)[source]#
Bases:
object
A schema containing metadata of a dataframe column.
- property shape#
- with_name(name: str) ColumnSchema [source]#
Create a copy of this ColumnSchema object with a different column name
- Parameters:
name (str) – New column name
- Returns:
Copied object with new column name
- Return type:
- with_tags(tags: str | Tags) ColumnSchema [source]#
Create a copy of this ColumnSchema object with different column tags
- Parameters:
- Returns:
Copied object with new column tags
- Return type:
- with_properties(properties: dict) ColumnSchema [source]#
Create a copy of this ColumnSchema object with different column properties
- with_dtype(dtype, is_list: bool | None = None, is_ragged: bool | None = None) ColumnSchema [source]#
Create a copy of this ColumnSchema object with different column dtype
- Parameters:
dtype (np.dtype) – New column dtype
is_list (bool :) –
- Whether rows in this column contain lists.
(Default value = None)
is_ragged (bool :) –
- Whether lists in this column have varying lengths.
(Default value = None)
- Returns:
Copied object with new column dtype
- Return type:
- with_shape(shape: Tuple | Shape) ColumnSchema [source]#
Create a copy of this object with a new shape
- Parameters:
shape (Union[Tuple, Shape]) – Object to set as shape, must be either a tuple or Shape.
- Returns:
A copy of this object containing the provided shape value
- Return type:
- Raises:
TypeError – If value is not either a tuple or a Shape
- class merlin.schema.Tags(value)[source]#
Bases:
Enum
Standard tags used in the Merlin ecosystem
- CATEGORICAL = 'categorical'#
- CONTINUOUS = 'continuous'#
- LIST = 'list'#
- SEQUENCE = 'sequence'#
- TEXT = 'text'#
- TOKENIZED = 'tokenized'#
- TIME = 'time'#
- ID = 'id'#
- USER = 'user'#
- ITEM = 'item'#
- SESSION = 'session'#
- CONTEXT = 'context'#
- TARGET = 'target'#
- REGRESSION = 'regression'#
- CLASSIFICATION = 'classification'#
- BINARY = 'binary'#
- MULTI_CLASS = 'multi_class'#
- USER_ID = 'user_id'#
- ITEM_ID = 'item_id'#
- SESSION_ID = 'session_id'#
- TEXT_TOKENIZED = 'text_tokenized'#
- BINARY_CLASSIFICATION = 'binary_classification'#
- MULTI_CLASS_CLASSIFICATION = 'multi_class_classification'#