merlin.models.tf.WideAndDeepModel#
- merlin.models.tf.WideAndDeepModel(schema: merlin.schema.schema.Schema, deep_block: merlin.models.tf.core.base.Block, wide_schema: Optional[merlin.schema.schema.Schema] = None, deep_schema: Optional[merlin.schema.schema.Schema] = None, wide_preprocess: Optional[merlin.models.tf.core.base.Block] = None, deep_input_block: Optional[merlin.models.tf.core.base.Block] = None, wide_input_block: Optional[merlin.models.tf.core.base.Block] = None, deep_regularizer: Optional[Union[str, keras.regularizers.Regularizer]] = None, wide_regularizer: Optional[Union[str, keras.regularizers.Regularizer]] = None, deep_dropout: Optional[float] = None, wide_dropout: Optional[float] = None, prediction_tasks: Optional[Union[merlin.models.tf.prediction_tasks.base.PredictionTask, List[merlin.models.tf.prediction_tasks.base.PredictionTask], merlin.models.tf.prediction_tasks.base.ParallelPredictionBlock, ModelOutput, merlin.models.tf.core.combinators.ParallelBlock]] = None, **wide_body_kwargs) merlin.models.tf.models.base.Model [source]#
The Wide&Deep architecture [1] was proposed by Google in 2016 to balance between the ability of neural networks to generalize and capacity of linear models to memorize relevant feature interactions. The deep part is an MLP model, with categorical features represented as embeddings, which are concatenated with continuous features and fed through multiple MLP layers. The wide part is a linear model takes a sparse representation of categorical features (i.e. one-hot or multi-hot representation). Both wide and deep sub-models output a logit, which is summed and followed by sigmoid for binary classification loss.
Example Usage:
1. Using default input block ```python wide_deep = ml.benchmark.WideAndDeepModel( schema, deep_block=ml.MLPBlock([32, 16]), wide_schema=wide_schema, deep_schema=deep_schema, prediction_tasks=ml.BinaryOutput("click"), ) wide_deep.compile(optimizer="adam") wide_deep.fit(train_data, epochs=10) ``` 2. Custom input block ```python deep_embedding = ml.Embeddings(schema, embedding_dim_default=8, infer_embedding_sizes=False) model = ml.WideAndDeepModel( schema, deep_input_block = ml.InputBlockV2(schema=schema, categorical=deep_embedding), wide_schema=wide_schema, wide_preprocess=ml.CategoryEncoding(wide_schema, output_mode="multi_hot", sparse=True), deep_block=ml.MLPBlock([32, 16]), prediction_tasks=ml.BinaryOutput("click"), ) ``` 3. Wide preprocess with one-hot categorical features and hashed 2nd-level feature interactions ```python model = ml.WideAndDeepModel( schema, wide_schema=wide_schema, deep_schema=deep_schema, wide_preprocess=ml.ParallelBlock( [ # One-hot representations of categorical features ml.CategoryEncoding(wide_schema, output_mode="one_hot", sparse=True), # One-hot representations of hashed 2nd-level feature interactions ml.HashedCrossAll(wide_schema, num_bins=1000, max_level=2, sparse=True), ], aggregation="concat", ), deep_block=ml.MLPBlock([31, 16]), prediction_tasks=ml.BinaryOutput("click"), ) ``` 4. Wide preprocess with multi-hot categorical features and hashed 2nd-level multi-hot feature interactions ```python one_hot_schema = schema.select_by_name(['categ_1', 'categ_2']) multi_hot_schema = schema.select_by_name(['categ_multi_hot_3']) wide_schema = one_hot_schema + multi_hot_schema # One-hot features one_hot_encoding = mm.SequentialBlock( mm.Filter(one_hot_schema), mm.CategoryEncoding(one_hot_schema, sparse=True, output_mode="one_hot"), ) ``` If your dataset contains multi-hot categorical features, i.e. features that may contain multiple categorical values for a data sample, you can instantiate the `AsDenseFeatures` block that converts the sparse representation of multi-hot features into a dense one (with maximum size defined) where the missing values are padded with zeros, as in the following example. ```python # Multi-hot features multi_hot_encoding = mm.SequentialBlock( mm.Filter(multi_hot_schema), # Assuming max size of multi-hot features is 5 ml.AsDenseFeatures(max_seq_length=5), mm.CategoryEncoding(multi_hot_schema, sparse=True, output_mode="multi_hot") ) ``` Linear models are not able to compute feature interaction (like MLPs). So to give the wide part more power we perform paired feature interactions as a preprocessing step, so that every possible combination of the values of two categorical features is mapped to a single id. That way, the model is be able to pick paired feature relationships, e.g., a pattern between the a category of a product and the city of a user. Although, this approach leads to very high-cardinality resulting feature (product between the two features cardinalities). So typically we apply the hashing trick to limit the resulting cardinality. Below you can see how easily you can compute crossed features with Merlin Models. Note: some feature combinations might not add information to the model, for example the feature cross between the item id and item category, as every item only maps to a single item category. You can explicitly ignore those combinations to reduce a bit the feature space. ```python # 2nd-level features interaction features_crossing = mm.SequentialBlock( mm.Filter(wide_schema), # Assuming max size of multi-hot features is 5 ml.AsDenseFeatures(max_seq_length=5), mm.HashedCrossAll( wide_schema, # The crossed features will be hashed to this number of bins num_bins=100, # Performs 2nd feature interactions, typically max is 3rd level max_level=2, output_mode="multi_hot", sparse=True, ignore_combinations=[["item_id", "item_category"], ["item_id", "item_brand"]] ), ) model = ml.WideAndDeepModel( schema, wide_schema=wide_schema, deep_schema=deep_schema, wide_preprocess=ml.ParallelBlock( [ one_hot_encoding, multi_hot_encoding, features_crossing ], aggregation="concat", ), deep_block=ml.MLPBlock([32, 16]), prediction_tasks=ml.BinaryOutput("click"), ) ``` 5. On Wide&Deep paper [1] they proposed usage of separate optimizers for dense (AdaGrad) and sparse embeddings parameters (FTRL). You can implement that by using `MultiOptimizer` class. For example: ```python wide_model = model.blocks[0].parallel_layers["wide"] deep_model = model.blocks[0].parallel_layers["deep"] multi_optimizer = ml.MultiOptimizer( default_optimizer="adagrad", optimizers_and_blocks=[ ml.OptimizerBlocks("ftrl", wide_model), ml.OptimizerBlocks("adagrad", deep_model), ], ) ```
References
[1] Cheng, Koc, Harmsen, Shaked, Chandra, Aradhye, Anderson et al. “Wide & deep learning for recommender systems.” In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7-10. (2016).
- Parameters
schema (Schema) – The Schema with the input features
deep_block (Block) – Block (structure) of deep model.
wide_schema (Optional[Schema]) – The ‘Schema’ of input features for wide model, by default no features would be sent to wide model, and the model would become a pure deep model, if specified, only features in wide_schema would be sent to wide model
deep_schema (Optional[Schema]) – The ‘Schema’ of input features for deep model, by default all features would be sent to deep model. deep_schema and wide_schema could contain the same features
wide_preprocess (Optional[Block]) – Transformation block for preprocess data in wide model, such as CategoryEncoding, HashedCross, and HashedCrossAll. Please note the schema of transformation block should be the same as the wide_schema. See example usages. If wide_schema is provided and wide_preprocess, the CategoryEncoding transformation is used by default for one-hot encoding.
deep_input_block (Optional[Block]) – The input block to be used by the deep part. It not provided, it is created internally by using the deep_schema. Defaults to None.
wide_input_block (Optional[Block]) – The input block to be used by the wide part. It not provided, it is created internally by using the wide_schema. Defaults to None.
deep_regularizer (Optional[RegularizerType]) – Regularizer function applied to the last layer kernel weights matrix and biases of the MLP layer of the wide part. Defaults to None.
wide_regularizer (Optional[RegularizerType]) – Regularizer function applied to the last layer kernel weights matrix and biases of the last MLP layer of deep part). Defaults to None.
deep_dropout (Optional[float]) – The dropout to be used by the last layer of deep part. Defaults to None.
wide_dropout (Optional[float]) – The dropout to be used by the last layer of wide part. Defaults to None.
prediction_tasks (Optional[Union[PredictionTask,List[PredictionTask],) – ParallelPredictionBlock,ModelOutputType] The prediction tasks to be used, by default this will be inferred from the Schema. For custom prediction tasks we recommending using OutputBlock and blocks based on ModelOutput than the ones based in PredictionTask (that will be deprecated).
- Return type
Model