Training Recommender Systems on Multi-modal Data


Recommender systems are often trained on tabular data, containing numeric fields (such as item price, numbers of user’s purchases) and categorical fields (such as user and item IDs).

Multi-modal data refer to data types in other modalities, such as text, image and video. Such data can additionally provide rich inputs to and potentially improve the effectiveness of recommender systems.

Several examples include:

  • Movie recommendation, where movie poster, plot and synopsis can be used.

  • Music recommendation, where audio features and lyric can be used.

  • Itinerary planning and attractions recommendation, where text (user profile, attraction description & review) and photos can be used.

Often times, features from multi-modal data are extracted using domain-specific networks, such as ResNet for images and BERT for text data. These pretrained features, also called pretrained embeddings, are then combined with other trainable features and embeddings for the task of recommendation.

This series of notebooks demonstrate the use of multi-modal data (text, image) for the task of movie recommendation, using the Movielens-25M dataset.