# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#     http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

Training Recommender Systems on Multi-modal Data


Recommender systems are often trained on tabular data, containing numeric fields (such as item price, numbers of user’s purchases) and categorical fields (such as user and item IDs).

Multi-modal data refer to data types in other modalities, such as text, image and video. Such data can additionally provide rich inputs to and potentially improve the effectiveness of recommender systems.

Several examples include:

  • Movie recommendation, where movie poster, plot and synopsis can be used.

  • Music recommendation, where audio features and lyric can be used.

  • Itinerary planning and attractions recommendation, where text (user profile, attraction description & review) and photos can be used.

Often times, features from multi-modal data are extracted using domain-specific networks, such as ResNet for images and BERT for text data. These pretrained features, also called pretrained embeddings, are then combined with other trainable features and embeddings for the task of recommendation.

This series of notebooks demonstrate the use of multi-modal data (text, image) for the task of movie recommendation, using the Movielens-25M dataset.