[1]:

# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

9e65fb351f804340825bb8523e2f47d4

Getting Started MovieLens: Download and Convert

MovieLens25M

The MovieLens25M is a popular dataset for recommender systems and is used in academic publications. The dataset contains 25M movie ratings for 62,000 movies given by 162,000 users. Many projects use only the user/item/rating information of MovieLens, but the original dataset provides metadata for the movies, as well. For example, which genres a movie has. Although we may not improve state-of-the-art results with our neural network architecture, we will use the metadata to show how to multi-hot encoded features.

Getting Started

[2]:

# External dependencies
import os
import cudf                 # cuDF is an implementation of Pandas-like Dataframe on GPU
import time
import gc

from os import path
from sklearn.model_selection import train_test_split

[3]:

cudf.__version__

[3]:

'0.18.0'

We define our base directory, containing the data.

[4]:

BASE_DIR = '~/nvt-examples/movielens/data/'

If the data is not available in the base directory, we will download and unzip the data.

[5]:

if not path.exists(BASE_DIR + 'ml-25m'):
    os.makedirs(BASE_DIR, exist_ok=True)
    zip_path = os.path.join(BASE_DIR, 'ml-25m.zip')
    if not path.exists(zip_path):
        os.system("mkdir -p " + BASE_DIR)
        os.system("wget http://files.grouplens.org/datasets/movielens/ml-25m.zip")
        os.system("mv ml-25m.zip " + BASE_DIR)

    os.system("unzip " + zip_path + " -d " + BASE_DIR)

Convert the dataset

First, we take a look on the movie metadata.

[6]:

movies = cudf.read_csv(os.path.join(BASE_DIR, 'ml-25m/movies.csv'))
movies.head()

[6]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

We can see, that genres are a multi-hot categorical features with different number of genres per movie. Currently, genres is a String and we want split the String into a list of Strings. In addition, we drop the title.

[7]:

movies['genres'] = movies['genres'].str.split('|')
movies = movies.drop('title', axis=1)
movies.head()

[7]:

	movieId	genres
0	1	[Adventure, Animation, Children, Comedy, Fantasy]
1	2	[Adventure, Children, Fantasy]
2	3	[Comedy, Romance]
3	4	[Comedy, Drama, Romance]
4	5	[Comedy]

We save movies genres in parquet format, so that they can be used by NVTabular in the next notebook.

[8]:

movies.to_parquet(os.path.join(BASE_DIR, "ml-25m", "movies_converted.parquet"))

Splitting into train and validation dataset

We load the movie ratings.

[9]:

ratings = cudf.read_csv(os.path.join(BASE_DIR, "ml-25m", "ratings.csv"))
ratings.head()

[9]:

	userId	movieId	rating	timestamp
0	1	296	5.0	1147880044
1	1	306	3.5	1147868817
2	1	307	5.0	1147868828
3	1	665	5.0	1147878820
4	1	899	3.5	1147868510

We drop the timestamp column and split the ratings into training and test dataset. We use a simple random split.

[10]:

ratings = ratings.drop('timestamp', axis=1)
train, valid = train_test_split(ratings, test_size=0.2, random_state=42)

We save the dataset to disk.

[11]:

train.to_parquet(os.path.join(BASE_DIR, "train.parquet"))
valid.to_parquet(os.path.join(BASE_DIR, "valid.parquet"))