NVTabular Example Notebooks

We have created a collection of Jupyter notebooks based on different datasets. These example notebooks demonstrate how to use NVTabular with TensorFlow, PyTorch, and HugeCTR. Each example provides additional information about NVTabular’s features.

If you’d like to create a full conda environment to run the example notebooks, do the following:

  1. Use the environment files that have been provided to install the CUDA Toolkit (11.0 or 11.2).

  2. Clone the NVTabular repo and run the following commands from the root directory:

    conda env create -f=conda/environments/nvtabular_dev_cuda11.2.yml
    conda activate nvtabular_dev_11.2
    python -m ipykernel install --user --name=nvt
    pip install -e .
    jupyter notebook
    

    When opening a notebook, be sure to select nvt from the Kernel->Change Kernel menu.

Structure

The example notebooks are structured as follows and should be reviewed in this order:

  • 01-Download-Convert.ipynb: Demonstrates how to download the dataset and convert it into the correct format so that it can be consumed.

  • 02-ETL-with-NVTabular.ipynb: Demonstrates how to execute the preprocessing and feature engineering pipeline (ETL) with NVTabular on the GPU.

  • 03-Training-with-TF.ipynb: Demonstrates how to train a model with TensorFlow based on the ETL output.

  • 03-Training-with-PyTorch.ipynb: Demonstrates how to train a model with PyTorch based on the ETL output.

  • 03-Training-with-HugeCTR.ipynb: Demonstrates how to train a model with HugeCTR based on the ETL output.

Available Example Notebooks

1. Getting Started with MovieLens

The MovieLens25M is a popular dataset for recommender systems and is used in academic publications. Most users are familiar with this dataset, so this example notebook is focusing primarily on the basic concepts of NVTabular, which includes:

  • Learning NVTabular with NVTabular’s high-level API

  • Using single-hot/multi-hot categorical input features with NVTabular

  • Using the NVTabular dataloader with the TensorFlow Keras model

  • Using the NVTabular dataloader with PyTorch

2. Advanced Ops with Outbrain

The Outbrain dataset is based on a Kaggle Competition in which Kagglers were challenged to predict which ads and other forms of sponsored content that their global users would click. This example notebook demonstrates how to use the available NVTabular operators, write a custom operator, and train a Wide&Deep model with the NVTabular dataloader in TensorFlow.

3. Scaling Large Datasets with Criteo

Criteo provides the largest publicly available dataset for recommender systems with a size of 1TB of uncompressed click logs that contain 4 billion examples. This example notebook demonstrates how to scale NVTabular, use multiple GPUs and multiple nodes with NVTabular for ETL, and train a recommender system model with the NVTabular dataloader for PyTorch.

4. Multi-GPU with MovieLens

In the Getting Started with MovieLens example, we explain the fundamentals of NVTabular and its dataloader, HugeCTR, and Triton Inference. With this example, we revisit the same dataset but demonstrate how to perform multi-GPU training with the NVTabular dataloader in TensorFlow.

5. Winning Solution of the RecSys2020 Competition

Twitter provided a dataset for the RecSys2020 challenge. The goal was to predict user engagement based on 200M user-tweet pairs. This example notebook demonstrates how to use NVTabular’s available operators for feature engineering and train a XGBoost model on the GPU with dask.

6. Applying the Techniques to other Tabular Problems with Rossmann

Rossmann operates over 3,000 drug stores across seven European countries. Historical sales data for 1,115 Rossmann stores are provided. The goal is to forecast the Sales column for the test set. Kaggle hosted it as a competition.

Running the Example Notebooks

You can run the example notebooks by installing NVTabular and other required libraries. Alternatively, Docker containers are available on http://ngc.nvidia.com/catalog/containers/ with pre-installed versions. Depending on which example you want to run, you should use any one of these Docker containers:

  • Merlin-Tensorflow-Training (contains NVTabular with TensorFlow)

  • Merlin-Pytorch-Training (contains NVTabular with PyTorch)

  • Merlin-Training (contains NVTabular with HugeCTR)

  • Merlin-Inference (contains NVTabular with TensorFlow and Triton Inference support)

To run the example notebooks using Docker containers, do the following:

  1. Pull the container by running the following command:

    docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host <docker container> /bin/bash
    

    NOTES:

    • If you are running on Docker version 19 and higher, change --runtime=nvidia to --gpus all.

    • If you are running Getting Started with MovieLens , Advanced Ops with Outbrain or Tabular Problems with Rossmann example notebooks you need to add -v ${PWD}:/root/ to the docker script above. Here PWD is a local directory in your system, and this very same directory should also be mounted to the merlin-inferencecontainer if you would like to run the inference example. Please follow the start and launch triton server instructions given in the inference notebooks.

    • If you are running Training-with-HugeCTR notebooks, please add --cap-add SYS_NICE to docker run command to suppress the set_mempolicy: Operation not permitted warnings.

The container will open a shell when the run command execution is completed. You will have to start JupyterLab on the Docker container. It should look similar to this:

root@2efa5b50b909:
  1. Install jupyter-lab with pip by running the following command:

    pip install jupyterlab
    

    For more information, see Installation Guide.

  2. Start the jupyter-lab server by running the following command:

    jupyter-lab --allow-root --ip='0.0.0.0' --NotebookApp.token='<password>'
    
  3. Open any browser to access the jupyter-lab server using :8888.

  4. Once in the server, navigate to the /nvtabular/ directory and try out the examples.