Scaling Large Datasets with Criteo

Scaling Large Datasets with Criteo#

Criteo provides the largest publicly available dataset for recommender systems with a size of 1TB of uncompressed click logs that contain 4 billion examples.

We demonstrate how to scale NVTabular, as well as:

  • Use multiple GPUs and nodes with NVTabular for feature engineering.

  • Train recommender system models with the Merlin Models for TensorFlow.

  • Train recommender system models with HugeCTR using multiple GPUs.

  • Inference with the Triton Inference Server and Merlin Models for TensorFlow.

Our recommendation is to use our latest stable Merlin containers for the examples. Each notebook provides the required container.

Explore the following notebooks:

Training and Deployment with TensorFlow:

Training with HugeCTR: