Contributing to HugeCTR

Overview of Contributing to HugeCTR

We’re grateful for your interest in HugeCTR and value your contributions. You can contribute to HugeCTR by:

  • submitting a feature, documentation, or bug request.

    NOTE: After we review your request, we’ll assign it to a future release. If you think the issue should be prioritized over others, comment on the issue.

  • proposing and implementing a new feature.

    NOTE: Once we agree to the proposed design, you can go ahead and implement the new feature using the steps outlined in the Contribute New Code section.

  • implementing a pending feature or fixing a bug.

    NOTE: Use the steps outlined in the Contribute New Code section. If you need more information about a particular issue, add your comments on the issue.

Contribute New Code

  1. Build HugeCTR or Sparse Operation Kit (SOK) from source using the steps outlined in the Set Up the Development Environment with Merlin Containers.

  2. File an issue and add a comment stating that you’ll work on it.

  3. Start coding.

    NOTE: Don’t forget to add or update the unit tests properly.

  4. Create a pull request for you work.

  5. Wait for a maintainer to review your code.

    You may be asked to make additional edits to your code if necessary. Once approved, a maintainer will merge your pull request.

If you have any questions or need clarification, don’t hesitate to add comments to your issue and we’ll respond promptly.

How to Start your Development

Set Up the Development Environment With Merlin Containers

We provide options to disable the installation of HugeCTR and HugeCTR Triton Backend in Merlin Dockerfiles so that our contributors can build the development environment (container) from them. By simply clone the code of HugeCTR into this environment and build you can start the journey of development.

Note: the message on terminal below is not errors if you are working in such containers.

groups: cannot find name for group ID 1007
I have no name!@56a762eae3f8:/hugectr

In Merlin CTR Dockerfile, Merlin Tensorflow Dockerfile, we provide a set of arguments to setup your HugeCTR development container:

The arguments and configurations in this example can be used in all the three containers building:

docker build --pull -t ${DST_IMAGE} -f ${DOCKER_FILE} --build-arg RELEASE=false --build-arg RMM_VER=vnightly --build-arg CUDF_VER=vnightly --build-arg NVTAB_VER=vnightly --build-arg HUGECTR_DEV_MODE=true --no-cache .

For RMM_VER, CUDF_VER, NVTAB_VER, you can simply specify the release tag e.g. v1.0 or vnightly if you want to build with the head of the main branch. With specifying HUGECTR_DEV_MODE=true, you can disable HugeCTR installation.

Docker CLI Quick Reference

$ docker build [<opts>] <path> | <URL>
               Build a new image from the source code at PATH
  -f, --file path/to/Dockerfile
               Path to the Dockerfile to use. Default: Dockerfile.
  --build-arg <varname>=<value>
               Name and value of a build argument defined with ARG
               Dockerfile instruction
  -t "<name>[:<tag>]"
               Repository names (and optionally with tags) to be applied
               to the resulting image
  --label =<label>
               Set metadata for an image
  -q, --quiet  Suppress the output generated by containers
  --rm         Remove intermediate containers after a successful build

Build HugeCTR Training Container from Source

To build HugeCTR Training Container from source, do the following:

  1. Build the hugectr:devel image using the steps outlined here. Remember that this instruction is only for the Merlin CTR Dockerfile.

  2. Download the HugeCTR repository and the third-party modules that it relies on by running the following commands:

    $ git clone https://github.com/NVIDIA/HugeCTR.git
    $ cd HugeCTR
    $ git submodule update --init --recursive
    
  3. Build HugeCTR from scratch using one or any combination of the following options:

    • SM: You can use this option to build HugeCTR with a specific compute capability (DSM=80) or multiple compute capabilities (DSM=”70;75”). The default compute capability is 70, which uses the NVIDIA V100 GPU. For more information, refer to the Compute Capability table. 60 is not supported for inference deployments. For more information, refer to the Quick Start for the HugeCTR backend of Triton Inference Server.

    • CMAKE_BUILD_TYPE: You can use this option to build HugeCTR with Debug or Release. When using Debug to build, HugeCTR will print more verbose logs and execute GPU tasks in a synchronous manner. average of eval_batches results. Only one thread and chunk will be used in the data reader. Performance will be lower when in validation mode. This option is set to OFF by default.

    • ENABLE_MULTINODES: You can use this option to build HugeCTR with multiple nodes. This option is set to OFF by default. For more information, refer to the deep and cross network samples directory on GitHub.

    • ENABLE_INFERENCE: You can use this option to build HugeCTR in inference mode, which was designed for the inference framework. In this mode, an inference shared library will be built for the HugeCTR Backend. Only interfaces that support the HugeCTR Backend can be used. Therefore, you can’t train models in this mode. This option is set to OFF by default. For building inference container, please refer to Build HugeCTR Inference Container from Source

    • ENABLE_HDFS: You can use this option to build HugeCTR together with HDFS to enable HDFS related functions. Make sure you are using the hugectr:devel_train.with_hdfs container or make sure you have correctly built Hadoop in your system before setting this option to ON. This option is set to OFF by default

    Here are some examples of how you can build HugeCTR using these build options:

    $ mkdir -p build && cd build
    $ cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 .. # Target is NVIDIA V100 with all others by default
    $ make -j && make install
    
    $ mkdir -p build && cd build
    $ cmake -DCMAKE_BUILD_TYPE=Release -DSM="70;80" -DENABLE_MULTINODES=ON .. # Target is NVIDIA V100 / A100 with the multi-node mode on.
    $ make -j && make install
    
    $ mkdir -p build && cd build
    $ cmake -DCMAKE_BUILD_TYPE=Debug -DSM="70;80" .. # Target is NVIDIA V100 / A100 with Debug mode.
    $ make -j && make install
    

    By default, HugeCTR is installed at /usr/local. However, you can use CMAKE_INSTALL_PREFIX to install HugeCTR to non-default location:

    $ cmake -DCMAKE_INSTALL_PREFIX=/opt/HugeCTR -DSM=70 ..
    

Build HugeCTR Inference Container from Source

To build HugeCTR inference container from source, do the following:

  1. Build the hugectr:devel_inference image using the steps outlined here. Remember that this instruction is only for the Merlin CTR Dockerfile.

  2. Download the HugeCTR repository and the third-party modules that it relies on by running the following commands:

    $ git clone https://github.com/NVIDIA/HugeCTR.git
    $ cd HugeCTR
    $ git submodule update --init --recursive
    
  3. Here is an example of how you can build HugeCTR inference container using the build options:

    $ mkdir -p build && cd build
    $ cmake -DCMAKE_BUILD_TYPE=Release -DSM="70;80" -DENABLE_INFERENCE=ON .. # Target is NVIDIA V100 / A100 with Inference mode ON.
    $ make -j && make install
    

Build Sparse Operation Kit (SOK) from Source

To build the Sparse Operation Kit component in HugeCTR, do the following:

  1. Build the hugectr:tf-plugin docker image using the steps noted here. Remember that this instruction is only for the Merlin Tensorflow Dockerfile.

  2. Download the HugeCTR repository by running the following command:

    $ git clone https://github.com/NVIDIA/HugeCTR.git hugectr
    
  3. Build and install libraries to the system paths by running the following commands:

    $ cd hugectr/sparse_operation_kit
    $ python setup.py install
    

    You can config different environment variables for compiling SOK, please refer to this section for more details.