Release Notes
What’s New in Version 23.12
Lock-free Inference Cache in HPS
We have added a new lock-free GPU embedding cache for the hierarhical parameter server, which can further improve the performance of embedding table lookup in inference. It also doesn’t lead to data inconsistency even if concurrent model updates or missing key insertions are in use. That is because we ensure the cache consistency through the asynchronous stream synchronization mechanism. To enable lock-free GPU embedding cache, a user only needs to set “embedding_cache_type” to
dynamic
and"use_hctr_cache_implementation"
tofalse
.
Official SOK Release
The SOK is not an
experiment
package anymore but is now officially supported by HugeCTR. Doimport sparse_operation_kit as sok
instead offrom sparse_operation_kit import experiment as sok
sok.DynamicVariable
supports Merlin-HKV as its backendThe parallel dump and load functions are added
Code Cleaning and Deprecation
Deprecated the
Model::export_predictions
function. Use the Model::check_out_tensor function instead.We have deprecated the
Norm
and legacyRaw
DataReaders. Usehugectr.DataReaderType_t.RawAsync
orhugectr.DataReaderType_t.Parquet
as their alternatives.
Issues Fixed:
Improved the performance of the HKV lookup via the SOK
Fix an illegal memory access issue from the SOK backward pass, occurring in a corner case
Resolved the mean combiner returning zeroes, when the pooling factor is zero, which can make the SOK lookup return NaN.
Fixed some dependency related build issues
Optimized the performance of the dynamic embedding table (DET) in the SOK.
Fixed the crash when a user specifies negative keys in using the DET via the SOK.
Resolved the occasional correctness issue which becomes visible during the backward propagation phase of the SOK, in handling thousands of embedding tables.
Removed the runtime errors happening in the Tensorflow >= 2.13.
Known Issues:
If we set
max_eval_batches
andbatchsize_eval
to some large values such as 5000 and 12000 respectively, the training process leads to the illegal memory access error. The issue is from the CUB, and is fixed in its latest version. However, because it is only included in CUDA 12.3, which is not used by our NGC container yet, until we update our NGC container to rely upon that version of CUDA, please rebuild HugeCTR with the newest CUB as a workaround. Otherwise, please try to avoid such largemax_eval_batches
andbatchsize_eval
.HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 23.11
Code Cleaning and Deprecation
The offline inference has been deprecated from our documentation, notebook suite, and code. Please check out the HPS plugin for TensorFlow and TensorRT. The multi-GPU inference is not illustrated in this HPS TRT notebook.
We are working on deprecating the Embedding Training Cache (ETC). If you trying using that feature, it still works but omits a deprecation warning message. In a near-futre release, they will be removed from the API and code level. Please refer to the NVIDIA HierarchicalKV as an alternative.
In this release, we have also cleaned up our C++ code and CMakeLists.txt to improve their maintainability and fix minor but potential issues. There will be more code cleanup in several future releases.
General Updates:
Enabled the support of the static CUDA runtime. Now you can experimentally enable the feature by specifying
-DUSE_CUDART_STATIC=ON
in configuring the code with cmake, while the dynamic CUDA runtime is still used by default.Added HPS as a custom extension for TorchScript. A user can leverage the HPS embedding lookup during the inference of scripted torch module.
Issues Fixed:
Resolved a couple of performance regressions when the SOK is used together with HKV, related to unique operation and unified memory
Reduced the unnessary memory consumption of intermediate buffers in loading and dumping the SOK embedding
Fixed the Interaction Layer to support large
num_slots
Resolved the occasional runtime error in using multiple H800 GPUs
Known Issues:
If we set
max_eval_batches
andbatchsize_eval
to some large values such as 5000 and 12000 respectively, the training process leads to the illegal memory access error. The issue is from the CUB, and is fixed in its latest version. However, because it is only included in CUDA 12.3, which is not used by our NGC container yet, until we update our NGC container to rely upon that version of CUDA, please rebuild HugeCTR with the newest CUB as a workaround. Otherwise, please try to avoid such largemax_eval_batches
andbatchsize_eval
.HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 23.08
Hierarchical Parameter Server:
Support static EC fp8 quantization: We already support quantization for fp8 in the static cache. HPS will perform fp8 quantization on the embedding vector when reading the embedding table by enable fp8_quant configuration, and perform fp32 dequantization on the embedding vector corresponding to the queried embedding key in the static embedding cache, so as to ensure the accuracy of dense part prediction.
Large model deployment demo based on HPS TensorRT-plugin: This demo shows how to use the HPS TRT-plugin to build a complete TRT engine for deploying a 147GB embedding table based on a 1TB Criteo dataset. We also provide static embedding implementation for fully offloading embedding tables to host page-locke memory for benchmarks on x86 and Grace Hopper Superchip.
Issues Fixed
Resolve Kafka update ingestion error. There was an error that prevented handing over online parameter updates coming from Kafka message queues to Redis database backends.
Fixed HPS Triton backend re-initializing the embedding cache issue due to undefined null when getting the embedded cache on the corresponding device.
HugeCTR Training & SOK:
Dense Embedding Support in Embedding Collection: We add the dense embedding in embedding collection. To use the dense embedding, a user just needs to specify the
_concat_
as the combiner. For more information, please refer to dense_embedding.py.Refinement of sequence mask layer and attention softmax layer to support cross-attention.
We introduce a more generalized reshape layer which allows user to reshape source tensor to destination tensor without dimension restriction. Please refer Reshape Layer API for more detailed information
Issues Fixed
Fix error when using Localized Variable in Sparse Operation Kit
Fix bug in Sparse Operation Kit backward computing.
Fix some SOK performance bugs by replacing the calls to
DeviceSegmentedSort
withDeviceSegmentedRadixSort
Fix a bug from the SOK’s Python API side, which led to the duplicate calls to the model’s forward function and thus degraded the performance.
Reduce the CPU launch overhead
Remove dynamic vector allocation in DataDistributor
Remove the use of the checkout value tensor from the DataReader. The data reader generates a nested std::vector on-the-fly and returns the vector to the embedding collection, which incur lots of host overhead. We have made it a class member so that the overhead can be eliminated.
Align with the latest parquet update. We have fixed a bug due to the parquet_reader_options::set_num_rows() update of cudf 23.06: PR .
Fix core23 assertion of debug mode We have fixed an assertion bug while the new core library is enabled if HugeCTR is built in debug mode.
General Updates:
Cleaned up logging code. Added compile-time format-string validation. Fixed issue where HCTR_PRINT did not interpret format strings properly.
Enabled the experimental enablement of the static CUDA runtime. Use
-DUSE_CUDART_STATIC=ON
in cmak’ingModified the data preprocessing documentation to clarify the correct commands to use in different situations. Fixed the error of the description of arguments
Known Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 23.06
In this release, we have fixed issues and enhanced the code.
3G Embedding Updates:
Refactored the
DataDistributor
related codeNew SOK
load()
anddump()
APIs are usable in TensorFlow 2. To use the API, specifysok_vars
in addition topath
.sok_vars
is a list ofsok.variable
and/orsok.dynamic_variable
.If you want to store optimizer states such as
m
andv
ofAdam
, theoptimizer
must be specified as well.The
optimizer
must be atf.keras.optimizers.Optimizer
orsok.OptimizerWrapper
while their underlying type must beSGD
,Adamax
,Adadelta
,Adagrad
, orFtrl
.
import sparse_operation_kit as sok sok.load(path, sok_vars, optimizer=None) sok.dump(path, sok_vars, optimizer=None)
These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.
Issues Fixed:
Fixed the segmentation fault and wrong initialization when the embedding table fusion is enabled in using the HPS UVM implementation
cudaDeviceSynchronize()
is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.Modified some Notebooks to use the most recent version of NGC container
Fixed the
EmbeddingTableCollection
utest to run correctly with multiple GPUs
Known Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 23.04
Hierarchical Parameter Server Enhancements:
HPS Table Fusion: From this release, you can fuse tables of the same embedding vector size in HPS. We support this feature in the HPS plugin for TensorFlow and the Triton backend for HPS. To turn on table fusion, set
fuse_embedding_table
totrue
in the HPS JSON file. This feature requires that the key values in different tables do not overlap and the embedding lookup layers are not dependent on each other in the model graph. For more information, refer to HPS configuration and HPS table fusion demo notebook. This feature can reduce the embedding lookup latency significantly when there are multiple tables and GPU embedding cache is employed. About 3x speedup is achieved on V100 for the fused case demonstrated in the notebook compared to the unfused one.UVM Support: We have upgraded the static embedding solution. For embedding tables whose size exceeds the device memory, we will save high-frequency embeddings in the HBM as an embedding cache and offload the remaining embeddings to the UVM. Compared with the dynamic cache solution that offloads the remaining embeddings to the Volatile DB, the UVM solution has higher CPU lookup throughput. We will support online updating of the UVM solution in a future release. Users can switch between different embedding cache solutions through the embedding_cache_type configuration parameter.
Triton Perf Analayzer’s Request Generator: We have added an inference request generator to generate the JSON request format required by Triton Perf Analyzer. By using this request generator together with the model generator, you can use the Triton Perf Analyzer to profile the HPS performance and do stress testing. For API documentation and demo usage, please refer to README
General Updates:
DenseLayerComputeConfig: MLP and CrossLayer support asynchronous weight gradient computations with data gradient backpropagation when training. We have added a new member
hugectr DenseLayerComputeConfig
tohugectr.DenseLayer
for configuring the computing behavior. The knob for enabling asynchronous weight gradient computations has been moved fromhugectr.CreateSolver
tohugectr.DenseLayerComputeConfig.async_wgrad
. The knob for controlling the fusion mode of weight gradients and bias gradients has been moved fromhugectr.DenseLayerSwitchs
tohugectr.DenseLayerComputeConfig.fuse_wb
.Hopper Architecture Support: Users can build HugeCTR from scratch with the compute capability 9.0 (
DSM=90
), so that it can run on Hopper architectures. Note that our NGC container does not support the compute capability yet. Users who are unfamiliar with how to build HugeCTR can refer to the HugeCTR Contribution Guide.RoCE Support for Hybrid Embedding: With the parameter
CommunicationType.IB_NVLink_Hier
in HybridEmbeddingParams, the RoCE is supported. We have also added 2 environment variablesHUGECTR_ROCE_GID
andHUGECTR_ROCE_TC
so that a user can control the RoCE NIC’s GID and traffic class. https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#hybridembeddingparam-class
Documentation Updates:
Data Reader: We have enhanced our Raw data reader to read multi-hot input data, connecting with an embedding collection seamlessly. The raw dataset format is strengthened as well. Refer to our online documentation for more details. We have refined the description for Norm datasest as well.
Embedding Collection: We have added the knob
is_exclusive_keys
to enable potential acceleration if a user has already preprocessed the input of embedding collection to make the resulting tables exclusive with one another. We have also added the nobcomm_strategy
in embedding collection for user to configure optimized communication strategy in multi-node trainingHPS Plugin: We have fixed the unit of measurement for DLRM inference benchmark results that leverage the HPS plugin. We have updated the user guide for the HPS plugin for TensorFlow and the HPS plugin for TensorRT
Embedding Cache: We have updated the usage of three types of embedding cache. We have updated the descriptions of the three types of embedding cache as well.
Issues Fixed:
We added a slots emptiness check to prevent
SparseParam
from being misused.We revised MPI lifetime service to become MPI init service with slightly greater scope and clearer interface. In this effort, we also fixed a rare bug that could lead access violations during the MPI shutdown procedure.
We fixed a segment fault that occurs when a GPU has no embedding wgrad to update.
SOK build & runtime error related to TF version: We made the SOK Experiment](https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/experiment) compatible with the Tensorflow >= v2.11.0. The legacy SOK doesn’t support that and newer versions of Tensorflow.
HPS requires CPU memory to be at least 2.5x larger than the model size during its initialization. From this release, we parse the model embedding files through chunks and reduce the required memory to 1.3x model size.
Known Issues:
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (https://github.com/NVIDIA-Merlin/HugeCTR/issues/356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](https://github.com/NVIDIA-Merlin/HugeCTR/issues/243).
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 23.02
HPS Enhancements:
Enabled the HPS Tensorflow plugin.
Enabled the max_norm clipping for the HPS Tensorflow plugin.
Optimized the performance of HPS HashMap fetch.
Enabled the HPS Profiler.
Google Cloud Storage (GCS) Support:
Added the support of Google Cloud Storage(GCS) for both training and inference. For more details, check out the GCS section in the training with remote filesystem notebook.
Issues Fixed:
Fixed a bug in HPS static table, which leads to a wrong results when the batch size is larger than 256.
Fixed a preprocessing issue in the
wdl_prediction
notebook.Corrected how devices are set and managed in HPS and InferenceModel.
Fixed the debug build error.
Fixed the build error related with the CUDA 12.0.
Fixed reported issues with respect to Multi-Process HashMap in notebook and a couple of minor issues on the side.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 4.3
Important
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as v4.3
.
Afterward, the library will use calendar versioning only, such as v23.12
.
Support for BERT and Variants: This release includes support for BERT in HugeCTR. The documentation includes updates to the MultiHeadAttention layer and adds documentation for the SequenceMask layer. For more information, refer to the samples/bst directory of the repository in GitHub.
HPS Plugin for TensorFlow integration with TensorFlow-TensorRT (TF-TRT): This release includes plugin support for integration with TensorFlow-TensorRT. For sample code, refer to the Deploy SavedModel using HPS with Triton TensorFlow Backend notebook.
Deep & Cross Network Layer version 2 Support: This release includes support for Deep & Cross Network version 2. For conceptual information, refer to https://arxiv.org/abs/2008.13535. The documentation for the MultiCross Layer is updated.
Enhancements to Hierarchical Parameter Server:
RedisClusterBackend now supports TLS/SSL communication. For sample code, refer to the Hierarchical Parameter Server Demo notebook. The notebook is updated with step-by-step instructions to show you how to setup HPS to use Redis with (and without) encryption. The Volatile Database Parameters documentation for HPS is updated with the
enable_tls
,tls_ca_certificate
,tls_client_certificate
,tls_client_key
, andtls_server_name_identification
parameters.MultiProcessHashMapBackend includes a bug fix that prevented configuring the shared memory size when using JSON file-based configuration.
On-device input keys are supported now so that an extra host-to-device copy is removed to improve performance.
A dependency on the XX-Hash library is removed. The library is no longer used by HugeCTR.
Added the static table support to the embedding cache. The static table is suitable when the embedding table can be placed entirely in GPU memory. In this case, the static table is more than three times faster than the embedding cache lookup. The static table does not support embedding updates.
Support for New Optimizers:
Added support for SGD, Momentum SGD, Nesterov Momentum, AdaGrad, RMS-Prop, Adam and FTRL optimizers for dynamic embedding table (DET). For sample code, refer to the
test_embedding_table_optimizer.cpp
file in the test/utest/embedding_collection/ directory of the repository on GitHub.Added support for the FTRL optimizer for dense networks.
Data Reading from S3 for Offline Inference: In addition to reading during training, HugeCTR now supports reading data from remote file systems such as HDFS and S3 during offline inference by using the DataSourceParams API. The HugeCTR Training and Inference with Remote File System Example is updated to demonstrate the new functionality.
Documentation Enhancements:
The set up instructions for running the example notebooks are revised for clarity.
The example notebooks are also updated to show using a data preprocessing script that simplifies the user experience.
Documentation for the MLP Layer is new.
Several 2022 talks and blogs are added to the HugeCTR Talks and Blogs page.
Issues Fixed:
The original CUDA device with NUMA bind before a call to some HugeCTR APIs is recovered correctly now. This issue sometimes lead to a problem when you mixed calls to HugeCTR and other CUDA enabled libraries.
Fixed the occasional CUDA kernel launch failure of embedding when installed HugeCTR with macro DEBUG.
Fixed an SOK build error that was related to TensorFlow v2.1.0 and higher. The issue was that the C++ API and C++ standard were updated to use C++17.
Fixed a CUDA 12 related compilation error.
Known Issues:
HugeCTR can lead to a runtime error if client code calls the RMM
rmm::mr::set_current_device_resource()
method orrmm::mr::set_current_device_resource()
method. The error is due to the Parquet data reader in HugeCTR also callingrmm::mr::set_current_device_resource()
. As a result, the device becomes visible to other libraries in the same process. Refer to GitHub issue #356 for more information. As a workaround, you can set environment variableHCTR_RMM_SETTABLE
to0
to prevent HugeCTR from setting a custom RMM device resource, if you know thatrmm::mr::set_current_device_resource()
is called by client code other than HugeCTR. But be cautious because the setting can reduce the performance of Parquet reading.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue #243.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 4.2
Important
In January 2023, the HugeCTR team plans to deprecate semantic versioning, such as v4.2
.
Afterward, the library will use calendar versioning only, such as v23.12
.
Change to HPS with Redis or Kafka: This release includes a change to Hierarchical Parameter Server and affects deployments that use
RedisClusterBackend
or model parameter streaming with Kafka. A third-party library that was used for HPS partition selection algorithm is replaced to improve performance. The new algorithm can produce different partition assignments for volatile databases. As a result, volatile database backends that retain data between application startup, such as theRedisClusterBackend
, must be reinitialized. Model streaming with Kafka is equally affected. To avoid issues with updates, reset all respective queue offsets to theend_offset
before you reinitialize theRedisClusterBackend
.Enhancements to the Sparse Operation Kit in DeepRec: This release includes updates to the Sparse Operation Kit to improve the performance of the embedding variable lookup operation in DeepRec. The API for the
lookup_sparse()
function is changed to remove thehotness
argument. Thelookup_sparse()
function is enhanced to calculate the number of non-zero elements dynamically. For more information, refer to the sparse_operation_kit directory of the DeepRec repository in GitHub.Enhancements to 3G Embedding: This release includes the following enhancements to 3G embedding:
The API is changed. The
EmbeddingPlanner
class is replaced with theEmbeddingCollectionConfig
class. For examples of the API, see the tests in the test/embedding_collection_test directory of the repository in GitHub.The API is enhanced to support dumping and loading weights during the training process. The methods are
Model.embedding_dump(path: str, table_names: list[str])
andModel.embedding_load(path: str, list[str])
. Thepath
argument is a directory in file system that you can dump weights to or load weights from. Thetable_names
argument is a list of embedding table names as strings.
New Volatile Database Type for HPS: This release adds a
db_type
value ofmulti_process_hash_map
to the Hierarchical Parameter Server. This database type supports sharing embeddings across process boundaries by using shared memory and the/dev/shm
device file. Multiple processes running HPS can read and write to the same hash map. For an example, refer to the Hierarchcal Parameter Server Demo notebook.Enhancements to the HPS Redis Backend: In this release, the Hierarchical Parameter Server can open multiple connections in parallel to each Redis node. This enhancement enables HPS to take advantage of overlapped processing optimizations in the I/O module of Redis servers. In addition, HPS can now take advantage of Redis hash tags to co-locate embedding values and metadata. This enhancement can reduce the number of accesses to Redis nodes and the number of per-node round trip communications that are needed to complete transactions. As a result, the enhancement increases the insertion performance.
MLPLayer is New: This release adds an MLP layer with the
hugectr.Layer_t.MLP
class. This layer is very flexible and makes it easier to use a group of fused fully-connected layers and enable the related optimizations. For each fused fully-connected layer inMLPLayer
, the output dimension, bias, and activation function are all adjustable. MLPLayer supports FP32, FP16 and TF32 data types. For an example, refer to the dgx_a100_mlp.py in thesamples/dlrm
directory of the GitHub repository to learn how to use the layer.Sparse Operation Kit installable from PyPi: Version
1.1.4
of the Sparse Operation Kit is installable from PyPi in the merlin-sok package.Multi-task Model Support added to the ONNX Model Converter: This release adds support for multi-task models to the ONNX converter. This release also includes an enhancement to the preprocess_census.py script in
samples/mmoe
directory of the GitHub repository.Issues Fixed:
Using the HPS Plugin for TensorFlow with
MirroredStrategy
and running the Hierarchical Parameter Server Demo notebook triggered an issue with ReplicaContext and caused a crash. The issue is fixed and resolves GitHub issue #362.The 4_nvt_process.py sample in the
samples/din/utils
directory of the GitHub repository is updated to use the latest NVTabular API. This update resolves GitHub issue #364.An illegal memory access related to 3G embedding and the dgx_a100_ib_nvlink.py sample in the
samples/dlrm
directory of the GitHub repository is fixed.An error in HPS with the
lookup_fromdlpack()
method is fixed. The error was related to calculating the number of keys and vectors from the corresponding DLPack tensors.An error in the HugeCTR backend for Triton Inference Server is fixed. A crash was triggered when the initial size of the embedding cache is smaller than the allowed minimum size.
An error related to using a ReLU layer with an odd input size in mixed precision mode could trigger a crash. The issue is fixed.
An error related to using an asynchronous reader with the AsyncParam class and specifying an
io_alignment
value that is smaller than the block device sector size is fixed. Now, if the specifiedio_alignment
value is smaller than the block device sector size,io_alignment
is automatically set to the block device sector size.Unreported memory leaks in the GRU layer and collectives are fixed.
Several broken documentation links related to HPS are fixed.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
What’s New in Version 4.1
Simplified Interface for 3G Embedding Table Placement Strategy: 3G embedding now provides an easier way for you to configure an embedding table placement strategy. Instead of using JSON, you can configure the embedding table placement strategy by using function arguments. You only need to provide the
shard_matrix
,table_group_strategy
, andtable_placement_strategy
arguments. With these arguments, 3G embedding can group different tables together and place them according to theshard_matrix
argument. For an example, refer to dlrm_train.py file in thetest/embedding_collection_test
directory of the repository on GitHub. For comparison, refer to the same file from the v4.0 branch of the repository.New MMoE and Shared-Bottom Samples: This release includes a new shared-bottom model, an example program, preprocessing scripts, and updates to documentation. For more information, refer to the
README.md
,mmoe_parquet.py
, and other files in thesamples/mmoe
directory of the repository on GitHub. This release also includes a fix to the calculation and reporting of AUC for multi-task models, such as MMoE.Support for AWS S3 File System: The Parquet DataReader can now read datasets from the Amazon Web Services S3 file system. You can also load and dump models from and to S3 during training. The documentation for the
DataSourceParams
class is updated. To view sample code, refer to the HugeCTR Training with Remote File System Example class is updated.Simplification for File System Usage: You no longer ’t need to pass
DataSourceParams
for model loading and dumping. TheFileSystem
class automatically infers the correct file system type, local, HDFS, or S3, based on the path URI that you specified when you built the model. For example, the pathhdfs://localhost:9000/
is inferred as an HDFS file system and the pathhttps://mybucket.s3.us-east-1.amazonaws.com/
is inferred as an S3 file system.Support for Loading Models from Remote File Systems to HPS: This release enables you to load models from HDFS and S3 remote file systems to HPS during inference. To use the new feature, specify an HDFS for S3 path URI in
InferenceParams
.Support for Exporting Intermediate Tensor Values into a Numpy Array: This release adds function
check_out_tensor
toModel
andInferenceModel
. You can use this function to check out the intermediate tensor values using the Python interface. This function is especially helpful for debugging. For more information, refer toModel.check_out_tensor
andInferenceModel.check_out_tensor
.On-Device Input Keys for HPS Lookup: The HPS lookup supports input embedding keys that are on GPU memory during inference. This enhancement removes a host-to-device copy by using the DLPack
lookup_fromdlpack()
interface. By using the interface, the input DLPack capsule of embedding key can be a GPU tensor.Documentation Enhancements:
The graphic for the Hierarchical Parameter Server library that shows relationship to other software packages is enhanced.
The sample notebook for Deploy SavedModel using HPS with Triton TensorFlow Backend is added to the documentation.
Style updates to the Hierarchical Parameter Server API documentation.
Issues Fixed:
The
InteractionLayer
class is fixed so that it works correctly withnum_feas > 30
.The cuBLASLt configuration is corrected by increasing the workspace size and adding the epilogue mask.
The NVTabular based preprocessing script for our samples that demonstrate feature crossing is fixed.
The async data reader is fixed. Previously, it would hang and cause a corruption issue due to an improper I/O block size and I/O alignment problem. The
AsyncParam
class is changed to implement the fix. Theio_block_size
argument is replaced by themax_nr_request
argument and the actual I/O block size that the async reader uses is computed accordingly. For more information, refer to theAsyncParam
class documentation.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
Dumping Adam optimizer states to AWS S3 is not supported.
Dumping to remote file systems when enabled MPI is not supported.
What’s New in Version 4.0
3G Embedding Stablization: Since the introduction of the next generation of HugeCTR embedding in v3.7, several updates and enhancements were made, including code refactoring to improve usability. The enhancements for this release are as follows:
Optimized the performance for sparse lookup in terms of inter-warp load imbalance. Sparse Operation Kit (SOK) takes advantage of the enhancement to improve performance.
This release includes a fix for determining the maximum embedding vector size in the
GlobalEmbeddingData
andLocalEmbeddingData
classes.Version 1.1.4 of Sparse Operation Kit can be installed with Pip and includes the enhancements mentioned in the preceding bullets.
Embedding Cache Initialization with Configurable Ratio: In previous releases, the default value for the
cache_refresh_percentage_per_iteration
parameter of the InferenceParams was0.1
.In this release, default value is
0.0
and the parameter provides an additional purpose. If you set the parameter to a value greater than0.0
and also setuse_gpu_embedding_cache
toTrue
for a model, when Hierarchical Parameter Server (HPS) starts, HPS initializes the embedding cache for the model on the GPU by loading a subset of the embedding vectors from the sparse files for the model. When embedding cache initialization is used, HPS creates log records when it starts at the INFO level. The logging records are similar toEC initialization for model: "<model-name>", num_tables: <int>
andEC initialization on device: <int>
. This enhancement reduces the duration of the warm up phase.Lazy Initialization of HPS Plugin for TensorFlow: In this release, when you deploy a
SavedModel
of TensorFlow with Triton Inference Server, HPS is implicitly initialized when the loaded model is executed for the first time. In previous releases, you needed to runhps.Init(ps_config_file, global_batch_size)
explicitly. For more information, see the API documentation forhierarchical_parameter_server.Init
.Enhancements to the HDFS Backend:
The HDFS Backend is now called IO::HadoopFileSystem.
This release includes fixes for memory leaks.
This release includes refactoring to generalize the interface for HDFS and S3 as remote filesystems.
For more information, see
hadoop_filesystem.hpp
in theinclude/io
directory of the repository on GitHub.
Dependency Clarification for Protobuf and Hadoop: Hadoop and Protobuf are true
third_party
modules now. Developers can now avoid unnecessary and frequent cloning and deletion.Finer granularity control for overlap behavior: We deperacated the old
overlapped_pipeline
knob and introduces four new knobstrain_intra_iteration_overlap
/train_inter_iteration_overlap
/eval_intra_iteration_overlap
/eval_inter_iteration_overlap
to help user better control the overlap behavior. For more information, see the API documentation forSolver.CreateSolver
Documentation Improvements:
Removed two deprecated tutorials
triton_tf_deploy
anddump_to_tf
.Previously, the graphics in the Performance page did not appear. This issue is fixed in this release.
Previously, the API documentation for the HPS Plugin for TensorFlow did not show the class information. This issue is fixed in this release.
Issues Fixed:
Fixed a build error that was triggered in debug mode. The error was caused by the newly introduced 3G embedding unit tests.
When using the Parquet DataReader, if a parquet dataset file specified in
metadata.json
does not exist, HugeCTR no longer crashes. The new behavior is to skip the missing file and display a warning message. This change relates to GitHub issue 321.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
What’s New in Version 3.9
Updates to 3G Embedding:
Sparse Operation Kit (SOK) is updated to use the HugeCTR 3G embedding as a developer preview feature. For more information, refer to the Python programs in the sparse_operation_kit/experiment/benchmark/dlrm directory of the repository on GitHub.
Dynamic embedding table mode is added. The mode is based on the cuCollection with some functionality enhancement. A dynamic embedding table grows its size when the table is full so that you no longer need to configure the memory usage information for embedding. For more information, refer to the embedding_storage/dynamic_embedding_storage directory of the repository on GitHub.
Enhancements to the HPS Plugin for TensorFlow: This release includes improvements to the interoperability of SOK and HPS. The plugin now supports the sparse lookup layer. The documentation for the HPS plugin is enhanced as follows:
An introduction to the plugin is new.
New notebooks demonstrate how to use the HPS plugin are added.
API documentation for the plugin is new.
Enhancements to the HPS Backend for Triton Inference Server This release adds support for integrating the HPS Backend and the TensorFlow Backend through the ensemble mode with Triton Inference Server. The enhancement enables deploying a TensorFlow model with large embedding tables with Triton by leveraging HPS. For more information, refer to the sample programs in the hps-triton-ensemble directory of the HugeCTR Backend repository in GitHub.
New Multi-Node Tutorial: The multi-node training tutorial is new. The additions show how to use HugeCTR to train a model with multiple nodes and is based on our most recent Docker container. The tutorial should be useful to users who do not have a job-scheduler-installed cluster such as Slurm Workload Manager. The update addresses a issue that was first reported in GitHub issue 305.
Support Offline Inference for MMoE: This release includes MMoE offline inference where both per-class AUC and average AUC are provided. When the number of class AUCs is greater than one, the output includes a line like the following example:
[HCTR][08:52:59.254][INFO][RK0][main]: Evaluation, AUC: {0.482141, 0.440781}, macro-averaging AUC: 0.46146124601364136
Enhancements to the API for the HPS Database Backend This release includes several enhancements to the API for the
DatabaseBackend
class. For more information, seedatabase_backend.hpp
and the header files for other database backends in theHugeCTR/include/hps
directory of the repository. The enhancements are as follows:You can now specify a maximum time budget, in nanoseconds, for queries so that you can build an application that must operate within strict latency limits. Fetch queries return execution control to the caller if the time budget is exhausted. The unprocessed entries are indicated to the caller through a callback function.
The
dump
andload_dump
methods are new. These methods support saving and loading embedding tables from disk. The methods support a custom binary format and the RocksDB SST table file format. These methods enable you to import and export embedding table data between your custom tools and HugeCTR.The
find_tables
method is new. The method enables you to discover all table data that is currently stored for a model in aDatabaseBackend
instance. A new overloaded method forevict
is added that can process the results fromfind_tables
to quickly and simply drop all the stored information that is related to a model.
Documentation Enhancements
The documentation for the
max_all_to_all_bandwidth
parameter of theHybridEmbeddingParam
class is clarified to indicate that the bandwidth unit is per-GPU. Previously, the unit was not specified.
Issues Fixed:
Hybrid embedding with
IB_NVLINK
as thecommunication_type
of theHybridEmbeddingParam
is fixed in this release.Training performance is affected by a GPU routine that checks if an input key can be out of the embedding table. If you can guarantee that the input keys can work with the specified
workspace_size_per_gpu_in_mb
, we have a workaround to disable the routine by setting the environment variableHUGECTR_DISABLE_OVERFLOW_CHECK=1
. The workaround restores the training performance.Engineering discovered and fixed a correctness issue with the Softmax layer.
Engineering removed an inline profiler that was rarely used or updated. This change relates to GitHub issue 340.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
What’s New in Version 3.8
Sample Notebook to Demonstrate 3G Embedding: This release includes a sample notebook that introduces the Python API of the embedding collection and the key concepts for using 3G embedding. You can view HugeCTR Embedding Collection from the documentation or access the
embedding_collection.ipynb
file from thenotebooks
directory of the repository.DLPack Python API for Hierarchical Parameter Server Lookup: This release introduces support for embedding lookup from the Hierarchical Parameter Server (HPS) using the DLPack Python API. The new method is
lookup_fromdlpack()
. For sample usage, see the Lookup the Embedding Vector from DLPack heading in the “Hierarchical Parameter Server Demo” notebook.Read Parquet Datasets from HDFS with the Python API: This release enhances the
DataReaderParams
class with adata_source_params
argument. You can use the argument to specify the data source configuration such as the host name of the Hadoop NameNode and the NameNode port number to read from HDFS.Logging Performance Improvements: This release includes a performance enhancement that reduces the performance impact of logging.
Enhancements to Layer Classes:
The
FullyConnected
layer now supports 3D inputsThe
MatrixMultiply
layer now supports 4D inputs.
Documentation Enhancements:
An automatically generated table of contents is added to the top of most pages in the web documentation. The goal is to provide a better experience for navigating long pages such as the HugeCTR Layer Classes and Methods page.
URLs to the Criteo 1TB click logs dataset are updated. For an example, see the HugeCTR Wide and Deep Model with Criteo notebook.
Issues Fixed:
The data generator for the Parquet file type is fixed and produces consistent file names between the
_metadata.json
file and the actual dataset files. Previously, running the data generator to create synthetic data resulted in a core dump. This issue was first reported in the GitHub issue 321.Fixed the memory crash in running a large model on multiple GPUs that occurred during AUC warm up.
Fixed the issue of keyset generation in the ETC notebook. Refer to the GitHub issue 332 for more details.
Fixed the inference build error that occurred when building with debug mode.
Fixed the issue that multi-node training prints duplicate messages.
Known Issues:
Hybrid embedding with
IB_NVLINK
as thecommunication_type
of theHybridEmbeddingParam
class does not work currently. We are working on fixing it. The other communication types have no issues.HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
What’s New in Version 3.7
3G Embedding Developer Preview: The 3.7 version introduces next-generation of embedding as a developer preview feature. We call it 3G embedding because it is the new update to the HugeCTR embedding interface and implementation since the unified embedding in v3.1 version, which was the second one. Compared with the previous embedding, there are three main changes in the embedding collection.
First, it allows users to fuse embedding tables with different embedding vector sizes. The previous embedding can only fuse embedding tables with the same embedding vector size. The enhancement boosts both flexibility and performance.
Second, it extends the functionality of embedding by supporting the
concat
combiner and supporting different slot lookup on the same embedding table.Finally, the embedding collection is powerful enough to support arbitrary embedding table placement which includes data parallel and model parallel. By providing a plan JSON file, you can configure the table placement strategy as you specify. See the
dlrm_train.py
file in the embedding_collection_test directory of the repository for a more detailed usage example.
HPS Performance Improvements:
Kafka: Model parameters are now stored in Kafka in a bandwidth-saving multiplexed data format. This data format vastly increases throughput. In our lab, we measured transfer speeds up to 1.1 Gbps for each Kafka broker.
HashMap backend: Parallel and single-threaded hashmap implementations have been replaced by a new unified implementation. This new implementation uses a new memory-pool based allocation method that vastly increases upsert performance without diminishing recall performance. Compared with the previous implementation, you can expect a 4x speed improvement for large-batch insertion operations.
Suppressed and simplified log: Most log messages related to HPS have the log level changed to
TRACE
rather thanINFO
orDEBUG
to reduce logging verbosity.
Offline Inference Usability Enhancements:
The thread pool size is configurable in the Python interface, which is useful for studying the embedding cache performance in scenarios of asynchronous update. Previously it was set as the minimum value of 16 and
std::thread::hardware_concurrency()
. For more information, please refer to Hierarchical Parameter Server Configuration.
DataGenerator Performance Improvements: You can specify the
num_threads
parameter to parallelize aNorm
dataset generation.Evaluation Metric Improvements:
Average loss performance improvement in multi-node environments.
AUC performance optimization and safer memory management.
Addition of NDCG and SMAPE.
Embedding Training Cache Parquet Demo: Created a keyset extractor script to generate keyset files for Parquet datasets. Provided users with an end-to-end demo of how to train a Parquet dataset using the embedding cache mode. See the Embedding Training Cache Example notebook.
Documentation Enhancements: The documentation details for HugeCTR Hierarchical Parameter Server Database Backend are updated for consistency and clarity.
Issues Fixed:
If
slot_size_array
is specified,workspace_size_per_gpu_in_mb
is no longer required.If you build and install HugeCTR from scratch, you can specify the
CMAKE_INSTALL_PREFIX
CMake variable to identify the installation directory for HugeCTR.Fixed SOK hang issue when calling
sok.Init()
with a large number of GPUs. See the GitHub issue 261 and 302 for more details.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable. Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator. For more information, see the Getting Started section of the repository README file.
Data generator of parquet type produces inconsistent file names between _metadata.json and actual dataset files, which will result in core dump fault when using the synthetic dataset.
What’s New in Version 3.6
Concat 3D Layer: In previous releases, the
Concat
layer could handle two-dimensional (2D) input tensors only. Now, the input can be three-dimensional (3D) and you can concatenate the inputs along axis 1 or 2. For more information, see the API documentation for the Concat Layer.Dense Column List Support in Parquet DataReader: In previous releases, HugeCTR assumes each dense feature has a single value and it must be the scalar data type
float32
. Now, you can mixfloat32
orlist[float32]
for dense columns. This enhancement means that each dense feature can have more than one value. For more information, see the API documentation for the Parquet dataset format.Support for HDFS is Re-enabled in Merlin Containers: Support for HDFS in Merlin containers is an optional dependency now. For more information, see HDFS Support.
Evaluation Metric Enhancements: In previous releases, HugeCTR computes AUC for binary classification only. Now, HugeCTR supports AUC for multi-label classification. The implementation is inspired by sklearn.metrics.roc_auc_score and performs the unweighted macro-averaging strategy that is the default for scikit-learn. You can specify a value for the
label_dim
parameter of the input layer to enable multi-label classification and HugeCTR will compute the multi-label AUC.Log Output Format Change: The default log format now includes milliseconds.
Documentation Enhancements:
These release notes are included in the documentation and are available at https://nvidia-merlin.github.io/HugeCTR/v3.6/release_notes.html.
The Configuration section of the Hierarchical Parameter Server information is updated with more information about the parameters in the configuration file.
The example notebooks that demonstrate how to work with multi-modal data are reorganized in the navigation. The notebooks are now available under the heading Multi-Modal Example Notebooks. This change is intended to make it easier to find the notebooks.
The documentation in the sparse_operation_kit directory of the repository on GitHub is updated with several clarifications about SOK.
Issues Fixed:
The
dlrm_kaggle_fp32.py
file in thesamples/dlrm/
directory of the repository is updated to show the correct number of samples. Thenum_samples
value is now set to36672493
. This fixes GitHub issue 301.Hierarchical Parameter Server (HPS) would produce a runtime error when the GPU cache was turned off. This issue is now fixed.
Known Issues:
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources. If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:
-shm-size=1g -ulimit memlock=-1
See also the NCCL known issue and the GitHub issue.
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive. To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and are reachable from the node where you run HugeCTR.The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
Joint loss training with a regularizer is not supported.
The Criteo 1 TB click logs dataset that is used with many HugeCTR sample programs and notebooks is currently unavailable. Until the dataset becomes downloadable again, you can run those samples based on our synthetic dataset generator. For more information, see the Getting Started section of the repository README file.
What’s New in Version 3.5
HPS interface encapsulation and exporting as library: We encapsulate the Hierarchical Parameter Server(HPS) interfaces and deliver it as a standalone library. Besides, we prodvide HPS Python APIs and demonstrate the usage with a notebook. For more information, please refer to Hierarchical Parameter Server and HPS Demo.
Hierarchical Parameter Server Triton Backend: The HPS Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. For more information, please refer to the samples directory of the HugeCTR backend for Triton Inference Server repository.
SOK pip release: SOK pip releases on https://pypi.org/project/merlin-sok/. Now users can install SOK via
pip install merlin-sok
.Joint loss and multi-tasks training support:: We support joint loss in training so that users can train with multiple labels and tasks with different weights. See the MMoE sample in the samples/mmoe directory of the repository to learn the usage.
HugeCTR documentation on web page: Now users can visit our web documentation.
ONNX converter enhancement:: We enable converting
MultiCrossEntropyLoss
andCrossEntropyLoss
layers to ONNX to support multi-label inference. For more information, please refer to the HugeCTR to ONNX Converter information in the onnx_converter directory of the repository.HDFS python API enhancement:
Simplified
DataSourceParams
so that users do not need to provide all the paths before they are really necessary. Now users only have to passDataSourceParams
once when creating a solver.Later paths will be automatically regarded as local paths or HDFS paths depending on the
DataSourceParams
setting. See notebook for usage.
HPS performance optimization: We use better method to determine partition number in database backends in HPS.
Issues Fixed: HugeCTR input layer now can take dense_dim greater than 1000.
What’s New in Version 3.4.1
Support mixed precision inference for dataset with multiple labels: We enable FP16 for the
Softmax
layer and support mixed precision for multi-label inference. For more information, please refer to Inference API.Support multi-GPU offline inference with Python API: We support multi-GPU offline inference with the Python interface, which can leverage Hierarchical Parameter Server and enable concurrent execution on multiple devices. For more information, please refer to Inference API and Multi-GPU Offline Inference Notebook.
Introduction to metadata.json: We add the introduction to
_metadata.json
for Parquet datasets. For more information, please refer to Parquet.Documents and tool for workspace size per GPU estimation: we add a tool that is named the
embedding_workspace_calculator
to help calculate the value forworkspace_size_per_gpu_in_mb
that is required by hugectr.SparseEmbedding. For more information, please refer to the README.md file in the tools/embedding_workspace_calculator directory of the repository and QA 24 in the documentation.Improved Debugging Capability: The old logging system, which was flagged as deprecated for some time has been removed. All remaining log messages and outputs have been revised and migrated to the new logging system (core23/logging.hpp/cpp). During this revision, we also adjusted log levels for log messages throughout the entire codebase to improve visibility of relevant information.
Support HDFS Parameter Server in Training:
Decoupled HDFS in Merlin containers to make the HDFS support more flexible. Users can now compile HDFS related functionalities optionally.
Now supports loading and dumping models and optimizer states from HDFS.
Added a notebook to show how to use HugeCTR with HDFS.
Support Multi-hot Inference on Hugectr Backend: We support categorical input in multi-hot format for HugeCTR Backend inference.
Multi-label inference with mixed precision: Mixed precision training is enabled for softmax layer.
Python Script and documentation demonstrating how to analyze model files: In this release, we provide a script to retrieve vocabulary information from model file. Please find more details on the README in the tools/model_analyzer directory of the repository.
Issues fixed:
Mirror strategy bug in SOK (see in https://github.com/NVIDIA-Merlin/HugeCTR/issues/291)
Can’t import sparse operation kit in nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.04 (see in https://github.com/NVIDIA-Merlin/HugeCTR/issues/296)
HPS: Fixed access violation that can occur during initialization when not configuring a volatile DB.
What’s New in Version 3.4
Support for Building HugeCTR with the Unified Merlin Container: HugeCTR can now be built using our unified Merlin container. For more information, refer to our Contributor Guide.
Hierarchical Parameter Server (HPS) Enhancements:
New Missing Key (Embedding Table Entries) Insertion Feature: Using a simple flag, it is now possible to configure HugeCTR with missing keys (embedding table entries). During lookup, these missing keys will automatically be inserted into volatile database layers such as the Redis and Hashmap backends.
Asynchronous Timestamp Refresh: To allow time-based eviction to take place, it is now possible to enable timestamp refreshing for frequently used embeddings. Once enabled, refreshing is handled asynchronously using background threads, which won’t block your inference jobs. For most applications, the associated performance impact from enabling this feature is barely noticeable.
HDFS (Hadoop Distributed File System) Parameter Server Support During Training:
We’re introducing a new DataSourceParams API, which is a python API that can be used to specify the file system and paths to data and model files.
We’ve added support for loading data from HDFS to the local file system for HugeCTR training.
We’ve added support for dumping trained model and optimizer states into HDFS.
New Load API Capabilities: In addition to being able to deploy new models, the HugeCTR Backend’s Load API can now be used to update the dense parameters for models and corresponding embedding inference cache online.
Sparse Operation Kit (SOK) Enhancements:
Mixed Precision Training: Enabling mixed precision training using TensorFlow’s pattern to enhance the training performance and lessen memory usage is now possible.
DLRM Benchmark: DLRM is a standard benchmark for recommendation model training, so we added a new notebook. Refer to the sparse_operation_kit/documents/tutorials/DLRM_Benchmark directory of the repository. The notebook shows how to address the performance of SOK on this benchmark.
Uint32_t / int64_t key dtype Support in SOK: Int64 or uint32 can now be used as the key data type for SOK’s embedding. Int64 is the default.
TensorFlow Initializers Support: We now support the TensorFlow native initializer within SOK, such as
sok.All2AllDenseEmbedding(embedding_initializer=tf.keras.initializers.RandomUniform())
. For more information, refer to All2All Dense Embedding.
Documentation Enhancements
We’ve revised several of our notebooks and readme files to improve readability and accessibility.
We’ve revised the SOK docker setup instructions to indicate that HugeCTR setup issues can be resolved using the
--shm-size
setting within docker.Although HugeCTR is designed for scalability, having a robust machine is not necessary for smaller workloads and testing. We’ve documented the required specifications for notebook testing environments. For more information, refer to our README for HugeCTR Jupyter Demo Notebooks.
Inference Enhancements:We now support HugeCTR inference for managing multiple tasks. When the label dimension is the number of binary classification tasks and
MultiCrossEntropyLoss
is employed during training, the shape of inference results will be(batch_size*num_batches, label_dim)
. For more information, refer to Inference API.Embedding Cache Issue Resolution: The embedding cache issue for very small embedding tables has been resolved.
What’s New in Version 3.3.1
Hierarchical Parameter Server (HPS) Enhancements:
HugeCTR Backend Enhancements: The HugeCTR Backend is now fully compatible with the Triton model control protocol, so new model configurations can be simply added to the HPS configuration file. The HugeCTR Backend will continue to support online deployments of new models using the Triton Load API. However, with this enhancement, old models can be recycled online using the Triton Unload API.
Simplified Database Backend: Multi-nodes, single-node, and all other kinds of volatile database backends can now be configured using the same configuration object.
Multi-Threaded Optimization of Redis Code: The speedup of HugeCTR version 3.3.1 is 2.3 times faster than HugeCTR version 3.3.
Additional HPS Enhancements and Fixes:
You can now build the HPS test environment and implement unit tests for each component.
You’ll no longer encounter the access violation issue when updating Apache Kafka online.
The parquet data reader no longer incorrectly parses the index of categorical features when multiple embedded tables are being used.
The HPS Redis Backend overflow is now invoked during single insertions.
New GroupDenseLayer: We’re introducing a new GroupDenseLayer. It can be used to group fused fully connected layers when constructing the model graph. A simplified Python interface is provided for adjusting the number of layers and specifying the output dimensions in each layer, which makes it easy to leverage the highly-optimized fused fully connected layers in HugeCTR. For more information, refer to GroupDenseLayer.
Global Fixes:
A warning message now appears when attempting to launch a multi-process job before importing the mpi.
When running with embedding training cache, a massive log is no longer generated.
Legacy conda information has been removed from the HugeCTR documentation.
What’s New in Version 3.3
Hierarchical Parameter Server (HPS) Enhancements:
Support for Incremental Model Updates: HPS now supports incremental model updates via Apache Kafka (a distributed event streaming platform) message queues. With this enhancement, HugeCTR can now be connected with Apache Kafka deployments to update models in real time during training and inference. For more information, refer to the Demo Notebok.
Improvements to the Memory Management: The Redis cluster and CPU memory database backends are now the primary sources for memory management. When performing incremental model updates, these memory database backends will automatically evict infrequently used embeddings as training progresses. The performance of the Redis cluster and CPU memory database backends have also been improved.
New Asynchronous Refresh Mechanism: Support for asynchronous refreshing of incremental embedding keys into the embedding cache has been added. The Refresh operation will be triggered when completing the model version iteration or outputting incremental parameters from online training. The Distributed Database and Persistent Database will be updated by Apache Kafka. The GPU embedding cache will then refresh the values of the existing embedding keys and replace them with the latest incremental embedding vectors. For more information, refer to the HPS README.
Configurable Backend Implementations for Databases: Backend implementations for databases are now fully configurable.
Improvements to the JSON Interface Parser: The JSON interface parser can now handle inaccurate parameterization.
More Meaningful Jabber: As requested, we’ve revised the log levels throughout the entire API database backend of the HPS. Selected configuration options are now printed entirely and uniformly to the log. Errors provide more verbose information about pending issues.
Sparse Operation Kit (SOK) Enhancements:
TensorFlow (TF) 1.15 Support: SOK can now be used with TensorFlow 1.15. For more information, refer to README.
Dedicated CUDA Stream: A dedicated CUDA stream is now used for SOK’s Ops, so this may help to eliminate kernel interleaving.
New pip Installation Option: SOK can now be installed using the
pip install SparseOperationKit
command. See more in our instructions). With this install option, root access to compile SOK is no longer required and python scripts don’t need to be copied.Visible Device Configuration Support:
tf.config.set_visible_device
can now be used to set visible GPUs for each process.CUDA_VISIBLE_DEVICES
can also be used. Whentf.distribute.Strategy
is used, thetf.config.set_visible_device
argument shouldn’t be set.
Hybrid-embedding indices pre-computing:The indices needed for hybrid embedding are pre-computed ahead of time and are overlapped with previous iterations.
Cached evaluation indices::The hybrid-embedding indices for eval are cached when applicable, hence eliminating the re-computing of the indices at every eval iteration.
MLP weight/data gradients calculation overlap::The weight gradients of MLP are calculated asynchronously with respect to the data gradients, enabling overlap between these two computations.
Better compute-communication overlap::Better overlap between compute and communication has been enabled to improve training throughput.
Fused weight conversion::The FP32-to-FP16 conversion of the weights are now fused into the SGD optimizer, saving trips to memory.
GraphScheduler::GrapScheduler was added to control the timing of cudaGraph launching. With GraphScheduler, the gap between adjacent cudaGraphs is eliminated.
Multi-Node Training Support Enhancements:You can now perform multi-node training on the cluster with non-RDMA hardware by setting the
AllReduceAlgo.NCCL
value for theall_reduce_algo
argument. For more information, refer to the details for theall_reduce_algo
argument in the CreateSolver API.Support for Model Naming During Model Dumping: You can now specify names for models with the
CreateSolver
training API, which will be dumped to the JSON configuration file with theModel.graph_to_json
API. This will facilitate the Triton deployment of saved HugeCTR models, as well as help to distinguish between models when Apache Kafka sends parameters from training to inference.Fine-Grained Control Accessibility Enhancements for Embedding Layers: We’ve added fine-grained control accessibility to embedding layers. Using the
Model.freeze_embedding
andModel.unfreeze_embedding
APIs, embedding layer weights can be frozen and unfrozen. Additionally, weights for multiple embedding layers can be loaded independently, making it possible to load pre-trained embeddings for a particular layer. For more information, refer to Model API and Section 3.4 of the HugeCTR Criteo Notebook.
What’s New in Version 3.2.1
GPU Embedding Cache Optimization: The performance of the GPU embedding cache for the standalone module has been optimized. With this enhancement, the performance of small to medium batch sizes has improved significantly. We’re not introducing any changes to the interface for the GPU embedding cache, so don’t worry about making changes to any existing code that uses this standalone module. For more information, refer to the
ReadMe.md
file in the gpu_cache directory of the repository.Model Oversubscription Enhancements: We’re introducing a new host memory cache (HMEM-Cache) component for the model oversubscription feature. When configured properly, incremental training can be efficiently performed on models with large embedding tables that exceed the host memory. For more information, refer to Host Memory Cache in MOS. Additionally, we’ve enhanced the Python interface for model oversubscription by replacing the
use_host_memory_ps
parameter with aps_types
parameter and adding asparse_models
parameter. For more information about these changes, refer to HugeCTR Python Interface.Debugging Enhancements: We’re introducing new debugging features such as multi-level logging, as well as kernel debugging functions. We’re also making our error messages more informative so that users know exactly how to resolve issues related to their training and inference code. For more information, refer to the comments in the header files, which are available at HugeCTR/include/base/debug.
Enhancements to the Embedding Key Insertion Mechanism for the Embedding Cache: Missing embedding keys can now be asynchronously inserted into the embedding cache. To enable automatically, set the hit rate threshold within the configuration file. When the actual hit rate of the embedding cache is higher than the hit rate threshold that the user set or vice versa, the embedding cache will insert the missing embedding key asynchronously.
Parameter Server Enhancements: We’re introducing a new “in memory” database that utilizes the local CPU memory for storing and recalling embeddings and uses multi-threading to accelerate lookup and storage. You can now also use the combined CPU-accessible memory of your Redis cluster to store embeddings. We improved the performance for the “persistent” storage and retrieving embeddings from RocksDB using structured column families, as well as added support for creating hierarchical storage such as Redis as distributed cache. You don’t have to worry about updating your Parameter Server configurations to take advantage of these enhancements.
Slice Layer Internalization Enhancements: The Slice layer for the branch toplogy can now be abstracted away in the Python interface. A model graph analysis will be conducted to resolve the tensor dependency and the Slice layer will be internally inserted if the same tensor is consumed more than once to form the branch topology. For more information about how to construct a model graph using branches without the Slice layer, refer to the Getting Started section of the repository README and the Slice Layer information.
What’s New in Version 3.2
New HugeCTR to ONNX Converter: We’re introducing a new HugeCTR to ONNX converter in the form of a Python package. All graph configuration files are required and model weights must be formatted as inputs. You can specify where you want to save the converted ONNX model. You can also convert sparse embedding models. For more information, refer to the HugeCTR to ONNX Converter information in the onnx_converter directory and the HugeCTR2ONNX Demo Notebook.
New Hierarchical Storage Mechanism on the Parameter Server (POC): We’ve implemented a hierarchical storage mechanism between local SSDs and CPU memory. As a result, embedding tables no longer have to be stored in the local CPU memory. The distributed Redis cluster is being implemented as a CPU cache to store larger embedding tables and interact with the GPU embedding cache directly. The local RocksDB serves as a query engine to back up the complete embedding table on the local SSDs and assist the Redis cluster with looking up missing embedding keys. For more information about how this works, refer to our HugeCTR Backend documentation
Parquet Format Support within the Data Generator: The HugeCTR data generator now supports the parquet format, which can be configured easily using the Python API. For more information, refer to Data Generator API.
Python Interface Support for the Data Generator: The data generator has been enabled within the HugeCTR Python interface. The parameters associated with the data generator have been encapsulated into the
DataGeneratorParams
struct, which is required to initialize theDataGenerator
instance. You can use the data generator’s Python APIs to easily generate the Norm, Parquet, or Raw dataset formats with the desired distribution of sparse keys. For more information, refer to Data Generator API and the data generator samples in the tools/data_generator directory of the repository.Improvements to the Formula of the Power Law Simulator within the Data Generator: We’ve modified the formula of the power law simulator within the data generator so that a positive alpha value is always produced, which will be needed for most use cases. The alpha values for
Long
,Medium
, andShort
within the power law distribution are 0.9, 1.1, and 1.3 respectively. For more information, refer to Data Generator API.Support for Arbitrary Input and Output Tensors in the Concat and Slice Layers: The Concat and Slice layers now support any number of input and output tensors. Previously, these layers were limited to a maximum of four tensors.
New Continuous Training Notebook: We’ve added a new notebook to demonstrate how to perform continuous training using the embedding training cache (also referred to as Embedding Training Cache) feature. For more information, refer to HugeCTR Continuous Training.
New HugeCTR Contributor Guide: We’ve added a new HugeCTR Contributor Guide that explains how to contribute to HugeCTR, which may involve reporting and fixing a bug, introducing a new feature, or implementing a new or pending feature.
Sparse Operation Kit (SOK) Enhancements: SOK now supports TensorFlow 2.5 and 2.6. We also added support for identity hashing, dynamic input, and Horovod within SOK. Lastly, we added a new SOK docs set to help you get started with SOK.
What’s New in Version 3.1
Hybrid Embedding: Hybrid embedding is designed to overcome the bandwidth constraint imposed by the embedding part of the embedding train workload by algorithmically reducing the traffic over netwoek. Requirements: The input dataset has only one-hot feature items and the model uses the SGD optimizer.
FusedReluBiasFullyConnectedLayer: FusedReluBiasFullyConnectedLayer is one of the major optimizations applied to dense layers. It fuses relu Bias and FullyConnected layers to reduce the memory access on HBM. Requirements: The model uses a layer with separate data / gradient tensors as the bottom layer.
Overlapped Pipeline: The computation in the dense input data path is overlapped with the hybrid embedding computation. Requirements: The data reader is asynchronous, hybrid embedding is used, and the model has a feature interaction layer.
Holistic CUDA Graph: Packing everything inside a training iteration into a CUDA Graph. Limitations: this option works only if use_cuda_graph is turned off and use_overlapped_pipeline is turned on.
Python Interface Enhancements: We’ve enhanced the Python interface for HugeCTR so that you no longer have to manually create a JSON configuration file. Our Python APIs can now be used to create the computation graph. They can also be used to dump the model graph as a JSON object and save the model weights as binary files so that continuous training and inference can take place. We’ve added an Inference API that takes Norm or Parquet datasets as input to facilitate the inference process. For more information, refer to HugeCTR Python Interface and HugeCTR Criteo Notebook.
New Interface for Unified Embedding: We’re introducing a new interface to simplify the use of embeddings and datareaders. To help you specify the number of keys in each slot, we added
nnz_per_slot
andis_fixed_length
. You can now directly configure how much memory usage you need by specifyingworkspace_size_per_gpu_in_mb
instead ofmax_vocabulary_size_per_gpu
. For convenience,mean/sum
is used in combinators instead of 0 and 1. In cases where you don’t know which embedding type you should use, you can specifyuse_hash_table
and let HugeCTR automatically select the embedding type based on your configuration. For more information, refer to HugeCTR Python Interface.Multi-Node Support for Embedding Training Cache (ETC): We’ve enabled multi-node support for the embedding training cache. You can now train a model with a terabyte-size embedding table using one node or multiple nodes even if the entire embedding table can’t fit into the GPU memory. We’re also introducing the host memory (HMEM) based parameter server (PS) along with its SSD-based counterpart. If the sparse model can fit into the host memory of each training node, the optimized HMEM-based PS can provide better model loading and dumping performance with a more effective bandwidth. For more information, refer to HugeCTR Python Interface.
Enhancements to the Multi-Nodes TensorFlow Plugin: The Multi-Nodes TensorFlow Plugin now supports multi-node synchronized training via tf.distribute.MultiWorkerMirroredStrategy. With minimal code changes, you can now easily scale your single GPU training to multi-node multi GPU training. The Multi-Nodes TensorFlow Plugin also supports multi-node synchronized training via Horovod. The inputs for embedding plugins are now data parallel, so the datareader no longer needs to preprocess data for different GPUs based on concrete embedding algorithms. For more information, see the
sparse_operation_kit_demo.ipynb
notebook in the sparse_operation_kit/notebooks directory of the repository.NCF Model Support: We’ve added support for the NCF model, as well as the GMF and NeuMF variant models. With this enhancement, we’re introducing a new element-wise multiplication layer and HitRate evaluation metric. Sample code was added that demonstrates how to preprocess user-item interaction data and train a NCF model with it. New examples have also been added that demonstrate how to train NCF models using MovieLens datasets.
DIN and DIEN Model Support: All of our layers support the DIN model. The following layers support the DIEN model: FusedReshapeConcat, FusedReshapeConcatGeneral, Gather, GRU, PReLUDice, ReduceMean, Scale, Softmax, and Sub. We also added sample code to demonstrate how to use the Amazon dataset to train the DIN model. See our sample programs in the samples/din directory of the repository.
Multi-Hot Support for Parquet Datasets: We’ve added multi-hot support for parquet datasets, so you can now train models with a paraquet dataset that contains both one hot and multi-hot slots.
Mixed Precision (FP16) Support in More Layers: The MultiCross layer now supports mixed precision (FP16). All layers now support FP16.
Mixed Precision (FP16) Support in Inference: We’ve added FP16 support for the inference pipeline. Therefore, dense layers can now adopt FP16 during inference.
Optimizer State Enhancements for Continuous Training: You can now store optimizer states that are updated during continuous training as files, such as the Adam optimizer’s first moment (m) and second moment (v). By default, the optimizer states are initialized with zeros, but you can specify a set of optimizer state files to recover their previous values. For more information about
dense_opt_states_file
andsparse_opt_states_file
, refer to Python Interface.New Library File for GPU Embedding Cache Data: We’ve moved the header/source code of the GPU embedding cache data structure into a stand-alone folder. It has been compiled into a stand-alone library file. Similar to HugeCTR, your application programs can now be directly linked from this new library file for future use. For more information, refer to the
ReadMe.md
file in the gpu_cache directory of the repository.Embedding Plugin Enhancements: We’ve moved all the embedding plugin files into a stand-alone folder. The embedding plugin can be used as a stand-alone python module, and works with TensorFlow to accelerate the embedding training process.
Adagrad Support: Adagrad can now be used to optimize your embedding and network. To use it, change the optimizer type in the Optimizer layer and set the corresponding parameters.
What’s New in Version 3.0.1
New DLRM Inference Benchmark: We’ve added two detailed Jupyter notebooks to demonstrate how to train, deploy, and benchmark the performance of a deep learning recommendation model (DLRM) with HugeCTR. For more information, refer to our HugeCTR Inference Notebooks.
FP16 Optimization: We’ve optimized the DotProduct, ELU, and Sigmoid layers based on
__half2
vectorized loads and stores, improving their device memory bandwidth utilization. MultiCross, FmOrder2, ReduceSum, and Multiply are the only layers that still need to be optimized for FP16.Synthetic Data Generator Enhancements: We’ve enhanced our synthetic data generator so that it can generate uniformly distributed datasets, as well as power-law based datasets. You can now specify the
vocabulary_size
andmax_nnz
per categorical feature instead of across all categorial features. For more information, refer to our user guide.Reduced Memory Allocation for Trained Model Exportation: To prevent the “Out of Memory” error message from displaying when exporting a trained model, which may include a very large embedding table, the amount of memory allocated by the related functions has been significantly reduced.
Dropout Layer Enhancement: The Dropout layer is now compatible with CUDA Graph. The Dropout layer is using cuDNN by default so that it can be used with CUDA Graph.
What’s New in Version 3.0
Inference Support: To streamline the recommender system workflow, we’ve implemented a custom HugeCTR backend on the NVIDIA Triton Inference Server. The HugeCTR backend leverages the embedding cache and parameter server to efficiently manage embeddings of different sizes and models in a hierarchical manner. For more information, refer to our inference repository.
New High-Level API: You can now also construct and train your models using the Python interface with our new high-level API. For more information, refer to our preview example code in the
samples/preview
directory to grasp how this new API works.FP16 Support in More Layers: All the layers except
MultiCross
support mixed precision mode. We’ve also optimized some of the FP16 layer implementations based on vectorized loads and stores.Enhanced TensorFlow Embedding Plugin: Our embedding plugin now supports
LocalizedSlotSparseEmbeddingHash
mode. With this enhancement, the DNN model no longer needs to be split into two parts since it now connects with the embedding op throughMirroredStrategy
within the embedding layer. For more information, see thenotebooks/embedding_plugin.ipynb
notebook.Extended Embedding Training Cache: We’ve extended the embedding training cache feature to support
LocalizedSlotSparseEmbeddingHash
andLocalizedSlotSparseEmbeddingHashOneHot
.Epoch-Based Training Enhancements: The
num_epochs
option in the Solver clause can now be used with theRaw
dataset format.Deprecation of the
eval_batches
Parameter: Theeval_batches
parameter has been deprecated and replaced with themax_eval_batches
andmax_eval_samples
parameters. In epoch mode, these parameters control the maximum number of evaluations. An error message will appear when attempting to use theeval_batches
parameter.MultiplyLayer
Renamed: To clarify what theMultiplyLayer
does, it was renamed toWeightMultiplyLayer
.Optimized Initialization Time: HugeCTR’s initialization time, which includes the GEMM algorithm search and parameter initialization, was significantly reduced.
Sample Enhancements: Our samples now rely upon the Criteo 1TB Click Logs dataset instead of the Kaggle Display Advertising Challenge dataset. Our preprocessing scripts (Perl, Pandas, and NVTabular) have also been unified and simplified.
Configurable DataReader Worker: You can now specify the number of data reader workers, which run in parallel, with the
num_workers
parameter. Its default value is 12. However, if you are using the Parquet data reader, you can’t configure thenum_workers
parameter since it always corresponds to the number of active GPUs.
What’s New in Version 2.3
New Python Interface: To enhance the interoperability with NVTabular and other Python-based libraries, we’re introducing a new Python interface for HugeCTR.
HugeCTR Embedding with Tensorflow: To help users easily integrate HugeCTR’s optimized embedding into their Tensorflow workflow, we now offer the HugeCTR embedding layer as a Tensorflow plugin. To better understand how to install, use, and verify it, see our Jupyter notebook tutorial in file
notebooks/embedding_plugin.ipynb
. The notebook also demonstrates how you can create a new Keras layer,EmbeddingLayer
, based on thehugectr.py
file in thetools/embedding_plugin/python
directory with the helper code that we provide.Embedding Training Cache: To enable a model with large embedding tables that exceeds the single GPU’s memory limit, we’ve added a new embedding training cache feature, giving you the ability to load a subset of an embedding table into the GPU in a coarse grained, on-demand manner during the training stage.
TF32 Support: We’ve added TensorFloat-32 (TF32), a new math mode and third-generation of Tensor Cores, support on Ampere. TF32 uses the same 10-bit mantissa as FP16 to ensure accuracy while providing the same range as FP32 by using an 8-bit exponent. Since TF32 is an internal data type that accelerates FP32 GEMM computations with tensor cores, you can simply turn it on with a newly added configuration option. For more information, refer to Solver.
Enhanced AUC Implementation: To enhance the performance of our AUC computation on multi-node environments, we’ve redesigned our AUC implementation to improve how the computational load gets distributed across nodes.
Epoch-Based Training: In addition to the
max_iter
parameter, you can now set thenum_epochs
parameter in the Solver clause within the configuration file. This mode can only currently be used withNorm
dataset formats and their corresponding file lists. All dataset formats will be supported in the future.New Multi-Node Training Tutorial: To better support multi-node training use cases, we’ve added a new step-by-step tutorial to the tutorial/multinode-training directory of our GitHub repository.
Power Law Distribution Support with Data Generator: Because of the increased need for generating a random dataset whose categorical features follows the power-law distribution, we’ve revised our data generation tool to support this use case. For additional information, refer to the
--long-tail
description in the Generating Synthetic Data and Benchmarks section of the docs/hugectr_user_guide.md file in the repository.Multi-GPU Preprocessing Script for Criteo Samples: Multiple GPUs can now be used when preparing the dataset for the programs in the samples directory of our GitHub repository. For more information, see how the
preprocess_nvt.py
program in the tools/criteo_script directory of the repository is used to preprocess the Criteo dataset for DCN, DeepFM, and W&D samples.
Known Issues
HugeCTR uses NCCL to share data between ranks, and NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:
-shm-size=1g -ulimit memlock=-1
See also NCCL’s known issue. And the GitHub issue.KafkaProducers startup will succeed, even if the target Kafka broker is unresponsive. In order to avoid data-loss in conjunction with streaming model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers is up, working properly and reachable from the node where you run HugeCTR.
The number of data files in the file list should be no less than the number of data reader workers. Otherwise, different workers will be mapped to the same file and data loading does not progress as expected.
Joint Loss training hasn’t been supported with regularizer.