HugeCTR training with HDFS example

Overview

In version v3.4, we introduced the support for HDFS. Users can now move their data and model files from HDFS to local filesystem through our API to do HugeCTR training. And after training, users can choose to dump the trained parameters and optimizer states into HDFS. In this example notebook, we are going to demonstrate the end to end procedure of training with HDFS.

Get HugeCTR from NGC

The HugeCTR Python module is preinstalled in the 22.04 and later Merlin Training Container: nvcr.io/nvidia/merlin/merlin-training:22.04.

You can check the existence of required libraries by running the following Python code after launching the container.

$ python3 -c "import hugectr"

If you prefer to build HugeCTR from the source code instead of using the NGC container, refer to the How to Start Your Development documentation.

Hadoop Installation and Configuration

Download and Install Hadoop

Download a JDK:

wget https://download.java.net/java/GA/jdk16.0.2/d4a915d82b4c4fbb9bde534da945d746/7/GPL/openjdk-16.0.2_linux-x64_bin.tar.gz
tar -zxvf openjdk-16.0.2_linux-x64_bin.tar.gz
mv jdk-16.0.2 /usr/local

Set Java environmental variables:

export JAVA_HOME=/usr/local/jdk-16.0.2
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=.:${JAVA_HOME}/bin:$PATH

Download and install Hadoop:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar -zxvf hadoop-3.3.1.tar.gz
mv hadoop-3.3.1 /usr/local

Hadoop configuration

Set Hadoop environment variables:

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
echo ‘export JAVA_HOME=/usr/local/jdk-16.0.2’ >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh

core-site.xml config:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://namenode:9000</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/hadoop/tmp</value>
</property>

hdfs-site.xml for name node:

<property>
     <name>dfs.replication</name>
     <value>4</value>
</property>
<property>
     <name>dfs.namenode.name.dir</name>
     <value>file:/usr/local/hadoop/hdfs/name</value>
</property>
<property>
     <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>         
     <value>true</value>
</property>
<property>
    <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
    <value>NEVER</value>
</property>

hdfs-site.xml for data node:

<property>
     <name>dfs.replication</name>
     <value>4</value>
</property>
<property>
     <name>dfs.datanode.data.dir</name>
     <value>file:/usr/local/hadoop/hdfs/data</value>
</property>
<property>
     <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>         
     <value>true</value>
</property>
<property>
    <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
    <value>NEVER</value>
</property>

workers for all node:

worker1
worker2
worker3
worker4

Start HDFS

Enable ssh connection:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
/etc/init.d/ssh start

Format the NameNode:

/usr/local/hadoop/bin/hdfs namenode -format

Format the DataNodes:

/usr/local/hadoop/bin/hdfs datanode -format

Start HDFS from the NameNode:
```
/usr/local/hadoop/sbin/start-dfs.sh
```

Wide and Deep Model

In the Docker container, nvcr.io/nvidia/merlin/merlin-training:22.04, make sure that you installed Hadoop and set the proper environment variables as instructed in the preceding sections.

If you chose to compile HugeCTR, make sure you that you set DENABLE_HDFS to ON.

Run export CLASSPATH=$(hadoop classpath --glob) first to link the required JAR file.
Make sure that we have the model files your Hadoop cluster and provide the correct links to the model files.

Now you can run the following sample.

%%writefile train_with_hdfs.py
import hugectr
from mpi4py import MPI
from hugectr.data import DataSource, DataSourceParams

data_source_params = DataSourceParams(
    use_hdfs = True, #whether use HDFS to save model files
    namenode = 'localhost', #HDFS namenode IP
    port = 9000, #HDFS port
)

solver = hugectr.CreateSolver(max_eval_batches = 1280,
                              batchsize_eval = 1024,
                              batchsize = 1024,
                              lr = 0.001,
                              vvgpu = [[0]],
                              repeat_dataset = True,
                              data_source_params = data_source_params)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                                  source = ['./wdl_norm/file_list.txt'],
                                  eval_source = './wdl_norm/file_list_test.txt',
                                  check_type = hugectr.Check_t.Sum)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.Adam,
                                    update_type = hugectr.Update_t.Global,
                                    beta1 = 0.9,
                                    beta2 = 0.999,
                                    epsilon = 0.0000001)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
                        dense_dim = 13, dense_name = "dense",
                        data_reader_sparse_param_array = 
                        # the total number of slots should be equal to data_generator_params.num_slot
                        [hugectr.DataReaderSparseParam("wide_data", 2, True, 1),
                        hugectr.DataReaderSparseParam("deep_data", 1, True, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 69,
                            embedding_vec_size = 1,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding2",
                            bottom_name = "wide_data",
                            optimizer = optimizer))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash, 
                            workspace_size_per_gpu_in_mb = 1074,
                            embedding_vec_size = 16,
                            combiner = "sum",
                            sparse_embedding_name = "sparse_embedding1",
                            bottom_name = "deep_data",
                            optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding1"],
                            top_names = ["reshape1"],
                            leading_dim=416))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Reshape,
                            bottom_names = ["sparse_embedding2"],
                            top_names = ["reshape2"],
                            leading_dim=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Concat,
                            bottom_names = ["reshape1", "dense"],
                            top_names = ["concat1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["concat1"],
                            top_names = ["fc1"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc1"],
                            top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu1"],
                            top_names = ["dropout1"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout1"],
                            top_names = ["fc2"],
                            num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
                            bottom_names = ["fc2"],
                            top_names = ["relu2"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Dropout,
                            bottom_names = ["relu2"],
                            top_names = ["dropout2"],
                            dropout_rate=0.5))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
                            bottom_names = ["dropout2"],
                            top_names = ["fc3"],
                            num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Add,
                            bottom_names = ["fc3", "reshape2"],
                            top_names = ["add1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
                            bottom_names = ["add1", "label"],
                            top_names = ["loss"]))
model.compile()
model.summary()

model.load_dense_weights('/model/wdl/_dense_1000.model')
model.load_dense_optimizer_states('/model/wdl/_opt_dense_1000.model')
model.load_sparse_weights(['/model/wdl/0_sparse_1000.model', '/model/wdl/1_sparse_1000.model'])
model.load_sparse_optimizer_states(['/model/wdl/0_opt_sparse_1000.model', '/model/wdl/1_opt_sparse_1000.model'])

model.fit(max_iter = 1020, display = 200, eval_interval = 500, snapshot = 1000, snapshot_prefix = "/model/wdl/")

Overwriting train_with_hdfs.py

!python train_with_hdfs.py

HugeCTR Version: 3.3
====================================================Model Init=====================================================
[HCTR][09:00:54][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][09:00:54][INFO][RK0][main]: Global seed is 1285686508
[HCTR][09:00:55][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][09:00:56][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][09:00:56][INFO][RK0][main]: Start all2all warmup
[HCTR][09:00:56][INFO][RK0][main]: End all2all warmup
[HCTR][09:00:56][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][09:00:56][INFO][RK0][main]: Device 0: Tesla V100-PCIE-32GB
[HCTR][09:00:56][INFO][RK0][main]: num of DataReader workers: 12
[HCTR][09:00:56][INFO][RK0][main]: max_vocabulary_size_per_gpu_=6029312
[HCTR][09:00:56][INFO][RK0][main]: max_vocabulary_size_per_gpu_=5865472
[HCTR][09:00:56][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][09:01:00][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][09:01:00][INFO][RK0][main]: gpu0 init embedding done
[HCTR][09:01:00][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][09:01:00][INFO][RK0][main]: gpu0 init embedding done
[HCTR][09:01:00][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][09:01:00][INFO][RK0][main]: Warm-up done
[HCTR][09:01:00][INFO][RK0][main]: ===================================================Model Summary===================================================
label                                   Dense                         Sparse                        
label                                   dense                          wide_data,deep_data           
(None, 1)                               (None, 13)                              
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type                              Input Name                    Output Name                   Output Shape                  
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
LocalizedSlotSparseEmbeddingHash        wide_data                     sparse_embedding2             (None, 1, 1)                  
------------------------------------------------------------------------------------------------------------------
LocalizedSlotSparseEmbeddingHash        deep_data                     sparse_embedding1             (None, 26, 16)                
------------------------------------------------------------------------------------------------------------------
Reshape                                 sparse_embedding1             reshape1                      (None, 416)                   
------------------------------------------------------------------------------------------------------------------
Reshape                                 sparse_embedding2             reshape2                      (None, 1)                     
------------------------------------------------------------------------------------------------------------------
Concat                                  reshape1                      concat1                       (None, 429)                   
                                        dense                                                                                     
------------------------------------------------------------------------------------------------------------------
InnerProduct                            concat1                       fc1                           (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
ReLU                                    fc1                           relu1                         (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
Dropout                                 relu1                         dropout1                      (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
InnerProduct                            dropout1                      fc2                           (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
ReLU                                    fc2                           relu2                         (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
Dropout                                 relu2                         dropout2                      (None, 1024)                  
------------------------------------------------------------------------------------------------------------------
InnerProduct                            dropout2                      fc3                           (None, 1)                     
------------------------------------------------------------------------------------------------------------------
Add                                     fc3                           add1                          (None, 1)                     
                                        reshape2                                                                                  
------------------------------------------------------------------------------------------------------------------
BinaryCrossEntropyLoss                  add1                          loss                                                        
                                        label                                                                                     
------------------------------------------------------------------------------------------------------------------
2022-02-23 09:01:00,548 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[HDFS][INFO]: Read file /model/wdl/_dense_1000.model successfully!
[HDFS][INFO]: Read file /model/wdl/_opt_dense_1000.model successfully!
[HCTR][09:01:01][INFO][RK0][main]: Loading dense opt states: 
[HCTR][09:01:01][INFO][RK0][main]: Loading sparse model: /model/wdl/0_sparse_1000.model
[HDFS][INFO]: Read file /model/wdl/0_sparse_1000.model/key successfully!
[HDFS][INFO]: Read file /model/wdl/0_sparse_1000.model/slot_id successfully!
[HDFS][INFO]: Read file /model/wdl/0_sparse_1000.model/emb_vector successfully!
[HCTR][09:01:01][INFO][RK0][main]: Start to upload embedding table file to GPUs, total loop_num: 128
[HCTR][09:01:01][INFO][RK0][main]: Done
[HCTR][09:01:01][INFO][RK0][main]: Loading sparse model: /model/wdl/1_sparse_1000.model
[HDFS][INFO]: Read file /model/wdl/1_sparse_1000.model/key successfully!
[HDFS][INFO]: Read file /model/wdl/1_sparse_1000.model/slot_id successfully!
[HDFS][INFO]: Read file /model/wdl/1_sparse_1000.model/emb_vector successfully!
[HCTR][09:01:01][INFO][RK0][main]: Start to upload embedding table file to GPUs, total loop_num: 518
[HCTR][09:01:01][INFO][RK0][main]: Done
[HCTR][09:01:01][INFO][RK0][main]: Loading sparse optimizer states: /model/wdl/0_opt_sparse_1000.model
[HCTR][09:01:01][INFO][RK0][main]: Rank0: Read optimzer state from file
[HDFS][INFO]: Read file /model/wdl/0_opt_sparse_1000.model successfully!
[HCTR][09:01:01][INFO][RK0][main]: Done
[HCTR][09:01:01][INFO][RK0][main]: Rank0: Read optimzer state from file
[HDFS][INFO]: Read file /model/wdl/0_opt_sparse_1000.model successfully!
[HCTR][09:01:01][INFO][RK0][main]: Done
[HCTR][09:01:01][INFO][RK0][main]: Loading sparse optimizer states: /model/wdl/1_opt_sparse_1000.model
[HCTR][09:01:01][INFO][RK0][main]: Rank0: Read optimzer state from file
[HDFS][INFO]: Read file /model/wdl/1_opt_sparse_1000.model successfully!
[HCTR][09:01:02][INFO][RK0][main]: Done
[HCTR][09:01:02][INFO][RK0][main]: Rank0: Read optimzer state from file
[HDFS][INFO]: Read file /model/wdl/1_opt_sparse_1000.model successfully!
[HCTR][09:01:02][INFO][RK0][main]: Done
=====================================================Model Fit=====================================================
[HCTR][09:01:02][INFO][RK0][main]: Use non-epoch mode with number of iterations: 1020
[HCTR][09:01:02][INFO][RK0][main]: Training batchsize: 1024, evaluation batchsize: 1024
[HCTR][09:01:02][INFO][RK0][main]: Evaluation interval: 500, snapshot interval: 1000
[HCTR][09:01:02][INFO][RK0][main]: Dense network trainable: True
[HCTR][09:01:02][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][09:01:02][INFO][RK0][main]: Sparse embedding sparse_embedding2 trainable: True
[HCTR][09:01:02][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][09:01:02][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][09:01:02][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][09:01:02][INFO][RK0][main]: Training source file: ./wdl_norm/file_list.txt
[HCTR][09:01:02][INFO][RK0][main]: Evaluation source file: ./wdl_norm/file_list_test.txt
[HCTR][09:01:04][INFO][RK0][main]: Iter: 200 Time(200 iters): 1.12465s Loss: 0.632464 lr:0.001
[HCTR][09:01:05][INFO][RK0][main]: Iter: 400 Time(200 iters): 1.03567s Loss: 0.612515 lr:0.001
[HCTR][09:01:06][INFO][RK0][main]: Evaluation, AUC: 0.499877
[HCTR][09:01:06][INFO][RK0][main]: Eval Time for 1280 iters: 0.647875s
[HCTR][09:01:06][INFO][RK0][main]: Iter: 600 Time(200 iters): 1.68717s Loss: 0.625102 lr:0.001
[HCTR][09:01:07][INFO][RK0][main]: Iter: 800 Time(200 iters): 1.03752s Loss: 0.608092 lr:0.001
[HCTR][09:01:08][INFO][RK0][main]: Iter: 1000 Time(200 iters): 1.03691s Loss: 0.688194 lr:0.001
[HCTR][09:01:09][INFO][RK0][main]: Evaluation, AUC: 0.500383
[HCTR][09:01:09][INFO][RK0][main]: Eval Time for 1280 iters: 0.650671s
[HCTR][09:01:09][INFO][RK0][main]: Rank0: Dump hash table from GPU0
[HCTR][09:01:09][INFO][RK0][main]: Rank0: Write hash table <key,value> pairs to file
[HDFS][INFO]: Write to HDFS /model/wdl/0_sparse_1000.model/key successfully!
[HDFS][INFO]: Write to HDFS /model/wdl/0_sparse_1000.model/slot_id successfully!
[HDFS][INFO]: Write to HDFS /model/wdl/0_sparse_1000.model/emb_vector successfully!
[HCTR][09:01:09][INFO][RK0][main]: Done
[HCTR][09:01:09][INFO][RK0][main]: Rank0: Dump hash table from GPU0
[HCTR][09:01:09][INFO][RK0][main]: Rank0: Write hash table <key,value> pairs to file
[HDFS][INFO]: Write to HDFS /model/wdl/1_sparse_1000.model/key successfully!
[HDFS][INFO]: Write to HDFS /model/wdl/1_sparse_1000.model/slot_id successfully!
[HDFS][INFO]: Write to HDFS /model/wdl/1_sparse_1000.model/emb_vector successfully!
[HCTR][09:01:09][INFO][RK0][main]: Done
[HCTR][09:01:09][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][09:01:09][INFO][RK0][main]: Rank0: Write optimzer state to file
[HDFS][INFO]: Write to HDFS /model/wdl/0_opt_sparse_1000.model successfully!
[HCTR][09:01:09][INFO][RK0][main]: Done
[HCTR][09:01:09][INFO][RK0][main]: Rank0: Write optimzer state to file
[HDFS][INFO]: Write to HDFS /model/wdl/0_opt_sparse_1000.model successfully!
[HCTR][09:01:10][INFO][RK0][main]: Done
[HCTR][09:01:10][INFO][RK0][main]: Rank0: Write optimzer state to file
[HDFS][INFO]: Write to HDFS /model/wdl/1_opt_sparse_1000.model successfully!
[HCTR][09:01:11][INFO][RK0][main]: Done
[HCTR][09:01:11][INFO][RK0][main]: Rank0: Write optimzer state to file
[HDFS][INFO]: Write to HDFS /model/wdl/1_opt_sparse_1000.model successfully!
[HCTR][09:01:12][INFO][RK0][main]: Done
[HCTR][09:01:12][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HDFS][INFO]: Write to HDFS /model/wdl/_dense_1000.model successfully!
[HCTR][09:01:12][INFO][RK0][main]: Dumping dense weights to HDFS, successful
[HDFS][INFO]: Write to HDFS /model/wdl/_opt_dense_1000.model successfully!
[HCTR][09:01:12][INFO][RK0][main]: Dumping dense optimizer states to HDFS, successful
[HCTR][09:01:12][INFO][RK0][main]: Finish 1020 iterations with batchsize: 1024 in 9.82s.