HugeCTR training with Remote File System example
Overview
HugeCTR supports reading Parquet data, loading and saving models from/to remote file systems like HDFS and AWS S3. Users can read their data stored in these remote file systems and train with it. And after training, users can choose to dump the trained parameters and optimizer states into these file systems. In this example notebook, we are going to demonstrate the end to end procedure of training with HDFS and AWS S3
Get HugeCTR from NGC
The HugeCTR Python module is preinstalled in the 22.10 and later Merlin Training Container: nvcr.io/nvidia/merlin/merlin-hugectr:22.10
.
You can check the existence of required libraries by running the following Python code after launching the container.
$ python3 -c "import hugectr"
If you prefer to build HugeCTR from the source code instead of using the NGC container, refer to the How to Start Your Development documentation.
Training with HDFS Example
Hadoop is not pre-installe din the Merlin Training Container. To help you build and install HDFS, we provide a script here. Please build and install Hadoop using these two scripts. Make sure you have hadoop installed in your Container by running the following:
!hadoop version
Hadoop 3.3.2
Source code repository https://github.com/apache/hadoop.git -r 0bcb014209e219273cb6fd4152df7df713cbac61
Compiled by root on 2022-07-25T09:53Z
Compiled with protoc 3.7.1
From source with checksum 4b40fff8bb27201ba07b6fa5651217fb
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.3.2.jar
Data Preparation
Users can use the DataSourceParams to setup file system configurations. Currently, we support Local
and HDFS
.
Firstly, we want to make sure that we have train and validation datasets ready:
!hdfs dfs -ls hdfs://10.19.172.76:9000/dlrm_parquet/train
Found 8 items
-rw-r--r-- 1 root supergroup 112247365 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_0.parquet
-rw-r--r-- 1 root supergroup 112243637 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_1.parquet
-rw-r--r-- 1 root supergroup 112251207 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_2.parquet
-rw-r--r-- 1 root supergroup 112241764 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_3.parquet
-rw-r--r-- 1 root supergroup 112247838 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_4.parquet
-rw-r--r-- 1 root supergroup 112244076 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_5.parquet
-rw-r--r-- 1 root supergroup 112253553 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_6.parquet
-rw-r--r-- 1 root supergroup 112249557 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_7.parquet
!hdfs dfs -ls hdfs://10.19.172.76:9000/dlrm_parquet/val
Found 2 items
-rw-r--r-- 1 root supergroup 112239093 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_0.parquet
-rw-r--r-- 1 root supergroup 112249156 2022-07-27 06:19 hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_1.parquet
Secondly, create file_list.txt and file_list_test.txt
:
!mkdir /dlrm_parquet
!mkdir /dlrm_parquet/train
!mkdir /dlrm_parquet/val
%%writefile /dlrm_parquet/file_list.txt
8
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_0.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_1.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_2.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_3.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_4.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_5.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_6.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/train/gen_7.parquet
Overwriting /dlrm_parquet/file_list.txt
%%writefile /dlrm_parquet/file_list_test.txt
2
hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_0.parquet
hdfs://10.19.172.76:9000/dlrm_parquet/val/gen_1.parquet
Overwriting /dlrm_parquet/file_list_test.txt
Lastly, create _metadata.json
for both train and validation dataset to specify the feature information of your dataset:
%%writefile /dlrm_parquet/train/_metadata.json
{ "file_stats": [{"file_name": "./dlrm_parquet/train/gen_0.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_1.parquet", "num_rows":1000000},
{"file_name": "./dlrm_parquet/train/gen_2.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_3.parquet", "num_rows":1000000},
{"file_name": "./dlrm_parquet/train/gen_4.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_5.parquet", "num_rows":1000000},
{"file_name": "./dlrm_parquet/train/gen_6.parquet", "num_rows":1000000}, {"file_name": "./dlrm_parquet/train/gen_7.parquet", "num_rows":1000000} ],
"labels": [{"col_name": "label0", "index":0} ],
"conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3},
{"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6},
{"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9},
{"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12},
{"col_name": "C13", "index":13} ],
"cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16},
{"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, {"col_name": "C19", "index":19},
{"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22},
{"col_name": "C23", "index":23}, {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25},
{"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28},
{"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31},
{"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, {"col_name": "C34", "index":34},
{"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37},
{"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }
Writing /dlrm_parquet/train/_metadata.json
%%writefile /dlrm_parquet/val/_metadata.json
{ "file_stats": [{"file_name": "./dlrm_parquet/val/gen_0.parquet", "num_rows":1000000},
{"file_name": "./dlrm_parquet/val/gen_1.parquet", "num_rows":1000000} ],
"labels": [{"col_name": "label0", "index":0} ],
"conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3},
{"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6},
{"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9},
{"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12},
{"col_name": "C13", "index":13} ],
"cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16},
{"col_name": "C17", "index":17}, {"col_name": "C18", "index":18}, {"col_name": "C19", "index":19},
{"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22},
{"col_name": "C23", "index":23}, {"col_name": "C24", "index":24}, {"col_name": "C25", "index":25},
{"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28},
{"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31},
{"col_name": "C32", "index":32}, {"col_name": "C33", "index":33}, {"col_name": "C34", "index":34},
{"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37},
{"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }
Writing /dlrm_parquet/val/_metadata.json
Training a DLRM model
%%writefile train_with_hdfs.py
import hugectr
from mpi4py import MPI
from hugectr.data import DataSourceParams
# Create a file system configuration
data_source_params = DataSourceParams(
source = hugectr.DataSourceType_t.HDFS, #use HDFS
server = '10.19.172.76', #your HDFS namenode IP
port = 9000, #your HDFS namenode port
)
# DLRM train
solver = hugectr.CreateSolver(max_eval_batches = 1280,
batchsize_eval = 1024,
batchsize = 1024,
lr = 0.01,
vvgpu = [[1]],
i64_input_key = True,
use_mixed_precision = False,
repeat_dataset = True,
use_cuda_graph = False)
reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
source = ["/dlrm_parquet/file_list.txt"],
eval_source = "/dlrm_parquet/file_list_test.txt",
slot_size_array = [405274, 72550, 55008, 222734, 316071, 156265, 220243, 200179, 234566, 335625, 278726, 263070, 312542, 203773, 145859, 117421, 78140, 3648, 156308, 94562, 357703, 386976, 238046, 230917, 292, 156382],
data_source_params = data_source_params, #file system config for data reading
check_type = hugectr.Check_t.Non)
optimizer = hugectr.CreateOptimizer(optimizer_type = hugectr.Optimizer_t.SGD,
update_type = hugectr.Update_t.Local,
atomic_update = True)
model = hugectr.Model(solver, reader, optimizer)
model.add(hugectr.Input(label_dim = 1, label_name = "label",
dense_dim = 13, dense_name = "dense",
data_reader_sparse_param_array =
[hugectr.DataReaderSparseParam("data1", 1, True, 26)]))
model.add(hugectr.SparseEmbedding(embedding_type = hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
workspace_size_per_gpu_in_mb = 10720,
embedding_vec_size = 128,
combiner = "sum",
sparse_embedding_name = "sparse_embedding1",
bottom_name = "data1",
optimizer = optimizer))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["dense"],
top_names = ["fc1"],
num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc1"],
top_names = ["relu1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["relu1"],
top_names = ["fc2"],
num_output=256))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc2"],
top_names = ["relu2"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["relu2"],
top_names = ["fc3"],
num_output=128))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc3"],
top_names = ["relu3"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.Interaction,
bottom_names = ["relu3","sparse_embedding1"],
top_names = ["interaction1"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["interaction1"],
top_names = ["fc4"],
num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc4"],
top_names = ["relu4"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["relu4"],
top_names = ["fc5"],
num_output=1024))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc5"],
top_names = ["relu5"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["relu5"],
top_names = ["fc6"],
num_output=512))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc6"],
top_names = ["relu6"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["relu6"],
top_names = ["fc7"],
num_output=256))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.ReLU,
bottom_names = ["fc7"],
top_names = ["relu7"]))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.InnerProduct,
bottom_names = ["relu7"],
top_names = ["fc8"],
num_output=1))
model.add(hugectr.DenseLayer(layer_type = hugectr.Layer_t.BinaryCrossEntropyLoss,
bottom_names = ["fc8", "label"],
top_names = ["loss"]))
model.compile()
model.summary()
model.fit(max_iter = 2020, display = 200, eval_interval = 1000, snapshot = 2000, snapshot_prefix = "hdfs://10.19.172.76:9000/model/dlrm/")
Overwriting train_with_hdfs.py
!python train_with_hdfs.py
HugeCTR Version: 3.8
====================================================Model Init=====================================================
[HCTR][07:51:52.502][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][07:51:52.502][INFO][RK0][main]: Global seed is 3218787045
[HCTR][07:51:52.505][INFO][RK0][main]: Device to NUMA mapping:
GPU 1 -> node 0
[HCTR][07:51:55.607][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][07:51:55.607][INFO][RK0][main]: Start all2all warmup
[HCTR][07:51:55.609][INFO][RK0][main]: End all2all warmup
[HCTR][07:51:56.529][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][07:51:56.530][INFO][RK0][main]: Device 1: NVIDIA A10
[HCTR][07:51:56.531][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][07:51:56.531][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][07:51:57.695][INFO][RK0][main]: Using Hadoop Cluster 10.19.172.76:9000
[HCTR][07:51:57.740][INFO][RK0][main]: Using Hadoop Cluster 10.19.172.76:9000
[HCTR][07:51:57.740][INFO][RK0][main]: Vocabulary size: 5242880
[HCTR][07:51:57.741][INFO][RK0][main]: max_vocabulary_size_per_gpu_=21954560
[HCTR][07:51:57.755][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][07:52:04.336][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][07:52:04.411][INFO][RK0][main]: gpu0 init embedding done
[HCTR][07:52:04.413][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][07:52:04.415][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][07:52:04.415][INFO][RK0][main]: label Dense Sparse
label dense data1
(None, 1) (None, 13)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (None, 26, 128)
------------------------------------------------------------------------------------------------------------------
InnerProduct dense fc1 (None, 512)
------------------------------------------------------------------------------------------------------------------
ReLU fc1 relu1 (None, 512)
------------------------------------------------------------------------------------------------------------------
InnerProduct relu1 fc2 (None, 256)
------------------------------------------------------------------------------------------------------------------
ReLU fc2 relu2 (None, 256)
------------------------------------------------------------------------------------------------------------------
InnerProduct relu2 fc3 (None, 128)
------------------------------------------------------------------------------------------------------------------
ReLU fc3 relu3 (None, 128)
------------------------------------------------------------------------------------------------------------------
Interaction relu3 interaction1 (None, 480)
sparse_embedding1
------------------------------------------------------------------------------------------------------------------
InnerProduct interaction1 fc4 (None, 1024)
------------------------------------------------------------------------------------------------------------------
ReLU fc4 relu4 (None, 1024)
------------------------------------------------------------------------------------------------------------------
InnerProduct relu4 fc5 (None, 1024)
------------------------------------------------------------------------------------------------------------------
ReLU fc5 relu5 (None, 1024)
------------------------------------------------------------------------------------------------------------------
InnerProduct relu5 fc6 (None, 512)
------------------------------------------------------------------------------------------------------------------
ReLU fc6 relu6 (None, 512)
------------------------------------------------------------------------------------------------------------------
InnerProduct relu6 fc7 (None, 256)
------------------------------------------------------------------------------------------------------------------
ReLU fc7 relu7 (None, 256)
------------------------------------------------------------------------------------------------------------------
InnerProduct relu7 fc8 (None, 1)
------------------------------------------------------------------------------------------------------------------
BinaryCrossEntropyLoss fc8 loss
label
------------------------------------------------------------------------------------------------------------------
=====================================================Model Fit=====================================================
[HCTR][07:52:04.415][INFO][RK0][main]: Use non-epoch mode with number of iterations: 2020
[HCTR][07:52:04.415][INFO][RK0][main]: Training batchsize: 1024, evaluation batchsize: 1024
[HCTR][07:52:04.415][INFO][RK0][main]: Evaluation interval: 1000, snapshot interval: 2000
[HCTR][07:52:04.415][INFO][RK0][main]: Dense network trainable: True
[HCTR][07:52:04.415][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][07:52:04.415][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: False
[HCTR][07:52:04.415][INFO][RK0][main]: lr: 0.010000, warmup_steps: 1, end_lr: 0.000000
[HCTR][07:52:04.415][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][07:52:04.415][INFO][RK0][main]: Training source file: /dlrm_parquet/file_list.txt
[HCTR][07:52:04.415][INFO][RK0][main]: Evaluation source file: /dlrm_parquet/file_list_test.txt
[HCTR][07:52:05.134][INFO][RK0][main]: Iter: 200 Time(200 iters): 0.716815s Loss: 0.69327 lr:0.01
[HCTR][07:52:05.856][INFO][RK0][main]: Iter: 400 Time(200 iters): 0.719486s Loss: 0.693207 lr:0.01
[HCTR][07:52:06.608][INFO][RK0][main]: Iter: 600 Time(200 iters): 0.750294s Loss: 0.693568 lr:0.01
[HCTR][07:52:07.331][INFO][RK0][main]: Iter: 800 Time(200 iters): 0.721128s Loss: 0.693352 lr:0.01
[HCTR][07:52:09.118][INFO][RK0][main]: Iter: 1000 Time(200 iters): 1.78435s Loss: 0.693352 lr:0.01
[HCTR][07:52:11.667][INFO][RK0][main]: Evaluation, AUC: 0.499891
[HCTR][07:52:11.668][INFO][RK0][main]: Eval Time for 1280 iters: 2.5486s
[HCTR][07:52:12.393][INFO][RK0][main]: Iter: 1200 Time(200 iters): 3.2728s Loss: 0.693178 lr:0.01
[HCTR][07:52:13.116][INFO][RK0][main]: Iter: 1400 Time(200 iters): 0.720984s Loss: 0.693292 lr:0.01
[HCTR][07:52:13.875][INFO][RK0][main]: Iter: 1600 Time(200 iters): 0.756448s Loss: 0.693053 lr:0.01
[HCTR][07:52:14.603][INFO][RK0][main]: Iter: 1800 Time(200 iters): 0.725832s Loss: 0.693433 lr:0.01
[HCTR][07:52:16.382][INFO][RK0][main]: Iter: 2000 Time(200 iters): 1.77763s Loss: 0.693193 lr:0.01
[HCTR][07:52:18.959][INFO][RK0][main]: Evaluation, AUC: 0.500092
[HCTR][07:52:18.959][INFO][RK0][main]: Eval Time for 1280 iters: 2.57548s
[HCTR][07:52:19.575][INFO][RK0][main]: Rank0: Write hash table to file
[HDFS][INFO]: Write to HDFS /model/dlrm/0_sparse_2000.model/key successfully!
[HDFS][INFO]: Write to HDFS /model/dlrm/0_sparse_2000.model/emb_vector successfully!
[HCTR][07:52:31.132][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][07:52:31.132][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HDFS][INFO]: Write to HDFS /model/dlrm/_dense_2000.model successfully!
[HCTR][07:52:31.307][INFO][RK0][main]: Dumping dense weights to HDFS, successful
[HDFS][INFO]: Write to HDFS /model/dlrm/_opt_dense_2000.model successfully!
[HCTR][07:52:31.365][INFO][RK0][main]: Dumping dense optimizer states to HDFS, successful
[HCTR][07:52:31.430][INFO][RK0][main]: Finish 2020 iterations with batchsize: 1024 in 27.02s.
Check that our model files are saved in HDFS:
!hdfs dfs -ls hdfs://10.19.172.76:9000/model/dlrm
Found 3 items
drwxr-xr-x - root supergroup 0 2022-07-27 07:52 hdfs://10.19.172.76:9000/model/dlrm/0_sparse_2000.model
-rw-r--r-- 3 root supergroup 9479684 2022-07-27 07:52 hdfs://10.19.172.76:9000/model/dlrm/_dense_2000.model
-rw-r--r-- 3 root supergroup 0 2022-07-27 07:52 hdfs://10.19.172.76:9000/model/dlrm/_opt_dense_2000.model
Training a DCN model with AWS S3
Data preparation
Create file_list.txt and file_list_test.txt
:
!mkdir -p /hugectr-io-test/data/dcn_parquet/train
!mkdir -p /hugectr-io-test/data/dcn_parquet/val
%%writefile /hugectr-io-test/data/dcn_parquet/file_list.txt
16
s3://hugectr-io-test/data/dcn_parquet/train/gen_0.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_1.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_2.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_3.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_4.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_5.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_6.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_7.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_8.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_9.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_10.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_11.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_12.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_13.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_14.parquet
s3://hugectr-io-test/data/dcn_parquet/train/gen_15.parquet
Overwriting /hugectr-io-test/data/dcn_parquet/file_list.txt
%%writefile /hugectr-io-test/data/dcn_parquet/file_list_test.txt
4
s3://hugectr-io-test/data/dcn_parquet/val/gen_0.parquet
s3://hugectr-io-test/data/dcn_parquet/val/gen_1.parquet
s3://hugectr-io-test/data/dcn_parquet/val/gen_2.parquet
s3://hugectr-io-test/data/dcn_parquet/val/gen_3.parquet
Overwriting /hugectr-io-test/data/dcn_parquet/file_list_test.txt
%%writefile /hugectr-io-test/data/dcn_parquet/train/_metadata.json
{ "file_stats": [{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_0.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_1.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_2.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_3.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_4.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_5.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_6.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_7.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_8.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_9.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_10.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_11.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_12.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_13.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_14.parquet", "num_rows":40960}, {"file_name": "s3://hugectr-io-test/data/dcn_parquet/train/gen_15.parquet", "num_rows":40960}],
"labels": [{"col_name": "label0", "index":0} ],
"conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6},
{"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12},
{"col_name": "C13", "index":13} ],
"cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18},
{"col_name": "C19", "index":19}, {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, {"col_name": "C23", "index":23},
{"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28},
{"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33},
{"col_name": "C34", "index":34}, {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }
Overwriting /hugectr-io-test/data/dcn_parquet/train/_metadata.json
%%writefile /hugectr-io-test/data/dcn_parquet/val/_metadata.json
{ "file_stats": [{"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_0.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_1.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_2.parquet", "num_rows":40960},
{"file_name": "s3://hugectr-io-test/data/dcn_parquet/val/gen_3.parquet", "num_rows":40960}],
"labels": [{"col_name": "label0", "index":0} ],
"conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6},
{"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10}, {"col_name": "C11", "index":11}, {"col_name": "C12", "index":12},
{"col_name": "C13", "index":13} ],
"cats": [{"col_name": "C14", "index":14}, {"col_name": "C15", "index":15}, {"col_name": "C16", "index":16}, {"col_name": "C17", "index":17}, {"col_name": "C18", "index":18},
{"col_name": "C19", "index":19}, {"col_name": "C20", "index":20}, {"col_name": "C21", "index":21}, {"col_name": "C22", "index":22}, {"col_name": "C23", "index":23},
{"col_name": "C24", "index":24}, {"col_name": "C25", "index":25}, {"col_name": "C26", "index":26}, {"col_name": "C27", "index":27}, {"col_name": "C28", "index":28},
{"col_name": "C29", "index":29}, {"col_name": "C30", "index":30}, {"col_name": "C31", "index":31}, {"col_name": "C32", "index":32}, {"col_name": "C33", "index":33},
{"col_name": "C34", "index":34}, {"col_name": "C35", "index":35}, {"col_name": "C36", "index":36}, {"col_name": "C37", "index":37}, {"col_name": "C38", "index":38}, {"col_name": "C39", "index":39} ] }
Overwriting /hugectr-io-test/data/dcn_parquet/val/_metadata.json
Trainig
%%writefile train_with_s3.py
import hugectr
from mpi4py import MPI
from hugectr.data import DataSourceParams
# Create a file system configuration for data reading
data_source_params = DataSourceParams(
source = hugectr.FileSystemType_t.S3, #use AWS S3
server = 'us-east-1', #your AWS region
port = 9000, #with be ignored
)
solver = hugectr.CreateSolver(
max_eval_batches=1280,
batchsize_eval=1024,
batchsize=1024,
lr=0.001,
vvgpu=[[0]],
i64_input_key=True,
repeat_dataset=False,
)
reader = hugectr.DataReaderParams(
data_reader_type=hugectr.DataReaderType_t.Parquet,
source=["/hugectr-io-test/data/dcn_parquet/file_list.txt"],
eval_source="/hugectr-io-test/data/dcn_parquet/file_list_test.txt",
slot_size_array=[39884,39043,17289,7420,20263,3,7120,1543,39884,39043,17289,7420,20263,3,7120,1543,63,63,39884,39043,17289,7420,20263,3,7120,1543],
data_source_params=data_source_params, # Using the S3 configurations
check_type=hugectr.Check_t.Non,
)
optimizer = hugectr.CreateOptimizer(optimizer_type=hugectr.Optimizer_t.SGD)
model = hugectr.Model(solver, reader, optimizer)
model.add(
hugectr.Input(
label_dim=1,
label_name="label",
dense_dim=13,
dense_name="dense",
data_reader_sparse_param_array=[
hugectr.DataReaderSparseParam("data1", 1, True, 26)
],
)
)
model.add(
hugectr.SparseEmbedding(
embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
workspace_size_per_gpu_in_mb=150,
embedding_vec_size=16,
combiner="sum",
sparse_embedding_name="sparse_embedding1",
bottom_name="data1",
optimizer=optimizer,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Reshape,
bottom_names=["sparse_embedding1"],
top_names=["reshape1"],
leading_dim=416,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Concat, bottom_names=["reshape1", "dense"], top_names=["concat1"]
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Slice,
bottom_names=["concat1"],
top_names=["slice11", "slice12"],
ranges=[(0, 429), (0, 429)],
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.MultiCross,
bottom_names=["slice11"],
top_names=["multicross1"],
num_layers=6,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["slice12"],
top_names=["fc1"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu1"],
top_names=["dropout1"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Concat,
bottom_names=["dropout1", "multicross1"],
top_names=["concat2"],
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["concat2"],
top_names=["fc2"],
num_output=1,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
bottom_names=["fc2", "label"],
top_names=["loss"],
)
)
model.compile()
model.summary()
model.fit(num_epochs = 1, display = 100, eval_interval = 500)
model.save_params_to_files("https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/test")
Overwriting train_with_s3.py
!python train_with_s3.py
HugeCTR Version: 4.0
====================================================Model Init=====================================================
[HCTR][10:20:27.878][WARNING][RK0][main]: The model name is not specified when creating the solver.
[HCTR][10:20:27.878][INFO][RK0][main]: Global seed is 1453804877
[HCTR][10:20:27.880][INFO][RK0][main]: Device to NUMA mapping:
GPU 0 -> node 0
[HCTR][10:20:29.757][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][10:20:29.757][INFO][RK0][main]: Start all2all warmup
[HCTR][10:20:29.757][INFO][RK0][main]: End all2all warmup
[HCTR][10:20:29.757][INFO][RK0][main]: Using All-reduce algorithm: NCCL
[HCTR][10:20:29.759][INFO][RK0][main]: Device 0: Tesla V100-SXM2-32GB
[HCTR][10:20:29.759][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][10:20:29.759][INFO][RK0][main]: num of DataReader workers for eval: 1
[HCTR][10:20:29.760][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:20:31.802][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:20:33.806][INFO][RK0][main]: Vocabulary size: 397821
[HCTR][10:20:33.807][INFO][RK0][main]: max_vocabulary_size_per_gpu_=2457600
[HCTR][10:20:33.810][INFO][RK0][main]: Graph analysis to resolve tensor dependency
===================================================Model Compile===================================================
[HCTR][10:20:35.435][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][10:20:35.436][INFO][RK0][main]: gpu0 init embedding done
[HCTR][10:20:35.437][INFO][RK0][main]: Starting AUC NCCL warm-up
[HCTR][10:20:35.439][INFO][RK0][main]: Warm-up done
===================================================Model Summary===================================================
[HCTR][10:20:35.440][INFO][RK0][main]: label Dense Sparse
label dense data1
(1024,1) (1024,13)
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Layer Type Input Name Output Name Output Shape
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
DistributedSlotSparseEmbeddingHash data1 sparse_embedding1 (1024,26,16)
------------------------------------------------------------------------------------------------------------------
Reshape sparse_embedding1 reshape1 (1024,416)
------------------------------------------------------------------------------------------------------------------
Concat reshape1 concat1 (1024,429)
dense
------------------------------------------------------------------------------------------------------------------
Slice concat1 slice11 (1024,429)
slice12 (1024,429)
------------------------------------------------------------------------------------------------------------------
MultiCross slice11 multicross1 (1024,429)
------------------------------------------------------------------------------------------------------------------
InnerProduct slice12 fc1 (1024,1024)
------------------------------------------------------------------------------------------------------------------
ReLU fc1 relu1 (1024,1024)
------------------------------------------------------------------------------------------------------------------
Dropout relu1 dropout1 (1024,1024)
------------------------------------------------------------------------------------------------------------------
Concat dropout1 concat2 (1024,1453)
multicross1
------------------------------------------------------------------------------------------------------------------
InnerProduct concat2 fc2 (1024,1)
------------------------------------------------------------------------------------------------------------------
BinaryCrossEntropyLoss fc2 loss
label
------------------------------------------------------------------------------------------------------------------
=====================================================Model Fit=====================================================
[HCTR][10:20:35.440][INFO][RK0][main]: Use epoch mode with number of epochs: 1
[HCTR][10:20:35.440][INFO][RK0][main]: Training batchsize: 1024, evaluation batchsize: 1024
[HCTR][10:20:35.440][INFO][RK0][main]: Evaluation interval: 500, snapshot interval: 10000
[HCTR][10:20:35.440][INFO][RK0][main]: Dense network trainable: True
[HCTR][10:20:35.440][INFO][RK0][main]: Sparse embedding sparse_embedding1 trainable: True
[HCTR][10:20:35.440][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: True
[HCTR][10:20:35.440][INFO][RK0][main]: lr: 0.001000, warmup_steps: 1, end_lr: 0.000000
[HCTR][10:20:35.440][INFO][RK0][main]: decay_start: 0, decay_steps: 1, decay_power: 2.000000
[HCTR][10:20:35.440][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:20:37.444][INFO][RK0][main]: Training source file: /hugectr-io-test/data/dcn_parquet/file_list.txt
[HCTR][10:20:37.444][INFO][RK0][main]: Evaluation source file: /hugectr-io-test/data/dcn_parquet/file_list_test.txt
[HCTR][10:20:37.444][INFO][RK0][main]: -----------------------------------Epoch 0-----------------------------------
[HCTR][10:20:37.444][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:20:41.825][INFO][RK0][main]: Iter: 100 Time(100 iters): 6.38401s Loss: 0.705401 lr:0.001
[HCTR][10:20:43.615][INFO][RK0][main]: Iter: 200 Time(100 iters): 1.78939s Loss: 0.696282 lr:0.001
[HCTR][10:20:44.823][INFO][RK0][main]: Iter: 300 Time(100 iters): 1.20686s Loss: 0.694805 lr:0.001
[HCTR][10:20:46.391][INFO][RK0][main]: Iter: 400 Time(100 iters): 1.56753s Loss: 0.697866 lr:0.001
[HCTR][10:20:47.468][INFO][RK0][main]: Iter: 500 Time(100 iters): 1.07658s Loss: 0.69365 lr:0.001
[HCTR][10:20:49.335][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:20:51.342][INFO][RK0][main]: Evaluation, AUC: 0.497726
[HCTR][10:20:51.342][INFO][RK0][main]: Eval Time for 1280 iters: 3.87204s
[HCTR][10:20:52.845][INFO][RK0][main]: Iter: 600 Time(100 iters): 5.37563s Loss: 0.695273 lr:0.001
[HCTR][10:20:52.898][INFO][RK0][main]: Finish 1 epochs 641 global iterations with batchsize 1024 in 17.46s.
[HCTR][10:20:52.914][INFO][RK0][main]: Rank0: Write hash table to file
[HCTR][10:20:52.914][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:20:56.138][DEBUG][RK0][main]: Successfully write to AWS S3 location: https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/test0_sparse_0.model/key
[HCTR][10:21:01.654][DEBUG][RK0][main]: Successfully write to AWS S3 location: https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/test0_sparse_0.model/emb_vector
[HCTR][10:21:01.663][INFO][RK0][main]: Dumping sparse weights to files, successful
[HCTR][10:21:01.663][INFO][RK0][main]: Dumping sparse optimzer states to files, successful
[HCTR][10:21:01.664][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:21:04.832][DEBUG][RK0][main]: Successfully write to AWS S3 location: https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/test_dense_0.model
[HCTR][10:21:04.834][INFO][RK0][main]: Dumping dense weights to file, successful
[HCTR][10:21:04.834][INFO][RK0][main]: Using S3 file system backend.
[HCTR][10:21:07.183][DEBUG][RK0][main]: Successfully write to AWS S3 location: https://s3.us-east-1.amazonaws.com/hugectr-io-test/pipeline_test/test_opt_dense_0.model
[HCTR][10:21:07.185][INFO][RK0][main]: Dumping dense optimizer states to file, successful