Multi-GPU data-parallel training using the
To train models faster, users can use Data-Parallel training when using
transformers4rec.Trainer for training. Data-parallel multi-GPU training distributes train data between GPUs to speedup training and support larger batch sizes at each step.
Trainer class supports both
DistributedDataParallel built-in features of PyTorch. Here we explain how each of these training modes can be used in Transformers4Rec.
DataParallel mode is used, the following happens for each training step:
GPU-0 reads a batch then evenly distributes it among available GPUs
The latest model will be copied to all GPUs
A Python thread is created for each GPU to run
forward()step and the partial loss will be sent to GPU-0 to compute the global loss
Computed global loss is broadcasted to all GPU threads to run
Gradients from each GPU are sent to GPU-0 and their average is computed
As we see, parallelism in
DataParallel mode is implemented through Python threads which will be blocked by GIL (Global Interepreter Lock) so
DataParallel is not the preferred method. Users are advised to use
DistributedDataParallel instead as it uses multi-processing and is better maintained. Also, some model types such as
transfoxl can not be used with
DataParallel. To learn more about
DataParallel refer to PyTorch documentation.
To use the
DataParallel mode training, user just needs to make sure
CUDA_VISIBLE_DEVICES is set. For example when 2 GPUs are available:
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"to the script
export CUDA_VISIBLE_DEVICES=0,1in the terminal
Do not try wrapping the model with
torch.nn.DataParallel yourself because that will break automatic wrapping done by
Note: When using
DataParallel the dataloader generates one batch on each train step then the batch will be divided between GPUs so the
per_device_train_batch_size argument represents the total batch size in this mode, not size of the batch each GPU receives.
This is the suggested and more efficient method. When a model is trained using the
A separate process will be assigned to each GPU in the beginning and GPU-0 will replicate the model on each GPU
On each step each GPU receives a different mini-batch produced by the dataloader
On the backward pass the gradient from GPUs will be averaged for accumulation
To learn more about
DistributedDataParallel see the PyTorch Documentation.
To train using the
DistributedDataParallel mode user should use PyTorch distributed launcher to run the script:
python -m torch.distributed.launch --nproc_per_node N_GPU your_script.py --your_arguments
To have one process per GPU, replace
N_GPU with the number of GPUs you want to use. Optionally, you can also set CUDA_VISIBLE_DEVICES accordingly.
Note: When using
DistributedDataParallel, our data loader splits data between the GPUs based on dataset partitions. For that reason, the number of partitions of the dataset must be equal to or an integer multiple of the number of processes. If the parquet file has a small number of row groups (partitions), try repartitioning and saving it again using cudf or pandas before training. The dataloader checks
dataloader.dataset.npartitions and will repartition if needed but we advise users to repartition the dataset and save it for better efficiency. Use pandas or cudf for repartitioning. Example of repartitioning a parquet file with cudf:
row_group_size_rows such that
With pandas, use the argument as
Please make sure that the resulting number of partitions is divisible by number of GPUs to be used (eg. 2 or 4 partitions, if 2 GPUs will be used).
We trained and evaluated a number of models using single GPU,
DistributedDataParallel training modes and the results are shown in the table below. To reproduce, use the models included in ci/test_integration.sh.
These experiments used a machine with 2 Tesla V100-SXM2-32GB-LS GPUs.