Accelerated Training with PyTorch
When training pipelines with PyTorch, the dataloader cannot prepare
sequential batches fast enough, so the GPU is not fully utilized. To
combat this issue, we’ve developed a highly customized tabular
dataloader, TorchAsyncItr, to accelerate existing pipelines in
PyTorch. The NVTabular dataloader is capable of:
removing bottlenecks from dataloading by processing large chunks of data at a time instead of item by item
processing datasets that don’t fit within the GPU or CPU memory by streaming from the disk
reading data directly into the GPU memory and removing CPU-GPU communication
preparing batch asynchronously into the GPU to avoid CPU-GPU communication
supporting commonly used formats such as parquet
integrating easily into existing PyTorch training pipelines by using a similar API as the native PyTorch dataloader
When TorchAsyncItr accelerates training with PyTorch, the following
happens:
The required libraries are imported.
import torch from nvtabular.loader.torch import TorchAsyncItr, DLDataLoader
The
TorchAsyncItriterator is initialized. The input is a NVTabular dataset that uses a list of file names. The NVTabular dataset is an abstraction layer that iterates over the full dataset in chunks. The dataset schema is defined in whichcatsare the column names for the categorical input features,contsare the column names for the continuous input features, andlabelsare the column names for the target. Each parameter should be formatted as a list of strings. The batch size is also specified.TRAIN_PATHS = glob.glob("./train/*.parquet") train_dataset = TorchAsyncItr( nvt.Dataset(TRAIN_PATHS), cats=CATEGORICAL_COLUMNS, conts=CONTINUOUS_COLUMNS, labels=LABEL_COLUMNS, batch_size=BATCH_SIZE )
TorchAsyncItris wrapped asDLDataLoader.train_loader = DLDataLoader( train_dataset, batch_size=None, collate_fn=collate_fn, pin_memory=False, num_workers=0 )
If a
torch.nn.Modulemodel was created,train_loadercan be used in the same way as the PyTorch dataloader.... model = get_model() optimizer = torch.optim.Adam(model.parameters(), lr=0.01) for x_cat, x_cont, y in iter(dataloader): y_pred = model(x_cat, x_cont) loss = loss_func(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step()
The
TorchAsyncItrdataloader can be initialized for the validation dataset using the same structure.
You can find additional examples in our repository such as MovieLens and Criteo.