Speed up coaching with a number of GPUs utilizing PyTorch Lightning

PyTorch Lighting is among the PyTorch frameworks that’s extensively used for AI-based analysis. The PyTorch Lightning Framework has the power to mannequin community architectures and adapt to advanced fashions. PyTorch Lightning will likely be used prominently by AI researchers and machine studying engineers as a result of scalability and most efficiency of the fashions. This framework has many options, and on this article, let’s examine use PyTorch Lighting to coach a mannequin on a number of GPUs.

Desk of Contents

  1. Introduction to PyTorch Lightning
  2. Advantages of PyTorch Lightning
  3. Advantages of Utilizing Multi GPU Coaching
  4. Coaching with a number of GPUs utilizing PyTorch Lightning
  5. abstract

Introduction to PyTorch Lightning

PyTorch Lighting is among the wrapper frameworks of PyTorch, which is used to boost the coaching strategy of advanced fashions. The framework helps quite a lot of features however permits us to focus coaching fashions on a number of GPU performance. The PyTorch lighting framework accelerates the analysis course of and separates precise modeling from engineering.

Heavy PyTorch fashions may be accelerated utilizing the PyTorch mild framework, and coaching heavy PyTorch fashions on low accelerator platforms may be time-consuming. The PyTorch Lightning Framework principally follows a standard workflow for its operation. The workflow of PyTorch is given beneath.

  • When the mannequin structure is instantiated within the work atmosphere the workflow is instantiated. PyTorch scripts that embrace pipelines, coaching and check evaluations, and all different parameters used for the mannequin can even be accelerated within the framework.
  • The pipelines used to mannequin or community will likely be configured in response to PyTorch Lightning requirements. Now the DataModule of the framework will reorganize the pipeline right into a usable format.
  • Now Coach cases may be instantiated within the work atmosphere. The instance of the coach may be manifested in response to the accelerators current within the work atmosphere.

The PyTorch Lightning Framework has the power to combine with advanced fashions comparable to numerous optimizers and transformers and helps AI researchers speed up their analysis work. The framework can be built-in on cloud-based platforms and with among the efficient coaching methods of fashions comparable to SOTA. The framework additionally has the pliability to use normal functionalities to advanced fashions comparable to early stopping, which terminates mannequin coaching if there isn’t any enchancment in mannequin efficiency after a sure threshold. Pre-trained (switch studying) fashions can be offered within the framework to inspire studying of different fashions. Within the subsequent a part of this text, allow us to have a look at among the advantages of utilizing PyTorch Lightning.

Advantages of PyTorch Lightning

A number of the advantages of utilizing PyTorch Lightning are talked about beneath.

  • PyTorch Lightning fashions are principally {hardware} agnostic. This makes the mannequin to be educated on a single GPU or assets with a number of GPUs.
  • PyTorch Lightning gives mannequin execution on quite a lot of platforms. There’s a separate Lightning Coach occasion for executing Lightning fashions in environments comparable to Google Colab or Jupyter.
  • Lightning fashions are simply interpretable and extremely reproducible throughout totally different platforms, growing using Lightning fashions.
  • Excessive flexibility and talent to adapt to quite a lot of gadgets and high-end assets.
  • Parallel coaching is supported by a number of GPUs in addition to Lightning fashions to hurry up the coaching course of.
  • Quick mannequin convergence and the power to combine with TensorBoard facilitates speedy mannequin convergence and simplifies mannequin analysis.

Advantages of Utilizing Multi GPU Coaching

Giant fashions principally contain coaching with massive batch sizes and huge dimensions of information. Segmentation of this information turns into essential to cut back the utmost reminiscence utilization of accelerators comparable to GPU. Through the use of a number of GPUs, parallel processing may be employed which reduces the overall time spent on mannequin coaching. Typically onerous memory-saving configuration will have an effect on coaching velocity however this may be dealt with effectively by utilizing a number of GPUs.

Using a number of GPUs additionally facilitates sharding which in flip quickens the coaching course of. Lightning fashions present an instance or technique for utilizing multiple GPU in a piece atmosphere named distributed information parallel, The whole educated mannequin dimension and batch dimension won’t change with respect to the variety of GPUs, however Lightning fashions have the power to mechanically apply sure methods for optimum batches of information to be shared throughout the GPUs as outlined within the coach examples .

As talked about earlier, many GPUs additionally present sharp coaching facility which may be very helpful for quick coaching. There are numerous advantages of shared coaching comparable to discount in peak reminiscence utilization, discount in massive batch dimension of information on a single accelerator, linear scaling of the mannequin, and plenty of extra.

Coaching with a number of GPUs utilizing PyTorch Lightning

A number of GPU coaching may be taken utilizing PyTorch Lightning as a strategic instance. There are principally 4 kinds of examples of PyTorch that can be utilized to make use of a number of GPU-based coaching. Allow us to clarify the performance of every instance.

Knowledge Parallel (DP)

Knowledge Parallel is liable for splitting the info into sub batches for a number of GPUs. Contemplate that there’s a batch dimension of 64 and there are 4 GPUs liable for processing the info. So every GPU may have 12 samples of information to course of. To make use of Knowledge Parallel, we have to specify it within the Coach occasion as talked about beneath.

coach = Coach(accelerator="gpu", gadgets=2, technique="dp")

Right here dp The information is the parameter used within the work atmosphere to entry the parallel occasion and the basis node will add the weights collectively after the final backward propagation.

Distributed Knowledge-Parallel (DDP)

In DDP every GPU may have a separate node for processing. Every subset of the info may be accessed by a number of GPUs on the general dataset. The gradients are synced to a number of GPUs and the model-trained parameters may be taken for additional analysis.

coach = Coach(accelerator="gpu", gadgets=8, technique="ddp")

Right here DDP Distributed information – is the parameter used within the work atmosphere to entry the parallel occasion and the basis node will add the weights collectively after the final backward propagation.

There are additionally different methods that will likely be used with DDP often known as DDP-2 and Spawn. The general operational traits of those two methods of DDP are related however the distinction may be seen within the weight replace course of. The information segmentation and coaching on the spot course of will likely be totally different from the unique DDP.

coach = Coach(accelerator="gpu", gadgets=8, technique="ddp2") ## DDP-2
coach = Coach(accelerator="gpu", gadgets=8, technique="ddp_spawn") ## DDP-Spawn

horovod a number of gpu coaching

Horovod is the framework for utilizing the identical script of coaching throughout a number of GPUs. Not like DDP every subset of information will likely be offered to a number of GPUs for sooner processing. Every GPU server driver within the structure will likely be configured by the purposes.

So the PyTorch lightning mannequin may be configured utilizing the Howard structure as proven within the code beneath.

coach = Coach(technique="horovod", accelerator="gpu", gadgets=1)


Bagua is among the deep studying frameworks used to speed up the coaching course of and enhance assist for utilizing distributed coaching algorithms. In some distributed coaching algorithms, the bagua makes use of gradient cut back all, This algorithm is principally used to ascertain communication between synchronous gadgets and can common the gradients amongst all the employees.

Under is a pattern code for utilizing the Bagua algorithm GradientAllReduce Algorithms within the work atmosphere.

coach = Coach(technique=BaguaStrategy(algorithm="gradient_allreduce"),accelerator="gpu",gadgets=2)

Utilizing a number of GPUs for coaching won’t solely velocity up the coaching course of, but additionally considerably cut back the wall time of the mannequin. Therefore the important technique of PyTorch Lightning can be utilized to make use of a number of GPUs accordingly and prepare the info utilizing PyTorch Lightning.


PyTorch Lightning is among the frameworks of PyTorch, which has in depth capabilities and advantages for simplifying advanced fashions. On this article among the many numerous functionalities of PyTorch Lightning, we noticed prepare a mannequin on a number of GPUs for sooner coaching. It principally makes use of some methods to separate the info based mostly on the batch dimension and transfer that information to a number of GPUs. It permits advanced fashions and information to be educated in a shorter length and in addition helps in accelerating the analysis work of AI researchers and ML engineers.


Supply hyperlink