Training large models - distributed training

Problem

A model is too large to fit on one machine.

Space – Too many weights to fit in memory
Time – Too many training samples to iterate

Solution

Distributed training has two broad type

Model parallelism (harder)
- When model parameters do not fit on one device
- Separate the model into different machines/accelerators (GPU/TPU)
- Breaking up a model to train efficiently on multiple device is still an active area of research and could be quite challenge.
- Achieved via pipeline parallelism
- - GPipe – Google’s version of pipeline parallelism
  - - Legend: F is forward pass and B is backward pass
    - Top figure: When each layer of the model is split across different device naively, they still have to wait for each other because the forward and backward passes are sequential
    - Bottom figure: When the mini-batch is further slip into micro-batches, each device can work on a portion of the micro-batch and does not need to wait for the entire mini-batch. For example, Device 0 is working on the second micro-batch while Device 1 is working on the first micro-batch in parallel. There is still some overhead because Device 1 still have to wait for Device 0 for the first micro-batch.
    - In both figures: the update is only compute when the entire mini-batch (all micro-batches) are done with forward and backward propagations.
    - When tested on AmoebaNet-D (4, 512) with 8 billion parameters, pipeline parallelism with GPipe is able achieve close to linear speedup
Data parallelism (easier)
- Synchronized
  - Separate the data into partitions
  - Model is replicated into different workers (a machine or accelerator)
  - Each worker perform forward and backward propagation to compute error and gradients for its responsible partition of the data in one epoch
  - Each worker updates its model and send its update to all other workers so they are able to apply the same updates
  - Keep model in sync in each epoch
    - Workers need to be synchronized and wait for each others’ updates
  - Communication
    - All-reduce algorithm
      - Every workers sends updates to every other workers
      - Requires large bandwidth and information shared is replicated
    - ring-reduce algorithm
      - Workers form a ring and passes updates to next worker in the ring and every worker gets the information when the message passes one round in the ring
      - Only send some part of the data each time so that every workers are using all the available bandwidth
      - Reduce bandwidth because each worker can accumulate the gradients and sent the partial sum to the next worker so that after one round the total is obtained.
- Asynchronized
  - Parameter server
    - TFJob (a component of Kubeflow) roles:
      - Chief: orcheestrating training and perform checkpointing
      - PS: parameter servers provide distributed data store
      - Worker: actually doing the work, worker 0 might also be the chief
      - Evaluation: compute eval metric during training
    - Each worker has its own piece of data
    - Each worker sends update to the parameter server asynchronously
    - More efficient – no waiting time for the worker
    - Different workers run at different pace
    - Slower to converge (measured in the number of epoch) and lower accuracy

Application

Uses Keras/Estimator APIs
Tensorflow distribute.Strategy API
- One Device Strategy – no distribution, used for testing your distributed training code
- Mirrored Strategy – synchronous on multiple GPU on one machine
  - Variables are mirrored across all replica of the model in all GPUs
  - Conceptual variable called mirrored variable, which kept in sync using communication algorithm such as all-reduce
- Parameter Strategy
  - Parameter server worker(s) – stores the parameters and maintains all updates
  - Computation workers – feeding data into model
  - Asynchronous by default, but can be done synchronously (although it would be less efficient)
- Multi-worker Mirrored Strategy
  - Synchronous distributed training across multiple workers, each with potentially multiple GPUs
  - Similar to MirroredStrategy, it creates copies of all variables in the model on each device across all workers
- Central Storage Strategy
  - Synchronous but variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.
- TPU Strategy
  - To take advance to TPU and TPU Pods, which are optimized for larger models and more cost effective than GPU
Fault-tolerant
- Preserving training state (in a distribute file system) when a worker fails
- Keras backup and restore callback
Don’t let the accelerator (GPU/TPU) idle
- Input pipeline needs to be efficient to keep the accelerator busy or else you might not get much benefit from the accelerator
Memory constraint of large model
- The number of parameters is so large that a model cannot fit on a single machine (billions of parameters)
- Gradient accumulation
  - Split mini-batches and only perform backprop after entire batch
- Memory swap
  - Copy activation states back and forth among CPU and memory

Training large models – distributed training

Problem

Solution

Application

One thought on “Training large models – distributed training”

Leave a Reply Cancel reply

Training large models – distributed training

Problem

Solution

Application

Related Posts

7 Game-Changing Strategies for Using Cold Emails in Your Data Science Job Search

Probability Recursion Question for DS/ML Interviews (Step-by-Step Simple Solution)

One thought on “Training large models – distributed training”

Leave a Reply Cancel reply