Don’t let the accelerator (GPU/TPU) idle


When training a large model with multiple workers/accelerators, they might be idle when the data ingestion is not fast enough to catch up with the training.

  • Accelerators are expensive
  • Transformation run on CPU can be slow
  • CPU/TPU/GPU can be waiting for data if model does not fit in memory


We should look into way to make sure the data ingestion is performant enough to keep the accelerator busy

  • Input pipeline (
    • Viewing pipeline as ETL
      • Extract – reading data from hard drive or distributed file system
      • Transform (map) – shuffle, augmentation, decompression, batching
      • Load – load transformed data to accelerator (GPU/TPU)
    • Input pipeline can take time, and we don’t want the accelerators to wait
    • Idle CPU and Accelerator (don’t want this)
      • idle cpu and accelerator
    • Pipeline allows CPU to be running in parallel with GPU/TPU
      • input pipeline
  • Prefetching
    • fetch data for the next training step while accelerator is training on the current step
    • improve latency when the data is ready for training
  • Parallel extraction and transformation
    • open/read multiple data samples concurrently
    • applying transformation on multiple CPU cores concurrently
    • improve extraction/transformation throughput by reading multiple files
    • interleaving the extraction and transformation
  • Caching – read and transform data once into memory during first epoch, and cache the transformed input
  • Reduce memory footprint – the order of the operation can affect how much memory is needed to buffer the records, we need to choose how much to produce and consume at each step


Related Posts

One thought on “Don’t let the accelerator (GPU/TPU) idle

Leave a Reply

Your email address will not be published. Required fields are marked *