The high bandwidth required for gradient exchange is a bottleneck for the distributed training of large transformer models. Most sparsification approaches focus on gradient compression for convolutional neural networks (CNNs) optimized by SGD. In this work, we show that performing local gradient accumulation when using Adam to optimize transformers in distributed fashion leads to a misled optimization direction and we address this problem by accumulating the optimization direction locally. We also empirically demonstrate most sparse gradients do not overlap and thus show that sparsification is comparable to an asynchronous update. Our experiments with classification and segmentation tasks show that our method can still maintain the correct optimization direction in distributed training event under highly sparse updates
Chen, Y & Deligiannis, N 2023, LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES. in 2023 IEEE International Conference on Image Processing. IEEE International Conference on Image Processing, IEEE, IEEE, pp. 2395-2399, 2023 IEEE International Conference on Image Processing, Kuala Lumpur, Malaysia, 8/10/23. https://doi.org/10.1109/ICIP49359.2023.10222032
Chen, Y., & Deligiannis, N. (2023). LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES. In 2023 IEEE International Conference on Image Processing (pp. 2395-2399). (IEEE International Conference on Image Processing). IEEE. https://doi.org/10.1109/ICIP49359.2023.10222032
@inproceedings{577d8ad16db946f88c6c9a8389f049f6,
title = "LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES",
abstract = "The high bandwidth required for gradient exchange is a bottleneck for the distributed training of large transformer models. Most sparsification approaches focus on gradient compression for convolutional neural networks (CNNs) optimized by SGD. In this work, we show that performing local gradient accumulation when using Adam to optimize transformers in distributed fashion leads to a misled optimization direction and we address this problem by accumulating the optimization direction locally. We also empirically demonstrate most sparse gradients do not overlap and thus show that sparsification is comparable to an asynchronous update. Our experiments with classification and segmentation tasks show that our method can still maintain the correct optimization direction in distributed training event under highly sparse updates",
keywords = "Distributed Learning, Vision Transformer, Gradient Compression, Optimization",
author = "Yiming Chen and Nikos Deligiannis",
year = "2023",
doi = "10.1109/ICIP49359.2023.10222032",
language = "English",
isbn = "978-1-7281-9836-1",
series = "IEEE International Conference on Image Processing",
publisher = "IEEE",
pages = "2395--2399",
booktitle = "2023 IEEE International Conference on Image Processing",
note = "2023 IEEE International Conference on Image Processing, ICIP2023 ; Conference date: 08-10-2023 Through 11-10-2023",
url = "https://2023.ieeeicip.org/",
}