Knowledge Sharing Seminar on Federated Learning

Peter Richtárik, Professor of Computer Science at King Abdullah University of Science and Technology (KAUST), gives a lecture followed by a discussion on the topic: „Is Going Asynchronous the Right Way to Handle Device Heterogeneity in Federated Learning? (The First Optimal Parallel SGD in the Presence of Data, Compute, and Communication Heterogeneity)“.

Lecture abstract

The design of efficient parallel/distributed optimization methods and tight analysis of their theoretical properties are important research endeavors. While minimax complexities are known for sequential optimization methods, the theory of parallel optimization methods is surprisingly much less explored, especially in the presence of data, compute and communication heterogeneity. In the first part of the talk [1], we establish the first optimal time complexities for parallel optimization methods (Rennala SGD and Malenia SGD) that have access to an unbiased stochastic gradient oracle with bounded variance, under the assumption that the workers compute stochastic gradients with different speeds, i.e., we assume compute heterogeneity. We prove lower bounds and develop optimal algorithms that attain them, both in the data homogeneous and heterogeneous regimes. In the second part of the talk [2], we establish the first optimal time complexities for parallel optimization methods (Shadowheart SGD) that have access to an unbiased stochastic gradient oracle with bounded variance, under the assumption that the workers compute stochastic gradients with different speeds, as before, but under the further assumption that the worker-to-server communication times are nonzero and heterogeneous. We prove lower bounds and develop optimal algorithms that attain them, in the data-homogeneous regime only. Time permitting, I may briefly outline some further recent results [3, 4, 5].

Our results have surprising consequences for the literature of asynchronous optimization methods: in contrast with prior attempts to tame compute heterogeneity via „complete/wild“ compute and update asynchronicity, our methods alternate fast asynchronous computation of a minibatch of stochastic gradients with infrequent synchronous update steps.

[1] Tyurin, A., Richtárik, P. Optimal time complexities of parallel stochastic optimization methods under a fixed computation model, Advances in Neural Information Processing Systems – NeurIPS 2023.

[2] Tyurin, A., Pozzi, M., Ilin, I., Richtárik, P. Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity, NeurIPS 2024.

[3] Tyurin, A., Gruntkowska, K., Richtárik, P. Freya PAGE: First optimal time complexity for large-scale nonconvex finite-sum optimization with heterogeneous asynchronous computations, NeurIPS 2024.

[4] Tyurin, A., Richtárik, P. On the optimal time complexities in decentralized stochastic asynchronous optimization, NeurIPS 2024.

[5] Maranjyan, A., Shaikh, O., Richtárik, P. MindFlayer: Efficient asynchronous parallel SGD in the presence of heterogeneous and random worker compute times, NeurIPS 2024 Workshop: Optimization for Machine Learning (OPT 2024).