oai:arXiv.org:2403.15654
Computer Science
2024
9/18/2024
We revisit two fundamental decentralized optimization methods, Decentralized Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple local updates.
We consider two settings and demonstrate that incorporating $K > 1$ local update steps can reduce communication complexity.
Specifically, for $\mu$-strongly convex and $L$-smooth loss functions, we proved that local DGT achieves communication complexity $\tilde{\mathcal{O}} \Big(\frac{L}{\mu K} + \frac{\delta}{\mu (1 - \rho)} + \frac{\rho }{(1 - \rho)^2} \cdot \frac{L+ \delta}{\mu}\Big)$, where $\rho$ measures the network connectivity and $\delta$ measures the second-order heterogeneity of the local loss.
Our result reveals the tradeoff between communication and computation and shows increasing $K$ can effectively reduce communication costs when the data heterogeneity is low and the network is well-connected.
We then consider the over-parameterization regime where the local losses share the same minimums, we proved that employing local updates in DGD, even without gradient correction, can yield a similar effect as DGT in reducing communication complexity.
Numerical experiments validate our theoretical results.
Wu, Tongle,Sun, Ying, 2024, The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity