PAPER DIGEST
Most Influential SIGCOMM 2024 Paper · 2026-03 edition

Rethinking Machine Learning Collective Communication As A Multi-Commodity Flow Problem

Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, Luke Marshall

Venue
ACM SIGCOMM Conference (SIGCOMM) 2024
Recognition
Most Influential SIGCOMM 2024 Paper (Rank No. 5)
Edition
2026-03
Impact factor
3
Certificate ID
280a23e93cfaf144

Abstract

Cloud operators utilize collective communication optimizers to enhance the efficiency of the single-tenant, centrally managed training clusters they manage. However, current optimizers struggle to scale for such use cases and often compromise solution quality for scalability. Our solution, TE-CCL, adopts a traffic-engineering-based approach to collective communication. Compared to a state-of-the-art optimizer, TACCL, TE-CCL produced schedules with 2\texttimes{} better performance on topologies TACCL supports (and its solver took a similar amount of time as TACCL's heuristic-based approach). TECCL additionally scales to larger topologies than TACCL. On our GPU testbed, TE-CCL outperformed TACCL by 2.14\texttimes{} and RCCL by 3.18\texttimes{} in terms of algorithm bandwidth.

Download PDF certificate