Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai

Venue: ACM SIGCOMM Conference (SIGCOMM) 2024
Recognition: Most Influential SIGCOMM 2024 Paper (Rank No. 6)
Edition: 2026-03
Impact factor: 3
Certificate ID: 7a9acd7d635d4780

Abstract

Deep learning training (DLT), e.g., large language model (LLM) training, has become one of the most important services in multitenant cloud computing. By deeply studying in-production DLT jobs, we observed that communication contention among different DLT jobs seriously influences the overall GPU computation utilization, resulting in the low efficiency of the training cluster. In this paper, we present Crux, a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs. Maximizing GPU computation utilization for DLT, nevertheless, is NP-Complete; thus, we formulate and prove a novel theorem to approach this goal by GPU intensity-aware communication scheduling. Then, we propose an approach that prioritizes the DLT flows with high GPU computation intensity, reducing potential communication contention. Our 96-GPU testbed experiments show that Crux improves 8.3\% to 14.8\% GPU computation utilization. The large-scale production trace-based simulation further shows that Crux increases GPU computation utilization by up to 23\% compared with alternatives including Sincronia, TACCL, and CASSINI.

Download PDF certificate