CUP-ECS Colloquium: Revisiting a Classic: Bruck algorithm for non-uniform all-to-all communication

views comments

Revisiting a Classic: Bruck algorithm for non-uniform all-to-all communication
Sidharth Kumar, University of Alabama at Birmingham

Abstract: As supercomputers enter the exascale era, scientific applications that rely heavily on collective communication operations in Message Passing Interface (MPI) for data exchange and global synchronization find their communication costs increasingly dominating. Among all the MPI collective operations, MPI_Alltoall and MPI_Alltoallv play especially important roles in allowing all the processes involved in communication to exchange data with each other. MPI_Alltoallv is a generalization of MPI_Alltoall, supporting the exchange of non-uniform distributions of data. However, the latency of the standard MPI_Alltoallv implementations is in O(P ) (linear in the number of processes), whereas MPI_Alltoall is O(log P ). Such linear-complexity performs poorly when applications are deployed on thousands of processes, especially for short-message communication which is dominated by latency.

In this talk, I will present variants of the Bruck algorithm, a classic uniform all-to-all communication algorithm with logarithmic latency in P. I will then present two techniques: padded bruck and two-phase bruck, that efficiently generalizes Bruck to non-uniform all-to-all data exchange. The techniques make it possible to perform non-uniform all-to-all data exchanges with logarithmic latency. We empirically validate the techniques on three supercomputers: Theta, Cori and Stampede, using both microbenchmarks and two real-world applications: graph mining and program analysis. We perform both weak and strong scaling studies for a range of average message sizes, degree of imbalances, and distribution schemes, and demonstrate that our techniques outperform vendor optimized Cray’s MPI_Alltoallv by as much as 50% for different workloads and scale.

Tags