Revisiting
a Classic: Bruck algorithm for non-uniform all-to-all communication
Sidharth Kumar, University of
Alabama at Birmingham
Abstract: As supercomputers enter the
exascale era, scientific applications that rely heavily on collective
communication operations in Message Passing Interface (MPI) for data exchange
and global synchronization find their communication costs increasingly
dominating. Among all the MPI collective operations, MPI_Alltoall and
MPI_Alltoallv play especially important roles in allowing all the processes
involved in communication to exchange data with each other. MPI_Alltoallv is a
generalization of MPI_Alltoall, supporting the exchange of non-uniform
distributions of data. However, the latency of the standard MPI_Alltoallv
implementations is in O(P ) (linear in the number of processes), whereas
MPI_Alltoall is O(log P ). Such linear-complexity performs poorly when
applications are deployed on thousands of processes, especially for
short-message communication which is dominated by latency.
In this talk, I will present variants of the Bruck algorithm,
a classic uniform all-to-all communication algorithm with logarithmic latency
in P. I will then present two techniques: padded bruck and two-phase bruck,
that efficiently generalizes Bruck to non-uniform all-to-all data exchange. The
techniques make it possible to perform non-uniform all-to-all data exchanges
with logarithmic latency. We empirically validate the techniques on three
supercomputers: Theta, Cori and Stampede, using both microbenchmarks and two
real-world applications: graph mining and program analysis. We perform both
weak and strong scaling studies for a range of average message sizes, degree of
imbalances, and distribution schemes, and demonstrate that our techniques
outperform vendor optimized Cray’s MPI_Alltoallv by as much as 50% for
different workloads and scale.