Abstract
Scalable parallel algorithms are considered for linear algebra applications.
A bottleneck in these algorithms is the mapping of matrix elements
to processors. Wrapping a block mapping in both rows and columns
of the matrix is called the torus-wrap mapping. Its generalization
is the block-torus-wrap, which assigns each block to a single processor
in such a way that the distribution of block mirrors is the distribution
of elements in a torus-wrap mapping. It is proved that this assignment
scheme leads to dense matrix algorithms that achieve the lower bound
on interprocessor communication under reasonable conditions. Theoretical
and experimental results are compared with those obtained from more
traditional mapping.
Users
Please
log in to take part in the discussion (add own reviews or comments).