Direct Solution of Linear Systems
Providing factorization-based sparse solvers and preconditioners on large-scale parallel machines.
We are developing scalable sparse direct linear solvers and factorization-based algebraic preconditioners, which can solve the most challenging linear systems that are often too difficult for iterative methods. This work is applicable to accelerator design, earthquake simulation, fluid dynamics, fusion energy, material sciences, quantum mechanics, and many other engineering and science domains.
Our effort is focused on the development of three types of linear solvers. The first, encapsulated in the software symPACK and SuperLU, are pure direct solvers using LU factorization followed by triangular solutions. The second, encapsulated in the software STRUMPACK, is the quasi-linear complexity preconditioners using low-rank approximate factorization of the dense submatrices, where the hierarchical matrix algebra, such as Hierarchically Semi-separable (HSS) matrix structure, is employed. The third, encapsulated in the software PDSLin, is a Schur complement-based nonoverlapping domain decomposition method, blending direct solvers for the subdomain solution with preconditioned iterative solvers for the solution of the Schur complement system.
Sparse direct linear solvers are usually organized into three distinct phases: a preprocessing (or commonly called symbolic) phase, the numerical factorization phase, and finally the triangular solution phase. When factoring a sparse matrix, some zero entries become nonzero. Increasing both memory and computational costs. These new nonzero entries are referred to as the fill-in entries. By applying good permutations to reorder the equations (rows) and the variables (columns) it is possible to reduce the number of fill-ins. Computing good permutations is referred to as the Ordering problem, and is usually performed during the symbolic phase. Our work covers all these areas.
Sparse Symmetric Solver. FASTMath has developed and released a new direct linear solver for sparse symmetric positive definite matrices, symPACK (www.sympack.org). By exploiting the symmetry explicitly, symPACK achieves low storage costs. This novel solver implements the Fan-Both method on distributed memory platforms by using the UPC++ framework, which allows an asynchronous pull communication strategy coupled with an event-driven task scheduling to be incorporated. SymPACK often outperforms state-of-the-art solvers on a wide collection of problems. It has been integrated into the Pole EXpansion and Selective Inversion library (PEXSI), developed under a couple of BES SciDAC-3 projects on computational chemistry and computational materials science.
Providing memory-scalable implementations of various steps in the symbolic phase is a challenging but crucial task. A significant accomplishment of FASTMath in this area is the implementation of a scalable distributed-memory Reverse Cuthill-McKee (RCM) ordering algorithm for reducing the profile of a sparse matrix. This is the first distributed-memory implementation of the RCM ordering algorithm.
Sparse Unsymmetric Solver. For the sparse direct solver SuperLU library, we developed new algorithms to make elective use of GPUs and Intel Xeon Phi. These include aggregating large dense blocks for BLAS, overlapping GPU/CPU computations, using OpenMP tasking for irregular nested loops to avoid load imbalance. SuperLU achieves 3x speedup and 2-5x memory saving using 100s of GPU nodes on Titan at ORNL, and strong scales to over 64 KNL nodes on Cori at NERSC. SuperLU is used in a wide variety of applications, in particular in the SciDAC fusion M3D-C1 and NIMROD code, HEP ACE3P code, BES AMIS and DGDFT codes.
Recently we have developed a communication-avoiding 3D sparse LU factorization in SuperLU_DIST, which reduces the latency by a factor of O(log n) for plannar graphs and O(n^(1/3)) for non-plannar graphs. Our new 3D code achieves speedups up to 27x for planar graphs and up to 2.5x for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30, Edison at NERSC.