A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations.
Azzam HaidarAhmad AbdelfattahMawussi ZounonStanimire TomovJack J. DongarraPublished in: IEEE Trans. Parallel Distributed Syst. (2018)