Better parallelization for LowerTrs and UpperTrs (OpenMP backend)? #1122
learning-chip
started this conversation in
Ideas
Replies: 1 comment
-
Thank you for the report. Some of our OpenMP kernels are not yet optimized. The GPU kernels with CUDA and HIP backends have optimized kernels, either with the vendor backends (cuSPARSE and hipSPARSE) or with our own kernels. We can definitely improve our OpenMP kernels, but unfortunately it is not a very high priority for us right now. If you would like to contribute, we would be glad to help you navigate any issues you may face (with Ginkgo or OpenMP). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was comparing the efficiency of the ISAI approximation (
gko::preconditioner::Isai
) against exact triangular solve (gko::solver::LowerTrs
), in the context of ILU/IC preconditioner. While ISAI can be well parallelized (just SpMVs), the "baseline"LowerTrs
doesn't seem to get parallel speed-up, making the comparison kind of unfair.Looking into the omp kernel of lower_trs:
ginkgo/omp/solver/lower_trs_kernels.cpp
Lines 101 to 119 in 693bc4f
Only the outer-loop get parallelized, which it is just for multiple right sides. No parallelism for a single RHS.
There are some fast, parallel implementations of sparse triangular solve on either CPUs or GPUs that might worth adding.
Beta Was this translation helpful? Give feedback.
All reactions