Better parallelization for LowerTrs and UpperTrs (OpenMP backend)? #1122

learning-chip · 2022-09-16T03:23:45Z

learning-chip
Sep 16, 2022

I was comparing the efficiency of the ISAI approximation (gko::preconditioner::Isai) against exact triangular solve (gko::solver::LowerTrs), in the context of ILU/IC preconditioner. While ISAI can be well parallelized (just SpMVs), the "baseline" LowerTrs doesn't seem to get parallel speed-up, making the comparison kind of unfair.

Looking into the omp kernel of lower_trs:

ginkgo/omp/solver/lower_trs_kernels.cpp

Lines 101 to 119 in 693bc4f

    
           #pragma omp parallel for 
        
               for (size_type j = 0; j < b->get_size()[1]; ++j) { 
        
                   for (size_type row = 0; row < matrix->get_size()[0]; ++row) { 
        
                       auto diag = one<ValueType>(); 
        
                       x->at(row, j) = b->at(row, j); 
        
                       for (auto k = row_ptrs[row]; k < row_ptrs[row + 1]; ++k) { 
        
                           auto col = col_idxs[k]; 
        
                           if (col < row) { 
        
                               x->at(row, j) -= vals[k] * x->at(col, j); 
        
                           } 
        
                           if (col == row) { 
        
                               diag = vals[k]; 
        
                           } 
        
                       } 
        
                       if (!unit_diag) { 
        
                           x->at(row, j) /= diag; 
        
                       } 
        
                   } 
        
               }

Only the outer-loop get parallelized, which it is just for multiple right sides. No parallelism for a single RHS.

There are some fast, parallel implementations of sparse triangular solve on either CPUs or GPUs that might worth adding.

pratikvn · 2022-10-06T14:08:27Z

pratikvn
Oct 6, 2022
Maintainer

Thank you for the report. Some of our OpenMP kernels are not yet optimized. The GPU kernels with CUDA and HIP backends have optimized kernels, either with the vendor backends (cuSPARSE and hipSPARSE) or with our own kernels.

We can definitely improve our OpenMP kernels, but unfortunately it is not a very high priority for us right now. If you would like to contribute, we would be glad to help you navigate any issues you may face (with Ginkgo or OpenMP).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better parallelization for LowerTrs and UpperTrs (OpenMP backend)? #1122

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Better parallelization for LowerTrs and UpperTrs (OpenMP backend)? #1122

learning-chip Sep 16, 2022

Replies: 1 comment

pratikvn Oct 6, 2022 Maintainer

learning-chip
Sep 16, 2022

pratikvn
Oct 6, 2022
Maintainer