In FFTW3 (or MKL) the only thread safe functions are fftw_execute_...()
.
This sometimes poses problems on the structure of multithreading code, where each thread may need to perform FFTs with different fftw_plan
.
This simple wrapper around FFTW3 aims at making things easier by maintain a global hash table of fftw_plan
, where each thread may insert new plans and read already allocated plans.
The hash table is locked such that multiple readers can access it but only one thread can insert new entry.
This allows multiple threads to reuse already allocated plan simultaneously, without allocating a new plan everytime.
I believe this approach has some performance and design advantage because allocating a new plan everytime for every thread requires a mutex lock to allow only one thread to create a plan.
This simple wrapper fits a case where a large number of FFTs must be processed, but the total number of different FFT plans are not that large, and it may also be hard to preallocate all possible plans before running any FFTs. In other words, the size of the hash map of plans is expected to be much smaller than the total number of FFTs to run, so that reusing a plan benefits performance. In general I expect the size of the hash map to be on the order of 10 ~ 1000.
The hash map is implemented with a simple std::unordered_map, and guarded with a customized & naive multi-reader, single writer locking system implemented with omp_lock
.
It is naively implemented because there is no things like std::shared_mutex
in openmp, and also mixing openmp threads with facilities from pthreads or std::thread
, or boost::thread
is not clear to me whether it is safe to do so.
For the same reason, I am not using a concurrent unordered_map like the one from Intel TBB or other concurrent containers like Junction.
- All threads share one SafeFFT object to process many FFTs in parallel (see
test/TestForwardBackward.cpp
) - Every objet can have its own SafeFFT object to process its FFT, and then many such objects are partitioned through
#omp parallel for
to execute FFT simultaneously (seetest/Test1DSmall.cpp
and other) - In any cases, the member functions of
SafeFFT
are supposed to be called from the root level openmp thread team. Otherwise, the thread id returned byomp_get_thread_num()
may not be meaningful, which may cause problems when locating the per thread buffer and internal locks.
- It is header only. Include
SafeFFT.hpp
in your code. Callsafefft::init()
once before callingrunFFT()
orfitBuffer()
. - All SafeFFT object share one Runner object, which does the real work. Only one Runner instance should exist throughout the program.
- The safeFFT object stores only one pointer. Its memory footprint is supposed to be small and it could be declared as a member for the user's class.
- Each thread maintains its own aligned in/out buffer memory.
- Link with
libfftw3_threads
,libfftw3_omp
ormkl
. - If using MKL, define the macro FFTW3_MKL for threading control
- If using MKL, calling
runFFT()
from multiple threads needs careful threading control. The only working setting I figured out is: a. SettingOMP_NUM_THREADS
to control the number of threads callingrunFFT()
. b. settingmkl_set_num_threads_local(plan_.nThreads);
inrunFFT()
to ensure the number of threads to execute this plan. c. SettingOMP_NESTED=true
andMKL_DYNAMIC=false
. DO NOT setMKL_NUM_THREADS
. - Nested threading is implemented for flexibility. For small FFTs (even those in Test1DLarge.cpp), sticking to
plan.nThreads=1
gives better performance. On my 12-core Xeon, the program Test1DLarge shows around 95% efficiency with 10 cores. It is the user's duty to think about it and get good performance. fftw3_threads
uses pthread rather than openmp thread (at least in my understanding). In my tests on a Mac the nested threading control still works well.- The code is internally implemented with the FFTW Guru functions. Parameters in the PlanGuruFFT struct are directly passed to FFTW so any functionalities of FFTW should be supported. Note that MKL does not implement all functions of FFTW. Read the documents first.
- I cannot guarantee the naive implementation of RWLock with omp_lock is optimal or even correct, although it works fine in my tests.
- The effect of mixing with other thread models is unknown.
- The Guru interface is not fully tested. Please perform your own tests before using it, and please report any bugs you find.
- The code is probably ugly and I cannot guarentee it is bug free. Comments, forks, and bug reports are welcome.