@node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top @chapter Multi-threaded FFTW @cindex parallel transform In this chapter we document the parallel FFTW routines for shared-memory parallel hardware. These routines, which support parallel one- and multi-dimensional transforms of both real and complex data, are the easiest way to take advantage of multiple processors with FFTW. They work just like the corresponding uniprocessor transform routines, except that you have an extra initialization routine to call, and there is a routine to set the number of threads to employ. Any program that uses the uniprocessor FFTW can therefore be trivially modified to use the multi-threaded FFTW. A shared-memory machine is one in which all CPUs can directly access the same main memory, and such machines are now common due to the ubiquity of multi-core CPUs. FFTW's multi-threading support allows you to utilize these additional CPUs transparently from a single program. However, this does not necessarily translate into performance gains---when multiple threads/CPUs are employed, there is an overhead required for synchronization that may outweigh the computatational parallelism. Therefore, you can only benefit from threads if your problem is sufficiently large. @cindex shared-memory @cindex threads @menu * Installation and Supported Hardware/Software:: * Usage of Multi-threaded FFTW:: * How Many Threads to Use?:: * Thread safety:: @end menu @c ------------------------------------------------------------ @node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW @section Installation and Supported Hardware/Software All of the FFTW threads code is located in the @code{threads} subdirectory of the FFTW package. On Unix systems, the FFTW threads libraries and header files can be automatically configured, compiled, and installed along with the uniprocessor FFTW libraries simply by including @code{--enable-threads} in the flags to the @code{configure} script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use @uref{http://www.openmp.org,OpenMP} threads. @fpindex configure @cindex portability @cindex OpenMP The threads routines require your operating system to have some sort of shared-memory threads support. Specifically, the FFTW threads package works with POSIX threads (available on most Unix variants, from GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which are supported in many common compilers (e.g. gcc) are also supported, and may give better performance on some systems. (OpenMP threads are also useful if you are employing OpenMP in your own code, in order to minimize conflicts between threading models.) If you have a shared-memory machine that uses a different threads API, it should be a simple matter of programming to include support for it; see the file @code{threads/threads.c} for more detail. You can compile FFTW with @emph{both} @code{--enable-threads} and @code{--enable-openmp} at the same time, since they install libraries with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as described below). However, your programs may only link to @emph{one} of these two libraries at a time. Ideally, of course, you should also have multiple processors in order to get any benefit from the threaded transforms. @c ------------------------------------------------------------ @node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW @section Usage of Multi-threaded FFTW Here, it is assumed that the reader is already familiar with the usage of the uniprocessor FFTW routines, described elsewhere in this manual. We only describe what one has to change in order to use the multi-threaded routines. @cindex OpenMP First, programs using the parallel complex transforms should be linked with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp -lfftw3 -lm} if you compiled with OpenMP. You will also need to link with whatever library is responsible for threads on your system (e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag enables OpenMP (e.g. @code{-fopenmp} with gcc). @cindex linking on Unix Second, before calling @emph{any} FFTW routines, you should call the function: @example int fftw_init_threads(void); @end example @findex fftw_init_threads This function, which need only be called once, performs any one-time initialization required to use threads on your system. It returns zero if there was some error (which should not happen under normal circumstances) and a non-zero value otherwise. Third, before creating a plan that you want to parallelize, you should call: @example void fftw_plan_with_nthreads(int nthreads); @end example @findex fftw_plan_with_nthreads The @code{nthreads} argument indicates the number of threads you want FFTW to use (or actually, the maximum number). All plans subsequently created with any planner routine will use that many threads. You can call @code{fftw_plan_with_nthreads}, create some plans, call @code{fftw_plan_with_nthreads} again with a different argument, and create some more plans for a new number of threads. Plans already created before a call to @code{fftw_plan_with_nthreads} are unaffected. If you pass an @code{nthreads} argument of @code{1} (the default), threads are disabled for subsequent plans. You can determine the current number of threads that the planner can use by calling: @example int fftw_planner_nthreads(void); @end example @findex fftw_planner_nthreads @cindex OpenMP With OpenMP, to configure FFTW to use all of the currently running OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the @code{OMP_NUM_THREADS} environment variable), you can do: @code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_} OpenMP functions are declared via @code{#include }.) @cindex thread safety Given a plan, you then execute it as usual with @code{fftw_execute(plan)}, and the execution will use the number of threads specified when the plan was created. When done, you destroy it as usual with @code{fftw_destroy_plan}. As described in @ref{Thread safety}, plan @emph{execution} is thread-safe, but plan creation and destruction are @emph{not}: you should create/destroy plans only from a single thread, but can safely execute multiple plans in parallel. There is one additional routine: if you want to get rid of all memory and other resources allocated internally by FFTW, you can call: @example void fftw_cleanup_threads(void); @end example @findex fftw_cleanup_threads which is much like the @code{fftw_cleanup()} function except that it also gets rid of threads-related data. You must @emph{not} execute any previously created plans after calling this function. We should also mention one other restriction: if you save wisdom from a program using the multi-threaded FFTW, that wisdom @emph{cannot be used} by a program using only the single-threaded FFTW (i.e. not calling @code{fftw_init_threads}). @xref{Words of Wisdom-Saving Plans}. Finally, FFTW provides a optional callback interface that allows you to replace its parallel threading backend at runtime: @example void fftw_threads_set_callback( void (*parallel_loop)(void *(*work)(void *), char *jobdata, size_t elsize, int njobs, void *data), void *data); @end example @findex fftw_threads_set_callback This routine (which is @emph{not} threadsafe and should generally be called before creating any FFTW plans) allows you to provide a function @code{parallel_loop} that executes parallel work for FFTW: it should call the function @code{work(jobdata + elsize*i)} for @code{i} from @code{0} to @code{njobs-1}, possibly in parallel. (The `data` pointer supplied to @code{fftw_threads_set_callback} is passed through to your @code{parallel_loop} function.) For example, if you link to an FFTW threads library built to use POSIX threads, but you want it to use OpenMP instead (because you are using OpenMP elsewhere in your program and want to avoid competing threads), you can call @code{fftw_threads_set_callback} with the callback function: @example void parallel_loop(void *(*work)(char *), char *jobdata, size_t elsize, int njobs, void *data) @{ #pragma omp parallel for for (int i = 0; i < njobs; ++i) work(jobdata + elsize * i); @} @end example The same mechanism could be used in order to make FFTW use a threading backend implemented via Intel TBB, Apple GCD, or Cilk, for example. @c ------------------------------------------------------------ @node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW @section How Many Threads to Use? @cindex number of threads There is a fair amount of overhead involved in synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have. As a general rule, you don't want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors. You will have to experiment with your system to see what level of parallelization is best for your problem size. Typically, the problem will have to involve at least a few thousand data points before threads become beneficial. If you plan with @code{FFTW_PATIENT}, it will automatically disable threads for sizes that don't benefit from parallelization. @ctindex FFTW_PATIENT @c ------------------------------------------------------------ @node Thread safety, , How Many Threads to Use?, Multi-threaded FFTW @section Thread safety @cindex threads @cindex OpenMP @cindex thread safety Users writing multi-threaded programs (including OpenMP) must concern themselves with the @dfn{thread safety} of the libraries they use---that is, whether it is safe to call routines in parallel from multiple threads. FFTW can be used in such an environment, but some care must be taken because the planner routines share data (e.g. wisdom and trigonometric tables) between calls and plans. The upshot is that the only thread-safe routine in FFTW is @code{fftw_execute} (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great. Note also that, since the plan is not modified by @code{fftw_execute}, it is safe to execute the @emph{same plan} in parallel by multiple threads. However, since a given plan operates by default on a fixed array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data. (Users should note that these comments only apply to programs using shared-memory threads or OpenMP. Parallelism using MPI or forked processes involves a separate address-space and global variables for each process, and is not susceptible to problems of this sort.) The FFTW planner is intended to be called from a single thread. If you really must call it from multiple threads, you are expected to grab whatever lock makes sense for your application, with the understanding that you may be holding that lock for a long time, which is undesirable. Neither strategy works, however, in the following situation. The ``application'' is structured as a set of ``plugins'' which are unaware of each other, and for whatever reason the ``plugins'' cannot coordinate on grabbing the lock. (This is not a technical problem, but an organizational one. The ``plugins'' are written by independent agents, and from the perspective of each plugin's author, each plugin is using FFTW correctly from a single thread.) To cope with this situation, starting from FFTW-3.3.5, FFTW supports an API to make the planner thread-safe: @example void fftw_make_planner_thread_safe(void); @end example @findex fftw_make_planner_thread_safe This call operates by brute force: It just installs a hook that wraps a lock (chosen by us) around all planner calls. So there is no magic and you get the worst of all worlds. The planner is still single-threaded, but you cannot choose which lock to use. The planner still holds the lock for a long time, but you cannot impose a timeout on lock acquisition. As of FFTW-3.3.5 and FFTW-3.3.6, this call does not work when using OpenMP as threading substrate. (Suggestions on what to do about this bug are welcome.) @emph{Do not use @code{fftw_make_planner_thread_safe} unless there is no other choice,} such as in the application/plugin situation.