mirror of
https://github.com/tildearrow/furnace.git
synced 2024-11-30 16:33:01 +00:00
54e93db207
not reliable yet
282 lines
13 KiB
Text
282 lines
13 KiB
Text
@node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top
|
|
@chapter Multi-threaded FFTW
|
|
|
|
@cindex parallel transform
|
|
In this chapter we document the parallel FFTW routines for
|
|
shared-memory parallel hardware. These routines, which support
|
|
parallel one- and multi-dimensional transforms of both real and
|
|
complex data, are the easiest way to take advantage of multiple
|
|
processors with FFTW. They work just like the corresponding
|
|
uniprocessor transform routines, except that you have an extra
|
|
initialization routine to call, and there is a routine to set the
|
|
number of threads to employ. Any program that uses the uniprocessor
|
|
FFTW can therefore be trivially modified to use the multi-threaded
|
|
FFTW.
|
|
|
|
A shared-memory machine is one in which all CPUs can directly access
|
|
the same main memory, and such machines are now common due to the
|
|
ubiquity of multi-core CPUs. FFTW's multi-threading support allows
|
|
you to utilize these additional CPUs transparently from a single
|
|
program. However, this does not necessarily translate into
|
|
performance gains---when multiple threads/CPUs are employed, there is
|
|
an overhead required for synchronization that may outweigh the
|
|
computatational parallelism. Therefore, you can only benefit from
|
|
threads if your problem is sufficiently large.
|
|
@cindex shared-memory
|
|
@cindex threads
|
|
|
|
@menu
|
|
* Installation and Supported Hardware/Software::
|
|
* Usage of Multi-threaded FFTW::
|
|
* How Many Threads to Use?::
|
|
* Thread safety::
|
|
@end menu
|
|
|
|
@c ------------------------------------------------------------
|
|
@node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW
|
|
@section Installation and Supported Hardware/Software
|
|
|
|
All of the FFTW threads code is located in the @code{threads}
|
|
subdirectory of the FFTW package. On Unix systems, the FFTW threads
|
|
libraries and header files can be automatically configured, compiled,
|
|
and installed along with the uniprocessor FFTW libraries simply by
|
|
including @code{--enable-threads} in the flags to the @code{configure}
|
|
script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use
|
|
@uref{http://www.openmp.org,OpenMP} threads.
|
|
@fpindex configure
|
|
|
|
|
|
@cindex portability
|
|
@cindex OpenMP
|
|
The threads routines require your operating system to have some sort
|
|
of shared-memory threads support. Specifically, the FFTW threads
|
|
package works with POSIX threads (available on most Unix variants,
|
|
from GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which
|
|
are supported in many common compilers (e.g. gcc) are also supported,
|
|
and may give better performance on some systems. (OpenMP threads are
|
|
also useful if you are employing OpenMP in your own code, in order to
|
|
minimize conflicts between threading models.) If you have a
|
|
shared-memory machine that uses a different threads API, it should be
|
|
a simple matter of programming to include support for it; see the file
|
|
@code{threads/threads.c} for more detail.
|
|
|
|
You can compile FFTW with @emph{both} @code{--enable-threads} and
|
|
@code{--enable-openmp} at the same time, since they install libraries
|
|
with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as
|
|
described below). However, your programs may only link to @emph{one}
|
|
of these two libraries at a time.
|
|
|
|
Ideally, of course, you should also have multiple processors in order to
|
|
get any benefit from the threaded transforms.
|
|
|
|
@c ------------------------------------------------------------
|
|
@node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW
|
|
@section Usage of Multi-threaded FFTW
|
|
|
|
Here, it is assumed that the reader is already familiar with the usage
|
|
of the uniprocessor FFTW routines, described elsewhere in this manual.
|
|
We only describe what one has to change in order to use the
|
|
multi-threaded routines.
|
|
|
|
@cindex OpenMP
|
|
First, programs using the parallel complex transforms should be linked
|
|
with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp
|
|
-lfftw3 -lm} if you compiled with OpenMP. You will also need to link
|
|
with whatever library is responsible for threads on your system
|
|
(e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag
|
|
enables OpenMP (e.g. @code{-fopenmp} with gcc).
|
|
@cindex linking on Unix
|
|
|
|
|
|
Second, before calling @emph{any} FFTW routines, you should call the
|
|
function:
|
|
|
|
@example
|
|
int fftw_init_threads(void);
|
|
@end example
|
|
@findex fftw_init_threads
|
|
|
|
This function, which need only be called once, performs any one-time
|
|
initialization required to use threads on your system. It returns zero
|
|
if there was some error (which should not happen under normal
|
|
circumstances) and a non-zero value otherwise.
|
|
|
|
Third, before creating a plan that you want to parallelize, you should
|
|
call:
|
|
|
|
@example
|
|
void fftw_plan_with_nthreads(int nthreads);
|
|
@end example
|
|
@findex fftw_plan_with_nthreads
|
|
|
|
The @code{nthreads} argument indicates the number of threads you want
|
|
FFTW to use (or actually, the maximum number). All plans subsequently
|
|
created with any planner routine will use that many threads. You can
|
|
call @code{fftw_plan_with_nthreads}, create some plans, call
|
|
@code{fftw_plan_with_nthreads} again with a different argument, and
|
|
create some more plans for a new number of threads. Plans already created
|
|
before a call to @code{fftw_plan_with_nthreads} are unaffected. If you
|
|
pass an @code{nthreads} argument of @code{1} (the default), threads are
|
|
disabled for subsequent plans.
|
|
|
|
You can determine the current number of threads that the planner can
|
|
use by calling:
|
|
|
|
@example
|
|
int fftw_planner_nthreads(void);
|
|
@end example
|
|
@findex fftw_planner_nthreads
|
|
|
|
@cindex OpenMP
|
|
With OpenMP, to configure FFTW to use all of the currently running
|
|
OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the
|
|
@code{OMP_NUM_THREADS} environment variable), you can do:
|
|
@code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_}
|
|
OpenMP functions are declared via @code{#include <omp.h>}.)
|
|
|
|
@cindex thread safety
|
|
Given a plan, you then execute it as usual with
|
|
@code{fftw_execute(plan)}, and the execution will use the number of
|
|
threads specified when the plan was created. When done, you destroy
|
|
it as usual with @code{fftw_destroy_plan}. As described in
|
|
@ref{Thread safety}, plan @emph{execution} is thread-safe, but plan
|
|
creation and destruction are @emph{not}: you should create/destroy
|
|
plans only from a single thread, but can safely execute multiple plans
|
|
in parallel.
|
|
|
|
There is one additional routine: if you want to get rid of all memory
|
|
and other resources allocated internally by FFTW, you can call:
|
|
|
|
@example
|
|
void fftw_cleanup_threads(void);
|
|
@end example
|
|
@findex fftw_cleanup_threads
|
|
|
|
which is much like the @code{fftw_cleanup()} function except that it
|
|
also gets rid of threads-related data. You must @emph{not} execute any
|
|
previously created plans after calling this function.
|
|
|
|
We should also mention one other restriction: if you save wisdom from a
|
|
program using the multi-threaded FFTW, that wisdom @emph{cannot be used}
|
|
by a program using only the single-threaded FFTW (i.e. not calling
|
|
@code{fftw_init_threads}). @xref{Words of Wisdom-Saving Plans}.
|
|
|
|
Finally, FFTW provides a optional callback interface that allows you to
|
|
replace its parallel threading backend at runtime:
|
|
|
|
@example
|
|
void fftw_threads_set_callback(
|
|
void (*parallel_loop)(void *(*work)(void *), char *jobdata, size_t elsize, int njobs, void *data),
|
|
void *data);
|
|
@end example
|
|
@findex fftw_threads_set_callback
|
|
|
|
This routine (which is @emph{not} threadsafe and should generally be called before creating
|
|
any FFTW plans) allows you to provide a function @code{parallel_loop} that executes
|
|
parallel work for FFTW: it should call the function @code{work(jobdata + elsize*i)} for
|
|
@code{i} from @code{0} to @code{njobs-1}, possibly in parallel. (The `data` pointer
|
|
supplied to @code{fftw_threads_set_callback} is passed through to your @code{parallel_loop}
|
|
function.) For example, if you link to an FFTW threads library built to use POSIX threads,
|
|
but you want it to use OpenMP instead (because you are using OpenMP elsewhere in your program
|
|
and want to avoid competing threads), you can call @code{fftw_threads_set_callback} with
|
|
the callback function:
|
|
|
|
@example
|
|
void parallel_loop(void *(*work)(char *), char *jobdata, size_t elsize, int njobs, void *data)
|
|
@{
|
|
#pragma omp parallel for
|
|
for (int i = 0; i < njobs; ++i)
|
|
work(jobdata + elsize * i);
|
|
@}
|
|
@end example
|
|
|
|
The same mechanism could be used in order to make FFTW use a threading backend
|
|
implemented via Intel TBB, Apple GCD, or Cilk, for example.
|
|
|
|
|
|
@c ------------------------------------------------------------
|
|
@node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW
|
|
@section How Many Threads to Use?
|
|
|
|
@cindex number of threads
|
|
There is a fair amount of overhead involved in synchronizing threads,
|
|
so the optimal number of threads to use depends upon the size of the
|
|
transform as well as on the number of processors you have.
|
|
|
|
As a general rule, you don't want to use more threads than you have
|
|
processors. (Using more threads will work, but there will be extra
|
|
overhead with no benefit.) In fact, if the problem size is too small,
|
|
you may want to use fewer threads than you have processors.
|
|
|
|
You will have to experiment with your system to see what level of
|
|
parallelization is best for your problem size. Typically, the problem
|
|
will have to involve at least a few thousand data points before threads
|
|
become beneficial. If you plan with @code{FFTW_PATIENT}, it will
|
|
automatically disable threads for sizes that don't benefit from
|
|
parallelization.
|
|
@ctindex FFTW_PATIENT
|
|
|
|
@c ------------------------------------------------------------
|
|
@node Thread safety, , How Many Threads to Use?, Multi-threaded FFTW
|
|
@section Thread safety
|
|
|
|
@cindex threads
|
|
@cindex OpenMP
|
|
@cindex thread safety
|
|
Users writing multi-threaded programs (including OpenMP) must concern
|
|
themselves with the @dfn{thread safety} of the libraries they
|
|
use---that is, whether it is safe to call routines in parallel from
|
|
multiple threads. FFTW can be used in such an environment, but some
|
|
care must be taken because the planner routines share data
|
|
(e.g. wisdom and trigonometric tables) between calls and plans.
|
|
|
|
The upshot is that the only thread-safe routine in FFTW is
|
|
@code{fftw_execute} (and the new-array variants thereof). All other routines
|
|
(e.g. the planner) should only be called from one thread at a time. So,
|
|
for example, you can wrap a semaphore lock around any calls to the
|
|
planner; even more simply, you can just create all of your plans from
|
|
one thread. We do not think this should be an important restriction
|
|
(FFTW is designed for the situation where the only performance-sensitive
|
|
code is the actual execution of the transform), and the benefits of
|
|
shared data between plans are great.
|
|
|
|
Note also that, since the plan is not modified by @code{fftw_execute},
|
|
it is safe to execute the @emph{same plan} in parallel by multiple
|
|
threads. However, since a given plan operates by default on a fixed
|
|
array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data.
|
|
|
|
(Users should note that these comments only apply to programs using
|
|
shared-memory threads or OpenMP. Parallelism using MPI or forked processes
|
|
involves a separate address-space and global variables for each process,
|
|
and is not susceptible to problems of this sort.)
|
|
|
|
The FFTW planner is intended to be called from a single thread. If you
|
|
really must call it from multiple threads, you are expected to grab
|
|
whatever lock makes sense for your application, with the understanding
|
|
that you may be holding that lock for a long time, which is undesirable.
|
|
|
|
Neither strategy works, however, in the following situation. The
|
|
``application'' is structured as a set of ``plugins'' which are unaware
|
|
of each other, and for whatever reason the ``plugins'' cannot coordinate
|
|
on grabbing the lock. (This is not a technical problem, but an
|
|
organizational one. The ``plugins'' are written by independent agents,
|
|
and from the perspective of each plugin's author, each plugin is using
|
|
FFTW correctly from a single thread.) To cope with this situation,
|
|
starting from FFTW-3.3.5, FFTW supports an API to make the planner
|
|
thread-safe:
|
|
|
|
@example
|
|
void fftw_make_planner_thread_safe(void);
|
|
@end example
|
|
@findex fftw_make_planner_thread_safe
|
|
|
|
This call operates by brute force: It just installs a hook that wraps a
|
|
lock (chosen by us) around all planner calls. So there is no magic and
|
|
you get the worst of all worlds. The planner is still single-threaded,
|
|
but you cannot choose which lock to use. The planner still holds the
|
|
lock for a long time, but you cannot impose a timeout on lock
|
|
acquisition. As of FFTW-3.3.5 and FFTW-3.3.6, this call does not work
|
|
when using OpenMP as threading substrate. (Suggestions on what to do
|
|
about this bug are welcome.) @emph{Do not use
|
|
@code{fftw_make_planner_thread_safe} unless there is no other choice,}
|
|
such as in the application/plugin situation.
|