@node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top @chapter Distributed-memory FFTW with MPI @cindex MPI @cindex parallel transform In this chapter we document the parallel FFTW routines for parallel systems supporting the MPI message-passing interface. Unlike the shared-memory threads described in the previous chapter, MPI allows you to use @emph{distributed-memory} parallelism, where each CPU has its own separate memory, and which can scale up to clusters of many thousands of processors. This capability comes at a price, however: each process only stores a @emph{portion} of the data to be transformed, which means that the data structures and programming-interface are quite different from the serial or threads versions of FFTW. @cindex data distribution Distributed-memory parallelism is especially useful when you are transforming arrays so large that they do not fit into the memory of a single processor. The storage per-process required by FFTW's MPI routines is proportional to the total array size divided by the number of processes. Conversely, distributed-memory parallelism can easily pose an unacceptably high communications overhead for small problems; the threshold problem size for which parallelism becomes advantageous will depend on the precise problem you are interested in, your hardware, and your MPI implementation. A note on terminology: in MPI, you divide the data among a set of ``processes'' which each run in their own memory address space. Generally, each process runs on a different physical processor, but this is not required. A set of processes in MPI is described by an opaque data structure called a ``communicator,'' the most common of which is the predefined communicator @code{MPI_COMM_WORLD} which refers to @emph{all} processes. For more information on these and other concepts common to all MPI programs, we refer the reader to the documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home page}. @cindex MPI communicator @ctindex MPI_COMM_WORLD We assume in this chapter that the reader is familiar with the usage of the serial (uniprocessor) FFTW, and focus only on the concepts new to the MPI interface. @menu * FFTW MPI Installation:: * Linking and Initializing MPI FFTW:: * 2d MPI example:: * MPI Data Distribution:: * Multi-dimensional MPI DFTs of Real Data:: * Other Multi-dimensional Real-data MPI Transforms:: * FFTW MPI Transposes:: * FFTW MPI Wisdom:: * Avoiding MPI Deadlocks:: * FFTW MPI Performance Tips:: * Combining MPI and Threads:: * FFTW MPI Reference:: * FFTW MPI Fortran Interface:: @end menu @c ------------------------------------------------------------ @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI @section FFTW MPI Installation All of the FFTW MPI code is located in the @code{mpi} subdirectory of the FFTW package. On Unix systems, the FFTW MPI libraries and header files are automatically configured, compiled, and installed along with the uniprocessor FFTW libraries simply by including @code{--enable-mpi} in the flags to the @code{configure} script (@pxref{Installation on Unix}). @fpindex configure Any implementation of the MPI standard, version 1 or later, should work with FFTW. The @code{configure} script will attempt to automatically detect how to compile and link code using your MPI implementation. In some cases, especially if you have multiple different MPI implementations installed or have an unusual MPI software package, you may need to provide this information explicitly. Most commonly, one compiles MPI code by invoking a special compiler command, typically @code{mpicc} for C code. The @code{configure} script knows the most common names for this command, but you can specify the MPI compilation command explicitly by setting the @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}. @fpindex mpicc If, instead of a special compiler command, you need to link a certain library, you can specify the link command via the @code{MPILIBS} variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if your MPI library is installed in a non-standard location (one the compiler does not know about by default), you may also have to specify the location of the library and header files via @code{LDFLAGS} and @code{CPPFLAGS} variables, respectively, as in @samp{./configure LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}. @c ------------------------------------------------------------ @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI @section Linking and Initializing MPI FFTW Programs using the MPI FFTW routines should be linked with @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision, @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on (@pxref{Precision}). You will also need to link with whatever library is responsible for MPI on your system; in most MPI implementations, there is a special compiler alias named @code{mpicc} to compile and link MPI code. @fpindex mpicc @cindex linking on Unix @cindex precision @findex fftw_init_threads Before calling any FFTW routines except possibly @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling @code{MPI_Init}, you should call the function: @example void fftw_mpi_init(void); @end example @findex fftw_mpi_init If, at the end of your program, you want to get rid of all memory and other resources allocated internally by FFTW, for both the serial and MPI routines, you can call: @example void fftw_mpi_cleanup(void); @end example @findex fftw_mpi_cleanup which is much like the @code{fftw_cleanup()} function except that it also gets rid of FFTW's MPI-related data. You must @emph{not} execute any previously created plans after calling this function. @c ------------------------------------------------------------ @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI @section 2d MPI example Before we document the FFTW MPI interface in detail, we begin with a simple example outlining how one would perform a two-dimensional @code{N0} by @code{N1} complex DFT. @example #include int main(int argc, char **argv) @{ const ptrdiff_t N0 = ..., N1 = ...; fftw_plan plan; fftw_complex *data; ptrdiff_t alloc_local, local_n0, local_0_start, i, j; MPI_Init(&argc, &argv); fftw_mpi_init(); /* @r{get local data size and allocate} */ alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, &local_n0, &local_0_start); data = fftw_alloc_complex(alloc_local); /* @r{create plan for in-place forward DFT} */ plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, FFTW_FORWARD, FFTW_ESTIMATE); /* @r{initialize data to some function} my_function(x,y) */ for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) data[i*N1 + j] = my_function(local_0_start + i, j); /* @r{compute transforms, in-place, as many times as desired} */ fftw_execute(plan); fftw_destroy_plan(plan); MPI_Finalize(); @} @end example As can be seen above, the MPI interface follows the same basic style of allocate/plan/execute/destroy as the serial FFTW routines. All of the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead of @samp{fftw_}. There are a few important differences, however: First, we must call @code{fftw_mpi_init()} after calling @code{MPI_Init} (required in all MPI programs) and before calling any other @samp{fftw_mpi_} routine. @findex MPI_Init @findex fftw_mpi_init Second, when we create the plan with @code{fftw_mpi_plan_dft_2d}, analogous to @code{fftw_plan_dft_2d}, we pass an additional argument: the communicator, indicating which processes will participate in the transform (here @code{MPI_COMM_WORLD}, indicating all processes). Whenever you create, execute, or destroy a plan for an MPI transform, you must call the corresponding FFTW routine on @emph{all} processes in the communicator for that transform. (That is, these are @emph{collective} calls.) Note that the plan for the MPI transform uses the standard @code{fftw_execute} and @code{fftw_destroy} routines (on the other hand, there are MPI-specific new-array execute functions documented below). @cindex collective function @findex fftw_mpi_plan_dft_2d @ctindex MPI_COMM_WORLD Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a standard C integer type which is (at least) 32 bits wide on a 32-bit machine and 64 bits wide on a 64-bit machine. This is to make it easy to specify very large parallel transforms on a 64-bit machine. (You can specify 64-bit transform sizes in the serial FFTW, too, but only by using the @samp{guru64} planner interface. @xref{64-bit Guru Interface}.) @tindex ptrdiff_t @cindex 64-bit architecture Fourth, and most importantly, you don't allocate the entire two-dimensional array on each process. Instead, you call @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the array resides on each processor, and how much space to allocate. Here, the portion of the array on each process is a @code{local_n0} by @code{N1} slice of the total array, starting at index @code{local_0_start}. The total number of @code{fftw_complex} numbers to allocate is given by the @code{alloc_local} return value, which @emph{may} be greater than @code{local_n0 * N1} (in case some intermediate calculations require additional storage). The data distribution in FFTW's MPI interface is described in more detail by the next section. @findex fftw_mpi_local_size_2d @cindex data distribution Given the portion of the array that resides on the local process, it is straightforward to initialize the data (here to a function @code{myfunction}) and otherwise manipulate it. Of course, at the end of the program you may want to output the data somehow, but synchronizing this output is up to you and is beyond the scope of this manual. (One good way to output a large multi-dimensional distributed array in MPI to a portable binary file is to use the free HDF5 library; see the @uref{http://www.hdfgroup.org/, HDF home page}.) @cindex HDF5 @cindex MPI I/O @c ------------------------------------------------------------ @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI @section MPI Data Distribution @cindex data distribution The most important concept to understand in using FFTW's MPI interface is the data distribution. With a serial or multithreaded FFT, all of the inputs and outputs are stored as a single contiguous chunk of memory. With a distributed-memory FFT, the inputs and outputs are broken into disjoint blocks, one per process. In particular, FFTW uses a @emph{1d block distribution} of the data, distributed along the @emph{first dimension}. For example, if you want to perform a @twodims{100,200} complex DFT, distributed over 4 processes, each process will get a @twodims{25,200} slice of the data. That is, process 0 will get rows 0 through 24, process 1 will get rows 25 through 49, process 2 will get rows 50 through 74, and process 3 will get rows 75 through 99. If you take the same array but distribute it over 3 processes, then it is not evenly divisible so the different processes will have unequal chunks. FFTW's default choice in this case is to assign 34 rows to processes 0 and 1, and 32 rows to process 2. @cindex block distribution FFTW provides several @samp{fftw_mpi_local_size} routines that you can call to find out what portion of an array is stored on the current process. In most cases, you should use the default block sizes picked by FFTW, but it is also possible to specify your own block size. For example, with a @twodims{100,200} array on three processes, you can tell FFTW to use a block size of 40, which would assign 40 rows to processes 0 and 1, and 20 rows to process 2. FFTW's default is to divide the data equally among the processes if possible, and as best it can otherwise. The rows are always assigned in ``rank order,'' i.e. process 0 gets the first block of rows, then process 1, and so on. (You can change this by using @code{MPI_Comm_split} to create a new communicator with re-ordered processes.) However, you should always call the @samp{fftw_mpi_local_size} routines, if possible, rather than trying to predict FFTW's distribution choices. In particular, it is critical that you allocate the storage size that is returned by @samp{fftw_mpi_local_size}, which is @emph{not} necessarily the size of the local slice of the array. The reason is that intermediate steps of FFTW's algorithms involve transposing the array and redistributing the data, so at these intermediate steps FFTW may require more local storage space (albeit always proportional to the total size divided by the number of processes). The @samp{fftw_mpi_local_size} functions know how much storage is required for these intermediate steps and tell you the correct amount to allocate. @menu * Basic and advanced distribution interfaces:: * Load balancing:: * Transposed distributions:: * One-dimensional distributions:: @end menu @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution @subsection Basic and advanced distribution interfaces As with the planner interface, the @samp{fftw_mpi_local_size} distribution interface is broken into basic and advanced (@samp{_many}) interfaces, where the latter allows you to specify the block size manually and also to request block sizes when computing multiple transforms simultaneously. These functions are documented more exhaustively by the FFTW MPI Reference, but we summarize the basic ideas here using a couple of two-dimensional examples. For the @twodims{100,200} complex-DFT example, above, we would find the distribution by calling the following function in the basic interface: @example ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start); @end example @findex fftw_mpi_local_size_2d Given the total size of the data to be transformed (here, @code{n0 = 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this function provides three numbers. First, it describes the shape of the local data: the current process should store a @code{local_n0} by @code{n1} slice of the overall dataset, in row-major order (@code{n1} dimension contiguous), starting at index @code{local_0_start}. That is, if the total dataset is viewed as a @code{n0} by @code{n1} matrix, the current process should store the rows @code{local_0_start} to @code{local_0_start+local_n0-1}. Obviously, if you are running with only a single MPI process, that process will store the entire array: @code{local_0_start} will be zero and @code{local_n0} will be @code{n0}. @xref{Row-major Format}. @cindex row-major Second, the return value is the total number of data elements (e.g., complex numbers for a complex DFT) that should be allocated for the input and output arrays on the current process (ideally with @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal alignment). It might seem that this should always be equal to @code{local_n0 * n1}, but this is @emph{not} the case. FFTW's distributed FFT algorithms require data redistributions at intermediate stages of the transform, and in some circumstances this may require slightly larger local storage. This is discussed in more detail below, under @ref{Load balancing}. @findex fftw_malloc @findex fftw_alloc_complex @cindex advanced interface The advanced-interface @samp{local_size} function for multidimensional transforms returns the same three things (@code{local_n0}, @code{local_0_start}, and the total number of elements to allocate), but takes more inputs: @example ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, ptrdiff_t block0, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start); @end example @findex fftw_mpi_local_size_many The two-dimensional case above corresponds to @code{rnk = 2} and an array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}. This routine is for any @code{rnk > 1}; one-dimensional transforms have their own interface because they work slightly differently, as discussed below. First, the advanced interface allows you to perform multiple transforms at once, of interleaved data, as specified by the @code{howmany} parameter. (@code{hoamany} is 1 for a single transform.) Second, here you can specify your desired block size in the @code{n0} dimension, @code{block0}. To use FFTW's default block size, pass @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on @code{P} processes, FFTW will return @code{local_n0} equal to @code{block0} on the first @code{P / block0} processes (rounded down), return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on the next process, and @code{local_n0} equal to zero on any remaining processes. In general, we recommend using the default block size (which corresponds to @code{n0 / P}, rounded up). @ctindex FFTW_MPI_DEFAULT_BLOCK @cindex block distribution For example, suppose you have @code{P = 4} processes and @code{n0 = 21}. The default will be a block size of @code{6}, which will give @code{local_n0 = 6} on the first three processes and @code{local_n0 = 3} on the last process. Instead, however, you could specify @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5} on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice, while it may look superficially more ``balanced,'' has the same critical path as FFTW's default but requires more communications.) @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution @subsection Load balancing @cindex load balancing Ideally, when you parallelize a transform over some @math{P} processes, each process should end up with work that takes equal time. Otherwise, all of the processes end up waiting on whichever process is slowest. This goal is known as ``load balancing.'' In this section, we describe the circumstances under which FFTW is able to load-balance well, and in particular how you should choose your transform size in order to load balance. Load balancing is especially difficult when you are parallelizing over heterogeneous machines; for example, if one of your processors is a old 486 and another is a Pentium IV, obviously you should give the Pentium more work to do than the 486 since the latter is much slower. FFTW does not deal with this problem, however---it assumes that your processes run on hardware of comparable speed, and that the goal is therefore to divide the problem as equally as possible. For a multi-dimensional complex DFT, FFTW can divide the problem equally among the processes if: (i) the @emph{first} dimension @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of the subsequent dimensions is divisible by @math{P}. (For the advanced interface, where you can specify multiple simultaneous transforms via some ``vector'' length @code{howmany}, a factor of @code{howmany} is included in the product of the subsequent dimensions.) For a one-dimensional complex DFT, the length @code{N} of the data should be divisible by @math{P} @emph{squared} to be able to divide the problem equally among the processes. @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution @subsection Transposed distributions Internally, FFTW's MPI transform algorithms work by first computing transforms of the data local to each process, then by globally @emph{transposing} the data in some fashion to redistribute the data among the processes, transforming the new data local to each process, and transposing back. For example, a two-dimensional @code{n0} by @code{n1} array, distributed across the @code{n0} dimension, is transformd by: (i) transforming the @code{n1} dimension, which are local to each process; (ii) transposing to an @code{n1} by @code{n0} array, distributed across the @code{n1} dimension; (iii) transforming the @code{n0} dimension, which is now local to each process; (iv) transposing back. @cindex transpose However, in many applications it is acceptable to compute a multidimensional DFT whose results are produced in transposed order (e.g., @code{n1} by @code{n0} in two dimensions). This provides a significant performance advantage, because it means that the final transposition step can be omitted. FFTW supports this optimization, which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT} to the planner routines. To compute the inverse transform of transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell it that the input is transposed. In this section, we explain how to interpret the output format of such a transform. @ctindex FFTW_MPI_TRANSPOSED_OUT @ctindex FFTW_MPI_TRANSPOSED_IN Suppose you have are transforming multi-dimensional data with (at least two) dimensions @ndims{}. As always, it is distributed along the first dimension @dimk{0}. Now, if we compute its DFT with the @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored with the first @emph{two} dimensions transposed: @ndimstrans{}, distributed along the @dimk{1} dimension. Conversely, if we take the @ndimstrans{} data and transform it with the @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the original @ndims{} array. There are two ways to find the portion of the transposed array that resides on the current process. First, you can simply call the appropriate @samp{local_size} function, passing @ndimstrans{} (the transposed dimensions). This would mean calling the @samp{local_size} function twice, once for the transposed and once for the non-transposed dimensions. Alternatively, you can call one of the @samp{local_size_transposed} functions, which returns both the non-transposed and transposed data distribution from a single call. For example, for a 3d transform with transposed output (or input), you might call: @example ptrdiff_t fftw_mpi_local_size_3d_transposed( ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start); @end example @findex fftw_mpi_local_size_3d_transposed Here, @code{local_n0} and @code{local_0_start} give the size and starting index of the @code{n0} dimension for the @emph{non}-transposed data, as in the previous sections. For @emph{transposed} data (e.g. the output for @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and @code{local_1_start} give the size and starting index of the @code{n1} dimension, which is the first dimension of the transposed data (@code{n1} by @code{n0} by @code{n2}). (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two dimensions to the planner in reverse order, or vice versa. If you pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the first two dimensions passed to the planner and passing @emph{neither} flag.) @node One-dimensional distributions, , Transposed distributions, MPI Data Distribution @subsection One-dimensional distributions For one-dimensional distributed DFTs using FFTW, matters are slightly more complicated because the data distribution is more closely tied to how the algorithm works. In particular, you can no longer pass an arbitrary block size and must accept FFTW's default; also, the block sizes may be different for input and output. Also, the data distribution depends on the flags and transform direction, in order for forward and backward transforms to work correctly. @example ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, ptrdiff_t *local_ni, ptrdiff_t *local_i_start, ptrdiff_t *local_no, ptrdiff_t *local_o_start); @end example @findex fftw_mpi_local_size_1d This function computes the data distribution for a 1d transform of size @code{n0} with the given transform @code{sign} and @code{flags}. Both input and output data use block distributions. The input on the current process will consist of @code{local_ni} numbers starting at index @code{local_i_start}; e.g. if only a single process is used, then @code{local_ni} will be @code{n0} and @code{local_i_start} will be @code{0}. Similarly for the output, with @code{local_no} numbers starting at index @code{local_o_start}. The return value of @code{fftw_mpi_local_size_1d} will be the total number of elements to allocate on the current process (which might be slightly larger than the local size due to intermediate steps in the algorithm). As mentioned above (@pxref{Load balancing}), the data will be divided equally among the processes if @code{n0} is divisible by the @emph{square} of the number of processes. In this case, @code{local_ni} will equal @code{local_no}. Otherwise, they may be different. For some applications, such as convolutions, the order of the output data is irrelevant. In this case, performance can be improved by specifying that the output data be stored in an FFTW-defined ``scrambled'' format. (In particular, this is the analogue of transposed output in the multidimensional case: scrambled output saves a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in the flags, then the output is stored in this (undocumented) scrambled order. Conversely, to perform the inverse transform of data in scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag. @ctindex FFTW_MPI_SCRAMBLED_OUT @ctindex FFTW_MPI_SCRAMBLED_IN In MPI FFTW, only composite sizes @code{n0} can be parallelized; we have not yet implemented a parallel algorithm for large prime sizes. @c ------------------------------------------------------------ @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI @section Multi-dimensional MPI DFTs of Real Data FFTW's MPI interface also supports multi-dimensional DFTs of real data, similar to the serial r2c and c2r interfaces. (Parallel one-dimensional real-data DFTs are not currently supported; you must use a complex transform and set the imaginary parts of the inputs to zero.) The key points to understand for r2c and c2r MPI transforms (compared to the MPI complex DFTs or the serial r2c/c2r transforms), are: @itemize @bullet @item Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real data to/from @ndimshalf{} complex data: the last dimension of the complex data is cut in half (rounded down), plus one. As for the serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data. @item @cindex padding Although the real data is @emph{conceptually} @ndims{}, it is @emph{physically} stored as an @ndimspad{} array, where the last dimension has been @emph{padded} to make it the same size as the complex output. This is much like the in-place serial r2c/c2r interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that in MPI the padding is required even for out-of-place data. The extra padding numbers are ignored by FFTW (they are @emph{not} like zero-padding the transform to a larger size); they are only used to determine the data layout. @item @cindex data distribution The data distribution in MPI for @emph{both} the real and complex data is determined by the shape of the @emph{complex} data. That is, you call the appropriate @samp{local size} function for the @ndimshalf{} complex data, and then use the @emph{same} distribution for the real data except that the last complex dimension is replaced by a (padded) real dimension of twice the length. @end itemize For example suppose we are performing an out-of-place r2c transform of @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}], resulting in @threedims{L,M,N/2+1} complex data. Similar to the example in @ref{2d MPI example}, we might do something like: @example #include int main(int argc, char **argv) @{ const ptrdiff_t L = ..., M = ..., N = ...; fftw_plan plan; double *rin; fftw_complex *cout; ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; MPI_Init(&argc, &argv); fftw_mpi_init(); /* @r{get local data size and allocate} */ alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, &local_n0, &local_0_start); rin = fftw_alloc_real(2 * alloc_local); cout = fftw_alloc_complex(alloc_local); /* @r{create plan for out-of-place r2c DFT} */ plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, FFTW_MEASURE); /* @r{initialize rin to some function} my_func(x,y,z) */ for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) for (k = 0; k < N; ++k) rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); /* @r{compute transforms as many times as desired} */ fftw_execute(plan); fftw_destroy_plan(plan); MPI_Finalize(); @} @end example @findex fftw_alloc_real @cindex row-major Note that we allocated @code{rin} using @code{fftw_alloc_real} with an argument of @code{2 * alloc_local}: since @code{alloc_local} is the number of @emph{complex} values to allocate, the number of @emph{real} values is twice as many. The @code{rin} array is then @threedims{local_n0,M,2(N/2+1)} in row-major order, so its @code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) + k} (@pxref{Multi-dimensional Array Format }). @cindex transpose @ctindex FFTW_TRANSPOSED_OUT @ctindex FFTW_TRANSPOSED_IN As for the complex transforms, improved performance can be obtained by specifying that the output is the transpose of the input or vice versa (@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that the input would be a padded @threedims{L,M,2(N/2+1)} real array distributed over the @code{L} dimension, while the output would be a @threedims{M,L,N/2+1} complex array distributed over the @code{M} dimension. To perform the inverse c2r transform with the same data distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag. @c ------------------------------------------------------------ @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI @section Other multi-dimensional Real-Data MPI Transforms @cindex r2r FFTW's MPI interface also supports multi-dimensional @samp{r2r} transforms of all kinds supported by the serial interface (e.g. discrete cosine and sine transforms, discrete Hartley transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not one-dimensional transforms, are currently parallelized. @tindex fftw_r2r_kind These are used much like the multidimensional complex DFTs discussed above, except that the data is real rather than complex, and one needs to pass an r2r transform kind (@code{fftw_r2r_kind}) for each dimension as in the serial FFTW (@pxref{More DFTs of Real Data}). For example, one might perform a two-dimensional @twodims{L,M} that is an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the second dimension with code like: @example const ptrdiff_t L = ..., M = ...; fftw_plan plan; double *data; ptrdiff_t alloc_local, local_n0, local_0_start, i, j; /* @r{get local data size and allocate} */ alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, &local_n0, &local_0_start); data = fftw_alloc_real(alloc_local); /* @r{create plan for in-place REDFT10 x RODFT10} */ plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); /* @r{initialize data to some function} my_function(x,y) */ for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) data[i*M + j] = my_function(local_0_start + i, j); /* @r{compute transforms, in-place, as many times as desired} */ fftw_execute(plan); fftw_destroy_plan(plan); @end example @findex fftw_alloc_real Notice that we use the same @samp{local_size} functions as we did for complex data, only now we interpret the sizes in terms of real rather than complex values, and correspondingly use @code{fftw_alloc_real}. @c ------------------------------------------------------------ @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI @section FFTW MPI Transposes @cindex transpose The FFTW's MPI Fourier transforms rely on one or more @emph{global transposition} step for their communications. For example, the multidimensional transforms work by transforming along some dimensions, then transposing to make the first dimension local and transforming that, then transposing back. Because global transposition of a block-distributed matrix has many other potential uses besides FFTs, FFTW's transpose routines can be called directly, as documented in this section. @menu * Basic distributed-transpose interface:: * Advanced distributed-transpose interface:: * An improved replacement for MPI_Alltoall:: @end menu @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes @subsection Basic distributed-transpose interface In particular, suppose that we have an @code{n0} by @code{n1} array in row-major order, block-distributed across the @code{n0} dimension. To transpose this into an @code{n1} by @code{n0} array block-distributed across the @code{n1} dimension, we would create a plan by calling the following function: @example fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, double *in, double *out, MPI_Comm comm, unsigned flags); @end example @findex fftw_mpi_plan_transpose The input and output arrays (@code{in} and @code{out}) can be the same. The transpose is actually executed by calling @code{fftw_execute} on the plan, as usual. @findex fftw_execute The @code{flags} are the usual FFTW planner flags, but support two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or @code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for transpose plans, is that the output and/or input, respectively, are @emph{locally} transposed. That is, on each process input data is normally stored as a @code{local_n0} by @code{n1} array in row-major order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is stored as @code{n1} by @code{local_n0} in row-major order. Similarly, @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by @code{local_n1} instead of @code{local_n1} by @code{n0}. @ctindex FFTW_MPI_TRANSPOSED_OUT @ctindex FFTW_MPI_TRANSPOSED_IN To determine the local size of the array on each process before and after the transpose, as well as the amount of storage that must be allocated, one should call @code{fftw_mpi_local_size_2d_transposed}, just as for a 2d DFT as described in the previous section: @cindex data distribution @example ptrdiff_t fftw_mpi_local_size_2d_transposed (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start); @end example @findex fftw_mpi_local_size_2d_transposed Again, the return value is the local storage to allocate, which in this case is the number of @emph{real} (@code{double}) values rather than complex numbers as in the previous examples. @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes @subsection Advanced distributed-transpose interface The above routines are for a transpose of a matrix of numbers (of type @code{double}), using FFTW's default block sizes. More generally, one can perform transposes of @emph{tuples} of numbers, with user-specified block sizes for the input and output: @example fftw_plan fftw_mpi_plan_many_transpose (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, ptrdiff_t block0, ptrdiff_t block1, double *in, double *out, MPI_Comm comm, unsigned flags); @end example @findex fftw_mpi_plan_many_transpose In this case, one is transposing an @code{n0} by @code{n1} matrix of @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers). The input is distributed along the @code{n0} dimension with block size @code{block0}, and the @code{n1} by @code{n0} output is distributed along the @code{n1} dimension with block size @code{block1}. If @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW uses its default block size. To get the local size of the data on each process, you should then call @code{fftw_mpi_local_size_many_transposed}. @ctindex FFTW_MPI_DEFAULT_BLOCK @findex fftw_mpi_local_size_many_transposed @node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes @subsection An improved replacement for MPI_Alltoall We close this section by noting that FFTW's MPI transpose routines can be thought of as a generalization for the @code{MPI_Alltoall} function (albeit only for floating-point types), and in some circumstances can function as an improved replacement. @findex MPI_Alltoall @code{MPI_Alltoall} is defined by the MPI standard as: @example int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm); @end example In particular, for @code{double*} arrays @code{in} and @code{out}, consider the call: @example MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); @end example This is completely equivalent to: @example MPI_Comm_size(comm, &P); plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); fftw_execute(plan); fftw_destroy_plan(plan); @end example That is, computing a @twodims{P,P} transpose on @code{P} processes, with a block size of 1, is just a standard all-to-all communication. However, using the FFTW routine instead of @code{MPI_Alltoall} may have certain advantages. First of all, FFTW's routine can operate in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only operate out-of-place. @cindex in-place Second, even for out-of-place plans, FFTW's routine may be faster, especially if you need to perform the all-to-all communication many times and can afford to use @code{FFTW_MEASURE} or @code{FFTW_PATIENT}. It should certainly be no slower, not including the time to create the plan, since one of the possible algorithms that FFTW uses for an out-of-place transpose @emph{is} simply to call @code{MPI_Alltoall}. However, FFTW also considers several other possible algorithms that, depending on your MPI implementation and your hardware, may be faster. @ctindex FFTW_MEASURE @ctindex FFTW_PATIENT @c ------------------------------------------------------------ @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI @section FFTW MPI Wisdom @cindex wisdom @cindex saving plans to disk FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can be used to save MPI plans as well as to save uniprocessor plans. However, for MPI there are several unavoidable complications. @cindex MPI I/O First, the MPI standard does not guarantee that every process can perform file I/O (at least, not using C stdio routines)---in general, we may only assume that process 0 is capable of I/O.@footnote{In fact, even this assumption is not technically guaranteed by the standard, although it seems to be universal in actual MPI implementations and is widely assumed by MPI-using software. Technically, you need to query the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with @code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can perform I/O. Otherwise, it is the rank of a process that can perform I/O ... but since it is not guaranteed to yield the @emph{same} rank on all processes, you have to do an @code{MPI_Allreduce} of some kind if you want all processes to agree about which is going to do I/O. And even then, the standard only guarantees that this process can perform output, but not input. See e.g. @cite{Parallel Programming with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our experience virtually no MPI programmers worry about this.} So, if we want to export the wisdom from a single process to a file, we must first export the wisdom to a string, then send it to process 0, then write it to a file. Second, in principle we may want to have separate wisdom for every process, since in general the processes may run on different hardware even for a single MPI program. However, in practice FFTW's MPI code is designed for the case of homogeneous hardware (@pxref{Load balancing}), and in this case it is convenient to use the same wisdom for every process. Thus, we need a mechanism to synchronize the wisdom. To address both of these problems, FFTW provides the following two functions: @example void fftw_mpi_broadcast_wisdom(MPI_Comm comm); void fftw_mpi_gather_wisdom(MPI_Comm comm); @end example @findex fftw_mpi_gather_wisdom @findex fftw_mpi_broadcast_wisdom Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom} will broadcast the wisdom from process 0 to all other processes. Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all processes onto process 0. (If the plans created for the same problem by different processes are not the same, @code{fftw_mpi_gather_wisdom} will arbitrarily choose one of the plans.) Both of these functions may result in suboptimal plans for different processes if the processes are running on non-identical hardware. Both of these functions are @emph{collective} calls, which means that they must be executed by all processes in the communicator. @cindex collective function So, for example, a typical code snippet to import wisdom from a file and use it on all processes would be: @example @{ int rank; fftw_mpi_init(); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); @} @end example (Note that we must call @code{fftw_mpi_init} before importing any wisdom that might contain MPI plans.) Similarly, a typical code snippet to export wisdom from all processes to a file is: @findex fftw_mpi_init @example @{ int rank; fftw_mpi_gather_wisdom(MPI_COMM_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); @} @end example @c ------------------------------------------------------------ @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI @section Avoiding MPI Deadlocks @cindex deadlock An MPI program can @emph{deadlock} if one process is waiting for a message from another process that never gets sent. To avoid deadlocks when using FFTW's MPI routines, it is important to know which functions are @emph{collective}: that is, which functions must @emph{always} be called in the @emph{same order} from @emph{every} process in a given communicator. (For example, @code{MPI_Barrier} is the canonical example of a collective function in the MPI standard.) @cindex collective function @findex MPI_Barrier The functions in FFTW that are @emph{always} collective are: every function beginning with @samp{fftw_mpi_plan}, as well as @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}. Also, the following functions from the ordinary FFTW interface are collective when they are applied to a plan created by an @samp{fftw_mpi_plan} function: @code{fftw_execute}, @code{fftw_destroy_plan}, and @code{fftw_flops}. @findex fftw_execute @findex fftw_destroy_plan @findex fftw_flops @c ------------------------------------------------------------ @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI @section FFTW MPI Performance Tips In this section, we collect a few tips on getting the best performance out of FFTW's MPI transforms. First, because of the 1d block distribution, FFTW's parallelization is currently limited by the size of the first dimension. (Multidimensional block distributions may be supported by a future version.) More generally, you should ideally arrange the dimensions so that FFTW can divide them equally among the processes. @xref{Load balancing}. @cindex block distribution @cindex load balancing Second, if it is not too inconvenient, you should consider working with transposed output for multidimensional plans, as this saves a considerable amount of communications. @xref{Transposed distributions}. @cindex transpose Third, the fastest choices are generally either an in-place transform or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag (which allows the input array to be used as scratch space). In-place is especially beneficial if the amount of data per process is large. @ctindex FFTW_DESTROY_INPUT Fourth, if you have multiple arrays to transform at once, rather than calling FFTW's MPI transforms several times it usually seems to be faster to interleave the data and use the advanced interface. (This groups the communications together instead of requiring separate messages for each transform.) @c ------------------------------------------------------------ @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI @section Combining MPI and Threads @cindex threads In certain cases, it may be advantageous to combine MPI (distributed-memory) and threads (shared-memory) parallelization. FFTW supports this, with certain caveats. For example, if you have a cluster of 4-processor shared-memory nodes, you may want to use threads within the nodes and MPI between the nodes, instead of MPI for all parallelization. In particular, it is possible to seamlessly combine the MPI FFTW routines with the multi-threaded FFTW routines (@pxref{Multi-threaded FFTW}). However, some care must be taken in the initialization code, which should look something like this: @example int threads_ok; int main(int argc, char **argv) @{ int provided; MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); threads_ok = provided >= MPI_THREAD_FUNNELED; if (threads_ok) threads_ok = fftw_init_threads(); fftw_mpi_init(); ... if (threads_ok) fftw_plan_with_nthreads(...); ... MPI_Finalize(); @} @end example @findex fftw_mpi_init @findex fftw_init_threads @findex fftw_plan_with_nthreads First, note that instead of calling @code{MPI_Init}, you should call @code{MPI_Init_threads}, which is the initialization routine defined by the MPI-2 standard to indicate to MPI that your program will be multithreaded. We pass @code{MPI_THREAD_FUNNELED}, which indicates that we will only call MPI routines from the main thread. (FFTW will launch additional threads internally, but the extra threads will not call MPI code.) (You may also pass @code{MPI_THREAD_SERIALIZED} or @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading support from the MPI implementation, but this is not required by FFTW.) The @code{provided} parameter returns what level of threads support is actually supported by your MPI implementation; this @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call the FFTW threads routines, so we define a global variable @code{threads_ok} to record this. You should only call @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if @code{threads_ok} is true. For more information on thread safety in MPI, see the @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and Threads} section of the MPI-2 standard. @cindex thread safety Second, we must call @code{fftw_init_threads} @emph{before} @code{fftw_mpi_init}. This is critical for technical reasons having to do with how FFTW initializes its list of algorithms. Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI process will launch (up to) @code{N} threads to parallelize its transforms. For example, in the hypothetical cluster of 4-processor nodes, you might wish to launch only a single MPI process per node, and then call @code{fftw_plan_with_nthreads(4)} on each process to use all processors in the nodes. This may or may not be faster than simply using as many MPI processes as you have processors, however. On the one hand, using threads within a node eliminates the need for explicit message passing within the node. On the other hand, FFTW's transpose routines are not multi-threaded, and this means that the communications that do take place will not benefit from parallelization within the node. Moreover, many MPI implementations already have optimizations to exploit shared memory when it is available, so adding the multithreaded FFTW on top of this may be superfluous. @cindex transpose @c ------------------------------------------------------------ @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI @section FFTW MPI Reference This chapter provides a complete reference to all FFTW MPI functions, datatypes, and constants. See also @ref{FFTW Reference} for information on functions and types in common with the serial interface. @menu * MPI Files and Data Types:: * MPI Initialization:: * Using MPI Plans:: * MPI Data Distribution Functions:: * MPI Plan Creation:: * MPI Wisdom Communication:: @end menu @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference @subsection MPI Files and Data Types All programs using FFTW's MPI support should include its header file: @example #include @end example Note that this header file includes the serial-FFTW @code{fftw3.h} header file, and also the @code{mpi.h} header file for MPI, so you need not include those files separately. You must also link to @emph{both} the FFTW MPI library and to the serial FFTW library. On Unix, this means adding @code{-lfftw3_mpi -lfftw3 -lm} at the end of the link command. @cindex precision Different precisions are handled as in the serial interface: @xref{Precision}. That is, @samp{fftw_} functions become @code{fftwf_} (in single precision) etcetera, and the libraries become @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix. Long-double precision is supported in MPI, but quad precision (@samp{fftwq_}) is not due to the lack of MPI support for this type. @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference @subsection MPI Initialization Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and before importing any wisdom for MPI problems, you must call: @findex fftw_mpi_init @example void fftw_mpi_init(void); @end example @findex fftw_init_threads If FFTW threads support is used, however, @code{fftw_mpi_init} should be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI and Threads}). Calling @code{fftw_mpi_init} additional times (before @code{fftw_mpi_cleanup}) has no effect. If you want to deallocate all persistent data and reset FFTW to the pristine state it was in when you started your program, you can call: @findex fftw_mpi_cleanup @example void fftw_mpi_cleanup(void); @end example @findex fftw_cleanup (This calls @code{fftw_cleanup}, so you need not call the serial cleanup routine too, although it is safe to do so.) After calling @code{fftw_mpi_cleanup}, all existing plans become undefined, and you should not attempt to execute or destroy them. You must call @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want to resume using the MPI FFTW routines. @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference @subsection Using MPI Plans Once an MPI plan is created, you can execute and destroy it using @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions in the serial interface that operate on generic plans (@pxref{Using Plans}). @cindex collective function @cindex MPI communicator The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to MPI plans, are @emph{collective} calls: they must be called for all processes in the communicator that was used to create the plan. @cindex new-array execution You must @emph{not} use the serial new-array plan-execution functions @code{fftw_execute_dft} and so on (@pxref{New-array Execute Functions}) with MPI plans. Such functions are specialized to the problem type, and there are specific new-array execute functions for MPI plans: @findex fftw_mpi_execute_dft @findex fftw_mpi_execute_dft_r2c @findex fftw_mpi_execute_dft_c2r @findex fftw_mpi_execute_r2r @example void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); @end example @cindex alignment @findex fftw_malloc These functions have the same restrictions as those of the serial new-array execute functions. They are @emph{always} safe to apply to the @emph{same} @code{in} and @code{out} arrays that were used to create the plan. They can only be applied to new arrarys if those arrays have the same types, dimensions, in-placeness, and alignment as the original arrays, where the best way to ensure the same alignment is to use FFTW's @code{fftw_malloc} and related allocation functions for all arrays (@pxref{Memory Allocation}). Note that distributed transposes (@pxref{FFTW MPI Transposes}) use @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans from FFTW's perspective. @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference @subsection MPI Data Distribution Functions @cindex data distribution As described above (@pxref{MPI Data Distribution}), in order to allocate your arrays, @emph{before} creating a plan, you must first call one of the following routines to determine the required allocation size and the portion of the array locally stored on a given process. The @code{MPI_Comm} communicator passed here must be equivalent to the communicator used below for plan creation. The basic interface for multidimensional transforms consists of the functions: @findex fftw_mpi_local_size_2d @findex fftw_mpi_local_size_3d @findex fftw_mpi_local_size @findex fftw_mpi_local_size_2d_transposed @findex fftw_mpi_local_size_3d_transposed @findex fftw_mpi_local_size_transposed @example ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start); ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start); ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start); ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start); ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start); ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start); @end example These functions return the number of elements to allocate (complex numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the @code{local_n0} and @code{local_0_start} return the portion (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the first dimension of an @ndims{} array that is stored on the local process. @xref{Basic and advanced distribution interfaces}. For @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants are useful in order to also return the local portion of the first dimension in the @ndimstrans{} transposed output. @xref{Transposed distributions}. The advanced interface for multidimensional transforms is: @cindex advanced interface @findex fftw_mpi_local_size_many @findex fftw_mpi_local_size_many_transposed @example ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, ptrdiff_t block0, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start); ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start); @end example These differ from the basic interface in only two ways. First, they allow you to specify block sizes @code{block0} and @code{block1} (the latter for the transposed output); you can pass @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in the basic interface. Second, you can pass a @code{howmany} parameter, corresponding to the advanced planning interface below: this is for transforms of contiguous @code{howmany}-tuples of numbers (@code{howmany = 1} in the basic interface). The corresponding basic and advanced routines for one-dimensional transforms (currently only complex DFTs) are: @findex fftw_mpi_local_size_1d @findex fftw_mpi_local_size_many_1d @example ptrdiff_t fftw_mpi_local_size_1d( ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, ptrdiff_t *local_ni, ptrdiff_t *local_i_start, ptrdiff_t *local_no, ptrdiff_t *local_o_start); ptrdiff_t fftw_mpi_local_size_many_1d( ptrdiff_t n0, ptrdiff_t howmany, MPI_Comm comm, int sign, unsigned flags, ptrdiff_t *local_ni, ptrdiff_t *local_i_start, ptrdiff_t *local_no, ptrdiff_t *local_o_start); @end example @ctindex FFTW_MPI_SCRAMBLED_OUT @ctindex FFTW_MPI_SCRAMBLED_IN As above, the return value is the number of elements to allocate (complex numbers, for complex DFTs). The @code{local_ni} and @code{local_i_start} arguments return the portion (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the 1d array that is stored on this process for the transform @emph{input}, and @code{local_no} and @code{local_o_start} are the corresponding quantities for the input. The @code{sign} (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must match the arguments passed when creating a plan. Although the inputs and outputs have different data distributions in general, it is guaranteed that the @emph{output} data distribution of an @code{FFTW_FORWARD} plan will match the @emph{input} data distribution of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags. @xref{One-dimensional distributions}. @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference @subsection MPI Plan Creation @subsubheading Complex-data MPI DFTs Plans for complex-data DFTs (@pxref{2d MPI example}) are created by: @findex fftw_mpi_plan_dft_1d @findex fftw_mpi_plan_dft_2d @findex fftw_mpi_plan_dft_3d @findex fftw_mpi_plan_dft @findex fftw_mpi_plan_many_dft @example fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, MPI_Comm comm, int sign, unsigned flags); fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, fftw_complex *in, fftw_complex *out, MPI_Comm comm, int sign, unsigned flags); fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, fftw_complex *in, fftw_complex *out, MPI_Comm comm, int sign, unsigned flags); fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n, fftw_complex *in, fftw_complex *out, MPI_Comm comm, int sign, unsigned flags); fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, fftw_complex *in, fftw_complex *out, MPI_Comm comm, int sign, unsigned flags); @end example @cindex MPI communicator @cindex collective function These are similar to their serial counterparts (@pxref{Complex DFTs}) in specifying the dimensions, sign, and flags of the transform. The @code{comm} argument gives an MPI communicator that specifies the set of processes to participate in the transform; plan creation is a collective function that must be called for all processes in the communicator. The @code{in} and @code{out} pointers refer only to a portion of the overall transform data (@pxref{MPI Data Distribution}) as specified by the @samp{local_size} functions in the previous section. Unless @code{flags} contains @code{FFTW_ESTIMATE}, these arrays are overwritten during plan creation as for the serial interface. For multi-dimensional transforms, any dimensions @code{> 1} are supported; for one-dimensional transforms, only composite (non-prime) @code{n0} are currently supported (unlike the serial FFTW). Requesting an unsupported transform size will yield a @code{NULL} plan. (As in the serial interface, highly composite sizes generally yield the best performance.) @cindex advanced interface @ctindex FFTW_MPI_DEFAULT_BLOCK @cindex stride The advanced-interface @code{fftw_mpi_plan_many_dft} additionally allows you to specify the block sizes for the first dimension (@code{block}) of the @ndims{} input data and the first dimension (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate steps of the transform, and for the output if @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}). These must be the same block sizes as were passed to the corresponding @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in the basic interface. Also, the @code{howmany} parameter specifies that the transform is of contiguous @code{howmany}-tuples rather than individual complex numbers; this corresponds to the same parameter in the serial advanced interface (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and @code{dist = 1}. @subsubheading MPI flags The @code{flags} can be any of those for the serial FFTW (@pxref{Planner Flags}), and in addition may include one or more of the following MPI-specific flags, which improve performance at the cost of changing the output or input data formats. @itemize @bullet @item @ctindex FFTW_MPI_SCRAMBLED_OUT @ctindex FFTW_MPI_SCRAMBLED_IN @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for 1d transforms only, these flags indicate that the output/input of the transform are in an undocumented ``scrambled'' order. A forward @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization). @xref{One-dimensional distributions}. @item @ctindex FFTW_MPI_TRANSPOSED_OUT @ctindex FFTW_MPI_TRANSPOSED_IN @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid for multidimensional (@code{rnk > 1}) transforms only, these flags specify that the output or input of an @ndims{} transform is transposed to @ndimstrans{}. @xref{Transposed distributions}. @end itemize @subsubheading Real-data MPI DFTs @cindex r2c Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional MPI DFTs of Real Data}) are created by: @findex fftw_mpi_plan_dft_r2c_2d @findex fftw_mpi_plan_dft_r2c_2d @findex fftw_mpi_plan_dft_r2c_3d @findex fftw_mpi_plan_dft_r2c @findex fftw_mpi_plan_dft_c2r_2d @findex fftw_mpi_plan_dft_c2r_2d @findex fftw_mpi_plan_dft_c2r_3d @findex fftw_mpi_plan_dft_c2r @example fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, double *in, fftw_complex *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1, double *in, fftw_complex *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, double *in, fftw_complex *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, double *in, fftw_complex *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, fftw_complex *in, double *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1, fftw_complex *in, double *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, fftw_complex *in, double *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, fftw_complex *in, double *out, MPI_Comm comm, unsigned flags); @end example Similar to the serial interface (@pxref{Real-data DFTs}), these transform logically @ndims{} real data to/from @ndimshalf{} complex data, representing the non-redundant half of the conjugate-symmetry output of a real-input DFT (@pxref{Multi-dimensional Transforms}). However, the real array must be stored within a padded @ndimspad{} array (much like the in-place serial r2c transforms, but here for out-of-place transforms as well). Currently, only multi-dimensional (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan for @code{rnk = 1} will yield @code{NULL}). As explained above (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data distribution of both the real and complex arrays is given by the @samp{local_size} function called for the dimensions of the @emph{complex} array. Similar to the other planning functions, the input and output arrays are overwritten when the plan is created except in @code{FFTW_ESTIMATE} mode. As for the complex DFTs above, there is an advance interface that allows you to manually specify block sizes and to transform contiguous @code{howmany}-tuples of real/complex numbers: @findex fftw_mpi_plan_many_dft_r2c @findex fftw_mpi_plan_many_dft_c2r @example fftw_plan fftw_mpi_plan_many_dft_r2c (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, ptrdiff_t iblock, ptrdiff_t oblock, double *in, fftw_complex *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_many_dft_c2r (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, ptrdiff_t iblock, ptrdiff_t oblock, fftw_complex *in, double *out, MPI_Comm comm, unsigned flags); @end example @subsubheading MPI r2r transforms @cindex r2r There are corresponding plan-creation routines for r2r transforms (@pxref{More DFTs of Real Data}), currently supporting multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will yield a @code{NULL} plan): @example fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, double *in, double *out, MPI_Comm comm, fftw_r2r_kind kind0, fftw_r2r_kind kind1, unsigned flags); fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, double *in, double *out, MPI_Comm comm, fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, unsigned flags); fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, double *in, double *out, MPI_Comm comm, const fftw_r2r_kind *kind, unsigned flags); fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, ptrdiff_t iblock, ptrdiff_t oblock, double *in, double *out, MPI_Comm comm, const fftw_r2r_kind *kind, unsigned flags); @end example The parameters are much the same as for the complex DFTs above, except that the arrays are of real numbers (and hence the outputs of the @samp{local_size} data-distribution functions should be interpreted as counts of real rather than complex numbers). Also, the @code{kind} parameters specify the r2r kinds along each dimension as for the serial interface (@pxref{Real-to-Real Transform Kinds}). @xref{Other Multi-dimensional Real-data MPI Transforms}. @subsubheading MPI transposition @cindex transpose FFTW also provides routines to plan a transpose of a distributed @code{n0} by @code{n1} array of real numbers, or an array of @code{howmany}-tuples of real numbers with specified block sizes (@pxref{FFTW MPI Transposes}): @findex fftw_mpi_plan_transpose @findex fftw_mpi_plan_many_transpose @example fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, double *in, double *out, MPI_Comm comm, unsigned flags); fftw_plan fftw_mpi_plan_many_transpose (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, ptrdiff_t block0, ptrdiff_t block1, double *in, double *out, MPI_Comm comm, unsigned flags); @end example @cindex new-array execution @findex fftw_mpi_execute_r2r These plans are used with the @code{fftw_mpi_execute_r2r} new-array execute function (@pxref{Using MPI Plans }), since they count as (rank zero) r2r plans from FFTW's perspective. @node MPI Wisdom Communication, , MPI Plan Creation, FFTW MPI Reference @subsection MPI Wisdom Communication To facilitate synchronizing wisdom among the different MPI processes, we provide two functions: @findex fftw_mpi_gather_wisdom @findex fftw_mpi_broadcast_wisdom @example void fftw_mpi_gather_wisdom(MPI_Comm comm); void fftw_mpi_broadcast_wisdom(MPI_Comm comm); @end example The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the given communicator @code{comm} to the process of rank 0 in the communicator: that process obtains the union of all wisdom on all the processes. As a side effect, some other processes will gain additional wisdom from other processes, but only process 0 will gain the complete union. The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports wisdom from process 0 in @code{comm} to all other processes in the communicator, replacing any wisdom they currently have. @xref{FFTW MPI Wisdom}. @c ------------------------------------------------------------ @node FFTW MPI Fortran Interface, , FFTW MPI Reference, Distributed-memory FFTW with MPI @section FFTW MPI Fortran Interface @cindex Fortran interface @cindex iso_c_binding The FFTW MPI interface is callable from modern Fortran compilers supporting the Fortran 2003 @code{iso_c_binding} standard for calling C functions. As described in @ref{Calling FFTW from Modern Fortran}, this means that you can directly call FFTW's C interface from Fortran with only minor changes in syntax. There are, however, a few things specific to the MPI interface to keep in mind: @itemize @bullet @item Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran interface }, you should @code{include 'fftw3-mpi.f03'} (after @code{use, intrinsic :: iso_c_binding} as before). The @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should @emph{not} @code{include} them both yourself. (You will also want to include the MPI header file, usually via @code{include 'mpif.h'} or similar, although though this is not needed by @code{fftw3-mpi.f03} @i{per se}.) (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.) @item Because of the different storage conventions between C and Fortran, you reverse the order of your array dimensions when passing them to FFTW (@pxref{Reversing array dimensions}). This is merely a difference in notation and incurs no performance overhead. However, it means that, whereas in C the @emph{first} dimension is distributed, in Fortran the @emph{last} dimension of your array is distributed. @item @cindex MPI communicator In Fortran, communicators are stored as @code{integer} types; there is no @code{MPI_Comm} type, nor is there any way to access a C @code{MPI_Comm}. Fortunately, this is taken care of for you by the FFTW Fortran interface: whenever the C interface expects an @code{MPI_Comm} type, you should pass the Fortran communicator as an @code{integer}.@footnote{Technically, this is because you aren't actually calling the C functions directly. You are calling wrapper functions that translate the communicator with @code{MPI_Comm_f2c} before calling the ordinary C interface. This is all done transparently, however, since the @code{fftw3-mpi.f03} interface file renames the wrappers so that they are called in Fortran with the same names as the C interface functions.} @item Because you need to call the @samp{local_size} function to find out how much space to allocate, and this may be @emph{larger} than the local portion of the array (@pxref{MPI Data Distribution}), you should @emph{always} allocate your arrays dynamically using FFTW's allocation routines as described in @ref{Allocating aligned memory in Fortran}. (Coincidentally, this also provides the best performance by guaranteeding proper data alignment.) @item Because all sizes in the MPI FFTW interface are declared as @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in Fortran (@pxref{FFTW Fortran type reference}). @item @findex fftw_execute_dft @findex fftw_mpi_execute_dft @cindex new-array execution In Fortran, because of the language semantics, we generally recommend using the new-array execute functions for all plans, even in the common case where you are executing the plan on the same arrays for which the plan was created (@pxref{Plan execution in Fortran}). However, note that in the MPI interface these functions are changed: @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft}, etcetera. @xref{Using MPI Plans}. @end itemize For example, here is a Fortran code snippet to perform a distributed @twodims{L,M} complex DFT in-place. (This assumes you have already initialized MPI with @code{MPI_init} and have also performed @code{call fftw_mpi_init}.) @example use, intrinsic :: iso_c_binding include 'fftw3-mpi.f03' integer(C_INTPTR_T), parameter :: L = ... integer(C_INTPTR_T), parameter :: M = ... type(C_PTR) :: plan, cdata complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset ! @r{get local data size and allocate (note dimension reversal)} alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & local_M, local_j_offset) cdata = fftw_alloc_complex(alloc_local) call c_f_pointer(cdata, data, [L,local_M]) ! @r{create MPI plan for in-place forward DFT (note dimension reversal)} plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & FFTW_FORWARD, FFTW_MEASURE) ! @r{initialize data to some function} my_function(i,j) do j = 1, local_M do i = 1, L data(i, j) = my_function(i, j + local_j_offset) end do end do ! @r{compute transform (as many times as desired)} call fftw_mpi_execute_dft(plan, data, data) call fftw_destroy_plan(plan) call fftw_free(cdata) @end example Note that when we called @code{fftw_mpi_local_size_2d} and @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order, since a @twodims{L,M} Fortran array is viewed by FFTW in C as a @twodims{M, L} array. This means that the array was distributed over the @code{M} dimension, the local portion of which is a @twodims{L,local_M} array in Fortran. (You must @emph{not} use an @code{allocate} statement to allocate an @twodims{L,local_M} array, however; you must allocate @code{alloc_local} complex numbers, which may be greater than @code{L * local_M}, in order to reserve space for intermediate steps of the transform.) Finally, we mention that because C's array indices are zero-based, the @code{local_j_offset} argument can conveniently be interpreted as an offset in the 1-based @code{j} index (rather than as a starting index as in C). If instead you had used the @code{ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a transposed @twodims{M,local_L} array, associated with the @emph{same} @code{cdata} allocation (since the transform is in-place), and which you could declare with: @example complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) ... call c_f_pointer(cdata, tdata, [M,local_L]) @end example where @code{local_L} would have been obtained by changing the @code{fftw_mpi_local_size_2d} call to: @example alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & local_M, local_j_offset, local_L, local_i_offset) @end example