furnace/extern/fftw/doc/fftw3.info-1
2022-05-31 03:24:29 -05:00

6304 lines
293 KiB
Text
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This is fftw3.info, produced by makeinfo version 6.7 from fftw3.texi.
This manual is for FFTW (version 3.3.10, 10 December 2020).
Copyright (C) 2003 Matteo Frigo.
Copyright (C) 2003 Massachusetts Institute of Technology.
Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission
notice are preserved on all copies.
Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided
that the entire resulting derived work is distributed under the
terms of a permission notice identical to this one.
Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for
modified versions, except that this permission notice may be stated
in a translation approved by the Free Software Foundation.
INFO-DIR-SECTION Development
START-INFO-DIR-ENTRY
* fftw3: (fftw3). FFTW User's Manual.
END-INFO-DIR-ENTRY

File: fftw3.info, Node: Top, Next: Introduction, Prev: (dir), Up: (dir)
FFTW User Manual
****************
Welcome to FFTW, the Fastest Fourier Transform in the West. FFTW is a
collection of fast C routines to compute the discrete Fourier transform.
This manual documents FFTW version 3.3.10.
* Menu:
* Introduction::
* Tutorial::
* Other Important Topics::
* FFTW Reference::
* Multi-threaded FFTW::
* Distributed-memory FFTW with MPI::
* Calling FFTW from Modern Fortran::
* Calling FFTW from Legacy Fortran::
* Upgrading from FFTW version 2::
* Installation and Customization::
* Acknowledgments::
* License and Copyright::
* Concept Index::
* Library Index::

File: fftw3.info, Node: Introduction, Next: Tutorial, Prev: Top, Up: Top
1 Introduction
**************
This manual documents version 3.3.10 of FFTW, the _Fastest Fourier
Transform in the West_. FFTW is a comprehensive collection of fast C
routines for computing the discrete Fourier transform (DFT) and various
special cases thereof.
* FFTW computes the DFT of complex data, real data, even- or
odd-symmetric real data (these symmetric transforms are usually
known as the discrete cosine or sine transform, respectively), and
the discrete Hartley transform (DHT) of real data.
* The input data can have arbitrary length. FFTW employs O(n log n)
algorithms for all lengths, including prime numbers.
* FFTW supports arbitrary multi-dimensional data.
* FFTW supports the SSE, SSE2, AVX, AVX2, AVX512, KCVI, Altivec, VSX,
and NEON vector instruction sets.
* FFTW includes parallel (multi-threaded) transforms for
shared-memory systems.
* Starting with version 3.3, FFTW includes distributed-memory
parallel transforms using MPI.
We assume herein that you are familiar with the properties and uses
of the DFT that are relevant to your application. Otherwise, see e.g.
'The Fast Fourier Transform and Its Applications' by E. O. Brigham
(Prentice-Hall, Englewood Cliffs, NJ, 1988). Our web page
(http://www.fftw.org) also has links to FFT-related information online.
In order to use FFTW effectively, you need to learn one basic concept
of FFTW's internal structure: FFTW does not use a fixed algorithm for
computing the transform, but instead it adapts the DFT algorithm to
details of the underlying hardware in order to maximize performance.
Hence, the computation of the transform is split into two phases.
First, FFTW's "planner" "learns" the fastest way to compute the
transform on your machine. The planner produces a data structure called
a "plan" that contains this information. Subsequently, the plan is
"executed" to transform the array of input data as dictated by the plan.
The plan can be reused as many times as needed. In typical
high-performance applications, many transforms of the same size are
computed and, consequently, a relatively expensive initialization of
this sort is acceptable. On the other hand, if you need a single
transform of a given size, the one-time cost of the planner becomes
significant. For this case, FFTW provides fast planners based on
heuristics or on previously computed plans.
FFTW supports transforms of data with arbitrary length, rank,
multiplicity, and a general memory layout. In simple cases, however,
this generality may be unnecessary and confusing. Consequently, we
organized the interface to FFTW into three levels of increasing
generality.
* The "basic interface" computes a single transform of contiguous
data.
* The "advanced interface" computes transforms of multiple or strided
arrays.
* The "guru interface" supports the most general data layouts,
multiplicities, and strides.
We expect that most users will be best served by the basic interface,
whereas the guru interface requires careful attention to the
documentation to avoid problems.
Besides the automatic performance adaptation performed by the
planner, it is also possible for advanced users to customize FFTW
manually. For example, if code space is a concern, we provide a tool
that links only the subset of FFTW needed by your application.
Conversely, you may need to extend FFTW because the standard
distribution is not sufficient for your needs. For example, the
standard FFTW distribution works most efficiently for arrays whose size
can be factored into small primes (2, 3, 5, and 7), and otherwise it
uses a slower general-purpose routine. If you need efficient transforms
of other sizes, you can use FFTW's code generator, which produces fast C
programs ("codelets") for any particular array size you may care about.
For example, if you need transforms of size 513 = 19 x 3^3, you can
customize FFTW to support the factor 19 efficiently.
For more information regarding FFTW, see the paper, "The Design and
Implementation of FFTW3," by M. Frigo and S. G. Johnson, which was an
invited paper in 'Proc. IEEE' 93 (2), p. 216 (2005). The code
generator is described in the paper "A fast Fourier transform compiler",
by M. Frigo, in the 'Proceedings of the 1999 ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI), Atlanta, Georgia,
May 1999'. These papers, along with the latest version of FFTW, the
FAQ, benchmarks, and other links, are available at the FFTW home page
(http://www.fftw.org).
The current version of FFTW incorporates many good ideas from the
past thirty years of FFT literature. In one way or another, FFTW uses
the Cooley-Tukey algorithm, the prime factor algorithm, Rader's
algorithm for prime sizes, and a split-radix algorithm (with a
"conjugate-pair" variation pointed out to us by Dan Bernstein). FFTW's
code generator also produces new algorithms that we do not completely
understand. The reader is referred to the cited papers for the
appropriate references.
The rest of this manual is organized as follows. We first discuss
the sequential (single-processor) implementation. We start by
describing the basic interface/features of FFTW in *note Tutorial::.
Next, *note Other Important Topics:: discusses data alignment (*note
SIMD alignment and fftw_malloc::), the storage scheme of
multi-dimensional arrays (*note Multi-dimensional Array Format::), and
FFTW's mechanism for storing plans on disk (*note Words of Wisdom-Saving
Plans::). Next, *note FFTW Reference:: provides comprehensive
documentation of all FFTW's features. Parallel transforms are discussed
in their own chapters: *note Multi-threaded FFTW:: and *note
Distributed-memory FFTW with MPI::. Fortran programmers can also use
FFTW, as described in *note Calling FFTW from Legacy Fortran:: and *note
Calling FFTW from Modern Fortran::. *note Installation and
Customization:: explains how to install FFTW in your computer system and
how to adapt FFTW to your needs. License and copyright information is
given in *note License and Copyright::. Finally, we thank all the
people who helped us in *note Acknowledgments::.

File: fftw3.info, Node: Tutorial, Next: Other Important Topics, Prev: Introduction, Up: Top
2 Tutorial
**********
* Menu:
* Complex One-Dimensional DFTs::
* Complex Multi-Dimensional DFTs::
* One-Dimensional DFTs of Real Data::
* Multi-Dimensional DFTs of Real Data::
* More DFTs of Real Data::
This chapter describes the basic usage of FFTW, i.e., how to compute the
Fourier transform of a single array. This chapter tells the truth, but
not the _whole_ truth. Specifically, FFTW implements additional
routines and flags that are not documented here, although in many cases
we try to indicate where added capabilities exist. For more complete
information, see *note FFTW Reference::. (Note that you need to compile
and install FFTW before you can use it in a program. For the details of
the installation, see *note Installation and Customization::.)
We recommend that you read this tutorial in order.(1) At the least,
read the first section (*note Complex One-Dimensional DFTs::) before
reading any of the others, even if your main interest lies in one of the
other transform types.
Users of FFTW version 2 and earlier may also want to read *note
Upgrading from FFTW version 2::.
---------- Footnotes ----------
(1) You can read the tutorial in bit-reversed order after computing
your first transform.

File: fftw3.info, Node: Complex One-Dimensional DFTs, Next: Complex Multi-Dimensional DFTs, Prev: Tutorial, Up: Tutorial
2.1 Complex One-Dimensional DFTs
================================
Plan: To bother about the best method of accomplishing an
accidental result. [Ambrose Bierce, 'The Enlarged Devil's
Dictionary'.]
The basic usage of FFTW to compute a one-dimensional DFT of size 'N'
is simple, and it typically looks something like this code:
#include <fftw3.h>
...
{
fftw_complex *in, *out;
fftw_plan p;
...
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
...
fftw_execute(p); /* repeat as needed */
...
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
You must link this code with the 'fftw3' library. On Unix systems,
link with '-lfftw3 -lm'.
The example code first allocates the input and output arrays. You
can allocate them in any way that you like, but we recommend using
'fftw_malloc', which behaves like 'malloc' except that it properly
aligns the array when SIMD instructions (such as SSE and Altivec) are
available (*note SIMD alignment and fftw_malloc::). [Alternatively, we
provide a convenient wrapper function 'fftw_alloc_complex(N)' which has
the same effect.]
The data is an array of type 'fftw_complex', which is by default a
'double[2]' composed of the real ('in[i][0]') and imaginary ('in[i][1]')
parts of a complex number.
The next step is to create a "plan", which is an object that contains
all the data that FFTW needs to compute the FFT. This function creates
the plan:
fftw_plan fftw_plan_dft_1d(int n, fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
The first argument, 'n', is the size of the transform you are trying
to compute. The size 'n' can be any positive integer, but sizes that
are products of small factors are transformed most efficiently (although
prime sizes still use an O(n log n) algorithm).
The next two arguments are pointers to the input and output arrays of
the transform. These pointers can be equal, indicating an "in-place"
transform.
The fourth argument, 'sign', can be either 'FFTW_FORWARD' ('-1') or
'FFTW_BACKWARD' ('+1'), and indicates the direction of the transform you
are interested in; technically, it is the sign of the exponent in the
transform.
The 'flags' argument is usually either 'FFTW_MEASURE' or
'FFTW_ESTIMATE'. 'FFTW_MEASURE' instructs FFTW to run and measure the
execution time of several FFTs in order to find the best way to compute
the transform of size 'n'. This process takes some time (usually a few
seconds), depending on your machine and on the size of the transform.
'FFTW_ESTIMATE', on the contrary, does not run any computation and just
builds a reasonable plan that is probably sub-optimal. In short, if
your program performs many transforms of the same size and
initialization time is not important, use 'FFTW_MEASURE'; otherwise use
the estimate.
_You must create the plan before initializing the input_, because
'FFTW_MEASURE' overwrites the 'in'/'out' arrays. (Technically,
'FFTW_ESTIMATE' does not touch your arrays, but you should always create
plans first just to be sure.)
Once the plan has been created, you can use it as many times as you
like for transforms on the specified 'in'/'out' arrays, computing the
actual transforms via 'fftw_execute(plan)':
void fftw_execute(const fftw_plan plan);
The DFT results are stored in-order in the array 'out', with the
zero-frequency (DC) component in 'out[0]'. If 'in != out', the
transform is "out-of-place" and the input array 'in' is not modified.
Otherwise, the input array is overwritten with the transform.
If you want to transform a _different_ array of the same size, you
can create a new plan with 'fftw_plan_dft_1d' and FFTW automatically
reuses the information from the previous plan, if possible.
Alternatively, with the "guru" interface you can apply a given plan to a
different array, if you are careful. *Note FFTW Reference::.
When you are done with the plan, you deallocate it by calling
'fftw_destroy_plan(plan)':
void fftw_destroy_plan(fftw_plan plan);
If you allocate an array with 'fftw_malloc()' you must deallocate it
with 'fftw_free()'. Do not use 'free()' or, heaven forbid, 'delete'.
FFTW computes an _unnormalized_ DFT. Thus, computing a forward
followed by a backward transform (or vice versa) results in the original
array scaled by 'n'. For the definition of the DFT, see *note What FFTW
Really Computes::.
If you have a C compiler, such as 'gcc', that supports the C99
standard, and you '#include <complex.h>' _before_ '<fftw3.h>', then
'fftw_complex' is the native double-precision complex type and you can
manipulate it with ordinary arithmetic. Otherwise, FFTW defines its own
complex type, which is bit-compatible with the C99 complex type. *Note
Complex numbers::. (The C++ '<complex>' template class may also be
usable via a typecast.)
To use single or long-double precision versions of FFTW, replace the
'fftw_' prefix by 'fftwf_' or 'fftwl_' and link with '-lfftw3f' or
'-lfftw3l', but use the _same_ '<fftw3.h>' header file.
Many more flags exist besides 'FFTW_MEASURE' and 'FFTW_ESTIMATE'.
For example, use 'FFTW_PATIENT' if you're willing to wait even longer
for a possibly even faster plan (*note FFTW Reference::). You can also
save plans for future use, as described by *note Words of Wisdom-Saving
Plans::.

File: fftw3.info, Node: Complex Multi-Dimensional DFTs, Next: One-Dimensional DFTs of Real Data, Prev: Complex One-Dimensional DFTs, Up: Tutorial
2.2 Complex Multi-Dimensional DFTs
==================================
Multi-dimensional transforms work much the same way as one-dimensional
transforms: you allocate arrays of 'fftw_complex' (preferably using
'fftw_malloc'), create an 'fftw_plan', execute it as many times as you
want with 'fftw_execute(plan)', and clean up with
'fftw_destroy_plan(plan)' (and 'fftw_free').
FFTW provides two routines for creating plans for 2d and 3d
transforms, and one routine for creating plans of arbitrary
dimensionality. The 2d and 3d routines have the following signature:
fftw_plan fftw_plan_dft_2d(int n0, int n1,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
These routines create plans for 'n0' by 'n1' two-dimensional (2d)
transforms and 'n0' by 'n1' by 'n2' 3d transforms, respectively. All of
these transforms operate on contiguous arrays in the C-standard
"row-major" order, so that the last dimension has the fastest-varying
index in the array. This layout is described further in *note
Multi-dimensional Array Format::.
FFTW can also compute transforms of higher dimensionality. In order
to avoid confusion between the various meanings of the the word
"dimension", we use the term _rank_ to denote the number of independent
indices in an array.(1) For example, we say that a 2d transform has
rank 2, a 3d transform has rank 3, and so on. You can plan transforms
of arbitrary rank by means of the following function:
fftw_plan fftw_plan_dft(int rank, const int *n,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
Here, 'n' is a pointer to an array 'n[rank]' denoting an 'n[0]' by
'n[1]' by ... by 'n[rank-1]' transform. Thus, for example, the call
fftw_plan_dft_2d(n0, n1, in, out, sign, flags);
is equivalent to the following code fragment:
int n[2];
n[0] = n0;
n[1] = n1;
fftw_plan_dft(2, n, in, out, sign, flags);
'fftw_plan_dft' is not restricted to 2d and 3d transforms, however,
but it can plan transforms of arbitrary rank.
You may have noticed that all the planner routines described so far
have overlapping functionality. For example, you can plan a 1d or 2d
transform by using 'fftw_plan_dft' with a 'rank' of '1' or '2', or even
by calling 'fftw_plan_dft_3d' with 'n0' and/or 'n1' equal to '1' (with
no loss in efficiency). This pattern continues, and FFTW's planning
routines in general form a "partial order," sequences of interfaces with
strictly increasing generality but correspondingly greater complexity.
'fftw_plan_dft' is the most general complex-DFT routine that we
describe in this tutorial, but there are also the advanced and guru
interfaces, which allow one to efficiently combine multiple/strided
transforms into a single FFTW plan, transform a subset of a larger
multi-dimensional array, and/or to handle more general complex-number
formats. For more information, see *note FFTW Reference::.
---------- Footnotes ----------
(1) The term "rank" is commonly used in the APL, FORTRAN, and Common
Lisp traditions, although it is not so common in the C world.

File: fftw3.info, Node: One-Dimensional DFTs of Real Data, Next: Multi-Dimensional DFTs of Real Data, Prev: Complex Multi-Dimensional DFTs, Up: Tutorial
2.3 One-Dimensional DFTs of Real Data
=====================================
In many practical applications, the input data 'in[i]' are purely real
numbers, in which case the DFT output satisfies the "Hermitian"
redundancy: 'out[i]' is the conjugate of 'out[n-i]'. It is possible to
take advantage of these circumstances in order to achieve roughly a
factor of two improvement in both speed and memory usage.
In exchange for these speed and space advantages, the user sacrifices
some of the simplicity of FFTW's complex transforms. First of all, the
input and output arrays are of _different sizes and types_: the input is
'n' real numbers, while the output is 'n/2+1' complex numbers (the
non-redundant outputs); this also requires slight "padding" of the input
array for in-place transforms. Second, the inverse transform (complex
to real) has the side-effect of _overwriting its input array_, by
default. Neither of these inconveniences should pose a serious problem
for users, but it is important to be aware of them.
The routines to perform real-data transforms are almost the same as
those for complex transforms: you allocate arrays of 'double' and/or
'fftw_complex' (preferably using 'fftw_malloc' or 'fftw_alloc_complex'),
create an 'fftw_plan', execute it as many times as you want with
'fftw_execute(plan)', and clean up with 'fftw_destroy_plan(plan)' (and
'fftw_free'). The only differences are that the input (or output) is of
type 'double' and there are new routines to create the plan. In one
dimension:
fftw_plan fftw_plan_dft_r2c_1d(int n, double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_dft_c2r_1d(int n, fftw_complex *in, double *out,
unsigned flags);
for the real input to complex-Hermitian output ("r2c") and
complex-Hermitian input to real output ("c2r") transforms. Unlike the
complex DFT planner, there is no 'sign' argument. Instead, r2c DFTs are
always 'FFTW_FORWARD' and c2r DFTs are always 'FFTW_BACKWARD'. (For
single/long-double precision 'fftwf' and 'fftwl', 'double' should be
replaced by 'float' and 'long double', respectively.)
Here, 'n' is the "logical" size of the DFT, not necessarily the
physical size of the array. In particular, the real ('double') array
has 'n' elements, while the complex ('fftw_complex') array has 'n/2+1'
elements (where the division is rounded down). For an in-place
transform, 'in' and 'out' are aliased to the same array, which must be
big enough to hold both; so, the real array would actually have
'2*(n/2+1)' elements, where the elements beyond the first 'n' are unused
padding. (Note that this is very different from the concept of
"zero-padding" a transform to a larger length, which changes the logical
size of the DFT by actually adding new input data.) The kth element of
the complex array is exactly the same as the kth element of the
corresponding complex DFT. All positive 'n' are supported; products of
small factors are most efficient, but an O(n log n) algorithm is used
even for prime sizes.
As noted above, the c2r transform destroys its input array even for
out-of-place transforms. This can be prevented, if necessary, by
including 'FFTW_PRESERVE_INPUT' in the 'flags', with unfortunately some
sacrifice in performance. This flag is also not currently supported for
multi-dimensional real DFTs (next section).
Readers familiar with DFTs of real data will recall that the 0th (the
"DC") and 'n/2'-th (the "Nyquist" frequency, when 'n' is even) elements
of the complex output are purely real. Some implementations therefore
store the Nyquist element where the DC imaginary part would go, in order
to make the input and output arrays the same size. Such packing,
however, does not generalize well to multi-dimensional transforms, and
the space savings are miniscule in any case; FFTW does not support it.
An alternative interface for one-dimensional r2c and c2r DFTs can be
found in the 'r2r' interface (*note The Halfcomplex-format DFT::), with
"halfcomplex"-format output that _is_ the same size (and type) as the
input array. That interface, although it is not very useful for
multi-dimensional transforms, may sometimes yield better performance.

File: fftw3.info, Node: Multi-Dimensional DFTs of Real Data, Next: More DFTs of Real Data, Prev: One-Dimensional DFTs of Real Data, Up: Tutorial
2.4 Multi-Dimensional DFTs of Real Data
=======================================
Multi-dimensional DFTs of real data use the following planner routines:
fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1,
double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2,
double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_dft_r2c(int rank, const int *n,
double *in, fftw_complex *out,
unsigned flags);
as well as the corresponding 'c2r' routines with the input/output
types swapped. These routines work similarly to their complex
analogues, except for the fact that here the complex output array is cut
roughly in half and the real array requires padding for in-place
transforms (as in 1d, above).
As before, 'n' is the logical size of the array, and the consequences
of this on the the format of the complex arrays deserve careful
attention. Suppose that the real data has dimensions n[0] x n[1] x n[2]
x ... x n[d-1] (in row-major order). Then, after an r2c transform, the
output is an n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) array of
'fftw_complex' values in row-major order, corresponding to slightly over
half of the output of the corresponding complex DFT. (The division is
rounded down.) The ordering of the data is otherwise exactly the same
as in the complex-DFT case.
For out-of-place transforms, this is the end of the story: the real
data is stored as a row-major array of size n[0] x n[1] x n[2] x ... x
n[d-1] and the complex data is stored as a row-major array of size n[0]
x n[1] x n[2] x ... x (n[d-1]/2 + 1) .
For in-place transforms, however, extra padding of the real-data
array is necessary because the complex array is larger than the real
array, and the two arrays share the same memory locations. Thus, for
in-place transforms, the final dimension of the real-data array must be
padded with extra values to accommodate the size of the complex
data--two values if the last dimension is even and one if it is odd.
That is, the last dimension of the real data must physically contain 2 *
(n[d-1]/2+1) 'double' values (exactly enough to hold the complex data).
This physical array size does not, however, change the _logical_ array
size--only n[d-1] values are actually stored in the last dimension, and
n[d-1] is the last dimension passed to the plan-creation routine.
For example, consider the transform of a two-dimensional real array
of size 'n0' by 'n1'. The output of the r2c transform is a
two-dimensional complex array of size 'n0' by 'n1/2+1', where the 'y'
dimension has been cut nearly in half because of redundancies in the
output. Because 'fftw_complex' is twice the size of 'double', the
output array is slightly bigger than the input array. Thus, if we want
to compute the transform in place, we must _pad_ the input array so that
it is of size 'n0' by '2*(n1/2+1)'. If 'n1' is even, then there are two
padding elements at the end of each row (which need not be initialized,
as they are only used for output).
These transforms are unnormalized, so an r2c followed by a c2r
transform (or vice versa) will result in the original data scaled by the
number of real data elements--that is, the product of the (logical)
dimensions of the real data.
(Because the last dimension is treated specially, if it is equal to
'1' the transform is _not_ equivalent to a lower-dimensional r2c/c2r
transform. In that case, the last complex dimension also has size '1'
('=1/2+1'), and no advantage is gained over the complex transforms.)

File: fftw3.info, Node: More DFTs of Real Data, Prev: Multi-Dimensional DFTs of Real Data, Up: Tutorial
2.5 More DFTs of Real Data
==========================
* Menu:
* The Halfcomplex-format DFT::
* Real even/odd DFTs (cosine/sine transforms)::
* The Discrete Hartley Transform::
FFTW supports several other transform types via a unified "r2r"
(real-to-real) interface, so called because it takes a real ('double')
array and outputs a real array of the same size. These r2r transforms
currently fall into three categories: DFTs of real input and
complex-Hermitian output in halfcomplex format, DFTs of real input with
even/odd symmetry (a.k.a. discrete cosine/sine transforms, DCTs/DSTs),
and discrete Hartley transforms (DHTs), all described in more detail by
the following sections.
The r2r transforms follow the by now familiar interface of creating
an 'fftw_plan', executing it with 'fftw_execute(plan)', and destroying
it with 'fftw_destroy_plan(plan)'. Furthermore, all r2r transforms
share the same planner interface:
fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out,
fftw_r2r_kind kind, unsigned flags);
fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out,
fftw_r2r_kind kind0, fftw_r2r_kind kind1,
unsigned flags);
fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2,
double *in, double *out,
fftw_r2r_kind kind0,
fftw_r2r_kind kind1,
fftw_r2r_kind kind2,
unsigned flags);
fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out,
const fftw_r2r_kind *kind, unsigned flags);
Just as for the complex DFT, these plan 1d/2d/3d/multi-dimensional
transforms for contiguous arrays in row-major order, transforming (real)
input to output of the same size, where 'n' specifies the _physical_
dimensions of the arrays. All positive 'n' are supported (with the
exception of 'n=1' for the 'FFTW_REDFT00' kind, noted in the real-even
subsection below); products of small factors are most efficient
(factorizing 'n-1' and 'n+1' for 'FFTW_REDFT00' and 'FFTW_RODFT00'
kinds, described below), but an O(n log n) algorithm is used even for
prime sizes.
Each dimension has a "kind" parameter, of type 'fftw_r2r_kind',
specifying the kind of r2r transform to be used for that dimension. (In
the case of 'fftw_plan_r2r', this is an array 'kind[rank]' where
'kind[i]' is the transform kind for the dimension 'n[i]'.) The kind can
be one of a set of predefined constants, defined in the following
subsections.
In other words, FFTW computes the separable product of the specified
r2r transforms over each dimension, which can be used e.g. for partial
differential equations with mixed boundary conditions. (For some r2r
kinds, notably the halfcomplex DFT and the DHT, such a separable product
is somewhat problematic in more than one dimension, however, as is
described below.)
In the current version of FFTW, all r2r transforms except for the
halfcomplex type are computed via pre- or post-processing of halfcomplex
transforms, and they are therefore not as fast as they could be. Since
most other general DCT/DST codes employ a similar algorithm, however,
FFTW's implementation should provide at least competitive performance.

File: fftw3.info, Node: The Halfcomplex-format DFT, Next: Real even/odd DFTs (cosine/sine transforms), Prev: More DFTs of Real Data, Up: More DFTs of Real Data
2.5.1 The Halfcomplex-format DFT
--------------------------------
An r2r kind of 'FFTW_R2HC' ("r2hc") corresponds to an r2c DFT (*note
One-Dimensional DFTs of Real Data::) but with "halfcomplex" format
output, and may sometimes be faster and/or more convenient than the
latter. The inverse "hc2r" transform is of kind 'FFTW_HC2R'. This
consists of the non-redundant half of the complex output for a 1d
real-input DFT of size 'n', stored as a sequence of 'n' real numbers
('double') in the format:
r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1
Here, rk is the real part of the kth output, and ik is the imaginary
part. (Division by 2 is rounded down.) For a halfcomplex array
'hc[n]', the kth component thus has its real part in 'hc[k]' and its
imaginary part in 'hc[n-k]', with the exception of 'k' '==' '0' or 'n/2'
(the latter only if 'n' is even)--in these two cases, the imaginary part
is zero due to symmetries of the real-input DFT, and is not stored.
Thus, the r2hc transform of 'n' real values is a halfcomplex array of
length 'n', and vice versa for hc2r.
Aside from the differing format, the output of
'FFTW_R2HC'/'FFTW_HC2R' is otherwise exactly the same as for the
corresponding 1d r2c/c2r transform (i.e. 'FFTW_FORWARD'/'FFTW_BACKWARD'
transforms, respectively). Recall that these transforms are
unnormalized, so r2hc followed by hc2r will result in the original data
multiplied by 'n'. Furthermore, like the c2r transform, an out-of-place
hc2r transform will _destroy its input_ array.
Although these halfcomplex transforms can be used with the
multi-dimensional r2r interface, the interpretation of such a separable
product of transforms along each dimension is problematic. For example,
consider a two-dimensional 'n0' by 'n1', r2hc by r2hc transform planned
by 'fftw_plan_r2r_2d(n0, n1, in, out, FFTW_R2HC, FFTW_R2HC,
FFTW_MEASURE)'. Conceptually, FFTW first transforms the rows (of size
'n1') to produce halfcomplex rows, and then transforms the columns (of
size 'n0'). Half of these column transforms, however, are of imaginary
parts, and should therefore be multiplied by i and combined with the
r2hc transforms of the real columns to produce the 2d DFT amplitudes;
FFTW's r2r transform does _not_ perform this combination for you. Thus,
if a multi-dimensional real-input/output DFT is required, we recommend
using the ordinary r2c/c2r interface (*note Multi-Dimensional DFTs of
Real Data::).

File: fftw3.info, Node: Real even/odd DFTs (cosine/sine transforms), Next: The Discrete Hartley Transform, Prev: The Halfcomplex-format DFT, Up: More DFTs of Real Data
2.5.2 Real even/odd DFTs (cosine/sine transforms)
-------------------------------------------------
The Fourier transform of a real-even function f(-x) = f(x) is real-even,
and i times the Fourier transform of a real-odd function f(-x) = -f(x)
is real-odd. Similar results hold for a discrete Fourier transform, and
thus for these symmetries the need for complex inputs/outputs is
entirely eliminated. Moreover, one gains a factor of two in speed/space
from the fact that the data are real, and an additional factor of two
from the even/odd symmetry: only the non-redundant (first) half of the
array need be stored. The result is the real-even DFT ("REDFT") and the
real-odd DFT ("RODFT"), also known as the discrete cosine and sine
transforms ("DCT" and "DST"), respectively.
(In this section, we describe the 1d transforms; multi-dimensional
transforms are just a separable product of these transforms operating
along each dimension.)
Because of the discrete sampling, one has an additional choice: is
the data even/odd around a sampling point, or around the point halfway
between two samples? The latter corresponds to _shifting_ the samples
by _half_ an interval, and gives rise to several transform variants
denoted by REDFTab and RODFTab: a and b are 0 or 1, and indicate whether
the input (a) and/or output (b) are shifted by half a sample (1 means it
is shifted). These are also known as types I-IV of the DCT and DST, and
all four types are supported by FFTW's r2r interface.(1)
The r2r kinds for the various REDFT and RODFT types supported by
FFTW, along with the boundary conditions at both ends of the _input_
array ('n' real numbers 'in[j=0..n-1]'), are:
* 'FFTW_REDFT00' (DCT-I): even around j=0 and even around j=n-1.
* 'FFTW_REDFT10' (DCT-II, "the" DCT): even around j=-0.5 and even
around j=n-0.5.
* 'FFTW_REDFT01' (DCT-III, "the" IDCT): even around j=0 and odd
around j=n.
* 'FFTW_REDFT11' (DCT-IV): even around j=-0.5 and odd around j=n-0.5.
* 'FFTW_RODFT00' (DST-I): odd around j=-1 and odd around j=n.
* 'FFTW_RODFT10' (DST-II): odd around j=-0.5 and odd around j=n-0.5.
* 'FFTW_RODFT01' (DST-III): odd around j=-1 and even around j=n-1.
* 'FFTW_RODFT11' (DST-IV): odd around j=-0.5 and even around j=n-0.5.
Note that these symmetries apply to the "logical" array being
transformed; *there are no constraints on your physical input data*.
So, for example, if you specify a size-5 REDFT00 (DCT-I) of the data
abcde, it corresponds to the DFT of the logical even array abcdedcb of
size 8. A size-4 REDFT10 (DCT-II) of the data abcd corresponds to the
size-8 logical DFT of the even array abcddcba, shifted by half a sample.
All of these transforms are invertible. The inverse of R*DFT00 is
R*DFT00; of R*DFT10 is R*DFT01 and vice versa (these are often called
simply "the" DCT and IDCT, respectively); and of R*DFT11 is R*DFT11.
However, the transforms computed by FFTW are unnormalized, exactly like
the corresponding real and complex DFTs, so computing a transform
followed by its inverse yields the original array scaled by N, where N
is the _logical_ DFT size. For REDFT00, N=2(n-1); for RODFT00,
N=2(n+1); otherwise, N=2n.
Note that the boundary conditions of the transform output array are
given by the input boundary conditions of the inverse transform. Thus,
the above transforms are all inequivalent in terms of input/output
boundary conditions, even neglecting the 0.5 shift difference.
FFTW is most efficient when N is a product of small factors; note
that this _differs_ from the factorization of the physical size 'n' for
REDFT00 and RODFT00! There is another oddity: 'n=1' REDFT00 transforms
correspond to N=0, and so are _not defined_ (the planner will return
'NULL'). Otherwise, any positive 'n' is supported.
For the precise mathematical definitions of these transforms as used
by FFTW, see *note What FFTW Really Computes::. (For people accustomed
to the DCT/DST, FFTW's definitions have a coefficient of 2 in front of
the cos/sin functions so that they correspond precisely to an even/odd
DFT of size N. Some authors also include additional multiplicative
factors of sqrt(2) for selected inputs and outputs; this makes the
transform orthogonal, but sacrifices the direct equivalence to a
symmetric DFT.)
Which type do you need?
.......................
Since the required flavor of even/odd DFT depends upon your problem, you
are the best judge of this choice, but we can make a few comments on
relative efficiency to help you in your selection. In particular,
R*DFT01 and R*DFT10 tend to be slightly faster than R*DFT11 (especially
for odd sizes), while the R*DFT00 transforms are sometimes significantly
slower (especially for even sizes).(2)
Thus, if only the boundary conditions on the transform inputs are
specified, we generally recommend R*DFT10 over R*DFT00 and R*DFT01 over
R*DFT11 (unless the half-sample shift or the self-inverse property is
significant for your problem).
If performance is important to you and you are using only small sizes
(say n<200), e.g. for multi-dimensional transforms, then you might
consider generating hard-coded transforms of those sizes and types that
you are interested in (*note Generating your own code::).
We are interested in hearing what types of symmetric transforms you
find most useful.
---------- Footnotes ----------
(1) There are also type V-VIII transforms, which correspond to a
logical DFT of _odd_ size N, independent of whether the physical size
'n' is odd, but we do not support these variants.
(2) R*DFT00 is sometimes slower in FFTW because we discovered that
the standard algorithm for computing this by a pre/post-processed real
DFT--the algorithm used in FFTPACK, Numerical Recipes, and other sources
for decades now--has serious numerical problems: it already loses
several decimal places of accuracy for 16k sizes. There seem to be only
two alternatives in the literature that do not suffer similarly: a
recursive decomposition into smaller DCTs, which would require a large
set of codelets for efficiency and generality, or sacrificing a factor
of 2 in speed to use a real DFT of twice the size. We currently employ
the latter technique for general n, as well as a limited form of the
former method: a split-radix decomposition when n is odd (N a multiple
of 4). For N containing many factors of 2, the split-radix method seems
to recover most of the speed of the standard algorithm without the
accuracy tradeoff.

File: fftw3.info, Node: The Discrete Hartley Transform, Prev: Real even/odd DFTs (cosine/sine transforms), Up: More DFTs of Real Data
2.5.3 The Discrete Hartley Transform
------------------------------------
If you are planning to use the DHT because you've heard that it is
"faster" than the DFT (FFT), *stop here*. The DHT is not faster than
the DFT. That story is an old but enduring misconception that was
debunked in 1987.
The discrete Hartley transform (DHT) is an invertible linear
transform closely related to the DFT. In the DFT, one multiplies each
input by cos - i * sin (a complex exponential), whereas in the DHT each
input is multiplied by simply cos + sin. Thus, the DHT transforms 'n'
real numbers to 'n' real numbers, and has the convenient property of
being its own inverse. In FFTW, a DHT (of any positive 'n') can be
specified by an r2r kind of 'FFTW_DHT'.
Like the DFT, in FFTW the DHT is unnormalized, so computing a DHT of
size 'n' followed by another DHT of the same size will result in the
original array multiplied by 'n'.
The DHT was originally proposed as a more efficient alternative to
the DFT for real data, but it was subsequently shown that a specialized
DFT (such as FFTW's r2hc or r2c transforms) could be just as fast. In
FFTW, the DHT is actually computed by post-processing an r2hc transform,
so there is ordinarily no reason to prefer it from a performance
perspective.(1) However, we have heard rumors that the DHT might be the
most appropriate transform in its own right for certain applications,
and we would be very interested to hear from anyone who finds it useful.
If 'FFTW_DHT' is specified for multiple dimensions of a
multi-dimensional transform, FFTW computes the separable product of 1d
DHTs along each dimension. Unfortunately, this is not quite the same
thing as a true multi-dimensional DHT; you can compute the latter, if
necessary, with at most 'rank-1' post-processing passes [see e.g. H.
Hao and R. N. Bracewell, Proc. IEEE 75, 264-266 (1987)].
For the precise mathematical definition of the DHT as used by FFTW,
see *note What FFTW Really Computes::.
---------- Footnotes ----------
(1) We provide the DHT mainly as a byproduct of some internal
algorithms. FFTW computes a real input/output DFT of _prime_ size by
re-expressing it as a DHT plus post/pre-processing and then using
Rader's prime-DFT algorithm adapted to the DHT.

File: fftw3.info, Node: Other Important Topics, Next: FFTW Reference, Prev: Tutorial, Up: Top
3 Other Important Topics
************************
* Menu:
* SIMD alignment and fftw_malloc::
* Multi-dimensional Array Format::
* Words of Wisdom-Saving Plans::
* Caveats in Using Wisdom::

File: fftw3.info, Node: SIMD alignment and fftw_malloc, Next: Multi-dimensional Array Format, Prev: Other Important Topics, Up: Other Important Topics
3.1 SIMD alignment and fftw_malloc
==================================
SIMD, which stands for "Single Instruction Multiple Data," is a set of
special operations supported by some processors to perform a single
operation on several numbers (usually 2 or 4) simultaneously. SIMD
floating-point instructions are available on several popular CPUs:
SSE/SSE2/AVX/AVX2/AVX512/KCVI on some x86/x86-64 processors, AltiVec and
VSX on some POWER/PowerPCs, NEON on some ARM models. FFTW can be
compiled to support the SIMD instructions on any of these systems.
A program linking to an FFTW library compiled with SIMD support can
obtain a nonnegligible speedup for most complex and r2c/c2r transforms.
In order to obtain this speedup, however, the arrays of complex (or
real) data passed to FFTW must be specially aligned in memory (typically
16-byte aligned), and often this alignment is more stringent than that
provided by the usual 'malloc' (etc.) allocation routines.
In order to guarantee proper alignment for SIMD, therefore, in case
your program is ever linked against a SIMD-using FFTW, we recommend
allocating your transform data with 'fftw_malloc' and de-allocating it
with 'fftw_free'. These have exactly the same interface and behavior as
'malloc'/'free', except that for a SIMD FFTW they ensure that the
returned pointer has the necessary alignment (by calling 'memalign' or
its equivalent on your OS).
You are not _required_ to use 'fftw_malloc'. You can allocate your
data in any way that you like, from 'malloc' to 'new' (in C++) to a
fixed-size array declaration. If the array happens not to be properly
aligned, FFTW will not use the SIMD extensions.
Since 'fftw_malloc' only ever needs to be used for real and complex
arrays, we provide two convenient wrapper routines 'fftw_alloc_real(N)'
and 'fftw_alloc_complex(N)' that are equivalent to
'(double*)fftw_malloc(sizeof(double) * N)' and
'(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively (or
their equivalents in other precisions).

File: fftw3.info, Node: Multi-dimensional Array Format, Next: Words of Wisdom-Saving Plans, Prev: SIMD alignment and fftw_malloc, Up: Other Important Topics
3.2 Multi-dimensional Array Format
==================================
This section describes the format in which multi-dimensional arrays are
stored in FFTW. We felt that a detailed discussion of this topic was
necessary. Since several different formats are common, this topic is
often a source of confusion.
* Menu:
* Row-major Format::
* Column-major Format::
* Fixed-size Arrays in C::
* Dynamic Arrays in C::
* Dynamic Arrays in C-The Wrong Way::

File: fftw3.info, Node: Row-major Format, Next: Column-major Format, Prev: Multi-dimensional Array Format, Up: Multi-dimensional Array Format
3.2.1 Row-major Format
----------------------
The multi-dimensional arrays passed to 'fftw_plan_dft' etcetera are
expected to be stored as a single contiguous block in "row-major" order
(sometimes called "C order"). Basically, this means that as you step
through adjacent memory locations, the first dimension's index varies
most slowly and the last dimension's index varies most quickly.
To be more explicit, let us consider an array of rank d whose
dimensions are n[0] x n[1] x n[2] x ... x n[d-1] . Now, we specify a
location in the array by a sequence of d (zero-based) indices, one for
each dimension: (i[0], i[1], ..., i[d-1]). If the array is stored in
row-major order, then this element is located at the position i[d-1] +
n[d-1] * (i[d-2] + n[d-2] * (... + n[1] * i[0])).
Note that, for the ordinary complex DFT, each element of the array
must be of type 'fftw_complex'; i.e. a (real, imaginary) pair of
(double-precision) numbers.
In the advanced FFTW interface, the physical dimensions n from which
the indices are computed can be different from (larger than) the logical
dimensions of the transform to be computed, in order to transform a
subset of a larger array. Note also that, in the advanced interface,
the expression above is multiplied by a "stride" to get the actual array
index--this is useful in situations where each element of the
multi-dimensional array is actually a data structure (or another array),
and you just want to transform a single field. In the basic interface,
however, the stride is 1.

File: fftw3.info, Node: Column-major Format, Next: Fixed-size Arrays in C, Prev: Row-major Format, Up: Multi-dimensional Array Format
3.2.2 Column-major Format
-------------------------
Readers from the Fortran world are used to arrays stored in
"column-major" order (sometimes called "Fortran order"). This is
essentially the exact opposite of row-major order in that, here, the
_first_ dimension's index varies most quickly.
If you have an array stored in column-major order and wish to
transform it using FFTW, it is quite easy to do. When creating the
plan, simply pass the dimensions of the array to the planner in _reverse
order_. For example, if your array is a rank three 'N x M x L' matrix
in column-major order, you should pass the dimensions of the array as if
it were an 'L x M x N' matrix (which it is, from the perspective of
FFTW). This is done for you _automatically_ by the FFTW legacy-Fortran
interface (*note Calling FFTW from Legacy Fortran::), but you must do it
manually with the modern Fortran interface (*note Reversing array
dimensions::).

File: fftw3.info, Node: Fixed-size Arrays in C, Next: Dynamic Arrays in C, Prev: Column-major Format, Up: Multi-dimensional Array Format
3.2.3 Fixed-size Arrays in C
----------------------------
A multi-dimensional array whose size is declared at compile time in C is
_already_ in row-major order. You don't have to do anything special to
transform it. For example:
{
fftw_complex data[N0][N1][N2];
fftw_plan plan;
...
plan = fftw_plan_dft_3d(N0, N1, N2, &data[0][0][0], &data[0][0][0],
FFTW_FORWARD, FFTW_ESTIMATE);
...
}
This will plan a 3d in-place transform of size 'N0 x N1 x N2'.
Notice how we took the address of the zero-th element to pass to the
planner (we could also have used a typecast).
However, we tend to _discourage_ users from declaring their arrays in
this way, for two reasons. First, this allocates the array on the stack
("automatic" storage), which has a very limited size on most operating
systems (declaring an array with more than a few thousand elements will
often cause a crash). (You can get around this limitation on many
systems by declaring the array as 'static' and/or global, but that has
its own drawbacks.) Second, it may not optimally align the array for
use with a SIMD FFTW (*note SIMD alignment and fftw_malloc::). Instead,
we recommend using 'fftw_malloc', as described below.

File: fftw3.info, Node: Dynamic Arrays in C, Next: Dynamic Arrays in C-The Wrong Way, Prev: Fixed-size Arrays in C, Up: Multi-dimensional Array Format
3.2.4 Dynamic Arrays in C
-------------------------
We recommend allocating most arrays dynamically, with 'fftw_malloc'.
This isn't too hard to do, although it is not as straightforward for
multi-dimensional arrays as it is for one-dimensional arrays.
Creating the array is simple: using a dynamic-allocation routine like
'fftw_malloc', allocate an array big enough to store N 'fftw_complex'
values (for a complex DFT), where N is the product of the sizes of the
array dimensions (i.e. the total number of complex values in the
array). For example, here is code to allocate a 5 x 12 x 27 rank-3
array:
fftw_complex *an_array;
an_array = (fftw_complex*) fftw_malloc(5*12*27 * sizeof(fftw_complex));
Accessing the array elements, however, is more tricky--you can't
simply use multiple applications of the '[]' operator like you could for
fixed-size arrays. Instead, you have to explicitly compute the offset
into the array using the formula given earlier for row-major arrays.
For example, to reference the (i,j,k)-th element of the array allocated
above, you would use the expression 'an_array[k + 27 * (j + 12 * i)]'.
This pain can be alleviated somewhat by defining appropriate macros,
or, in C++, creating a class and overloading the '()' operator. The
recent C99 standard provides a way to reinterpret the dynamic array as a
"variable-length" multi-dimensional array amenable to '[]', but this
feature is not yet widely supported by compilers.

File: fftw3.info, Node: Dynamic Arrays in C-The Wrong Way, Prev: Dynamic Arrays in C, Up: Multi-dimensional Array Format
3.2.5 Dynamic Arrays in C--The Wrong Way
----------------------------------------
A different method for allocating multi-dimensional arrays in C is often
suggested that is incompatible with FFTW: _using it will cause FFTW to
die a painful death_. We discuss the technique here, however, because
it is so commonly known and used. This method is to create arrays of
pointers of arrays of pointers of ...etcetera. For example, the
analogue in this method to the example above is:
int i,j;
fftw_complex ***a_bad_array; /* another way to make a 5x12x27 array */
a_bad_array = (fftw_complex ***) malloc(5 * sizeof(fftw_complex **));
for (i = 0; i < 5; ++i) {
a_bad_array[i] =
(fftw_complex **) malloc(12 * sizeof(fftw_complex *));
for (j = 0; j < 12; ++j)
a_bad_array[i][j] =
(fftw_complex *) malloc(27 * sizeof(fftw_complex));
}
As you can see, this sort of array is inconvenient to allocate (and
deallocate). On the other hand, it has the advantage that the
(i,j,k)-th element can be referenced simply by 'a_bad_array[i][j][k]'.
If you like this technique and want to maximize convenience in
accessing the array, but still want to pass the array to FFTW, you can
use a hybrid method. Allocate the array as one contiguous block, but
also declare an array of arrays of pointers that point to appropriate
places in the block. That sort of trick is beyond the scope of this
documentation; for more information on multi-dimensional arrays in C,
see the 'comp.lang.c' FAQ (http://c-faq.com/aryptr/dynmuldimary.html).

File: fftw3.info, Node: Words of Wisdom-Saving Plans, Next: Caveats in Using Wisdom, Prev: Multi-dimensional Array Format, Up: Other Important Topics
3.3 Words of Wisdom--Saving Plans
=================================
FFTW implements a method for saving plans to disk and restoring them.
In fact, what FFTW does is more general than just saving and loading
plans. The mechanism is called "wisdom". Here, we describe this
feature at a high level. *Note FFTW Reference::, for a less casual but
more complete discussion of how to use wisdom in FFTW.
Plans created with the 'FFTW_MEASURE', 'FFTW_PATIENT', or
'FFTW_EXHAUSTIVE' options produce near-optimal FFT performance, but may
require a long time to compute because FFTW must measure the runtime of
many possible plans and select the best one. This setup is designed for
the situations where so many transforms of the same size must be
computed that the start-up time is irrelevant. For short initialization
times, but slower transforms, we have provided 'FFTW_ESTIMATE'. The
'wisdom' mechanism is a way to get the best of both worlds: you compute
a good plan once, save it to disk, and later reload it as many times as
necessary. The wisdom mechanism can actually save and reload many plans
at once, not just one.
Whenever you create a plan, the FFTW planner accumulates wisdom,
which is information sufficient to reconstruct the plan. After
planning, you can save this information to disk by means of the
function:
int fftw_export_wisdom_to_filename(const char *filename);
(This function returns non-zero on success.)
The next time you run the program, you can restore the wisdom with
'fftw_import_wisdom_from_filename' (which also returns non-zero on
success), and then recreate the plan using the same flags as before.
int fftw_import_wisdom_from_filename(const char *filename);
Wisdom is automatically used for any size to which it is applicable,
as long as the planner flags are not more "patient" than those with
which the wisdom was created. For example, wisdom created with
'FFTW_MEASURE' can be used if you later plan with 'FFTW_ESTIMATE' or
'FFTW_MEASURE', but not with 'FFTW_PATIENT'.
The 'wisdom' is cumulative, and is stored in a global, private data
structure managed internally by FFTW. The storage space required is
minimal, proportional to the logarithm of the sizes the wisdom was
generated from. If memory usage is a concern, however, the wisdom can
be forgotten and its associated memory freed by calling:
void fftw_forget_wisdom(void);
Wisdom can be exported to a file, a string, or any other medium. For
details, see *note Wisdom::.

File: fftw3.info, Node: Caveats in Using Wisdom, Prev: Words of Wisdom-Saving Plans, Up: Other Important Topics
3.4 Caveats in Using Wisdom
===========================
For in much wisdom is much grief, and he that increaseth knowledge
increaseth sorrow. [Ecclesiastes 1:18]
There are pitfalls to using wisdom, in that it can negate FFTW's
ability to adapt to changing hardware and other conditions. For
example, it would be perfectly possible to export wisdom from a program
running on one processor and import it into a program running on another
processor. Doing so, however, would mean that the second program would
use plans optimized for the first processor, instead of the one it is
running on.
It should be safe to reuse wisdom as long as the hardware and program
binaries remain unchanged. (Actually, the optimal plan may change even
between runs of the same binary on identical hardware, due to
differences in the virtual memory environment, etcetera. Users
seriously interested in performance should worry about this problem,
too.) It is likely that, if the same wisdom is used for two different
program binaries, even running on the same machine, the plans may be
sub-optimal because of differing code alignments. It is therefore wise
to recreate wisdom every time an application is recompiled. The more
the underlying hardware and software changes between the creation of
wisdom and its use, the greater grows the risk of sub-optimal plans.
Nevertheless, if the choice is between using 'FFTW_ESTIMATE' or using
possibly-suboptimal wisdom (created on the same machine, but for a
different binary), the wisdom is likely to be better. For this reason,
we provide a function to import wisdom from a standard system-wide
location ('/etc/fftw/wisdom' on Unix):
int fftw_import_system_wisdom(void);
FFTW also provides a standalone program, 'fftw-wisdom' (described by
its own 'man' page on Unix) with which users can create wisdom, e.g.
for a canonical set of sizes to store in the system wisdom file. *Note
Wisdom Utilities::.

File: fftw3.info, Node: FFTW Reference, Next: Multi-threaded FFTW, Prev: Other Important Topics, Up: Top
4 FFTW Reference
****************
This chapter provides a complete reference for all sequential (i.e.,
one-processor) FFTW functions. Parallel transforms are described in
later chapters.
* Menu:
* Data Types and Files::
* Using Plans::
* Basic Interface::
* Advanced Interface::
* Guru Interface::
* New-array Execute Functions::
* Wisdom::
* What FFTW Really Computes::

File: fftw3.info, Node: Data Types and Files, Next: Using Plans, Prev: FFTW Reference, Up: FFTW Reference
4.1 Data Types and Files
========================
All programs using FFTW should include its header file:
#include <fftw3.h>
You must also link to the FFTW library. On Unix, this means adding
'-lfftw3 -lm' at the _end_ of the link command.
* Menu:
* Complex numbers::
* Precision::
* Memory Allocation::

File: fftw3.info, Node: Complex numbers, Next: Precision, Prev: Data Types and Files, Up: Data Types and Files
4.1.1 Complex numbers
---------------------
The default FFTW interface uses 'double' precision for all
floating-point numbers, and defines a 'fftw_complex' type to hold
complex numbers as:
typedef double fftw_complex[2];
Here, the '[0]' element holds the real part and the '[1]' element
holds the imaginary part.
Alternatively, if you have a C compiler (such as 'gcc') that supports
the C99 revision of the ANSI C standard, you can use C's new native
complex type (which is binary-compatible with the typedef above). In
particular, if you '#include <complex.h>' _before_ '<fftw3.h>', then
'fftw_complex' is defined to be the native complex type and you can
manipulate it with ordinary arithmetic (e.g. 'x = y * (3+4*I)', where
'x' and 'y' are 'fftw_complex' and 'I' is the standard symbol for the
imaginary unit);
C++ has its own 'complex<T>' template class, defined in the standard
'<complex>' header file. Reportedly, the C++ standards committee has
recently agreed to mandate that the storage format used for this type be
binary-compatible with the C99 type, i.e. an array 'T[2]' with
consecutive real '[0]' and imaginary '[1]' parts. (See report
<http://www.open-std.org/jtc1/sc22/WG21/docs/papers/2002/n1388.pdf
WG21/N1388>.) Although not part of the official standard as of this
writing, the proposal stated that: "This solution has been tested with
all current major implementations of the standard library and shown to
be working." To the extent that this is true, if you have a variable
'complex<double> *x', you can pass it directly to FFTW via
'reinterpret_cast<fftw_complex*>(x)'.

File: fftw3.info, Node: Precision, Next: Memory Allocation, Prev: Complex numbers, Up: Data Types and Files
4.1.2 Precision
---------------
You can install single and long-double precision versions of FFTW, which
replace 'double' with 'float' and 'long double', respectively (*note
Installation and Customization::). To use these interfaces, you:
* Link to the single/long-double libraries; on Unix, '-lfftw3f' or
'-lfftw3l' instead of (or in addition to) '-lfftw3'. (You can link
to the different-precision libraries simultaneously.)
* Include the _same_ '<fftw3.h>' header file.
* Replace all lowercase instances of 'fftw_' with 'fftwf_' or
'fftwl_' for single or long-double precision, respectively.
('fftw_complex' becomes 'fftwf_complex', 'fftw_execute' becomes
'fftwf_execute', etcetera.)
* Uppercase names, i.e. names beginning with 'FFTW_', remain the
same.
* Replace 'double' with 'float' or 'long double' for subroutine
parameters.
Depending upon your compiler and/or hardware, 'long double' may not
be any more precise than 'double' (or may not be supported at all,
although it is standard in C99).
We also support using the nonstandard '__float128'
quadruple-precision type provided by recent versions of 'gcc' on 32- and
64-bit x86 hardware (*note Installation and Customization::). To use
this type, link with '-lfftw3q -lquadmath -lm' (the 'libquadmath'
library provided by 'gcc' is needed for quadruple-precision
trigonometric functions) and use 'fftwq_' identifiers.

File: fftw3.info, Node: Memory Allocation, Prev: Precision, Up: Data Types and Files
4.1.3 Memory Allocation
-----------------------
void *fftw_malloc(size_t n);
void fftw_free(void *p);
These are functions that behave identically to 'malloc' and 'free',
except that they guarantee that the returned pointer obeys any special
alignment restrictions imposed by any algorithm in FFTW (e.g. for SIMD
acceleration). *Note SIMD alignment and fftw_malloc::.
Data allocated by 'fftw_malloc' _must_ be deallocated by 'fftw_free'
and not by the ordinary 'free'.
These routines simply call through to your operating system's
'malloc' or, if necessary, its aligned equivalent (e.g. 'memalign'), so
you normally need not worry about any significant time or space
overhead. You are _not required_ to use them to allocate your data, but
we strongly recommend it.
Note: in C++, just as with ordinary 'malloc', you must typecast the
output of 'fftw_malloc' to whatever pointer type you are allocating.
We also provide the following two convenience functions to allocate
real and complex arrays with 'n' elements, which are equivalent to
'(double *) fftw_malloc(sizeof(double) * n)' and '(fftw_complex *)
fftw_malloc(sizeof(fftw_complex) * n)', respectively:
double *fftw_alloc_real(size_t n);
fftw_complex *fftw_alloc_complex(size_t n);
The equivalent functions in other precisions allocate arrays of 'n'
elements in that precision. e.g. 'fftwf_alloc_real(n)' is equivalent
to '(float *) fftwf_malloc(sizeof(float) * n)'.

File: fftw3.info, Node: Using Plans, Next: Basic Interface, Prev: Data Types and Files, Up: FFTW Reference
4.2 Using Plans
===============
Plans for all transform types in FFTW are stored as type 'fftw_plan' (an
opaque pointer type), and are created by one of the various planning
routines described in the following sections. An 'fftw_plan' contains
all information necessary to compute the transform, including the
pointers to the input and output arrays.
void fftw_execute(const fftw_plan plan);
This executes the 'plan', to compute the corresponding transform on
the arrays for which it was planned (which must still exist). The plan
is not modified, and 'fftw_execute' can be called as many times as
desired.
To apply a given plan to a different array, you can use the new-array
execute interface. *Note New-array Execute Functions::.
'fftw_execute' (and equivalents) is the only function in FFTW
guaranteed to be thread-safe; see *note Thread safety::.
This function:
void fftw_destroy_plan(fftw_plan plan);
deallocates the 'plan' and all its associated data.
FFTW's planner saves some other persistent data, such as the
accumulated wisdom and a list of algorithms available in the current
configuration. If you want to deallocate all of that and reset FFTW to
the pristine state it was in when you started your program, you can
call:
void fftw_cleanup(void);
After calling 'fftw_cleanup', all existing plans become undefined,
and you should not attempt to execute them nor to destroy them. You can
however create and execute/destroy new plans, in which case FFTW starts
accumulating wisdom information again.
'fftw_cleanup' does not deallocate your plans, however. To prevent
memory leaks, you must still call 'fftw_destroy_plan' before executing
'fftw_cleanup'.
Occasionally, it may useful to know FFTW's internal "cost" metric
that it uses to compare plans to one another; this cost is proportional
to an execution time of the plan, in undocumented units, if the plan was
created with the 'FFTW_MEASURE' or other timing-based options, or
alternatively is a heuristic cost function for 'FFTW_ESTIMATE' plans.
(The cost values of measured and estimated plans are not comparable,
being in different units. Also, costs from different FFTW versions or
the same version compiled differently may not be in the same units.
Plans created from wisdom have a cost of 0 since no timing measurement
is performed for them. Finally, certain problems for which only one
top-level algorithm was possible may have required no measurements of
the cost of the whole plan, in which case 'fftw_cost' will also return
0.) The cost metric for a given plan is returned by:
double fftw_cost(const fftw_plan plan);
The following two routines are provided purely for academic purposes
(that is, for entertainment).
void fftw_flops(const fftw_plan plan,
double *add, double *mul, double *fma);
Given a 'plan', set 'add', 'mul', and 'fma' to an exact count of the
number of floating-point additions, multiplications, and fused
multiply-add operations involved in the plan's execution. The total
number of floating-point operations (flops) is 'add + mul + 2*fma', or
'add + mul + fma' if the hardware supports fused multiply-add
instructions (although the number of FMA operations is only approximate
because of compiler voodoo). (The number of operations should be an
integer, but we use 'double' to avoid overflowing 'int' for large
transforms; the arguments are of type 'double' even for single and
long-double precision versions of FFTW.)
void fftw_fprint_plan(const fftw_plan plan, FILE *output_file);
void fftw_print_plan(const fftw_plan plan);
char *fftw_sprint_plan(const fftw_plan plan);
This outputs a "nerd-readable" representation of the 'plan' to the
given file, to 'stdout', or two a newly allocated NUL-terminated string
(which the caller is responsible for deallocating with 'free'),
respectively.

File: fftw3.info, Node: Basic Interface, Next: Advanced Interface, Prev: Using Plans, Up: FFTW Reference
4.3 Basic Interface
===================
Recall that the FFTW API is divided into three parts(1): the "basic
interface" computes a single transform of contiguous data, the "advanced
interface" computes transforms of multiple or strided arrays, and the
"guru interface" supports the most general data layouts, multiplicities,
and strides. This section describes the basic interface, which we
expect to satisfy the needs of most users.
* Menu:
* Complex DFTs::
* Planner Flags::
* Real-data DFTs::
* Real-data DFT Array Format::
* Real-to-Real Transforms::
* Real-to-Real Transform Kinds::
---------- Footnotes ----------
(1) Gallia est omnis divisa in partes tres (Julius Caesar).

File: fftw3.info, Node: Complex DFTs, Next: Planner Flags, Prev: Basic Interface, Up: Basic Interface
4.3.1 Complex DFTs
------------------
fftw_plan fftw_plan_dft_1d(int n0,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
fftw_plan fftw_plan_dft_2d(int n0, int n1,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
fftw_plan fftw_plan_dft_3d(int n0, int n1, int n2,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
fftw_plan fftw_plan_dft(int rank, const int *n,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
Plan a complex input/output discrete Fourier transform (DFT) in zero
or more dimensions, returning an 'fftw_plan' (*note Using Plans::).
Once you have created a plan for a certain transform type and
parameters, then creating another plan of the same type and parameters,
but for different arrays, is fast and shares constant data with the
first plan (if it still exists).
The planner returns 'NULL' if the plan cannot be created. In the
standard FFTW distribution, the basic interface is guaranteed to return
a non-'NULL' plan. A plan may be 'NULL', however, if you are using a
customized FFTW configuration supporting a restricted set of transforms.
Arguments
.........
* 'rank' is the rank of the transform (it should be the size of the
array '*n'), and can be any non-negative integer. (*Note Complex
Multi-Dimensional DFTs::, for the definition of "rank".) The
'_1d', '_2d', and '_3d' planners correspond to a 'rank' of '1',
'2', and '3', respectively. The rank may be zero, which is
equivalent to a rank-1 transform of size 1, i.e. a copy of one
number from input to output.
* 'n0', 'n1', 'n2', or 'n[0..rank-1]' (as appropriate for each
routine) specify the size of the transform dimensions. They can be
any positive integer.
- Multi-dimensional arrays are stored in row-major order with
dimensions: 'n0' x 'n1'; or 'n0' x 'n1' x 'n2'; or 'n[0]' x
'n[1]' x ... x 'n[rank-1]'. *Note Multi-dimensional Array
Format::.
- FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d
11^e 13^f, where e+f is either 0 or 1, and the other exponents
are arbitrary. Other sizes are computed by means of a slow,
general-purpose algorithm (which nevertheless retains O(n log
n) performance even for prime sizes). It is possible to
customize FFTW for different array sizes; see *note
Installation and Customization::. Transforms whose sizes are
powers of 2 are especially fast.
* 'in' and 'out' point to the input and output arrays of the
transform, which may be the same (yielding an in-place transform).
These arrays are overwritten during planning, unless
'FFTW_ESTIMATE' is used in the flags. (The arrays need not be
initialized, but they must be allocated.)
If 'in == out', the transform is "in-place" and the input array is
overwritten. If 'in != out', the two arrays must not overlap (but
FFTW does not check for this condition).
* 'sign' is the sign of the exponent in the formula that defines the
Fourier transform. It can be -1 (= 'FFTW_FORWARD') or +1 (=
'FFTW_BACKWARD').
* 'flags' is a bitwise OR ('|') of zero or more planner flags, as
defined in *note Planner Flags::.
FFTW computes an unnormalized transform: computing a forward followed
by a backward transform (or vice versa) will result in the original data
multiplied by the size of the transform (the product of the dimensions).
For more information, see *note What FFTW Really Computes::.

File: fftw3.info, Node: Planner Flags, Next: Real-data DFTs, Prev: Complex DFTs, Up: Basic Interface
4.3.2 Planner Flags
-------------------
All of the planner routines in FFTW accept an integer 'flags' argument,
which is a bitwise OR ('|') of zero or more of the flag constants
defined below. These flags control the rigor (and time) of the planning
process, and can also impose (or lift) restrictions on the type of
transform algorithm that is employed.
_Important:_ the planner overwrites the input array during planning
unless a saved plan (*note Wisdom::) is available for that problem, so
you should initialize your input data after creating the plan. The only
exceptions to this are the 'FFTW_ESTIMATE' and 'FFTW_WISDOM_ONLY' flags,
as mentioned below.
In all cases, if wisdom is available for the given problem that was
created with equal-or-greater planning rigor, then the more rigorous
wisdom is used. For example, in 'FFTW_ESTIMATE' mode any available
wisdom is used, whereas in 'FFTW_PATIENT' mode only wisdom created in
patient or exhaustive mode can be used. *Note Words of Wisdom-Saving
Plans::.
Planning-rigor flags
....................
* 'FFTW_ESTIMATE' specifies that, instead of actual measurements of
different algorithms, a simple heuristic is used to pick a
(probably sub-optimal) plan quickly. With this flag, the
input/output arrays are not overwritten during planning.
* 'FFTW_MEASURE' tells FFTW to find an optimized plan by actually
_computing_ several FFTs and measuring their execution time.
Depending on your machine, this can take some time (often a few
seconds). 'FFTW_MEASURE' is the default planning option.
* 'FFTW_PATIENT' is like 'FFTW_MEASURE', but considers a wider range
of algorithms and often produces a "more optimal" plan (especially
for large transforms), but at the expense of several times longer
planning time (especially for large transforms).
* 'FFTW_EXHAUSTIVE' is like 'FFTW_PATIENT', but considers an even
wider range of algorithms, including many that we think are
unlikely to be fast, to produce the most optimal plan but with a
substantially increased planning time.
* 'FFTW_WISDOM_ONLY' is a special planning mode in which the plan is
only created if wisdom is available for the given problem, and
otherwise a 'NULL' plan is returned. This can be combined with
other flags, e.g. 'FFTW_WISDOM_ONLY | FFTW_PATIENT' creates a plan
only if wisdom is available that was created in 'FFTW_PATIENT' or
'FFTW_EXHAUSTIVE' mode. The 'FFTW_WISDOM_ONLY' flag is intended
for users who need to detect whether wisdom is available; for
example, if wisdom is not available one may wish to allocate new
arrays for planning so that user data is not overwritten.
Algorithm-restriction flags
...........................
* 'FFTW_DESTROY_INPUT' specifies that an out-of-place transform is
allowed to _overwrite its input_ array with arbitrary data; this
can sometimes allow more efficient algorithms to be employed.
* 'FFTW_PRESERVE_INPUT' specifies that an out-of-place transform must
_not change its input_ array. This is ordinarily the _default_,
except for c2r and hc2r (i.e. complex-to-real) transforms for
which 'FFTW_DESTROY_INPUT' is the default. In the latter cases,
passing 'FFTW_PRESERVE_INPUT' will attempt to use algorithms that
do not destroy the input, at the expense of worse performance; for
multi-dimensional c2r transforms, however, no input-preserving
algorithms are implemented and the planner will return 'NULL' if
one is requested.
* 'FFTW_UNALIGNED' specifies that the algorithm may not impose any
unusual alignment requirements on the input/output arrays (i.e. no
SIMD may be used). This flag is normally _not necessary_, since
the planner automatically detects misaligned arrays. The only use
for this flag is if you want to use the new-array execute interface
to execute a given plan on a different array that may not be
aligned like the original. (Using 'fftw_malloc' makes this flag
unnecessary even then. You can also use 'fftw_alignment_of' to
detect whether two arrays are equivalently aligned.)
Limiting planning time
......................
extern void fftw_set_timelimit(double seconds);
This function instructs FFTW to spend at most 'seconds' seconds
(approximately) in the planner. If 'seconds == FFTW_NO_TIMELIMIT' (the
default value, which is negative), then planning time is unbounded.
Otherwise, FFTW plans with a progressively wider range of algorithms
until the given time limit is reached or the given range of algorithms
is explored, returning the best available plan.
For example, specifying 'FFTW_PATIENT' first plans in 'FFTW_ESTIMATE'
mode, then in 'FFTW_MEASURE' mode, then finally (time permitting) in
'FFTW_PATIENT'. If 'FFTW_EXHAUSTIVE' is specified instead, the planner
will further progress to 'FFTW_EXHAUSTIVE' mode.
Note that the 'seconds' argument specifies only a rough limit; in
practice, the planner may use somewhat more time if the time limit is
reached when the planner is in the middle of an operation that cannot be
interrupted. At the very least, the planner will complete planning in
'FFTW_ESTIMATE' mode (which is thus equivalent to a time limit of 0).

File: fftw3.info, Node: Real-data DFTs, Next: Real-data DFT Array Format, Prev: Planner Flags, Up: Basic Interface
4.3.3 Real-data DFTs
--------------------
fftw_plan fftw_plan_dft_r2c_1d(int n0,
double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_dft_r2c_2d(int n0, int n1,
double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_dft_r2c_3d(int n0, int n1, int n2,
double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_dft_r2c(int rank, const int *n,
double *in, fftw_complex *out,
unsigned flags);
Plan a real-input/complex-output discrete Fourier transform (DFT) in
zero or more dimensions, returning an 'fftw_plan' (*note Using Plans::).
Once you have created a plan for a certain transform type and
parameters, then creating another plan of the same type and parameters,
but for different arrays, is fast and shares constant data with the
first plan (if it still exists).
The planner returns 'NULL' if the plan cannot be created. A
non-'NULL' plan is always returned by the basic interface unless you are
using a customized FFTW configuration supporting a restricted set of
transforms, or if you use the 'FFTW_PRESERVE_INPUT' flag with a
multi-dimensional out-of-place c2r transform (see below).
Arguments
.........
* 'rank' is the rank of the transform (it should be the size of the
array '*n'), and can be any non-negative integer. (*Note Complex
Multi-Dimensional DFTs::, for the definition of "rank".) The
'_1d', '_2d', and '_3d' planners correspond to a 'rank' of '1',
'2', and '3', respectively. The rank may be zero, which is
equivalent to a rank-1 transform of size 1, i.e. a copy of one
real number (with zero imaginary part) from input to output.
* 'n0', 'n1', 'n2', or 'n[0..rank-1]', (as appropriate for each
routine) specify the size of the transform dimensions. They can be
any positive integer. This is different in general from the
_physical_ array dimensions, which are described in *note Real-data
DFT Array Format::.
- FFTW is best at handling sizes of the form 2^a 3^b 5^c 7^d
11^e 13^f, where e+f is either 0 or 1, and the other exponents
are arbitrary. Other sizes are computed by means of a slow,
general-purpose algorithm (which nevertheless retains O(n log
n) performance even for prime sizes). (It is possible to
customize FFTW for different array sizes; see *note
Installation and Customization::.) Transforms whose sizes are
powers of 2 are especially fast, and it is generally
beneficial for the _last_ dimension of an r2c/c2r transform to
be _even_.
* 'in' and 'out' point to the input and output arrays of the
transform, which may be the same (yielding an in-place transform).
These arrays are overwritten during planning, unless
'FFTW_ESTIMATE' is used in the flags. (The arrays need not be
initialized, but they must be allocated.) For an in-place
transform, it is important to remember that the real array will
require padding, described in *note Real-data DFT Array Format::.
* 'flags' is a bitwise OR ('|') of zero or more planner flags, as
defined in *note Planner Flags::.
The inverse transforms, taking complex input (storing the
non-redundant half of a logically Hermitian array) to real output, are
given by:
fftw_plan fftw_plan_dft_c2r_1d(int n0,
fftw_complex *in, double *out,
unsigned flags);
fftw_plan fftw_plan_dft_c2r_2d(int n0, int n1,
fftw_complex *in, double *out,
unsigned flags);
fftw_plan fftw_plan_dft_c2r_3d(int n0, int n1, int n2,
fftw_complex *in, double *out,
unsigned flags);
fftw_plan fftw_plan_dft_c2r(int rank, const int *n,
fftw_complex *in, double *out,
unsigned flags);
The arguments are the same as for the r2c transforms, except that the
input and output data formats are reversed.
FFTW computes an unnormalized transform: computing an r2c followed by
a c2r transform (or vice versa) will result in the original data
multiplied by the size of the transform (the product of the logical
dimensions). An r2c transform produces the same output as a
'FFTW_FORWARD' complex DFT of the same input, and a c2r transform is
correspondingly equivalent to 'FFTW_BACKWARD'. For more information,
see *note What FFTW Really Computes::.

File: fftw3.info, Node: Real-data DFT Array Format, Next: Real-to-Real Transforms, Prev: Real-data DFTs, Up: Basic Interface
4.3.4 Real-data DFT Array Format
--------------------------------
The output of a DFT of real data (r2c) contains symmetries that, in
principle, make half of the outputs redundant (*note What FFTW Really
Computes::). (Similarly for the input of an inverse c2r transform.) In
practice, it is not possible to entirely realize these savings in an
efficient and understandable format that generalizes to
multi-dimensional transforms. Instead, the output of the r2c transforms
is _slightly_ over half of the output of the corresponding complex
transform. We do not "pack" the data in any way, but store it as an
ordinary array of 'fftw_complex' values. In fact, this data is simply a
subsection of what would be the array in the corresponding complex
transform.
Specifically, for a real transform of d (= 'rank') dimensions n[0] x
n[1] x n[2] x ... x n[d-1] , the complex data is an n[0] x n[1] x n[2]
x ... x (n[d-1]/2 + 1) array of 'fftw_complex' values in row-major
order (with the division rounded down). That is, we only store the
_lower_ half (non-negative frequencies), plus one element, of the last
dimension of the data from the ordinary complex transform. (We could
have instead taken half of any other dimension, but implementation turns
out to be simpler if the last, contiguous, dimension is used.)
For an out-of-place transform, the real data is simply an array with
physical dimensions n[0] x n[1] x n[2] x ... x n[d-1] in row-major
order.
For an in-place transform, some complications arise since the complex
data is slightly larger than the real data. In this case, the final
dimension of the real data must be _padded_ with extra values to
accommodate the size of the complex data--two extra if the last
dimension is even and one if it is odd. That is, the last dimension of
the real data must physically contain 2 * (n[d-1]/2+1) 'double' values
(exactly enough to hold the complex data). This physical array size
does not, however, change the _logical_ array size--only n[d-1] values
are actually stored in the last dimension, and n[d-1] is the last
dimension passed to the planner.

File: fftw3.info, Node: Real-to-Real Transforms, Next: Real-to-Real Transform Kinds, Prev: Real-data DFT Array Format, Up: Basic Interface
4.3.5 Real-to-Real Transforms
-----------------------------
fftw_plan fftw_plan_r2r_1d(int n, double *in, double *out,
fftw_r2r_kind kind, unsigned flags);
fftw_plan fftw_plan_r2r_2d(int n0, int n1, double *in, double *out,
fftw_r2r_kind kind0, fftw_r2r_kind kind1,
unsigned flags);
fftw_plan fftw_plan_r2r_3d(int n0, int n1, int n2,
double *in, double *out,
fftw_r2r_kind kind0,
fftw_r2r_kind kind1,
fftw_r2r_kind kind2,
unsigned flags);
fftw_plan fftw_plan_r2r(int rank, const int *n, double *in, double *out,
const fftw_r2r_kind *kind, unsigned flags);
Plan a real input/output (r2r) transform of various kinds in zero or
more dimensions, returning an 'fftw_plan' (*note Using Plans::).
Once you have created a plan for a certain transform type and
parameters, then creating another plan of the same type and parameters,
but for different arrays, is fast and shares constant data with the
first plan (if it still exists).
The planner returns 'NULL' if the plan cannot be created. A
non-'NULL' plan is always returned by the basic interface unless you are
using a customized FFTW configuration supporting a restricted set of
transforms, or for size-1 'FFTW_REDFT00' kinds (which are not defined).
Arguments
.........
* 'rank' is the dimensionality of the transform (it should be the
size of the arrays '*n' and '*kind'), and can be any non-negative
integer. The '_1d', '_2d', and '_3d' planners correspond to a
'rank' of '1', '2', and '3', respectively. A 'rank' of zero is
equivalent to a copy of one number from input to output.
* 'n', or 'n0'/'n1'/'n2', or 'n[rank]', respectively, gives the
(physical) size of the transform dimensions. They can be any
positive integer.
- Multi-dimensional arrays are stored in row-major order with
dimensions: 'n0' x 'n1'; or 'n0' x 'n1' x 'n2'; or 'n[0]' x
'n[1]' x ... x 'n[rank-1]'. *Note Multi-dimensional Array
Format::.
- FFTW is generally best at handling sizes of the form 2^a 3^b
5^c 7^d 11^e 13^f, where e+f is either 0 or 1, and the other
exponents are arbitrary. Other sizes are computed by means of
a slow, general-purpose algorithm (which nevertheless retains
O(n log n) performance even for prime sizes). (It is possible
to customize FFTW for different array sizes; see *note
Installation and Customization::.) Transforms whose sizes are
powers of 2 are especially fast.
- For a 'REDFT00' or 'RODFT00' transform kind in a dimension of
size n, it is n-1 or n+1, respectively, that should be
factorizable in the above form.
* 'in' and 'out' point to the input and output arrays of the
transform, which may be the same (yielding an in-place transform).
These arrays are overwritten during planning, unless
'FFTW_ESTIMATE' is used in the flags. (The arrays need not be
initialized, but they must be allocated.)
* 'kind', or 'kind0'/'kind1'/'kind2', or 'kind[rank]', is the kind of
r2r transform used for the corresponding dimension. The valid kind
constants are described in *note Real-to-Real Transform Kinds::.
In a multi-dimensional transform, what is computed is the separable
product formed by taking each transform kind along the
corresponding dimension, one dimension after another.
* 'flags' is a bitwise OR ('|') of zero or more planner flags, as
defined in *note Planner Flags::.

File: fftw3.info, Node: Real-to-Real Transform Kinds, Prev: Real-to-Real Transforms, Up: Basic Interface
4.3.6 Real-to-Real Transform Kinds
----------------------------------
FFTW currently supports 11 different r2r transform kinds, specified by
one of the constants below. For the precise definitions of these
transforms, see *note What FFTW Really Computes::. For a more
colloquial introduction to these transform kinds, see *note More DFTs of
Real Data::.
For dimension of size 'n', there is a corresponding "logical"
dimension 'N' that determines the normalization (and the optimal
factorization); the formula for 'N' is given for each kind below. Also,
with each transform kind is listed its corrsponding inverse transform.
FFTW computes unnormalized transforms: a transform followed by its
inverse will result in the original data multiplied by 'N' (or the
product of the 'N''s for each dimension, in multi-dimensions).
* 'FFTW_R2HC' computes a real-input DFT with output in "halfcomplex"
format, i.e. real and imaginary parts for a transform of size 'n'
stored as: r0, r1, r2, r(n/2), i((n+1)/2-1), ..., i2, i1 (Logical
'N=n', inverse is 'FFTW_HC2R'.)
* 'FFTW_HC2R' computes the reverse of 'FFTW_R2HC', above. (Logical
'N=n', inverse is 'FFTW_R2HC'.)
* 'FFTW_DHT' computes a discrete Hartley transform. (Logical 'N=n',
inverse is 'FFTW_DHT'.)
* 'FFTW_REDFT00' computes an REDFT00 transform, i.e. a DCT-I.
(Logical 'N=2*(n-1)', inverse is 'FFTW_REDFT00'.)
* 'FFTW_REDFT10' computes an REDFT10 transform, i.e. a DCT-II
(sometimes called "the" DCT). (Logical 'N=2*n', inverse is
'FFTW_REDFT01'.)
* 'FFTW_REDFT01' computes an REDFT01 transform, i.e. a DCT-III
(sometimes called "the" IDCT, being the inverse of DCT-II).
(Logical 'N=2*n', inverse is 'FFTW_REDFT=10'.)
* 'FFTW_REDFT11' computes an REDFT11 transform, i.e. a DCT-IV.
(Logical 'N=2*n', inverse is 'FFTW_REDFT11'.)
* 'FFTW_RODFT00' computes an RODFT00 transform, i.e. a DST-I.
(Logical 'N=2*(n+1)', inverse is 'FFTW_RODFT00'.)
* 'FFTW_RODFT10' computes an RODFT10 transform, i.e. a DST-II.
(Logical 'N=2*n', inverse is 'FFTW_RODFT01'.)
* 'FFTW_RODFT01' computes an RODFT01 transform, i.e. a DST-III.
(Logical 'N=2*n', inverse is 'FFTW_RODFT=10'.)
* 'FFTW_RODFT11' computes an RODFT11 transform, i.e. a DST-IV.
(Logical 'N=2*n', inverse is 'FFTW_RODFT11'.)

File: fftw3.info, Node: Advanced Interface, Next: Guru Interface, Prev: Basic Interface, Up: FFTW Reference
4.4 Advanced Interface
======================
FFTW's "advanced" interface supplements the basic interface with four
new planner routines, providing a new level of flexibility: you can plan
a transform of multiple arrays simultaneously, operate on non-contiguous
(strided) data, and transform a subset of a larger multi-dimensional
array. Other than these additional features, the planner operates in
the same fashion as in the basic interface, and the resulting
'fftw_plan' is used in the same way (*note Using Plans::).
* Menu:
* Advanced Complex DFTs::
* Advanced Real-data DFTs::
* Advanced Real-to-real Transforms::

File: fftw3.info, Node: Advanced Complex DFTs, Next: Advanced Real-data DFTs, Prev: Advanced Interface, Up: Advanced Interface
4.4.1 Advanced Complex DFTs
---------------------------
fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany,
fftw_complex *in, const int *inembed,
int istride, int idist,
fftw_complex *out, const int *onembed,
int ostride, int odist,
int sign, unsigned flags);
This routine plans multiple multidimensional complex DFTs, and it
extends the 'fftw_plan_dft' routine (*note Complex DFTs::) to compute
'howmany' transforms, each having rank 'rank' and size 'n'. In
addition, the transform data need not be contiguous, but it may be laid
out in memory with an arbitrary stride. To account for these
possibilities, 'fftw_plan_many_dft' adds the new parameters 'howmany',
{'i','o'}'nembed', {'i','o'}'stride', and {'i','o'}'dist'. The FFTW
basic interface (*note Complex DFTs::) provides routines specialized for
ranks 1, 2, and 3, but the advanced interface handles only the
general-rank case.
'howmany' is the (nonnegative) number of transforms to compute. The
resulting plan computes 'howmany' transforms, where the input of the
'k'-th transform is at location 'in+k*idist' (in C pointer arithmetic),
and its output is at location 'out+k*odist'. Plans obtained in this way
can often be faster than calling FFTW multiple times for the individual
transforms. The basic 'fftw_plan_dft' interface corresponds to
'howmany=1' (in which case the 'dist' parameters are ignored).
Each of the 'howmany' transforms has rank 'rank' and size 'n', as in
the basic interface. In addition, the advanced interface allows the
input and output arrays of each transform to be row-major subarrays of
larger rank-'rank' arrays, described by 'inembed' and 'onembed'
parameters, respectively. {'i','o'}'nembed' must be arrays of length
'rank', and 'n' should be elementwise less than or equal to
{'i','o'}'nembed'. Passing 'NULL' for an 'nembed' parameter is
equivalent to passing 'n' (i.e. same physical and logical dimensions,
as in the basic interface.)
The 'stride' parameters indicate that the 'j'-th element of the input
or output arrays is located at 'j*istride' or 'j*ostride', respectively.
(For a multi-dimensional array, 'j' is the ordinary row-major index.)
When combined with the 'k'-th transform in a 'howmany' loop, from above,
this means that the ('j','k')-th element is at 'j*stride+k*dist'. (The
basic 'fftw_plan_dft' interface corresponds to a stride of 1.)
For in-place transforms, the input and output 'stride' and 'dist'
parameters should be the same; otherwise, the planner may return 'NULL'.
Arrays 'n', 'inembed', and 'onembed' are not used after this function
returns. You can safely free or reuse them.
*Examples*: One transform of one 5 by 6 array contiguous in memory:
int rank = 2;
int n[] = {5, 6};
int howmany = 1;
int idist = odist = 0; /* unused because howmany = 1 */
int istride = ostride = 1; /* array is contiguous in memory */
int *inembed = n, *onembed = n;
Transform of three 5 by 6 arrays, each contiguous in memory, stored
in memory one after another:
int rank = 2;
int n[] = {5, 6};
int howmany = 3;
int idist = odist = n[0]*n[1]; /* = 30, the distance in memory
between the first element
of the first array and the
first element of the second array */
int istride = ostride = 1; /* array is contiguous in memory */
int *inembed = n, *onembed = n;
Transform each column of a 2d array with 10 rows and 3 columns:
int rank = 1; /* not 2: we are computing 1d transforms */
int n[] = {10}; /* 1d transforms of length 10 */
int howmany = 3;
int idist = odist = 1;
int istride = ostride = 3; /* distance between two elements in
the same column */
int *inembed = n, *onembed = n;

File: fftw3.info, Node: Advanced Real-data DFTs, Next: Advanced Real-to-real Transforms, Prev: Advanced Complex DFTs, Up: Advanced Interface
4.4.2 Advanced Real-data DFTs
-----------------------------
fftw_plan fftw_plan_many_dft_r2c(int rank, const int *n, int howmany,
double *in, const int *inembed,
int istride, int idist,
fftw_complex *out, const int *onembed,
int ostride, int odist,
unsigned flags);
fftw_plan fftw_plan_many_dft_c2r(int rank, const int *n, int howmany,
fftw_complex *in, const int *inembed,
int istride, int idist,
double *out, const int *onembed,
int ostride, int odist,
unsigned flags);
Like 'fftw_plan_many_dft', these two functions add 'howmany',
'nembed', 'stride', and 'dist' parameters to the 'fftw_plan_dft_r2c' and
'fftw_plan_dft_c2r' functions, but otherwise behave the same as the
basic interface.
The interpretation of 'howmany', 'stride', and 'dist' are the same as
for 'fftw_plan_many_dft', above. Note that the 'stride' and 'dist' for
the real array are in units of 'double', and for the complex array are
in units of 'fftw_complex'.
If an 'nembed' parameter is 'NULL', it is interpreted as what it
would be in the basic interface, as described in *note Real-data DFT
Array Format::. That is, for the complex array the size is assumed to
be the same as 'n', but with the last dimension cut roughly in half.
For the real array, the size is assumed to be 'n' if the transform is
out-of-place, or 'n' with the last dimension "padded" if the transform
is in-place.
If an 'nembed' parameter is non-'NULL', it is interpreted as the
physical size of the corresponding array, in row-major order, just as
for 'fftw_plan_many_dft'. In this case, each dimension of 'nembed'
should be '>=' what it would be in the basic interface (e.g. the halved
or padded 'n').
Arrays 'n', 'inembed', and 'onembed' are not used after this function
returns. You can safely free or reuse them.

File: fftw3.info, Node: Advanced Real-to-real Transforms, Prev: Advanced Real-data DFTs, Up: Advanced Interface
4.4.3 Advanced Real-to-real Transforms
--------------------------------------
fftw_plan fftw_plan_many_r2r(int rank, const int *n, int howmany,
double *in, const int *inembed,
int istride, int idist,
double *out, const int *onembed,
int ostride, int odist,
const fftw_r2r_kind *kind, unsigned flags);
Like 'fftw_plan_many_dft', this functions adds 'howmany', 'nembed',
'stride', and 'dist' parameters to the 'fftw_plan_r2r' function, but
otherwise behave the same as the basic interface. The interpretation of
those additional parameters are the same as for 'fftw_plan_many_dft'.
(Of course, the 'stride' and 'dist' parameters are now in units of
'double', not 'fftw_complex'.)
Arrays 'n', 'inembed', 'onembed', and 'kind' are not used after this
function returns. You can safely free or reuse them.

File: fftw3.info, Node: Guru Interface, Next: New-array Execute Functions, Prev: Advanced Interface, Up: FFTW Reference
4.5 Guru Interface
==================
The "guru" interface to FFTW is intended to expose as much as possible
of the flexibility in the underlying FFTW architecture. It allows one
to compute multi-dimensional "vectors" (loops) of multi-dimensional
transforms, where each vector/transform dimension has an independent
size and stride. One can also use more general complex-number formats,
e.g. separate real and imaginary arrays.
For those users who require the flexibility of the guru interface, it
is important that they pay special attention to the documentation lest
they shoot themselves in the foot.
* Menu:
* Interleaved and split arrays::
* Guru vector and transform sizes::
* Guru Complex DFTs::
* Guru Real-data DFTs::
* Guru Real-to-real Transforms::
* 64-bit Guru Interface::

File: fftw3.info, Node: Interleaved and split arrays, Next: Guru vector and transform sizes, Prev: Guru Interface, Up: Guru Interface
4.5.1 Interleaved and split arrays
----------------------------------
The guru interface supports two representations of complex numbers,
which we call the interleaved and the split format.
The "interleaved" format is the same one used by the basic and
advanced interfaces, and it is documented in *note Complex numbers::.
In the interleaved format, you provide pointers to the real part of a
complex number, and the imaginary part understood to be stored in the
next memory location.
The "split" format allows separate pointers to the real and imaginary
parts of a complex array.
Technically, the interleaved format is redundant, because you can
always express an interleaved array in terms of a split array with
appropriate pointers and strides. On the other hand, the interleaved
format is simpler to use, and it is common in practice. Hence, FFTW
supports it as a special case.

File: fftw3.info, Node: Guru vector and transform sizes, Next: Guru Complex DFTs, Prev: Interleaved and split arrays, Up: Guru Interface
4.5.2 Guru vector and transform sizes
-------------------------------------
The guru interface introduces one basic new data structure,
'fftw_iodim', that is used to specify sizes and strides for
multi-dimensional transforms and vectors:
typedef struct {
int n;
int is;
int os;
} fftw_iodim;
Here, 'n' is the size of the dimension, and 'is' and 'os' are the
strides of that dimension for the input and output arrays. (The stride
is the separation of consecutive elements along this dimension.)
The meaning of the stride parameter depends on the type of the array
that the stride refers to. _If the array is interleaved complex,
strides are expressed in units of complex numbers ('fftw_complex'). If
the array is split complex or real, strides are expressed in units of
real numbers ('double')._ This convention is consistent with the usual
pointer arithmetic in the C language. An interleaved array is denoted
by a pointer 'p' to 'fftw_complex', so that 'p+1' points to the next
complex number. Split arrays are denoted by pointers to 'double', in
which case pointer arithmetic operates in units of 'sizeof(double)'.
The guru planner interfaces all take a ('rank', 'dims[rank]') pair
describing the transform size, and a ('howmany_rank',
'howmany_dims[howmany_rank]') pair describing the "vector" size (a
multi-dimensional loop of transforms to perform), where 'dims' and
'howmany_dims' are arrays of 'fftw_iodim'. Each 'n' field must be
positive for 'dims' and nonnegative for 'howmany_dims', while both
'rank' and 'howmany_rank' must be nonnegative.
For example, the 'howmany' parameter in the advanced complex-DFT
interface corresponds to 'howmany_rank' = 1, 'howmany_dims[0].n' =
'howmany', 'howmany_dims[0].is' = 'idist', and 'howmany_dims[0].os' =
'odist'. (To compute a single transform, you can just use
'howmany_rank' = 0.)
A row-major multidimensional array with dimensions 'n[rank]' (*note
Row-major Format::) corresponds to 'dims[i].n' = 'n[i]' and the
recurrence 'dims[i].is' = 'n[i+1] * dims[i+1].is' (similarly for 'os').
The stride of the last ('i=rank-1') dimension is the overall stride of
the array. e.g. to be equivalent to the advanced complex-DFT
interface, you would have 'dims[rank-1].is' = 'istride' and
'dims[rank-1].os' = 'ostride'.
In general, we only guarantee FFTW to return a non-'NULL' plan if the
vector and transform dimensions correspond to a set of distinct indices,
and for in-place transforms the input/output strides should be the same.

File: fftw3.info, Node: Guru Complex DFTs, Next: Guru Real-data DFTs, Prev: Guru vector and transform sizes, Up: Guru Interface
4.5.3 Guru Complex DFTs
-----------------------
fftw_plan fftw_plan_guru_dft(
int rank, const fftw_iodim *dims,
int howmany_rank, const fftw_iodim *howmany_dims,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
fftw_plan fftw_plan_guru_split_dft(
int rank, const fftw_iodim *dims,
int howmany_rank, const fftw_iodim *howmany_dims,
double *ri, double *ii, double *ro, double *io,
unsigned flags);
These two functions plan a complex-data, multi-dimensional DFT for
the interleaved and split format, respectively. Transform dimensions
are given by ('rank', 'dims') over a multi-dimensional vector (loop) of
dimensions ('howmany_rank', 'howmany_dims'). 'dims' and 'howmany_dims'
should point to 'fftw_iodim' arrays of length 'rank' and 'howmany_rank',
respectively.
'flags' is a bitwise OR ('|') of zero or more planner flags, as
defined in *note Planner Flags::.
In the 'fftw_plan_guru_dft' function, the pointers 'in' and 'out'
point to the interleaved input and output arrays, respectively. The
sign can be either -1 (= 'FFTW_FORWARD') or +1 (= 'FFTW_BACKWARD'). If
the pointers are equal, the transform is in-place.
In the 'fftw_plan_guru_split_dft' function, 'ri' and 'ii' point to
the real and imaginary input arrays, and 'ro' and 'io' point to the real
and imaginary output arrays. The input and output pointers may be the
same, indicating an in-place transform. For example, for 'fftw_complex'
pointers 'in' and 'out', the corresponding parameters are:
ri = (double *) in;
ii = (double *) in + 1;
ro = (double *) out;
io = (double *) out + 1;
Because 'fftw_plan_guru_split_dft' accepts split arrays, strides are
expressed in units of 'double'. For a contiguous 'fftw_complex' array,
the overall stride of the transform should be 2, the distance between
consecutive real parts or between consecutive imaginary parts; see *note
Guru vector and transform sizes::. Note that the dimension strides are
applied equally to the real and imaginary parts; real and imaginary
arrays with different strides are not supported.
There is no 'sign' parameter in 'fftw_plan_guru_split_dft'. This
function always plans for an 'FFTW_FORWARD' transform. To plan for an
'FFTW_BACKWARD' transform, you can exploit the identity that the
backwards DFT is equal to the forwards DFT with the real and imaginary
parts swapped. For example, in the case of the 'fftw_complex' arrays
above, the 'FFTW_BACKWARD' transform is computed by the parameters:
ri = (double *) in + 1;
ii = (double *) in;
ro = (double *) out + 1;
io = (double *) out;

File: fftw3.info, Node: Guru Real-data DFTs, Next: Guru Real-to-real Transforms, Prev: Guru Complex DFTs, Up: Guru Interface
4.5.4 Guru Real-data DFTs
-------------------------
fftw_plan fftw_plan_guru_dft_r2c(
int rank, const fftw_iodim *dims,
int howmany_rank, const fftw_iodim *howmany_dims,
double *in, fftw_complex *out,
unsigned flags);
fftw_plan fftw_plan_guru_split_dft_r2c(
int rank, const fftw_iodim *dims,
int howmany_rank, const fftw_iodim *howmany_dims,
double *in, double *ro, double *io,
unsigned flags);
fftw_plan fftw_plan_guru_dft_c2r(
int rank, const fftw_iodim *dims,
int howmany_rank, const fftw_iodim *howmany_dims,
fftw_complex *in, double *out,
unsigned flags);
fftw_plan fftw_plan_guru_split_dft_c2r(
int rank, const fftw_iodim *dims,
int howmany_rank, const fftw_iodim *howmany_dims,
double *ri, double *ii, double *out,
unsigned flags);
Plan a real-input (r2c) or real-output (c2r), multi-dimensional DFT
with transform dimensions given by ('rank', 'dims') over a
multi-dimensional vector (loop) of dimensions ('howmany_rank',
'howmany_dims'). 'dims' and 'howmany_dims' should point to 'fftw_iodim'
arrays of length 'rank' and 'howmany_rank', respectively. As for the
basic and advanced interfaces, an r2c transform is 'FFTW_FORWARD' and a
c2r transform is 'FFTW_BACKWARD'.
The _last_ dimension of 'dims' is interpreted specially: that
dimension of the real array has size 'dims[rank-1].n', but that
dimension of the complex array has size 'dims[rank-1].n/2+1' (division
rounded down). The strides, on the other hand, are taken to be exactly
as specified. It is up to the user to specify the strides appropriately
for the peculiar dimensions of the data, and we do not guarantee that
the planner will succeed (return non-'NULL') for any dimensions other
than those described in *note Real-data DFT Array Format:: and
generalized in *note Advanced Real-data DFTs::. (That is, for an
in-place transform, each individual dimension should be able to operate
in place.)
'in' and 'out' point to the input and output arrays for r2c and c2r
transforms, respectively. For split arrays, 'ri' and 'ii' point to the
real and imaginary input arrays for a c2r transform, and 'ro' and 'io'
point to the real and imaginary output arrays for an r2c transform.
'in' and 'ro' or 'ri' and 'out' may be the same, indicating an in-place
transform. (In-place transforms where 'in' and 'io' or 'ii' and 'out'
are the same are not currently supported.)
'flags' is a bitwise OR ('|') of zero or more planner flags, as
defined in *note Planner Flags::.
In-place transforms of rank greater than 1 are currently only
supported for interleaved arrays. For split arrays, the planner will
return 'NULL'.

File: fftw3.info, Node: Guru Real-to-real Transforms, Next: 64-bit Guru Interface, Prev: Guru Real-data DFTs, Up: Guru Interface
4.5.5 Guru Real-to-real Transforms
----------------------------------
fftw_plan fftw_plan_guru_r2r(int rank, const fftw_iodim *dims,
int howmany_rank,
const fftw_iodim *howmany_dims,
double *in, double *out,
const fftw_r2r_kind *kind,
unsigned flags);
Plan a real-to-real (r2r) multi-dimensional 'FFTW_FORWARD' transform
with transform dimensions given by ('rank', 'dims') over a
multi-dimensional vector (loop) of dimensions ('howmany_rank',
'howmany_dims'). 'dims' and 'howmany_dims' should point to 'fftw_iodim'
arrays of length 'rank' and 'howmany_rank', respectively.
The transform kind of each dimension is given by the 'kind'
parameter, which should point to an array of length 'rank'. Valid
'fftw_r2r_kind' constants are given in *note Real-to-Real Transform
Kinds::.
'in' and 'out' point to the real input and output arrays; they may be
the same, indicating an in-place transform.
'flags' is a bitwise OR ('|') of zero or more planner flags, as
defined in *note Planner Flags::.

File: fftw3.info, Node: 64-bit Guru Interface, Prev: Guru Real-to-real Transforms, Up: Guru Interface
4.5.6 64-bit Guru Interface
---------------------------
When compiled in 64-bit mode on a 64-bit architecture (where addresses
are 64 bits wide), FFTW uses 64-bit quantities internally for all
transform sizes, strides, and so on--you don't have to do anything
special to exploit this. However, in the ordinary FFTW interfaces, you
specify the transform size by an 'int' quantity, which is normally only
32 bits wide. This means that, even though FFTW is using 64-bit sizes
internally, you cannot specify a single transform dimension larger than
2^31-1 numbers.
We expect that few users will require transforms larger than this,
but, for those who do, we provide a 64-bit version of the guru interface
in which all sizes are specified as integers of type 'ptrdiff_t' instead
of 'int'. ('ptrdiff_t' is a signed integer type defined by the C
standard to be wide enough to represent address differences, and thus
must be at least 64 bits wide on a 64-bit machine.) We stress that
there is _no performance advantage_ to using this interface--the same
internal FFTW code is employed regardless--and it is only necessary if
you want to specify very large transform sizes.
In particular, the 64-bit guru interface is a set of planner routines
that are exactly the same as the guru planner routines, except that they
are named with 'guru64' instead of 'guru' and they take arguments of
type 'fftw_iodim64' instead of 'fftw_iodim'. For example, instead of
'fftw_plan_guru_dft', we have 'fftw_plan_guru64_dft'.
fftw_plan fftw_plan_guru64_dft(
int rank, const fftw_iodim64 *dims,
int howmany_rank, const fftw_iodim64 *howmany_dims,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
The 'fftw_iodim64' type is similar to 'fftw_iodim', with the same
interpretation, except that it uses type 'ptrdiff_t' instead of type
'int'.
typedef struct {
ptrdiff_t n;
ptrdiff_t is;
ptrdiff_t os;
} fftw_iodim64;
Every other 'fftw_plan_guru' function also has a 'fftw_plan_guru64'
equivalent, but we do not repeat their documentation here since they are
identical to the 32-bit versions except as noted above.

File: fftw3.info, Node: New-array Execute Functions, Next: Wisdom, Prev: Guru Interface, Up: FFTW Reference
4.6 New-array Execute Functions
===============================
Normally, one executes a plan for the arrays with which the plan was
created, by calling 'fftw_execute(plan)' as described in *note Using
Plans::. However, it is possible for sophisticated users to apply a
given plan to a _different_ array using the "new-array execute"
functions detailed below, provided that the following conditions are
met:
* The array size, strides, etcetera are the same (since those are set
by the plan).
* The input and output arrays are the same (in-place) or different
(out-of-place) if the plan was originally created to be in-place or
out-of-place, respectively.
* For split arrays, the separations between the real and imaginary
parts, 'ii-ri' and 'io-ro', are the same as they were for the input
and output arrays when the plan was created. (This condition is
automatically satisfied for interleaved arrays.)
* The "alignment" of the new input/output arrays is the same as that
of the input/output arrays when the plan was created, unless the
plan was created with the 'FFTW_UNALIGNED' flag. Here, the
alignment is a platform-dependent quantity (for example, it is the
address modulo 16 if SSE SIMD instructions are used, but the
address modulo 4 for non-SIMD single-precision FFTW on the same
machine). In general, only arrays allocated with 'fftw_malloc' are
guaranteed to be equally aligned (*note SIMD alignment and
fftw_malloc::).
The alignment issue is especially critical, because if you don't use
'fftw_malloc' then you may have little control over the alignment of
arrays in memory. For example, neither the C++ 'new' function nor the
Fortran 'allocate' statement provide strong enough guarantees about data
alignment. If you don't use 'fftw_malloc', therefore, you probably have
to use 'FFTW_UNALIGNED' (which disables most SIMD support). If
possible, it is probably better for you to simply create multiple plans
(creating a new plan is quick once one exists for a given size), or
better yet re-use the same array for your transforms.
For rare circumstances in which you cannot control the alignment of
allocated memory, but wish to determine where a given array is aligned
like the original array for which a plan was created, you can use the
'fftw_alignment_of' function:
int fftw_alignment_of(double *p);
Two arrays have equivalent alignment (for the purposes of applying a
plan) if and only if 'fftw_alignment_of' returns the same value for the
corresponding pointers to their data (typecast to 'double*' if
necessary).
If you are tempted to use the new-array execute interface because you
want to transform a known bunch of arrays of the same size, you should
probably go use the advanced interface instead (*note Advanced
Interface::)).
The new-array execute functions are:
void fftw_execute_dft(
const fftw_plan p,
fftw_complex *in, fftw_complex *out);
void fftw_execute_split_dft(
const fftw_plan p,
double *ri, double *ii, double *ro, double *io);
void fftw_execute_dft_r2c(
const fftw_plan p,
double *in, fftw_complex *out);
void fftw_execute_split_dft_r2c(
const fftw_plan p,
double *in, double *ro, double *io);
void fftw_execute_dft_c2r(
const fftw_plan p,
fftw_complex *in, double *out);
void fftw_execute_split_dft_c2r(
const fftw_plan p,
double *ri, double *ii, double *out);
void fftw_execute_r2r(
const fftw_plan p,
double *in, double *out);
These execute the 'plan' to compute the corresponding transform on
the input/output arrays specified by the subsequent arguments. The
input/output array arguments have the same meanings as the ones passed
to the guru planner routines in the preceding sections. The 'plan' is
not modified, and these routines can be called as many times as desired,
or intermixed with calls to the ordinary 'fftw_execute'.
The 'plan' _must_ have been created for the transform type
corresponding to the execute function, e.g. it must be a complex-DFT
plan for 'fftw_execute_dft'. Any of the planner routines for that
transform type, from the basic to the guru interface, could have been
used to create the plan, however.

File: fftw3.info, Node: Wisdom, Next: What FFTW Really Computes, Prev: New-array Execute Functions, Up: FFTW Reference
4.7 Wisdom
==========
This section documents the FFTW mechanism for saving and restoring plans
from disk. This mechanism is called "wisdom".
* Menu:
* Wisdom Export::
* Wisdom Import::
* Forgetting Wisdom::
* Wisdom Utilities::

File: fftw3.info, Node: Wisdom Export, Next: Wisdom Import, Prev: Wisdom, Up: Wisdom
4.7.1 Wisdom Export
-------------------
int fftw_export_wisdom_to_filename(const char *filename);
void fftw_export_wisdom_to_file(FILE *output_file);
char *fftw_export_wisdom_to_string(void);
void fftw_export_wisdom(void (*write_char)(char c, void *), void *data);
These functions allow you to export all currently accumulated wisdom
in a form from which it can be later imported and restored, even during
a separate run of the program. (*Note Words of Wisdom-Saving Plans::.)
The current store of wisdom is not affected by calling any of these
routines.
'fftw_export_wisdom' exports the wisdom to any output medium, as
specified by the callback function 'write_char'. 'write_char' is a
'putc'-like function that writes the character 'c' to some output; its
second parameter is the 'data' pointer passed to 'fftw_export_wisdom'.
For convenience, the following three "wrapper" routines are provided:
'fftw_export_wisdom_to_filename' writes wisdom to a file named
'filename' (which is created or overwritten), returning '1' on success
and '0' on failure. A lower-level function, which requires you to open
and close the file yourself (e.g. if you want to write wisdom to a
portion of a larger file) is 'fftw_export_wisdom_to_file'. This writes
the wisdom to the current position in 'output_file', which should be
open with write permission; upon exit, the file remains open and is
positioned at the end of the wisdom data.
'fftw_export_wisdom_to_string' returns a pointer to a
'NULL'-terminated string holding the wisdom data. This string is
dynamically allocated, and it is the responsibility of the caller to
deallocate it with 'free' when it is no longer needed.
All of these routines export the wisdom in the same format, which we
will not document here except to say that it is LISP-like ASCII text
that is insensitive to white space.

File: fftw3.info, Node: Wisdom Import, Next: Forgetting Wisdom, Prev: Wisdom Export, Up: Wisdom
4.7.2 Wisdom Import
-------------------
int fftw_import_system_wisdom(void);
int fftw_import_wisdom_from_filename(const char *filename);
int fftw_import_wisdom_from_string(const char *input_string);
int fftw_import_wisdom(int (*read_char)(void *), void *data);
These functions import wisdom into a program from data stored by the
'fftw_export_wisdom' functions above. (*Note Words of Wisdom-Saving
Plans::.) The imported wisdom replaces any wisdom already accumulated
by the running program.
'fftw_import_wisdom' imports wisdom from any input medium, as
specified by the callback function 'read_char'. 'read_char' is a
'getc'-like function that returns the next character in the input; its
parameter is the 'data' pointer passed to 'fftw_import_wisdom'. If the
end of the input data is reached (which should never happen for valid
data), 'read_char' should return 'EOF' (as defined in '<stdio.h>'). For
convenience, the following three "wrapper" routines are provided:
'fftw_import_wisdom_from_filename' reads wisdom from a file named
'filename'. A lower-level function, which requires you to open and
close the file yourself (e.g. if you want to read wisdom from a portion
of a larger file) is 'fftw_import_wisdom_from_file'. This reads wisdom
from the current position in 'input_file' (which should be open with
read permission); upon exit, the file remains open, but the position of
the read pointer is unspecified.
'fftw_import_wisdom_from_string' reads wisdom from the
'NULL'-terminated string 'input_string'.
'fftw_import_system_wisdom' reads wisdom from an
implementation-defined standard file ('/etc/fftw/wisdom' on Unix and GNU
systems).
The return value of these import routines is '1' if the wisdom was
read successfully and '0' otherwise. Note that, in all of these
functions, any data in the input stream past the end of the wisdom data
is simply ignored.

File: fftw3.info, Node: Forgetting Wisdom, Next: Wisdom Utilities, Prev: Wisdom Import, Up: Wisdom
4.7.3 Forgetting Wisdom
-----------------------
void fftw_forget_wisdom(void);
Calling 'fftw_forget_wisdom' causes all accumulated 'wisdom' to be
discarded and its associated memory to be freed. (New 'wisdom' can
still be gathered subsequently, however.)

File: fftw3.info, Node: Wisdom Utilities, Prev: Forgetting Wisdom, Up: Wisdom
4.7.4 Wisdom Utilities
----------------------
FFTW includes two standalone utility programs that deal with wisdom. We
merely summarize them here, since they come with their own 'man' pages
for Unix and GNU systems (with HTML versions on our web site).
The first program is 'fftw-wisdom' (or 'fftwf-wisdom' in single
precision, etcetera), which can be used to create a wisdom file
containing plans for any of the transform sizes and types supported by
FFTW. It is preferable to create wisdom directly from your executable
(*note Caveats in Using Wisdom::), but this program is useful for
creating global wisdom files for 'fftw_import_system_wisdom'.
The second program is 'fftw-wisdom-to-conf', which takes a wisdom
file as input and produces a "configuration routine" as output. The
latter is a C subroutine that you can compile and link into your
program, replacing a routine of the same name in the FFTW library, that
determines which parts of FFTW are callable by your program.
'fftw-wisdom-to-conf' produces a configuration routine that links to
only those parts of FFTW needed by the saved plans in the wisdom,
greatly reducing the size of statically linked executables (which should
only attempt to create plans corresponding to those in the wisdom,
however).

File: fftw3.info, Node: What FFTW Really Computes, Prev: Wisdom, Up: FFTW Reference
4.8 What FFTW Really Computes
=============================
In this section, we provide precise mathematical definitions for the
transforms that FFTW computes. These transform definitions are fairly
standard, but some authors follow slightly different conventions for the
normalization of the transform (the constant factor in front) and the
sign of the complex exponent. We begin by presenting the
one-dimensional (1d) transform definitions, and then give the
straightforward extension to multi-dimensional transforms.
* Menu:
* The 1d Discrete Fourier Transform (DFT)::
* The 1d Real-data DFT::
* 1d Real-even DFTs (DCTs)::
* 1d Real-odd DFTs (DSTs)::
* 1d Discrete Hartley Transforms (DHTs)::
* Multi-dimensional Transforms::

File: fftw3.info, Node: The 1d Discrete Fourier Transform (DFT), Next: The 1d Real-data DFT, Prev: What FFTW Really Computes, Up: What FFTW Really Computes
4.8.1 The 1d Discrete Fourier Transform (DFT)
---------------------------------------------
The forward ('FFTW_FORWARD') discrete Fourier transform (DFT) of a 1d
complex array X of size n computes an array Y, where:
Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) .
The backward ('FFTW_BACKWARD') DFT computes:
Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) .
FFTW computes an unnormalized transform, in that there is no
coefficient in front of the summation in the DFT. In other words,
applying the forward and then the backward transform will multiply the
input by n.
From above, an 'FFTW_FORWARD' transform corresponds to a sign of -1
in the exponent of the DFT. Note also that we use the standard
"in-order" output ordering--the k-th output corresponds to the frequency
k/n (or k/T, where T is your total sampling period). For those who like
to think in terms of positive and negative frequencies, this means that
the positive frequencies are stored in the first half of the output and
the negative frequencies are stored in backwards order in the second
half of the output. (The frequency -k/n is the same as the frequency
(n-k)/n.)

File: fftw3.info, Node: The 1d Real-data DFT, Next: 1d Real-even DFTs (DCTs), Prev: The 1d Discrete Fourier Transform (DFT), Up: What FFTW Really Computes
4.8.2 The 1d Real-data DFT
--------------------------
The real-input (r2c) DFT in FFTW computes the _forward_ transform Y of
the size 'n' real array X, exactly as defined above, i.e.
Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(-2 pi j k sqrt(-1)/n) .
This output array Y can easily be shown to possess the "Hermitian"
symmetry Y[k] = Y[n-k]*, where we take Y to be periodic so that Y[n] =
Y[0].
As a result of this symmetry, half of the output Y is redundant
(being the complex conjugate of the other half), and so the 1d r2c
transforms only output elements 0...n/2 of Y (n/2+1 complex numbers),
where the division by 2 is rounded down.
Moreover, the Hermitian symmetry implies that Y[0] and, if n is even,
the Y[n/2] element, are purely real. So, for the 'R2HC' r2r transform,
the halfcomplex format does not store the imaginary parts of these
elements.
The c2r and 'H2RC' r2r transforms compute the backward DFT of the
_complex_ array X with Hermitian symmetry, stored in the r2c/'R2HC'
output formats, respectively, where the backward transform is defined
exactly as for the complex case:
Y[k] = sum for j = 0 to (n - 1) of X[j] * exp(2 pi j k sqrt(-1)/n) .
The outputs 'Y' of this transform can easily be seen to be purely
real, and are stored as an array of real numbers.
Like FFTW's complex DFT, these transforms are unnormalized. In other
words, applying the real-to-complex (forward) and then the
complex-to-real (backward) transform will multiply the input by n.

File: fftw3.info, Node: 1d Real-even DFTs (DCTs), Next: 1d Real-odd DFTs (DSTs), Prev: The 1d Real-data DFT, Up: What FFTW Really Computes
4.8.3 1d Real-even DFTs (DCTs)
------------------------------
The Real-even symmetry DFTs in FFTW are exactly equivalent to the
unnormalized forward (and backward) DFTs as defined above, where the
input array X of length N is purely real and is also "even" symmetry.
In this case, the output array is likewise real and even symmetry.
For the case of 'REDFT00', this even symmetry means that X[j] =
X[N-j], where we take X to be periodic so that X[N] = X[0]. Because of
this redundancy, only the first n real numbers are actually stored,
where N = 2(n-1).
The proper definition of even symmetry for 'REDFT10', 'REDFT01', and
'REDFT11' transforms is somewhat more intricate because of the shifts by
1/2 of the input and/or output, although the corresponding boundary
conditions are given in *note Real even/odd DFTs (cosine/sine
transforms)::. Because of the even symmetry, however, the sine terms in
the DFT all cancel and the remaining cosine terms are written explicitly
below. This formulation often leads people to call such a transform a
"discrete cosine transform" (DCT), although it is really just a special
case of the DFT.
In each of the definitions below, we transform a real array X of
length n to a real array Y of length n:
REDFT00 (DCT-I)
...............
An 'REDFT00' transform (type-I DCT) in FFTW is defined by: Y[k] = X[0] +
(-1)^k X[n-1] + 2 (sum for j = 1 to n-2 of X[j] cos(pi jk /(n-1))).
Note that this transform is not defined for n=1. For n=2, the summation
term above is dropped as you might expect.
REDFT10 (DCT-II)
................
An 'REDFT10' transform (type-II DCT, sometimes called "the" DCT) in FFTW
is defined by: Y[k] = 2 (sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) k /
n)).
REDFT01 (DCT-III)
.................
An 'REDFT01' transform (type-III DCT) in FFTW is defined by: Y[k] = X[0]
+ 2 (sum for j = 1 to n-1 of X[j] cos(pi j (k+1/2) / n)). In the case
of n=1, this reduces to Y[0] = X[0]. Up to a scale factor (see below),
this is the inverse of 'REDFT10' ("the" DCT), and so the 'REDFT01'
(DCT-III) is sometimes called the "IDCT".
REDFT11 (DCT-IV)
................
An 'REDFT11' transform (type-IV DCT) in FFTW is defined by: Y[k] = 2
(sum for j = 0 to n-1 of X[j] cos(pi (j+1/2) (k+1/2) / n)).
Inverses and Normalization
..........................
These definitions correspond directly to the unnormalized DFTs used
elsewhere in FFTW (hence the factors of 2 in front of the summations).
The unnormalized inverse of 'REDFT00' is 'REDFT00', of 'REDFT10' is
'REDFT01' and vice versa, and of 'REDFT11' is 'REDFT11'. Each
unnormalized inverse results in the original array multiplied by N,
where N is the _logical_ DFT size. For 'REDFT00', N=2(n-1) (note that
n=1 is not defined); otherwise, N=2n.
In defining the discrete cosine transform, some authors also include
additional factors of sqrt(2) (or its inverse) multiplying selected
inputs and/or outputs. This is a mostly cosmetic change that makes the
transform orthogonal, but sacrifices the direct equivalence to a
symmetric DFT.

File: fftw3.info, Node: 1d Real-odd DFTs (DSTs), Next: 1d Discrete Hartley Transforms (DHTs), Prev: 1d Real-even DFTs (DCTs), Up: What FFTW Really Computes
4.8.4 1d Real-odd DFTs (DSTs)
-----------------------------
The Real-odd symmetry DFTs in FFTW are exactly equivalent to the
unnormalized forward (and backward) DFTs as defined above, where the
input array X of length N is purely real and is also "odd" symmetry. In
this case, the output is odd symmetry and purely imaginary.
For the case of 'RODFT00', this odd symmetry means that X[j] =
-X[N-j], where we take X to be periodic so that X[N] = X[0]. Because of
this redundancy, only the first n real numbers starting at j=1 are
actually stored (the j=0 element is zero), where N = 2(n+1).
The proper definition of odd symmetry for 'RODFT10', 'RODFT01', and
'RODFT11' transforms is somewhat more intricate because of the shifts by
1/2 of the input and/or output, although the corresponding boundary
conditions are given in *note Real even/odd DFTs (cosine/sine
transforms)::. Because of the odd symmetry, however, the cosine terms
in the DFT all cancel and the remaining sine terms are written
explicitly below. This formulation often leads people to call such a
transform a "discrete sine transform" (DST), although it is really just
a special case of the DFT.
In each of the definitions below, we transform a real array X of
length n to a real array Y of length n:
RODFT00 (DST-I)
...............
An 'RODFT00' transform (type-I DST) in FFTW is defined by: Y[k] = 2 (sum
for j = 0 to n-1 of X[j] sin(pi (j+1)(k+1) / (n+1))).
RODFT10 (DST-II)
................
An 'RODFT10' transform (type-II DST) in FFTW is defined by: Y[k] = 2
(sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1) / n)).
RODFT01 (DST-III)
.................
An 'RODFT01' transform (type-III DST) in FFTW is defined by: Y[k] =
(-1)^k X[n-1] + 2 (sum for j = 0 to n-2 of X[j] sin(pi (j+1) (k+1/2) /
n)). In the case of n=1, this reduces to Y[0] = X[0].
RODFT11 (DST-IV)
................
An 'RODFT11' transform (type-IV DST) in FFTW is defined by: Y[k] = 2
(sum for j = 0 to n-1 of X[j] sin(pi (j+1/2) (k+1/2) / n)).
Inverses and Normalization
..........................
These definitions correspond directly to the unnormalized DFTs used
elsewhere in FFTW (hence the factors of 2 in front of the summations).
The unnormalized inverse of 'RODFT00' is 'RODFT00', of 'RODFT10' is
'RODFT01' and vice versa, and of 'RODFT11' is 'RODFT11'. Each
unnormalized inverse results in the original array multiplied by N,
where N is the _logical_ DFT size. For 'RODFT00', N=2(n+1); otherwise,
N=2n.
In defining the discrete sine transform, some authors also include
additional factors of sqrt(2) (or its inverse) multiplying selected
inputs and/or outputs. This is a mostly cosmetic change that makes the
transform orthogonal, but sacrifices the direct equivalence to an
antisymmetric DFT.

File: fftw3.info, Node: 1d Discrete Hartley Transforms (DHTs), Next: Multi-dimensional Transforms, Prev: 1d Real-odd DFTs (DSTs), Up: What FFTW Really Computes
4.8.5 1d Discrete Hartley Transforms (DHTs)
-------------------------------------------
The discrete Hartley transform (DHT) of a 1d real array X of size n
computes a real array Y of the same size, where:
Y[k] = sum for j = 0 to (n - 1) of X[j] * [cos(2 pi j k / n) + sin(2 pi j k / n)].
FFTW computes an unnormalized transform, in that there is no
coefficient in front of the summation in the DHT. In other words,
applying the transform twice (the DHT is its own inverse) will multiply
the input by n.

File: fftw3.info, Node: Multi-dimensional Transforms, Prev: 1d Discrete Hartley Transforms (DHTs), Up: What FFTW Really Computes
4.8.6 Multi-dimensional Transforms
----------------------------------
The multi-dimensional transforms of FFTW, in general, compute simply the
separable product of the given 1d transform along each dimension of the
array. Since each of these transforms is unnormalized, computing the
forward followed by the backward/inverse multi-dimensional transform
will result in the original array scaled by the product of the
normalization factors for each dimension (e.g. the product of the
dimension sizes, for a multi-dimensional DFT).
The definition of FFTW's multi-dimensional DFT of real data (r2c)
deserves special attention. In this case, we logically compute the full
multi-dimensional DFT of the input data; since the input data are purely
real, the output data have the Hermitian symmetry and therefore only one
non-redundant half need be stored. More specifically, for an n[0] x
n[1] x n[2] x ... x n[d-1] multi-dimensional real-input DFT, the full
(logical) complex output array Y[k[0], k[1], ..., k[d-1]] has the
symmetry: Y[k[0], k[1], ..., k[d-1]] = Y[n[0] - k[0], n[1] - k[1], ...,
n[d-1] - k[d-1]]* (where each dimension is periodic). Because of this
symmetry, we only store the k[d-1] = 0...n[d-1]/2 elements of the _last_
dimension (division by 2 is rounded down). (We could instead have cut
any other dimension in half, but the last dimension proved
computationally convenient.) This results in the peculiar array format
described in more detail by *note Real-data DFT Array Format::.
The multi-dimensional c2r transform is simply the unnormalized
inverse of the r2c transform. i.e. it is the same as FFTW's complex
backward multi-dimensional DFT, operating on a Hermitian input array in
the peculiar format mentioned above and outputting a real array (since
the DFT output is purely real).
We should remind the user that the separable product of 1d transforms
along each dimension, as computed by FFTW, is not always the same thing
as the usual multi-dimensional transform. A multi-dimensional 'R2HC'
(or 'HC2R') transform is not identical to the multi-dimensional DFT,
requiring some post-processing to combine the requisite real and
imaginary parts, as was described in *note The Halfcomplex-format DFT::.
Likewise, FFTW's multidimensional 'FFTW_DHT' r2r transform is not the
same thing as the logical multi-dimensional discrete Hartley transform
defined in the literature, as discussed in *note The Discrete Hartley
Transform::.

File: fftw3.info, Node: Multi-threaded FFTW, Next: Distributed-memory FFTW with MPI, Prev: FFTW Reference, Up: Top
5 Multi-threaded FFTW
*********************
In this chapter we document the parallel FFTW routines for shared-memory
parallel hardware. These routines, which support parallel one- and
multi-dimensional transforms of both real and complex data, are the
easiest way to take advantage of multiple processors with FFTW. They
work just like the corresponding uniprocessor transform routines, except
that you have an extra initialization routine to call, and there is a
routine to set the number of threads to employ. Any program that uses
the uniprocessor FFTW can therefore be trivially modified to use the
multi-threaded FFTW.
A shared-memory machine is one in which all CPUs can directly access
the same main memory, and such machines are now common due to the
ubiquity of multi-core CPUs. FFTW's multi-threading support allows you
to utilize these additional CPUs transparently from a single program.
However, this does not necessarily translate into performance
gains--when multiple threads/CPUs are employed, there is an overhead
required for synchronization that may outweigh the computatational
parallelism. Therefore, you can only benefit from threads if your
problem is sufficiently large.
* Menu:
* Installation and Supported Hardware/Software::
* Usage of Multi-threaded FFTW::
* How Many Threads to Use?::
* Thread safety::

File: fftw3.info, Node: Installation and Supported Hardware/Software, Next: Usage of Multi-threaded FFTW, Prev: Multi-threaded FFTW, Up: Multi-threaded FFTW
5.1 Installation and Supported Hardware/Software
================================================
All of the FFTW threads code is located in the 'threads' subdirectory of
the FFTW package. On Unix systems, the FFTW threads libraries and
header files can be automatically configured, compiled, and installed
along with the uniprocessor FFTW libraries simply by including
'--enable-threads' in the flags to the 'configure' script (*note
Installation on Unix::), or '--enable-openmp' to use OpenMP
(http://www.openmp.org) threads.
The threads routines require your operating system to have some sort
of shared-memory threads support. Specifically, the FFTW threads
package works with POSIX threads (available on most Unix variants, from
GNU/Linux to MacOS X) and Win32 threads. OpenMP threads, which are
supported in many common compilers (e.g. gcc) are also supported, and
may give better performance on some systems. (OpenMP threads are also
useful if you are employing OpenMP in your own code, in order to
minimize conflicts between threading models.) If you have a
shared-memory machine that uses a different threads API, it should be a
simple matter of programming to include support for it; see the file
'threads/threads.c' for more detail.
You can compile FFTW with _both_ '--enable-threads' and
'--enable-openmp' at the same time, since they install libraries with
different names ('fftw3_threads' and 'fftw3_omp', as described below).
However, your programs may only link to _one_ of these two libraries at
a time.
Ideally, of course, you should also have multiple processors in order
to get any benefit from the threaded transforms.

File: fftw3.info, Node: Usage of Multi-threaded FFTW, Next: How Many Threads to Use?, Prev: Installation and Supported Hardware/Software, Up: Multi-threaded FFTW
5.2 Usage of Multi-threaded FFTW
================================
Here, it is assumed that the reader is already familiar with the usage
of the uniprocessor FFTW routines, described elsewhere in this manual.
We only describe what one has to change in order to use the
multi-threaded routines.
First, programs using the parallel complex transforms should be
linked with '-lfftw3_threads -lfftw3 -lm' on Unix, or '-lfftw3_omp
-lfftw3 -lm' if you compiled with OpenMP. You will also need to link
with whatever library is responsible for threads on your system (e.g.
'-lpthread' on GNU/Linux) or include whatever compiler flag enables
OpenMP (e.g. '-fopenmp' with gcc).
Second, before calling _any_ FFTW routines, you should call the
function:
int fftw_init_threads(void);
This function, which need only be called once, performs any one-time
initialization required to use threads on your system. It returns zero
if there was some error (which should not happen under normal
circumstances) and a non-zero value otherwise.
Third, before creating a plan that you want to parallelize, you
should call:
void fftw_plan_with_nthreads(int nthreads);
The 'nthreads' argument indicates the number of threads you want FFTW
to use (or actually, the maximum number). All plans subsequently
created with any planner routine will use that many threads. You can
call 'fftw_plan_with_nthreads', create some plans, call
'fftw_plan_with_nthreads' again with a different argument, and create
some more plans for a new number of threads. Plans already created
before a call to 'fftw_plan_with_nthreads' are unaffected. If you pass
an 'nthreads' argument of '1' (the default), threads are disabled for
subsequent plans.
You can determine the current number of threads that the planner can
use by calling:
int fftw_planner_nthreads(void);
With OpenMP, to configure FFTW to use all of the currently running
OpenMP threads (set by 'omp_set_num_threads(nthreads)' or by the
'OMP_NUM_THREADS' environment variable), you can do:
'fftw_plan_with_nthreads(omp_get_max_threads())'. (The 'omp_' OpenMP
functions are declared via '#include <omp.h>'.)
Given a plan, you then execute it as usual with 'fftw_execute(plan)',
and the execution will use the number of threads specified when the plan
was created. When done, you destroy it as usual with
'fftw_destroy_plan'. As described in *note Thread safety::, plan
_execution_ is thread-safe, but plan creation and destruction are _not_:
you should create/destroy plans only from a single thread, but can
safely execute multiple plans in parallel.
There is one additional routine: if you want to get rid of all memory
and other resources allocated internally by FFTW, you can call:
void fftw_cleanup_threads(void);
which is much like the 'fftw_cleanup()' function except that it also
gets rid of threads-related data. You must _not_ execute any previously
created plans after calling this function.
We should also mention one other restriction: if you save wisdom from
a program using the multi-threaded FFTW, that wisdom _cannot be used_ by
a program using only the single-threaded FFTW (i.e. not calling
'fftw_init_threads'). *Note Words of Wisdom-Saving Plans::.
Finally, FFTW provides a optional callback interface that allows you
to replace its parallel threading backend at runtime:
void fftw_threads_set_callback(
void (*parallel_loop)(void *(*work)(void *), char *jobdata, size_t elsize, int njobs, void *data),
void *data);
This routine (which is _not_ threadsafe and should generally be
called before creating any FFTW plans) allows you to provide a function
'parallel_loop' that executes parallel work for FFTW: it should call the
function 'work(jobdata + elsize*i)' for 'i' from '0' to 'njobs-1',
possibly in parallel. (The 'data' pointer supplied to
'fftw_threads_set_callback' is passed through to your 'parallel_loop'
function.) For example, if you link to an FFTW threads library built to
use POSIX threads, but you want it to use OpenMP instead (because you
are using OpenMP elsewhere in your program and want to avoid competing
threads), you can call 'fftw_threads_set_callback' with the callback
function:
void parallel_loop(void *(*work)(char *), char *jobdata, size_t elsize, int njobs, void *data)
{
#pragma omp parallel for
for (int i = 0; i < njobs; ++i)
work(jobdata + elsize * i);
}
The same mechanism could be used in order to make FFTW use a
threading backend implemented via Intel TBB, Apple GCD, or Cilk, for
example.

File: fftw3.info, Node: How Many Threads to Use?, Next: Thread safety, Prev: Usage of Multi-threaded FFTW, Up: Multi-threaded FFTW
5.3 How Many Threads to Use?
============================
There is a fair amount of overhead involved in synchronizing threads, so
the optimal number of threads to use depends upon the size of the
transform as well as on the number of processors you have.
As a general rule, you don't want to use more threads than you have
processors. (Using more threads will work, but there will be extra
overhead with no benefit.) In fact, if the problem size is too small,
you may want to use fewer threads than you have processors.
You will have to experiment with your system to see what level of
parallelization is best for your problem size. Typically, the problem
will have to involve at least a few thousand data points before threads
become beneficial. If you plan with 'FFTW_PATIENT', it will
automatically disable threads for sizes that don't benefit from
parallelization.

File: fftw3.info, Node: Thread safety, Prev: How Many Threads to Use?, Up: Multi-threaded FFTW
5.4 Thread safety
=================
Users writing multi-threaded programs (including OpenMP) must concern
themselves with the "thread safety" of the libraries they use--that is,
whether it is safe to call routines in parallel from multiple threads.
FFTW can be used in such an environment, but some care must be taken
because the planner routines share data (e.g. wisdom and trigonometric
tables) between calls and plans.
The upshot is that the only thread-safe routine in FFTW is
'fftw_execute' (and the new-array variants thereof). All other routines
(e.g. the planner) should only be called from one thread at a time.
So, for example, you can wrap a semaphore lock around any calls to the
planner; even more simply, you can just create all of your plans from
one thread. We do not think this should be an important restriction
(FFTW is designed for the situation where the only performance-sensitive
code is the actual execution of the transform), and the benefits of
shared data between plans are great.
Note also that, since the plan is not modified by 'fftw_execute', it
is safe to execute the _same plan_ in parallel by multiple threads.
However, since a given plan operates by default on a fixed array, you
need to use one of the new-array execute functions (*note New-array
Execute Functions::) so that different threads compute the transform of
different data.
(Users should note that these comments only apply to programs using
shared-memory threads or OpenMP. Parallelism using MPI or forked
processes involves a separate address-space and global variables for
each process, and is not susceptible to problems of this sort.)
The FFTW planner is intended to be called from a single thread. If
you really must call it from multiple threads, you are expected to grab
whatever lock makes sense for your application, with the understanding
that you may be holding that lock for a long time, which is undesirable.
Neither strategy works, however, in the following situation. The
"application" is structured as a set of "plugins" which are unaware of
each other, and for whatever reason the "plugins" cannot coordinate on
grabbing the lock. (This is not a technical problem, but an
organizational one. The "plugins" are written by independent agents,
and from the perspective of each plugin's author, each plugin is using
FFTW correctly from a single thread.) To cope with this situation,
starting from FFTW-3.3.5, FFTW supports an API to make the planner
thread-safe:
void fftw_make_planner_thread_safe(void);
This call operates by brute force: It just installs a hook that wraps
a lock (chosen by us) around all planner calls. So there is no magic
and you get the worst of all worlds. The planner is still
single-threaded, but you cannot choose which lock to use. The planner
still holds the lock for a long time, but you cannot impose a timeout on
lock acquisition. As of FFTW-3.3.5 and FFTW-3.3.6, this call does not
work when using OpenMP as threading substrate. (Suggestions on what to
do about this bug are welcome.) _Do not use
'fftw_make_planner_thread_safe' unless there is no other choice,_ such
as in the application/plugin situation.

File: fftw3.info, Node: Distributed-memory FFTW with MPI, Next: Calling FFTW from Modern Fortran, Prev: Multi-threaded FFTW, Up: Top
6 Distributed-memory FFTW with MPI
**********************************
In this chapter we document the parallel FFTW routines for parallel
systems supporting the MPI message-passing interface. Unlike the
shared-memory threads described in the previous chapter, MPI allows you
to use _distributed-memory_ parallelism, where each CPU has its own
separate memory, and which can scale up to clusters of many thousands of
processors. This capability comes at a price, however: each process
only stores a _portion_ of the data to be transformed, which means that
the data structures and programming-interface are quite different from
the serial or threads versions of FFTW.
Distributed-memory parallelism is especially useful when you are
transforming arrays so large that they do not fit into the memory of a
single processor. The storage per-process required by FFTW's MPI
routines is proportional to the total array size divided by the number
of processes. Conversely, distributed-memory parallelism can easily
pose an unacceptably high communications overhead for small problems;
the threshold problem size for which parallelism becomes advantageous
will depend on the precise problem you are interested in, your hardware,
and your MPI implementation.
A note on terminology: in MPI, you divide the data among a set of
"processes" which each run in their own memory address space.
Generally, each process runs on a different physical processor, but this
is not required. A set of processes in MPI is described by an opaque
data structure called a "communicator," the most common of which is the
predefined communicator 'MPI_COMM_WORLD' which refers to _all_
processes. For more information on these and other concepts common to
all MPI programs, we refer the reader to the documentation at the MPI
home page (http://www.mcs.anl.gov/research/projects/mpi/).
We assume in this chapter that the reader is familiar with the usage
of the serial (uniprocessor) FFTW, and focus only on the concepts new to
the MPI interface.
* Menu:
* FFTW MPI Installation::
* Linking and Initializing MPI FFTW::
* 2d MPI example::
* MPI Data Distribution::
* Multi-dimensional MPI DFTs of Real Data::
* Other Multi-dimensional Real-data MPI Transforms::
* FFTW MPI Transposes::
* FFTW MPI Wisdom::
* Avoiding MPI Deadlocks::
* FFTW MPI Performance Tips::
* Combining MPI and Threads::
* FFTW MPI Reference::
* FFTW MPI Fortran Interface::

File: fftw3.info, Node: FFTW MPI Installation, Next: Linking and Initializing MPI FFTW, Prev: Distributed-memory FFTW with MPI, Up: Distributed-memory FFTW with MPI
6.1 FFTW MPI Installation
=========================
All of the FFTW MPI code is located in the 'mpi' subdirectory of the
FFTW package. On Unix systems, the FFTW MPI libraries and header files
are automatically configured, compiled, and installed along with the
uniprocessor FFTW libraries simply by including '--enable-mpi' in the
flags to the 'configure' script (*note Installation on Unix::).
Any implementation of the MPI standard, version 1 or later, should
work with FFTW. The 'configure' script will attempt to automatically
detect how to compile and link code using your MPI implementation. In
some cases, especially if you have multiple different MPI
implementations installed or have an unusual MPI software package, you
may need to provide this information explicitly.
Most commonly, one compiles MPI code by invoking a special compiler
command, typically 'mpicc' for C code. The 'configure' script knows the
most common names for this command, but you can specify the MPI
compilation command explicitly by setting the 'MPICC' variable, as in
'./configure MPICC=mpicc ...'.
If, instead of a special compiler command, you need to link a certain
library, you can specify the link command via the 'MPILIBS' variable, as
in './configure MPILIBS=-lmpi ...'. Note that if your MPI library is
installed in a non-standard location (one the compiler does not know
about by default), you may also have to specify the location of the
library and header files via 'LDFLAGS' and 'CPPFLAGS' variables,
respectively, as in './configure LDFLAGS=-L/path/to/mpi/libs
CPPFLAGS=-I/path/to/mpi/include ...'.

File: fftw3.info, Node: Linking and Initializing MPI FFTW, Next: 2d MPI example, Prev: FFTW MPI Installation, Up: Distributed-memory FFTW with MPI
6.2 Linking and Initializing MPI FFTW
=====================================
Programs using the MPI FFTW routines should be linked with '-lfftw3_mpi
-lfftw3 -lm' on Unix in double precision, '-lfftw3f_mpi -lfftw3f -lm' in
single precision, and so on (*note Precision::). You will also need to
link with whatever library is responsible for MPI on your system; in
most MPI implementations, there is a special compiler alias named
'mpicc' to compile and link MPI code.
Before calling any FFTW routines except possibly 'fftw_init_threads'
(*note Combining MPI and Threads::), but after calling 'MPI_Init', you
should call the function:
void fftw_mpi_init(void);
If, at the end of your program, you want to get rid of all memory and
other resources allocated internally by FFTW, for both the serial and
MPI routines, you can call:
void fftw_mpi_cleanup(void);
which is much like the 'fftw_cleanup()' function except that it also
gets rid of FFTW's MPI-related data. You must _not_ execute any
previously created plans after calling this function.

File: fftw3.info, Node: 2d MPI example, Next: MPI Data Distribution, Prev: Linking and Initializing MPI FFTW, Up: Distributed-memory FFTW with MPI
6.3 2d MPI example
==================
Before we document the FFTW MPI interface in detail, we begin with a
simple example outlining how one would perform a two-dimensional 'N0' by
'N1' complex DFT.
#include <fftw3-mpi.h>
int main(int argc, char **argv)
{
const ptrdiff_t N0 = ..., N1 = ...;
fftw_plan plan;
fftw_complex *data;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
MPI_Init(&argc, &argv);
fftw_mpi_init();
/* get local data size and allocate */
alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
&local_n0, &local_0_start);
data = fftw_alloc_complex(alloc_local);
/* create plan for in-place forward DFT */
plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
FFTW_FORWARD, FFTW_ESTIMATE);
/* initialize data to some function my_function(x,y) */
for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
data[i*N1 + j] = my_function(local_0_start + i, j);
/* compute transforms, in-place, as many times as desired */
fftw_execute(plan);
fftw_destroy_plan(plan);
MPI_Finalize();
}
As can be seen above, the MPI interface follows the same basic style
of allocate/plan/execute/destroy as the serial FFTW routines. All of
the MPI-specific routines are prefixed with 'fftw_mpi_' instead of
'fftw_'. There are a few important differences, however:
First, we must call 'fftw_mpi_init()' after calling 'MPI_Init'
(required in all MPI programs) and before calling any other 'fftw_mpi_'
routine.
Second, when we create the plan with 'fftw_mpi_plan_dft_2d',
analogous to 'fftw_plan_dft_2d', we pass an additional argument: the
communicator, indicating which processes will participate in the
transform (here 'MPI_COMM_WORLD', indicating all processes). Whenever
you create, execute, or destroy a plan for an MPI transform, you must
call the corresponding FFTW routine on _all_ processes in the
communicator for that transform. (That is, these are _collective_
calls.) Note that the plan for the MPI transform uses the standard
'fftw_execute' and 'fftw_destroy' routines (on the other hand, there are
MPI-specific new-array execute functions documented below).
Third, all of the FFTW MPI routines take 'ptrdiff_t' arguments
instead of 'int' as for the serial FFTW. 'ptrdiff_t' is a standard C
integer type which is (at least) 32 bits wide on a 32-bit machine and 64
bits wide on a 64-bit machine. This is to make it easy to specify very
large parallel transforms on a 64-bit machine. (You can specify 64-bit
transform sizes in the serial FFTW, too, but only by using the 'guru64'
planner interface. *Note 64-bit Guru Interface::.)
Fourth, and most importantly, you don't allocate the entire
two-dimensional array on each process. Instead, you call
'fftw_mpi_local_size_2d' to find out what _portion_ of the array resides
on each processor, and how much space to allocate. Here, the portion of
the array on each process is a 'local_n0' by 'N1' slice of the total
array, starting at index 'local_0_start'. The total number of
'fftw_complex' numbers to allocate is given by the 'alloc_local' return
value, which _may_ be greater than 'local_n0 * N1' (in case some
intermediate calculations require additional storage). The data
distribution in FFTW's MPI interface is described in more detail by the
next section.
Given the portion of the array that resides on the local process, it
is straightforward to initialize the data (here to a function
'myfunction') and otherwise manipulate it. Of course, at the end of the
program you may want to output the data somehow, but synchronizing this
output is up to you and is beyond the scope of this manual. (One good
way to output a large multi-dimensional distributed array in MPI to a
portable binary file is to use the free HDF5 library; see the HDF home
page (http://www.hdfgroup.org/).)

File: fftw3.info, Node: MPI Data Distribution, Next: Multi-dimensional MPI DFTs of Real Data, Prev: 2d MPI example, Up: Distributed-memory FFTW with MPI
6.4 MPI Data Distribution
=========================
The most important concept to understand in using FFTW's MPI interface
is the data distribution. With a serial or multithreaded FFT, all of
the inputs and outputs are stored as a single contiguous chunk of
memory. With a distributed-memory FFT, the inputs and outputs are
broken into disjoint blocks, one per process.
In particular, FFTW uses a _1d block distribution_ of the data,
distributed along the _first dimension_. For example, if you want to
perform a 100 x 200 complex DFT, distributed over 4 processes, each
process will get a 25 x 200 slice of the data. That is, process 0 will
get rows 0 through 24, process 1 will get rows 25 through 49, process 2
will get rows 50 through 74, and process 3 will get rows 75 through 99.
If you take the same array but distribute it over 3 processes, then it
is not evenly divisible so the different processes will have unequal
chunks. FFTW's default choice in this case is to assign 34 rows to
processes 0 and 1, and 32 rows to process 2.
FFTW provides several 'fftw_mpi_local_size' routines that you can
call to find out what portion of an array is stored on the current
process. In most cases, you should use the default block sizes picked
by FFTW, but it is also possible to specify your own block size. For
example, with a 100 x 200 array on three processes, you can tell FFTW to
use a block size of 40, which would assign 40 rows to processes 0 and 1,
and 20 rows to process 2. FFTW's default is to divide the data equally
among the processes if possible, and as best it can otherwise. The rows
are always assigned in "rank order," i.e. process 0 gets the first
block of rows, then process 1, and so on. (You can change this by using
'MPI_Comm_split' to create a new communicator with re-ordered
processes.) However, you should always call the 'fftw_mpi_local_size'
routines, if possible, rather than trying to predict FFTW's distribution
choices.
In particular, it is critical that you allocate the storage size that
is returned by 'fftw_mpi_local_size', which is _not_ necessarily the
size of the local slice of the array. The reason is that intermediate
steps of FFTW's algorithms involve transposing the array and
redistributing the data, so at these intermediate steps FFTW may require
more local storage space (albeit always proportional to the total size
divided by the number of processes). The 'fftw_mpi_local_size'
functions know how much storage is required for these intermediate steps
and tell you the correct amount to allocate.
* Menu:
* Basic and advanced distribution interfaces::
* Load balancing::
* Transposed distributions::
* One-dimensional distributions::

File: fftw3.info, Node: Basic and advanced distribution interfaces, Next: Load balancing, Prev: MPI Data Distribution, Up: MPI Data Distribution
6.4.1 Basic and advanced distribution interfaces
------------------------------------------------
As with the planner interface, the 'fftw_mpi_local_size' distribution
interface is broken into basic and advanced ('_many') interfaces, where
the latter allows you to specify the block size manually and also to
request block sizes when computing multiple transforms simultaneously.
These functions are documented more exhaustively by the FFTW MPI
Reference, but we summarize the basic ideas here using a couple of
two-dimensional examples.
For the 100 x 200 complex-DFT example, above, we would find the
distribution by calling the following function in the basic interface:
ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
Given the total size of the data to be transformed (here, 'n0 = 100'
and 'n1 = 200') and an MPI communicator ('comm'), this function provides
three numbers.
First, it describes the shape of the local data: the current process
should store a 'local_n0' by 'n1' slice of the overall dataset, in
row-major order ('n1' dimension contiguous), starting at index
'local_0_start'. That is, if the total dataset is viewed as a 'n0' by
'n1' matrix, the current process should store the rows 'local_0_start'
to 'local_0_start+local_n0-1'. Obviously, if you are running with only
a single MPI process, that process will store the entire array:
'local_0_start' will be zero and 'local_n0' will be 'n0'. *Note
Row-major Format::.
Second, the return value is the total number of data elements (e.g.,
complex numbers for a complex DFT) that should be allocated for the
input and output arrays on the current process (ideally with
'fftw_malloc' or an 'fftw_alloc' function, to ensure optimal alignment).
It might seem that this should always be equal to 'local_n0 * n1', but
this is _not_ the case. FFTW's distributed FFT algorithms require data
redistributions at intermediate stages of the transform, and in some
circumstances this may require slightly larger local storage. This is
discussed in more detail below, under *note Load balancing::.
The advanced-interface 'local_size' function for multidimensional
transforms returns the same three things ('local_n0', 'local_0_start',
and the total number of elements to allocate), but takes more inputs:
ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
ptrdiff_t howmany,
ptrdiff_t block0,
MPI_Comm comm,
ptrdiff_t *local_n0,
ptrdiff_t *local_0_start);
The two-dimensional case above corresponds to 'rnk = 2' and an array
'n' of length 2 with 'n[0] = n0' and 'n[1] = n1'. This routine is for
any 'rnk > 1'; one-dimensional transforms have their own interface
because they work slightly differently, as discussed below.
First, the advanced interface allows you to perform multiple
transforms at once, of interleaved data, as specified by the 'howmany'
parameter. ('hoamany' is 1 for a single transform.)
Second, here you can specify your desired block size in the 'n0'
dimension, 'block0'. To use FFTW's default block size, pass
'FFTW_MPI_DEFAULT_BLOCK' (0) for 'block0'. Otherwise, on 'P' processes,
FFTW will return 'local_n0' equal to 'block0' on the first 'P / block0'
processes (rounded down), return 'local_n0' equal to 'n0 - block0 * (P /
block0)' on the next process, and 'local_n0' equal to zero on any
remaining processes. In general, we recommend using the default block
size (which corresponds to 'n0 / P', rounded up).
For example, suppose you have 'P = 4' processes and 'n0 = 21'. The
default will be a block size of '6', which will give 'local_n0 = 6' on
the first three processes and 'local_n0 = 3' on the last process.
Instead, however, you could specify 'block0 = 5' if you wanted, which
would give 'local_n0 = 5' on processes 0 to 2, 'local_n0 = 6' on process
3. (This choice, while it may look superficially more "balanced," has
the same critical path as FFTW's default but requires more
communications.)

File: fftw3.info, Node: Load balancing, Next: Transposed distributions, Prev: Basic and advanced distribution interfaces, Up: MPI Data Distribution
6.4.2 Load balancing
--------------------
Ideally, when you parallelize a transform over some P processes, each
process should end up with work that takes equal time. Otherwise, all
of the processes end up waiting on whichever process is slowest. This
goal is known as "load balancing." In this section, we describe the
circumstances under which FFTW is able to load-balance well, and in
particular how you should choose your transform size in order to load
balance.
Load balancing is especially difficult when you are parallelizing
over heterogeneous machines; for example, if one of your processors is a
old 486 and another is a Pentium IV, obviously you should give the
Pentium more work to do than the 486 since the latter is much slower.
FFTW does not deal with this problem, however--it assumes that your
processes run on hardware of comparable speed, and that the goal is
therefore to divide the problem as equally as possible.
For a multi-dimensional complex DFT, FFTW can divide the problem
equally among the processes if: (i) the _first_ dimension 'n0' is
divisible by P; and (ii), the _product_ of the subsequent dimensions is
divisible by P. (For the advanced interface, where you can specify
multiple simultaneous transforms via some "vector" length 'howmany', a
factor of 'howmany' is included in the product of the subsequent
dimensions.)
For a one-dimensional complex DFT, the length 'N' of the data should
be divisible by P _squared_ to be able to divide the problem equally
among the processes.

File: fftw3.info, Node: Transposed distributions, Next: One-dimensional distributions, Prev: Load balancing, Up: MPI Data Distribution
6.4.3 Transposed distributions
------------------------------
Internally, FFTW's MPI transform algorithms work by first computing
transforms of the data local to each process, then by globally
_transposing_ the data in some fashion to redistribute the data among
the processes, transforming the new data local to each process, and
transposing back. For example, a two-dimensional 'n0' by 'n1' array,
distributed across the 'n0' dimension, is transformd by: (i)
transforming the 'n1' dimension, which are local to each process; (ii)
transposing to an 'n1' by 'n0' array, distributed across the 'n1'
dimension; (iii) transforming the 'n0' dimension, which is now local to
each process; (iv) transposing back.
However, in many applications it is acceptable to compute a
multidimensional DFT whose results are produced in transposed order
(e.g., 'n1' by 'n0' in two dimensions). This provides a significant
performance advantage, because it means that the final transposition
step can be omitted. FFTW supports this optimization, which you specify
by passing the flag 'FFTW_MPI_TRANSPOSED_OUT' to the planner routines.
To compute the inverse transform of transposed output, you specify
'FFTW_MPI_TRANSPOSED_IN' to tell it that the input is transposed. In
this section, we explain how to interpret the output format of such a
transform.
Suppose you have are transforming multi-dimensional data with (at
least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] . As always,
it is distributed along the first dimension n[0] . Now, if we compute
its DFT with the 'FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output
data are stored with the first _two_ dimensions transposed: n[1] x n[0]
x n[2] x ... x n[d-1] , distributed along the n[1] dimension.
Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1] data and
transform it with the 'FFTW_MPI_TRANSPOSED_IN' flag, then the format
goes back to the original n[0] x n[1] x n[2] x ... x n[d-1] array.
There are two ways to find the portion of the transposed array that
resides on the current process. First, you can simply call the
appropriate 'local_size' function, passing n[1] x n[0] x n[2] x ... x
n[d-1] (the transposed dimensions). This would mean calling the
'local_size' function twice, once for the transposed and once for the
non-transposed dimensions. Alternatively, you can call one of the
'local_size_transposed' functions, which returns both the non-transposed
and transposed data distribution from a single call. For example, for a
3d transform with transposed output (or input), you might call:
ptrdiff_t fftw_mpi_local_size_3d_transposed(
ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
Here, 'local_n0' and 'local_0_start' give the size and starting index
of the 'n0' dimension for the _non_-transposed data, as in the previous
sections. For _transposed_ data (e.g. the output for
'FFTW_MPI_TRANSPOSED_OUT'), 'local_n1' and 'local_1_start' give the size
and starting index of the 'n1' dimension, which is the first dimension
of the transposed data ('n1' by 'n0' by 'n2').
(Note that 'FFTW_MPI_TRANSPOSED_IN' is completely equivalent to
performing 'FFTW_MPI_TRANSPOSED_OUT' and passing the first two
dimensions to the planner in reverse order, or vice versa. If you pass
_both_ the 'FFTW_MPI_TRANSPOSED_IN' and 'FFTW_MPI_TRANSPOSED_OUT' flags,
it is equivalent to swapping the first two dimensions passed to the
planner and passing _neither_ flag.)

File: fftw3.info, Node: One-dimensional distributions, Prev: Transposed distributions, Up: MPI Data Distribution
6.4.4 One-dimensional distributions
-----------------------------------
For one-dimensional distributed DFTs using FFTW, matters are slightly
more complicated because the data distribution is more closely tied to
how the algorithm works. In particular, you can no longer pass an
arbitrary block size and must accept FFTW's default; also, the block
sizes may be different for input and output. Also, the data
distribution depends on the flags and transform direction, in order for
forward and backward transforms to work correctly.
ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
int sign, unsigned flags,
ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
ptrdiff_t *local_no, ptrdiff_t *local_o_start);
This function computes the data distribution for a 1d transform of
size 'n0' with the given transform 'sign' and 'flags'. Both input and
output data use block distributions. The input on the current process
will consist of 'local_ni' numbers starting at index 'local_i_start';
e.g. if only a single process is used, then 'local_ni' will be 'n0' and
'local_i_start' will be '0'. Similarly for the output, with 'local_no'
numbers starting at index 'local_o_start'. The return value of
'fftw_mpi_local_size_1d' will be the total number of elements to
allocate on the current process (which might be slightly larger than the
local size due to intermediate steps in the algorithm).
As mentioned above (*note Load balancing::), the data will be divided
equally among the processes if 'n0' is divisible by the _square_ of the
number of processes. In this case, 'local_ni' will equal 'local_no'.
Otherwise, they may be different.
For some applications, such as convolutions, the order of the output
data is irrelevant. In this case, performance can be improved by
specifying that the output data be stored in an FFTW-defined "scrambled"
format. (In particular, this is the analogue of transposed output in
the multidimensional case: scrambled output saves a communications
step.) If you pass 'FFTW_MPI_SCRAMBLED_OUT' in the flags, then the
output is stored in this (undocumented) scrambled order. Conversely, to
perform the inverse transform of data in scrambled order, pass the
'FFTW_MPI_SCRAMBLED_IN' flag.
In MPI FFTW, only composite sizes 'n0' can be parallelized; we have
not yet implemented a parallel algorithm for large prime sizes.

File: fftw3.info, Node: Multi-dimensional MPI DFTs of Real Data, Next: Other Multi-dimensional Real-data MPI Transforms, Prev: MPI Data Distribution, Up: Distributed-memory FFTW with MPI
6.5 Multi-dimensional MPI DFTs of Real Data
===========================================
FFTW's MPI interface also supports multi-dimensional DFTs of real data,
similar to the serial r2c and c2r interfaces. (Parallel one-dimensional
real-data DFTs are not currently supported; you must use a complex
transform and set the imaginary parts of the inputs to zero.)
The key points to understand for r2c and c2r MPI transforms (compared
to the MPI complex DFTs or the serial r2c/c2r transforms), are:
* Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1] x
n[2] x ... x n[d-1] real data to/from n[0] x n[1] x n[2] x ... x
(n[d-1]/2 + 1) complex data: the last dimension of the complex data
is cut in half (rounded down), plus one. As for the serial
transforms, the sizes you pass to the 'plan_dft_r2c' and
'plan_dft_c2r' are the n[0] x n[1] x n[2] x ... x n[d-1]
dimensions of the real data.
* Although the real data is _conceptually_ n[0] x n[1] x n[2] x ...
x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x ...
x [2 (n[d-1]/2 + 1)] array, where the last dimension has been
_padded_ to make it the same size as the complex output. This is
much like the in-place serial r2c/c2r interface (*note
Multi-Dimensional DFTs of Real Data::), except that in MPI the
padding is required even for out-of-place data. The extra padding
numbers are ignored by FFTW (they are _not_ like zero-padding the
transform to a larger size); they are only used to determine the
data layout.
* The data distribution in MPI for _both_ the real and complex data
is determined by the shape of the _complex_ data. That is, you
call the appropriate 'local size' function for the n[0] x n[1] x
n[2] x ... x (n[d-1]/2 + 1) complex data, and then use the _same_
distribution for the real data except that the last complex
dimension is replaced by a (padded) real dimension of twice the
length.
For example suppose we are performing an out-of-place r2c transform
of L x M x N real data [padded to L x M x 2(N/2+1) ], resulting in L x M
x N/2+1 complex data. Similar to the example in *note 2d MPI example::,
we might do something like:
#include <fftw3-mpi.h>
int main(int argc, char **argv)
{
const ptrdiff_t L = ..., M = ..., N = ...;
fftw_plan plan;
double *rin;
fftw_complex *cout;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
MPI_Init(&argc, &argv);
fftw_mpi_init();
/* get local data size and allocate */
alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
&local_n0, &local_0_start);
rin = fftw_alloc_real(2 * alloc_local);
cout = fftw_alloc_complex(alloc_local);
/* create plan for out-of-place r2c DFT */
plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
FFTW_MEASURE);
/* initialize rin to some function my_func(x,y,z) */
for (i = 0; i < local_n0; ++i)
for (j = 0; j < M; ++j)
for (k = 0; k < N; ++k)
rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
/* compute transforms as many times as desired */
fftw_execute(plan);
fftw_destroy_plan(plan);
MPI_Finalize();
}
Note that we allocated 'rin' using 'fftw_alloc_real' with an argument
of '2 * alloc_local': since 'alloc_local' is the number of _complex_
values to allocate, the number of _real_ values is twice as many. The
'rin' array is then local_n0 x M x 2(N/2+1) in row-major order, so its
'(i,j,k)' element is at the index '(i*M + j) * (2*(N/2+1)) + k' (*note
Multi-dimensional Array Format::).
As for the complex transforms, improved performance can be obtained
by specifying that the output is the transpose of the input or vice
versa (*note Transposed distributions::). In our L x M x N r2c example,
including 'FFTW_TRANSPOSED_OUT' in the flags means that the input would
be a padded L x M x 2(N/2+1) real array distributed over the 'L'
dimension, while the output would be a M x L x N/2+1 complex array
distributed over the 'M' dimension. To perform the inverse c2r
transform with the same data distributions, you would use the
'FFTW_TRANSPOSED_IN' flag.

File: fftw3.info, Node: Other Multi-dimensional Real-data MPI Transforms, Next: FFTW MPI Transposes, Prev: Multi-dimensional MPI DFTs of Real Data, Up: Distributed-memory FFTW with MPI
6.6 Other multi-dimensional Real-Data MPI Transforms
====================================================
FFTW's MPI interface also supports multi-dimensional 'r2r' transforms of
all kinds supported by the serial interface (e.g. discrete cosine and
sine transforms, discrete Hartley transforms, etc.). Only
multi-dimensional 'r2r' transforms, not one-dimensional transforms, are
currently parallelized.
These are used much like the multidimensional complex DFTs discussed
above, except that the data is real rather than complex, and one needs
to pass an r2r transform kind ('fftw_r2r_kind') for each dimension as in
the serial FFTW (*note More DFTs of Real Data::).
For example, one might perform a two-dimensional L x M that is an
REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in the
second dimension with code like:
const ptrdiff_t L = ..., M = ...;
fftw_plan plan;
double *data;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
/* get local data size and allocate */
alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
&local_n0, &local_0_start);
data = fftw_alloc_real(alloc_local);
/* create plan for in-place REDFT10 x RODFT10 */
plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
/* initialize data to some function my_function(x,y) */
for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
data[i*M + j] = my_function(local_0_start + i, j);
/* compute transforms, in-place, as many times as desired */
fftw_execute(plan);
fftw_destroy_plan(plan);
Notice that we use the same 'local_size' functions as we did for
complex data, only now we interpret the sizes in terms of real rather
than complex values, and correspondingly use 'fftw_alloc_real'.

File: fftw3.info, Node: FFTW MPI Transposes, Next: FFTW MPI Wisdom, Prev: Other Multi-dimensional Real-data MPI Transforms, Up: Distributed-memory FFTW with MPI
6.7 FFTW MPI Transposes
=======================
The FFTW's MPI Fourier transforms rely on one or more _global
transposition_ step for their communications. For example, the
multidimensional transforms work by transforming along some dimensions,
then transposing to make the first dimension local and transforming
that, then transposing back. Because global transposition of a
block-distributed matrix has many other potential uses besides FFTs,
FFTW's transpose routines can be called directly, as documented in this
section.
* Menu:
* Basic distributed-transpose interface::
* Advanced distributed-transpose interface::
* An improved replacement for MPI_Alltoall::

File: fftw3.info, Node: Basic distributed-transpose interface, Next: Advanced distributed-transpose interface, Prev: FFTW MPI Transposes, Up: FFTW MPI Transposes
6.7.1 Basic distributed-transpose interface
-------------------------------------------
In particular, suppose that we have an 'n0' by 'n1' array in row-major
order, block-distributed across the 'n0' dimension. To transpose this
into an 'n1' by 'n0' array block-distributed across the 'n1' dimension,
we would create a plan by calling the following function:
fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
double *in, double *out,
MPI_Comm comm, unsigned flags);
The input and output arrays ('in' and 'out') can be the same. The
transpose is actually executed by calling 'fftw_execute' on the plan, as
usual.
The 'flags' are the usual FFTW planner flags, but support two
additional flags: 'FFTW_MPI_TRANSPOSED_OUT' and/or
'FFTW_MPI_TRANSPOSED_IN'. What these flags indicate, for transpose
plans, is that the output and/or input, respectively, are _locally_
transposed. That is, on each process input data is normally stored as a
'local_n0' by 'n1' array in row-major order, but for an
'FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as 'n1' by
'local_n0' in row-major order. Similarly, 'FFTW_MPI_TRANSPOSED_OUT'
means that the output is 'n0' by 'local_n1' instead of 'local_n1' by
'n0'.
To determine the local size of the array on each process before and
after the transpose, as well as the amount of storage that must be
allocated, one should call 'fftw_mpi_local_size_2d_transposed', just as
for a 2d DFT as described in the previous section:
ptrdiff_t fftw_mpi_local_size_2d_transposed
(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
Again, the return value is the local storage to allocate, which in
this case is the number of _real_ ('double') values rather than complex
numbers as in the previous examples.

File: fftw3.info, Node: Advanced distributed-transpose interface, Next: An improved replacement for MPI_Alltoall, Prev: Basic distributed-transpose interface, Up: FFTW MPI Transposes
6.7.2 Advanced distributed-transpose interface
----------------------------------------------
The above routines are for a transpose of a matrix of numbers (of type
'double'), using FFTW's default block sizes. More generally, one can
perform transposes of _tuples_ of numbers, with user-specified block
sizes for the input and output:
fftw_plan fftw_mpi_plan_many_transpose
(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
ptrdiff_t block0, ptrdiff_t block1,
double *in, double *out, MPI_Comm comm, unsigned flags);
In this case, one is transposing an 'n0' by 'n1' matrix of
'howmany'-tuples (e.g. 'howmany = 2' for complex numbers). The input
is distributed along the 'n0' dimension with block size 'block0', and
the 'n1' by 'n0' output is distributed along the 'n1' dimension with
block size 'block1'. If 'FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a
block size then FFTW uses its default block size. To get the local size
of the data on each process, you should then call
'fftw_mpi_local_size_many_transposed'.

File: fftw3.info, Node: An improved replacement for MPI_Alltoall, Prev: Advanced distributed-transpose interface, Up: FFTW MPI Transposes
6.7.3 An improved replacement for MPI_Alltoall
----------------------------------------------
We close this section by noting that FFTW's MPI transpose routines can
be thought of as a generalization for the 'MPI_Alltoall' function
(albeit only for floating-point types), and in some circumstances can
function as an improved replacement.
'MPI_Alltoall' is defined by the MPI standard as:
int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcnt, MPI_Datatype recvtype,
MPI_Comm comm);
In particular, for 'double*' arrays 'in' and 'out', consider the
call:
MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
This is completely equivalent to:
MPI_Comm_size(comm, &P);
plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
That is, computing a P x P transpose on 'P' processes, with a block
size of 1, is just a standard all-to-all communication.
However, using the FFTW routine instead of 'MPI_Alltoall' may have
certain advantages. First of all, FFTW's routine can operate in-place
('in == out') whereas 'MPI_Alltoall' can only operate out-of-place.
Second, even for out-of-place plans, FFTW's routine may be faster,
especially if you need to perform the all-to-all communication many
times and can afford to use 'FFTW_MEASURE' or 'FFTW_PATIENT'. It should
certainly be no slower, not including the time to create the plan, since
one of the possible algorithms that FFTW uses for an out-of-place
transpose _is_ simply to call 'MPI_Alltoall'. However, FFTW also
considers several other possible algorithms that, depending on your MPI
implementation and your hardware, may be faster.

File: fftw3.info, Node: FFTW MPI Wisdom, Next: Avoiding MPI Deadlocks, Prev: FFTW MPI Transposes, Up: Distributed-memory FFTW with MPI
6.8 FFTW MPI Wisdom
===================
FFTW's "wisdom" facility (*note Words of Wisdom-Saving Plans::) can be
used to save MPI plans as well as to save uniprocessor plans. However,
for MPI there are several unavoidable complications.
First, the MPI standard does not guarantee that every process can
perform file I/O (at least, not using C stdio routines)--in general, we
may only assume that process 0 is capable of I/O.(1) So, if we want to
export the wisdom from a single process to a file, we must first export
the wisdom to a string, then send it to process 0, then write it to a
file.
Second, in principle we may want to have separate wisdom for every
process, since in general the processes may run on different hardware
even for a single MPI program. However, in practice FFTW's MPI code is
designed for the case of homogeneous hardware (*note Load balancing::),
and in this case it is convenient to use the same wisdom for every
process. Thus, we need a mechanism to synchronize the wisdom.
To address both of these problems, FFTW provides the following two
functions:
void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
void fftw_mpi_gather_wisdom(MPI_Comm comm);
Given a communicator 'comm', 'fftw_mpi_broadcast_wisdom' will
broadcast the wisdom from process 0 to all other processes. Conversely,
'fftw_mpi_gather_wisdom' will collect wisdom from all processes onto
process 0. (If the plans created for the same problem by different
processes are not the same, 'fftw_mpi_gather_wisdom' will arbitrarily
choose one of the plans.) Both of these functions may result in
suboptimal plans for different processes if the processes are running on
non-identical hardware. Both of these functions are _collective_ calls,
which means that they must be executed by all processes in the
communicator.
So, for example, a typical code snippet to import wisdom from a file
and use it on all processes would be:
{
int rank;
fftw_mpi_init();
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
}
(Note that we must call 'fftw_mpi_init' before importing any wisdom
that might contain MPI plans.) Similarly, a typical code snippet to
export wisdom from all processes to a file is:
{
int rank;
fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
}
---------- Footnotes ----------
(1) In fact, even this assumption is not technically guaranteed by
the standard, although it seems to be universal in actual MPI
implementations and is widely assumed by MPI-using software.
Technically, you need to query the 'MPI_IO' attribute of
'MPI_COMM_WORLD' with 'MPI_Attr_get'. If this attribute is
'MPI_PROC_NULL', no I/O is possible. If it is 'MPI_ANY_SOURCE', any
process can perform I/O. Otherwise, it is the rank of a process that can
perform I/O ... but since it is not guaranteed to yield the _same_ rank
on all processes, you have to do an 'MPI_Allreduce' of some kind if you
want all processes to agree about which is going to do I/O. And even
then, the standard only guarantees that this process can perform output,
but not input. See e.g. 'Parallel Programming with MPI' by P. S.
Pacheco, section 8.1.3. Needless to say, in our experience virtually no
MPI programmers worry about this.

File: fftw3.info, Node: Avoiding MPI Deadlocks, Next: FFTW MPI Performance Tips, Prev: FFTW MPI Wisdom, Up: Distributed-memory FFTW with MPI
6.9 Avoiding MPI Deadlocks
==========================
An MPI program can _deadlock_ if one process is waiting for a message
from another process that never gets sent. To avoid deadlocks when
using FFTW's MPI routines, it is important to know which functions are
_collective_: that is, which functions must _always_ be called in the
_same order_ from _every_ process in a given communicator. (For
example, 'MPI_Barrier' is the canonical example of a collective function
in the MPI standard.)
The functions in FFTW that are _always_ collective are: every
function beginning with 'fftw_mpi_plan', as well as
'fftw_mpi_broadcast_wisdom' and 'fftw_mpi_gather_wisdom'. Also, the
following functions from the ordinary FFTW interface are collective when
they are applied to a plan created by an 'fftw_mpi_plan' function:
'fftw_execute', 'fftw_destroy_plan', and 'fftw_flops'.

File: fftw3.info, Node: FFTW MPI Performance Tips, Next: Combining MPI and Threads, Prev: Avoiding MPI Deadlocks, Up: Distributed-memory FFTW with MPI
6.10 FFTW MPI Performance Tips
==============================
In this section, we collect a few tips on getting the best performance
out of FFTW's MPI transforms.
First, because of the 1d block distribution, FFTW's parallelization
is currently limited by the size of the first dimension.
(Multidimensional block distributions may be supported by a future
version.) More generally, you should ideally arrange the dimensions so
that FFTW can divide them equally among the processes. *Note Load
balancing::.
Second, if it is not too inconvenient, you should consider working
with transposed output for multidimensional plans, as this saves a
considerable amount of communications. *Note Transposed
distributions::.
Third, the fastest choices are generally either an in-place transform
or an out-of-place transform with the 'FFTW_DESTROY_INPUT' flag (which
allows the input array to be used as scratch space). In-place is
especially beneficial if the amount of data per process is large.
Fourth, if you have multiple arrays to transform at once, rather than
calling FFTW's MPI transforms several times it usually seems to be
faster to interleave the data and use the advanced interface. (This
groups the communications together instead of requiring separate
messages for each transform.)

File: fftw3.info, Node: Combining MPI and Threads, Next: FFTW MPI Reference, Prev: FFTW MPI Performance Tips, Up: Distributed-memory FFTW with MPI
6.11 Combining MPI and Threads
==============================
In certain cases, it may be advantageous to combine MPI
(distributed-memory) and threads (shared-memory) parallelization. FFTW
supports this, with certain caveats. For example, if you have a cluster
of 4-processor shared-memory nodes, you may want to use threads within
the nodes and MPI between the nodes, instead of MPI for all
parallelization.
In particular, it is possible to seamlessly combine the MPI FFTW
routines with the multi-threaded FFTW routines (*note Multi-threaded
FFTW::). However, some care must be taken in the initialization code,
which should look something like this:
int threads_ok;
int main(int argc, char **argv)
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
threads_ok = provided >= MPI_THREAD_FUNNELED;
if (threads_ok) threads_ok = fftw_init_threads();
fftw_mpi_init();
...
if (threads_ok) fftw_plan_with_nthreads(...);
...
MPI_Finalize();
}
First, note that instead of calling 'MPI_Init', you should call
'MPI_Init_threads', which is the initialization routine defined by the
MPI-2 standard to indicate to MPI that your program will be
multithreaded. We pass 'MPI_THREAD_FUNNELED', which indicates that we
will only call MPI routines from the main thread. (FFTW will launch
additional threads internally, but the extra threads will not call MPI
code.) (You may also pass 'MPI_THREAD_SERIALIZED' or
'MPI_THREAD_MULTIPLE', which requests additional multithreading support
from the MPI implementation, but this is not required by FFTW.) The
'provided' parameter returns what level of threads support is actually
supported by your MPI implementation; this _must_ be at least
'MPI_THREAD_FUNNELED' if you want to call the FFTW threads routines, so
we define a global variable 'threads_ok' to record this. You should
only call 'fftw_init_threads' or 'fftw_plan_with_nthreads' if
'threads_ok' is true. For more information on thread safety in MPI, see
the MPI and Threads
(http://www.mpi-forum.org/docs/mpi-20-html/node162.htm) section of the
MPI-2 standard.
Second, we must call 'fftw_init_threads' _before_ 'fftw_mpi_init'.
This is critical for technical reasons having to do with how FFTW
initializes its list of algorithms.
Then, if you call 'fftw_plan_with_nthreads(N)', _every_ MPI process
will launch (up to) 'N' threads to parallelize its transforms.
For example, in the hypothetical cluster of 4-processor nodes, you
might wish to launch only a single MPI process per node, and then call
'fftw_plan_with_nthreads(4)' on each process to use all processors in
the nodes.
This may or may not be faster than simply using as many MPI processes
as you have processors, however. On the one hand, using threads within
a node eliminates the need for explicit message passing within the node.
On the other hand, FFTW's transpose routines are not multi-threaded, and
this means that the communications that do take place will not benefit
from parallelization within the node. Moreover, many MPI
implementations already have optimizations to exploit shared memory when
it is available, so adding the multithreaded FFTW on top of this may be
superfluous.

File: fftw3.info, Node: FFTW MPI Reference, Next: FFTW MPI Fortran Interface, Prev: Combining MPI and Threads, Up: Distributed-memory FFTW with MPI
6.12 FFTW MPI Reference
=======================
This chapter provides a complete reference to all FFTW MPI functions,
datatypes, and constants. See also *note FFTW Reference:: for
information on functions and types in common with the serial interface.
* Menu:
* MPI Files and Data Types::
* MPI Initialization::
* Using MPI Plans::
* MPI Data Distribution Functions::
* MPI Plan Creation::
* MPI Wisdom Communication::

File: fftw3.info, Node: MPI Files and Data Types, Next: MPI Initialization, Prev: FFTW MPI Reference, Up: FFTW MPI Reference
6.12.1 MPI Files and Data Types
-------------------------------
All programs using FFTW's MPI support should include its header file:
#include <fftw3-mpi.h>
Note that this header file includes the serial-FFTW 'fftw3.h' header
file, and also the 'mpi.h' header file for MPI, so you need not include
those files separately.
You must also link to _both_ the FFTW MPI library and to the serial
FFTW library. On Unix, this means adding '-lfftw3_mpi -lfftw3 -lm' at
the end of the link command.
Different precisions are handled as in the serial interface: *Note
Precision::. That is, 'fftw_' functions become 'fftwf_' (in single
precision) etcetera, and the libraries become '-lfftw3f_mpi -lfftw3f
-lm' etcetera on Unix. Long-double precision is supported in MPI, but
quad precision ('fftwq_') is not due to the lack of MPI support for this
type.

File: fftw3.info, Node: MPI Initialization, Next: Using MPI Plans, Prev: MPI Files and Data Types, Up: FFTW MPI Reference
6.12.2 MPI Initialization
-------------------------
Before calling any other FFTW MPI ('fftw_mpi_') function, and before
importing any wisdom for MPI problems, you must call:
void fftw_mpi_init(void);
If FFTW threads support is used, however, 'fftw_mpi_init' should be
called _after_ 'fftw_init_threads' (*note Combining MPI and Threads::).
Calling 'fftw_mpi_init' additional times (before 'fftw_mpi_cleanup') has
no effect.
If you want to deallocate all persistent data and reset FFTW to the
pristine state it was in when you started your program, you can call:
void fftw_mpi_cleanup(void);
(This calls 'fftw_cleanup', so you need not call the serial cleanup
routine too, although it is safe to do so.) After calling
'fftw_mpi_cleanup', all existing plans become undefined, and you should
not attempt to execute or destroy them. You must call 'fftw_mpi_init'
again after 'fftw_mpi_cleanup' if you want to resume using the MPI FFTW
routines.

File: fftw3.info, Node: Using MPI Plans, Next: MPI Data Distribution Functions, Prev: MPI Initialization, Up: FFTW MPI Reference
6.12.3 Using MPI Plans
----------------------
Once an MPI plan is created, you can execute and destroy it using
'fftw_execute', 'fftw_destroy_plan', and the other functions in the
serial interface that operate on generic plans (*note Using Plans::).
The 'fftw_execute' and 'fftw_destroy_plan' functions, applied to MPI
plans, are _collective_ calls: they must be called for all processes in
the communicator that was used to create the plan.
You must _not_ use the serial new-array plan-execution functions
'fftw_execute_dft' and so on (*note New-array Execute Functions::) with
MPI plans. Such functions are specialized to the problem type, and
there are specific new-array execute functions for MPI plans:
void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out);
void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out);
void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out);
void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out);
These functions have the same restrictions as those of the serial
new-array execute functions. They are _always_ safe to apply to the
_same_ 'in' and 'out' arrays that were used to create the plan. They
can only be applied to new arrarys if those arrays have the same types,
dimensions, in-placeness, and alignment as the original arrays, where
the best way to ensure the same alignment is to use FFTW's 'fftw_malloc'
and related allocation functions for all arrays (*note Memory
Allocation::). Note that distributed transposes (*note FFTW MPI
Transposes::) use 'fftw_mpi_execute_r2r', since they count as rank-zero
r2r plans from FFTW's perspective.

File: fftw3.info, Node: MPI Data Distribution Functions, Next: MPI Plan Creation, Prev: Using MPI Plans, Up: FFTW MPI Reference
6.12.4 MPI Data Distribution Functions
--------------------------------------
As described above (*note MPI Data Distribution::), in order to allocate
your arrays, _before_ creating a plan, you must first call one of the
following routines to determine the required allocation size and the
portion of the array locally stored on a given process. The 'MPI_Comm'
communicator passed here must be equivalent to the communicator used
below for plan creation.
The basic interface for multidimensional transforms consists of the
functions:
ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
These functions return the number of elements to allocate (complex
numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas the
'local_n0' and 'local_0_start' return the portion ('local_0_start' to
'local_0_start + local_n0 - 1') of the first dimension of an n[0] x n[1]
x n[2] x ... x n[d-1] array that is stored on the local process. *Note
Basic and advanced distribution interfaces::. For
'FFTW_MPI_TRANSPOSED_OUT' plans, the '_transposed' variants are useful
in order to also return the local portion of the first dimension in the
n[1] x n[0] x n[2] x ... x n[d-1] transposed output. *Note Transposed
distributions::. The advanced interface for multidimensional transforms
is:
ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
ptrdiff_t block0, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
These differ from the basic interface in only two ways. First, they
allow you to specify block sizes 'block0' and 'block1' (the latter for
the transposed output); you can pass 'FFTW_MPI_DEFAULT_BLOCK' to use
FFTW's default block size as in the basic interface. Second, you can
pass a 'howmany' parameter, corresponding to the advanced planning
interface below: this is for transforms of contiguous 'howmany'-tuples
of numbers ('howmany = 1' in the basic interface).
The corresponding basic and advanced routines for one-dimensional
transforms (currently only complex DFTs) are:
ptrdiff_t fftw_mpi_local_size_1d(
ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags,
ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
ptrdiff_t *local_no, ptrdiff_t *local_o_start);
ptrdiff_t fftw_mpi_local_size_many_1d(
ptrdiff_t n0, ptrdiff_t howmany,
MPI_Comm comm, int sign, unsigned flags,
ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
ptrdiff_t *local_no, ptrdiff_t *local_o_start);
As above, the return value is the number of elements to allocate
(complex numbers, for complex DFTs). The 'local_ni' and 'local_i_start'
arguments return the portion ('local_i_start' to 'local_i_start +
local_ni - 1') of the 1d array that is stored on this process for the
transform _input_, and 'local_no' and 'local_o_start' are the
corresponding quantities for the input. The 'sign' ('FFTW_FORWARD' or
'FFTW_BACKWARD') and 'flags' must match the arguments passed when
creating a plan. Although the inputs and outputs have different data
distributions in general, it is guaranteed that the _output_ data
distribution of an 'FFTW_FORWARD' plan will match the _input_ data
distribution of an 'FFTW_BACKWARD' plan and vice versa; similarly for
the 'FFTW_MPI_SCRAMBLED_OUT' and 'FFTW_MPI_SCRAMBLED_IN' flags. *Note
One-dimensional distributions::.

File: fftw3.info, Node: MPI Plan Creation, Next: MPI Wisdom Communication, Prev: MPI Data Distribution Functions, Up: FFTW MPI Reference
6.12.5 MPI Plan Creation
------------------------
Complex-data MPI DFTs
.....................
Plans for complex-data DFTs (*note 2d MPI example::) are created by:
fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out,
MPI_Comm comm, int sign, unsigned flags);
fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1,
fftw_complex *in, fftw_complex *out,
MPI_Comm comm, int sign, unsigned flags);
fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
fftw_complex *in, fftw_complex *out,
MPI_Comm comm, int sign, unsigned flags);
fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,
fftw_complex *in, fftw_complex *out,
MPI_Comm comm, int sign, unsigned flags);
fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n,
ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock,
fftw_complex *in, fftw_complex *out,
MPI_Comm comm, int sign, unsigned flags);
These are similar to their serial counterparts (*note Complex DFTs::)
in specifying the dimensions, sign, and flags of the transform. The
'comm' argument gives an MPI communicator that specifies the set of
processes to participate in the transform; plan creation is a collective
function that must be called for all processes in the communicator. The
'in' and 'out' pointers refer only to a portion of the overall transform
data (*note MPI Data Distribution::) as specified by the 'local_size'
functions in the previous section. Unless 'flags' contains
'FFTW_ESTIMATE', these arrays are overwritten during plan creation as
for the serial interface. For multi-dimensional transforms, any
dimensions '> 1' are supported; for one-dimensional transforms, only
composite (non-prime) 'n0' are currently supported (unlike the serial
FFTW). Requesting an unsupported transform size will yield a 'NULL'
plan. (As in the serial interface, highly composite sizes generally
yield the best performance.)
The advanced-interface 'fftw_mpi_plan_many_dft' additionally allows
you to specify the block sizes for the first dimension ('block') of the
n[0] x n[1] x n[2] x ... x n[d-1] input data and the first dimension
('tblock') of the n[1] x n[0] x n[2] x ... x n[d-1] transposed data (at
intermediate steps of the transform, and for the output if
'FFTW_TRANSPOSED_OUT' is specified in 'flags'). These must be the same
block sizes as were passed to the corresponding 'local_size' function;
you can pass 'FFTW_MPI_DEFAULT_BLOCK' to use FFTW's default block size
as in the basic interface. Also, the 'howmany' parameter specifies that
the transform is of contiguous 'howmany'-tuples rather than individual
complex numbers; this corresponds to the same parameter in the serial
advanced interface (*note Advanced Complex DFTs::) with 'stride =
howmany' and 'dist = 1'.
MPI flags
.........
The 'flags' can be any of those for the serial FFTW (*note Planner
Flags::), and in addition may include one or more of the following
MPI-specific flags, which improve performance at the cost of changing
the output or input data formats.
* 'FFTW_MPI_SCRAMBLED_OUT', 'FFTW_MPI_SCRAMBLED_IN': valid for 1d
transforms only, these flags indicate that the output/input of the
transform are in an undocumented "scrambled" order. A forward
'FFTW_MPI_SCRAMBLED_OUT' transform can be inverted by a backward
'FFTW_MPI_SCRAMBLED_IN' (times the usual 1/N normalization). *Note
One-dimensional distributions::.
* 'FFTW_MPI_TRANSPOSED_OUT', 'FFTW_MPI_TRANSPOSED_IN': valid for
multidimensional ('rnk > 1') transforms only, these flags specify
that the output or input of an n[0] x n[1] x n[2] x ... x n[d-1]
transform is transposed to n[1] x n[0] x n[2] x ... x n[d-1] .
*Note Transposed distributions::.
Real-data MPI DFTs
..................
Plans for real-input/output (r2c/c2r) DFTs (*note Multi-dimensional MPI
DFTs of Real Data::) are created by:
fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
double *in, fftw_complex *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,
double *in, fftw_complex *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
double *in, fftw_complex *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n,
double *in, fftw_complex *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
fftw_complex *in, double *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,
fftw_complex *in, double *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
fftw_complex *in, double *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n,
fftw_complex *in, double *out,
MPI_Comm comm, unsigned flags);
Similar to the serial interface (*note Real-data DFTs::), these
transform logically n[0] x n[1] x n[2] x ... x n[d-1] real data to/from
n[0] x n[1] x n[2] x ... x (n[d-1]/2 + 1) complex data, representing
the non-redundant half of the conjugate-symmetry output of a real-input
DFT (*note Multi-dimensional Transforms::). However, the real array
must be stored within a padded n[0] x n[1] x n[2] x ... x [2 (n[d-1]/2
+ 1)] array (much like the in-place serial r2c transforms, but here for
out-of-place transforms as well). Currently, only multi-dimensional
('rnk > 1') r2c/c2r transforms are supported (requesting a plan for 'rnk
= 1' will yield 'NULL'). As explained above (*note Multi-dimensional
MPI DFTs of Real Data::), the data distribution of both the real and
complex arrays is given by the 'local_size' function called for the
dimensions of the _complex_ array. Similar to the other planning
functions, the input and output arrays are overwritten when the plan is
created except in 'FFTW_ESTIMATE' mode.
As for the complex DFTs above, there is an advance interface that
allows you to manually specify block sizes and to transform contiguous
'howmany'-tuples of real/complex numbers:
fftw_plan fftw_mpi_plan_many_dft_r2c
(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
ptrdiff_t iblock, ptrdiff_t oblock,
double *in, fftw_complex *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_many_dft_c2r
(int rnk, const ptrdiff_t *n, ptrdiff_t howmany,
ptrdiff_t iblock, ptrdiff_t oblock,
fftw_complex *in, double *out,
MPI_Comm comm, unsigned flags);
MPI r2r transforms
..................
There are corresponding plan-creation routines for r2r transforms (*note
More DFTs of Real Data::), currently supporting multidimensional ('rnk >
1') transforms only ('rnk = 1' will yield a 'NULL' plan):
fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1,
double *in, double *out,
MPI_Comm comm,
fftw_r2r_kind kind0, fftw_r2r_kind kind1,
unsigned flags);
fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2,
double *in, double *out,
MPI_Comm comm,
fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2,
unsigned flags);
fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n,
double *in, double *out,
MPI_Comm comm, const fftw_r2r_kind *kind,
unsigned flags);
fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n,
ptrdiff_t iblock, ptrdiff_t oblock,
double *in, double *out,
MPI_Comm comm, const fftw_r2r_kind *kind,
unsigned flags);
The parameters are much the same as for the complex DFTs above,
except that the arrays are of real numbers (and hence the outputs of the
'local_size' data-distribution functions should be interpreted as counts
of real rather than complex numbers). Also, the 'kind' parameters
specify the r2r kinds along each dimension as for the serial interface
(*note Real-to-Real Transform Kinds::). *Note Other Multi-dimensional
Real-data MPI Transforms::.
MPI transposition
.................
FFTW also provides routines to plan a transpose of a distributed 'n0' by
'n1' array of real numbers, or an array of 'howmany'-tuples of real
numbers with specified block sizes (*note FFTW MPI Transposes::):
fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
double *in, double *out,
MPI_Comm comm, unsigned flags);
fftw_plan fftw_mpi_plan_many_transpose
(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
ptrdiff_t block0, ptrdiff_t block1,
double *in, double *out, MPI_Comm comm, unsigned flags);
These plans are used with the 'fftw_mpi_execute_r2r' new-array
execute function (*note Using MPI Plans::), since they count as (rank
zero) r2r plans from FFTW's perspective.

File: fftw3.info, Node: MPI Wisdom Communication, Prev: MPI Plan Creation, Up: FFTW MPI Reference
6.12.6 MPI Wisdom Communication
-------------------------------
To facilitate synchronizing wisdom among the different MPI processes, we
provide two functions:
void fftw_mpi_gather_wisdom(MPI_Comm comm);
void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
The 'fftw_mpi_gather_wisdom' function gathers all wisdom in the given
communicator 'comm' to the process of rank 0 in the communicator: that
process obtains the union of all wisdom on all the processes. As a side
effect, some other processes will gain additional wisdom from other
processes, but only process 0 will gain the complete union.
The 'fftw_mpi_broadcast_wisdom' does the reverse: it exports wisdom
from process 0 in 'comm' to all other processes in the communicator,
replacing any wisdom they currently have.
*Note FFTW MPI Wisdom::.

File: fftw3.info, Node: FFTW MPI Fortran Interface, Prev: FFTW MPI Reference, Up: Distributed-memory FFTW with MPI
6.13 FFTW MPI Fortran Interface
===============================
The FFTW MPI interface is callable from modern Fortran compilers
supporting the Fortran 2003 'iso_c_binding' standard for calling C
functions. As described in *note Calling FFTW from Modern Fortran::,
this means that you can directly call FFTW's C interface from Fortran
with only minor changes in syntax. There are, however, a few things
specific to the MPI interface to keep in mind:
* Instead of including 'fftw3.f03' as in *note Overview of Fortran
interface::, you should 'include 'fftw3-mpi.f03'' (after 'use,
intrinsic :: iso_c_binding' as before). The 'fftw3-mpi.f03' file
includes 'fftw3.f03', so you should _not_ 'include' them both
yourself. (You will also want to include the MPI header file,
usually via 'include 'mpif.h'' or similar, although though this is
not needed by 'fftw3-mpi.f03' per se.) (To use the 'fftwl_' 'long
double' extended-precision routines in supporting compilers, you
should include 'fftw3f-mpi.f03' in _addition_ to 'fftw3-mpi.f03'.
*Note Extended and quadruple precision in Fortran::.)
* Because of the different storage conventions between C and Fortran,
you reverse the order of your array dimensions when passing them to
FFTW (*note Reversing array dimensions::). This is merely a
difference in notation and incurs no performance overhead.
However, it means that, whereas in C the _first_ dimension is
distributed, in Fortran the _last_ dimension of your array is
distributed.
* In Fortran, communicators are stored as 'integer' types; there is
no 'MPI_Comm' type, nor is there any way to access a C 'MPI_Comm'.
Fortunately, this is taken care of for you by the FFTW Fortran
interface: whenever the C interface expects an 'MPI_Comm' type, you
should pass the Fortran communicator as an 'integer'.(1)
* Because you need to call the 'local_size' function to find out how
much space to allocate, and this may be _larger_ than the local
portion of the array (*note MPI Data Distribution::), you should
_always_ allocate your arrays dynamically using FFTW's allocation
routines as described in *note Allocating aligned memory in
Fortran::. (Coincidentally, this also provides the best
performance by guaranteeding proper data alignment.)
* Because all sizes in the MPI FFTW interface are declared as
'ptrdiff_t' in C, you should use 'integer(C_INTPTR_T)' in Fortran
(*note FFTW Fortran type reference::).
* In Fortran, because of the language semantics, we generally
recommend using the new-array execute functions for all plans, even
in the common case where you are executing the plan on the same
arrays for which the plan was created (*note Plan execution in
Fortran::). However, note that in the MPI interface these
functions are changed: 'fftw_execute_dft' becomes
'fftw_mpi_execute_dft', etcetera. *Note Using MPI Plans::.
For example, here is a Fortran code snippet to perform a distributed
L x M complex DFT in-place. (This assumes you have already initialized
MPI with 'MPI_init' and have also performed 'call fftw_mpi_init'.)
use, intrinsic :: iso_c_binding
include 'fftw3-mpi.f03'
integer(C_INTPTR_T), parameter :: L = ...
integer(C_INTPTR_T), parameter :: M = ...
type(C_PTR) :: plan, cdata
complex(C_DOUBLE_COMPLEX), pointer :: data(:,:)
integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset
! get local data size and allocate (note dimension reversal)
alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, &
local_M, local_j_offset)
cdata = fftw_alloc_complex(alloc_local)
call c_f_pointer(cdata, data, [L,local_M])
! create MPI plan for in-place forward DFT (note dimension reversal)
plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, &
FFTW_FORWARD, FFTW_MEASURE)
! initialize data to some function my_function(i,j)
do j = 1, local_M
do i = 1, L
data(i, j) = my_function(i, j + local_j_offset)
end do
end do
! compute transform (as many times as desired)
call fftw_mpi_execute_dft(plan, data, data)
call fftw_destroy_plan(plan)
call fftw_free(cdata)
Note that when we called 'fftw_mpi_local_size_2d' and
'fftw_mpi_plan_dft_2d' with the dimensions in reversed order, since a L
x M Fortran array is viewed by FFTW in C as a M x L array. This means
that the array was distributed over the 'M' dimension, the local portion
of which is a L x local_M array in Fortran. (You must _not_ use an
'allocate' statement to allocate an L x local_M array, however; you must
allocate 'alloc_local' complex numbers, which may be greater than 'L *
local_M', in order to reserve space for intermediate steps of the
transform.) Finally, we mention that because C's array indices are
zero-based, the 'local_j_offset' argument can conveniently be
interpreted as an offset in the 1-based 'j' index (rather than as a
starting index as in C).
If instead you had used the 'ior(FFTW_MEASURE,
FFTW_MPI_TRANSPOSED_OUT)' flag, the output of the transform would be a
transposed M x local_L array, associated with the _same_ 'cdata'
allocation (since the transform is in-place), and which you could
declare with:
complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:)
...
call c_f_pointer(cdata, tdata, [M,local_L])
where 'local_L' would have been obtained by changing the
'fftw_mpi_local_size_2d' call to:
alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, &
local_M, local_j_offset, local_L, local_i_offset)
---------- Footnotes ----------
(1) Technically, this is because you aren't actually calling the C
functions directly. You are calling wrapper functions that translate
the communicator with 'MPI_Comm_f2c' before calling the ordinary C
interface. This is all done transparently, however, since the
'fftw3-mpi.f03' interface file renames the wrappers so that they are
called in Fortran with the same names as the C interface functions.

File: fftw3.info, Node: Calling FFTW from Modern Fortran, Next: Calling FFTW from Legacy Fortran, Prev: Distributed-memory FFTW with MPI, Up: Top
7 Calling FFTW from Modern Fortran
**********************************
Fortran 2003 standardized ways for Fortran code to call C libraries, and
this allows us to support a direct translation of the FFTW C API into
Fortran. Compared to the legacy Fortran 77 interface (*note Calling
FFTW from Legacy Fortran::), this direct interface offers many
advantages, especially compile-time type-checking and aligned memory
allocation. As of this writing, support for these C interoperability
features seems widespread, having been implemented in nearly all major
Fortran compilers (e.g. GNU, Intel, IBM, Oracle/Solaris, Portland
Group, NAG).
This chapter documents that interface. For the most part, since this
interface allows Fortran to call the C interface directly, the usage is
identical to C translated to Fortran syntax. However, there are a few
subtle points such as memory allocation, wisdom, and data types that
deserve closer attention.
* Menu:
* Overview of Fortran interface::
* Reversing array dimensions::
* FFTW Fortran type reference::
* Plan execution in Fortran::
* Allocating aligned memory in Fortran::
* Accessing the wisdom API from Fortran::
* Defining an FFTW module::

File: fftw3.info, Node: Overview of Fortran interface, Next: Reversing array dimensions, Prev: Calling FFTW from Modern Fortran, Up: Calling FFTW from Modern Fortran
7.1 Overview of Fortran interface
=================================
FFTW provides a file 'fftw3.f03' that defines Fortran 2003 interfaces
for all of its C routines, except for the MPI routines described
elsewhere, which can be found in the same directory as 'fftw3.h' (the C
header file). In any Fortran subroutine where you want to use FFTW
functions, you should begin with:
use, intrinsic :: iso_c_binding
include 'fftw3.f03'
This includes the interface definitions and the standard
'iso_c_binding' module (which defines the equivalents of C types). You
can also put the FFTW functions into a module if you prefer (*note
Defining an FFTW module::).
At this point, you can now call anything in the FFTW C interface
directly, almost exactly as in C other than minor changes in syntax.
For example:
type(C_PTR) :: plan
complex(C_DOUBLE_COMPLEX), dimension(1024,1000) :: in, out
plan = fftw_plan_dft_2d(1000,1024, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
...
call fftw_execute_dft(plan, in, out)
...
call fftw_destroy_plan(plan)
A few important things to keep in mind are:
* FFTW plans are 'type(C_PTR)'. Other C types are mapped in the
obvious way via the 'iso_c_binding' standard: 'int' turns into
'integer(C_INT)', 'fftw_complex' turns into
'complex(C_DOUBLE_COMPLEX)', 'double' turns into 'real(C_DOUBLE)',
and so on. *Note FFTW Fortran type reference::.
* Functions in C become functions in Fortran if they have a return
value, and subroutines in Fortran otherwise.
* The ordering of the Fortran array dimensions must be _reversed_
when they are passed to the FFTW plan creation, thanks to
differences in array indexing conventions (*note Multi-dimensional
Array Format::). This is _unlike_ the legacy Fortran interface
(*note Fortran-interface routines::), which reversed the dimensions
for you. *Note Reversing array dimensions::.
* Using ordinary Fortran array declarations like this works, but may
yield suboptimal performance because the data may not be not
aligned to exploit SIMD instructions on modern proessors (*note
SIMD alignment and fftw_malloc::). Better performance will often
be obtained by allocating with 'fftw_alloc'. *Note Allocating
aligned memory in Fortran::.
* Similar to the legacy Fortran interface (*note FFTW Execution in
Fortran::), we currently recommend _not_ using 'fftw_execute' but
rather using the more specialized functions like 'fftw_execute_dft'
(*note New-array Execute Functions::). However, you should execute
the plan on the 'same arrays' as the ones for which you created the
plan, unless you are especially careful. *Note Plan execution in
Fortran::. To prevent you from using 'fftw_execute' by mistake,
the 'fftw3.f03' file does not provide an 'fftw_execute' interface
declaration.
* Multiple planner flags are combined with 'ior' (equivalent to '|'
in C). e.g. 'FFTW_MEASURE | FFTW_DESTROY_INPUT' becomes
'ior(FFTW_MEASURE, FFTW_DESTROY_INPUT)'. (You can also use '+' as
long as you don't try to include a given flag more than once.)
* Menu:
* Extended and quadruple precision in Fortran::

File: fftw3.info, Node: Extended and quadruple precision in Fortran, Prev: Overview of Fortran interface, Up: Overview of Fortran interface
7.1.1 Extended and quadruple precision in Fortran
-------------------------------------------------
If FFTW is compiled in 'long double' (extended) precision (*note
Installation and Customization::), you may be able to call the resulting
'fftwl_' routines (*note Precision::) from Fortran if your compiler
supports the 'C_LONG_DOUBLE_COMPLEX' type code.
Because some Fortran compilers do not support
'C_LONG_DOUBLE_COMPLEX', the 'fftwl_' declarations are segregated into a
separate interface file 'fftw3l.f03', which you should include _in
addition_ to 'fftw3.f03' (which declares precision-independent 'FFTW_'
constants):
use, intrinsic :: iso_c_binding
include 'fftw3.f03'
include 'fftw3l.f03'
We also support using the nonstandard '__float128'
quadruple-precision type provided by recent versions of 'gcc' on 32- and
64-bit x86 hardware (*note Installation and Customization::), using the
corresponding 'real(16)' and 'complex(16)' types supported by
'gfortran'. The quadruple-precision 'fftwq_' functions (*note
Precision::) are declared in a 'fftw3q.f03' interface file, which should
be included in addition to 'fftw3.f03', as above. You should also link
with '-lfftw3q -lquadmath -lm' as in C.

File: fftw3.info, Node: Reversing array dimensions, Next: FFTW Fortran type reference, Prev: Overview of Fortran interface, Up: Calling FFTW from Modern Fortran
7.2 Reversing array dimensions
==============================
A minor annoyance in calling FFTW from Fortran is that FFTW's array
dimensions are defined in the C convention (row-major order), while
Fortran's array dimensions are the opposite convention (column-major
order). *Note Multi-dimensional Array Format::. This is just a
bookkeeping difference, with no effect on performance. The only
consequence of this is that, whenever you create an FFTW plan for a
multi-dimensional transform, you must always _reverse the ordering of
the dimensions_.
For example, consider the three-dimensional (L x M x N ) arrays:
complex(C_DOUBLE_COMPLEX), dimension(L,M,N) :: in, out
To plan a DFT for these arrays using 'fftw_plan_dft_3d', you could
do:
plan = fftw_plan_dft_3d(N,M,L, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
That is, from FFTW's perspective this is a N x M x L array. _No data
transposition need occur_, as this is _only notation_. Similarly, to
use the more generic routine 'fftw_plan_dft' with the same arrays, you
could do:
integer(C_INT), dimension(3) :: n = [N,M,L]
plan = fftw_plan_dft_3d(3, n, in,out, FFTW_FORWARD,FFTW_ESTIMATE)
Note, by the way, that this is different from the legacy Fortran
interface (*note Fortran-interface routines::), which automatically
reverses the order of the array dimension for you. Here, you are
calling the C interface directly, so there is no "translation" layer.
An important thing to keep in mind is the implication of this for
multidimensional real-to-complex transforms (*note Multi-Dimensional
DFTs of Real Data::). In C, a multidimensional real-to-complex DFT
chops the last dimension roughly in half (N x M x L real input goes to N
x M x L/2+1 complex output). In Fortran, because the array dimension
notation is reversed, the _first_ dimension of the complex data is
chopped roughly in half. For example consider the 'r2c' transform of L
x M x N real input in Fortran:
type(C_PTR) :: plan
real(C_DOUBLE), dimension(L,M,N) :: in
complex(C_DOUBLE_COMPLEX), dimension(L/2+1,M,N) :: out
plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE)
...
call fftw_execute_dft_r2c(plan, in, out)
Alternatively, for an in-place r2c transform, as described in the C
documentation we must _pad_ the _first_ dimension of the real input with
an extra two entries (which are ignored by FFTW) so as to leave enough
space for the complex output. The input is _allocated_ as a 2[L/2+1] x
M x N array, even though only L x M x N of it is actually used. In this
example, we will allocate the array as a pointer type, using
'fftw_alloc' to ensure aligned memory for maximum performance (*note
Allocating aligned memory in Fortran::); this also makes it easy to
reference the same memory as both a real array and a complex array.
real(C_DOUBLE), pointer :: in(:,:,:)
complex(C_DOUBLE_COMPLEX), pointer :: out(:,:,:)
type(C_PTR) :: plan, data
data = fftw_alloc_complex(int((L/2+1) * M * N, C_SIZE_T))
call c_f_pointer(data, in, [2*(L/2+1),M,N])
call c_f_pointer(data, out, [L/2+1,M,N])
plan = fftw_plan_dft_r2c_3d(N,M,L, in,out, FFTW_ESTIMATE)
...
call fftw_execute_dft_r2c(plan, in, out)
...
call fftw_destroy_plan(plan)
call fftw_free(data)

File: fftw3.info, Node: FFTW Fortran type reference, Next: Plan execution in Fortran, Prev: Reversing array dimensions, Up: Calling FFTW from Modern Fortran
7.3 FFTW Fortran type reference
===============================
The following are the most important type correspondences between the C
interface and Fortran:
* Plans ('fftw_plan' and variants) are 'type(C_PTR)' (i.e. an opaque
pointer).
* The C floating-point types 'double', 'float', and 'long double'
correspond to 'real(C_DOUBLE)', 'real(C_FLOAT)', and
'real(C_LONG_DOUBLE)', respectively. The C complex types
'fftw_complex', 'fftwf_complex', and 'fftwl_complex' correspond in
Fortran to 'complex(C_DOUBLE_COMPLEX)', 'complex(C_FLOAT_COMPLEX)',
and 'complex(C_LONG_DOUBLE_COMPLEX)', respectively. Just as in C
(*note Precision::), the FFTW subroutines and types are prefixed
with 'fftw_', 'fftwf_', and 'fftwl_' for the different precisions,
and link to different libraries ('-lfftw3', '-lfftw3f', and
'-lfftw3l' on Unix), but use the _same_ include file 'fftw3.f03'
and the _same_ constants (all of which begin with 'FFTW_'). The
exception is 'long double' precision, for which you should _also_
include 'fftw3l.f03' (*note Extended and quadruple precision in
Fortran::).
* The C integer types 'int' and 'unsigned' (used for planner flags)
become 'integer(C_INT)'. The C integer type 'ptrdiff_t' (e.g. in
the *note 64-bit Guru Interface::) becomes 'integer(C_INTPTR_T)',
and 'size_t' (in 'fftw_malloc' etc.) becomes 'integer(C_SIZE_T)'.
* The 'fftw_r2r_kind' type (*note Real-to-Real Transform Kinds::)
becomes 'integer(C_FFTW_R2R_KIND)'. The various constant values of
the C enumerated type ('FFTW_R2HC' etc.) become simply integer
constants of the same names in Fortran.
* Numeric array pointer arguments (e.g. 'double *') become
'dimension(*), intent(out)' arrays of the same type, or
'dimension(*), intent(in)' if they are pointers to constant data
(e.g. 'const int *'). There are a few exceptions where numeric
pointers refer to scalar outputs (e.g. for 'fftw_flops'), in which
case they are 'intent(out)' scalar arguments in Fortran too. For
the new-array execute functions (*note New-array Execute
Functions::), the input arrays are declared 'dimension(*),
intent(inout)', since they can be modified in the case of in-place
or 'FFTW_DESTROY_INPUT' transforms.
* Pointer _return_ values (e.g 'double *') become 'type(C_PTR)'. (If
they are pointers to arrays, as for 'fftw_alloc_real', you can
convert them back to Fortran array pointers with the standard
intrinsic function 'c_f_pointer'.)
* The 'fftw_iodim' type in the guru interface (*note Guru vector and
transform sizes::) becomes 'type(fftw_iodim)' in Fortran, a derived
data type (the Fortran analogue of C's 'struct') with three
'integer(C_INT)' components: 'n', 'is', and 'os', with the same
meanings as in C. The 'fftw_iodim64' type in the 64-bit guru
interface (*note 64-bit Guru Interface::) is the same, except that
its components are of type 'integer(C_INTPTR_T)'.
* Using the wisdom import/export functions from Fortran is a bit
tricky, and is discussed in *note Accessing the wisdom API from
Fortran::. In brief, the 'FILE *' arguments map to 'type(C_PTR)',
'const char *' to 'character(C_CHAR), dimension(*), intent(in)'
(null-terminated!), and the generic read-char/write-char functions
map to 'type(C_FUNPTR)'.
You may be wondering if you need to search-and-replace
'real(kind(0.0d0))' (or whatever your favorite Fortran spelling of
"double precision" is) with 'real(C_DOUBLE)' everywhere in your program,
and similarly for 'complex' and 'integer' types. The answer is no; you
can still use your existing types. As long as these types match their C
counterparts, things should work without a hitch. The worst that can
happen, e.g. in the (unlikely) event of a system where
'real(kind(0.0d0))' is different from 'real(C_DOUBLE)', is that the
compiler will give you a type-mismatch error. That is, if you don't use
the 'iso_c_binding' kinds you need to accept at least the theoretical
possibility of having to change your code in response to compiler errors
on some future machine, but you don't need to worry about silently
compiling incorrect code that yields runtime errors.

File: fftw3.info, Node: Plan execution in Fortran, Next: Allocating aligned memory in Fortran, Prev: FFTW Fortran type reference, Up: Calling FFTW from Modern Fortran
7.4 Plan execution in Fortran
=============================
In C, in order to use a plan, one normally calls 'fftw_execute', which
executes the plan to perform the transform on the input/output arrays
passed when the plan was created (*note Using Plans::). The
corresponding subroutine call in modern Fortran is:
call fftw_execute(plan)
However, we have had reports that this causes problems with some
recent optimizing Fortran compilers. The problem is, because the
input/output arrays are not passed as explicit arguments to
'fftw_execute', the semantics of Fortran (unlike C) allow the compiler
to assume that the input/output arrays are not changed by
'fftw_execute'. As a consequence, certain compilers end up
repositioning the call to 'fftw_execute', assuming incorrectly that it
does nothing to the arrays.
There are various workarounds to this, but the safest and simplest
thing is to not use 'fftw_execute' in Fortran. Instead, use the
functions described in *note New-array Execute Functions::, which take
the input/output arrays as explicit arguments. For example, if the plan
is for a complex-data DFT and was created for the arrays 'in' and 'out',
you would do:
call fftw_execute_dft(plan, in, out)
There are a few things to be careful of, however:
* You must use the correct type of execute function, matching the way
the plan was created. Complex DFT plans should use
'fftw_execute_dft', Real-input (r2c) DFT plans should use use
'fftw_execute_dft_r2c', and real-output (c2r) DFT plans should use
'fftw_execute_dft_c2r'. The various r2r plans should use
'fftw_execute_r2r'. Fortunately, if you use the wrong one you will
get a compile-time type-mismatch error (unlike legacy Fortran).
* You should normally pass the same input/output arrays that were
used when creating the plan. This is always safe.
* _If_ you pass _different_ input/output arrays compared to those
used when creating the plan, you must abide by all the restrictions
of the new-array execute functions (*note New-array Execute
Functions::). The most tricky of these is the requirement that the
new arrays have the same alignment as the original arrays; the best
(and possibly only) way to guarantee this is to use the
'fftw_alloc' functions to allocate your arrays (*note Allocating
aligned memory in Fortran::). Alternatively, you can use the
'FFTW_UNALIGNED' flag when creating the plan, in which case the
plan does not depend on the alignment, but this may sacrifice
substantial performance on architectures (like x86) with SIMD
instructions (*note SIMD alignment and fftw_malloc::).

File: fftw3.info, Node: Allocating aligned memory in Fortran, Next: Accessing the wisdom API from Fortran, Prev: Plan execution in Fortran, Up: Calling FFTW from Modern Fortran
7.5 Allocating aligned memory in Fortran
========================================
In order to obtain maximum performance in FFTW, you should store your
data in arrays that have been specially aligned in memory (*note SIMD
alignment and fftw_malloc::). Enforcing alignment also permits you to
safely use the new-array execute functions (*note New-array Execute
Functions::) to apply a given plan to more than one pair of in/out
arrays. Unfortunately, standard Fortran arrays do _not_ provide any
alignment guarantees. The _only_ way to allocate aligned memory in
standard Fortran is to allocate it with an external C function, like the
'fftw_alloc_real' and 'fftw_alloc_complex' functions. Fortunately,
Fortran 2003 provides a simple way to associate such allocated memory
with a standard Fortran array pointer that you can then use normally.
We therefore recommend allocating all your input/output arrays using
the following technique:
1. Declare a 'pointer', 'arr', to your array of the desired type and
dimensions. For example, 'real(C_DOUBLE), pointer :: a(:,:)' for a
2d real array, or 'complex(C_DOUBLE_COMPLEX), pointer :: a(:,:,:)'
for a 3d complex array.
2. The number of elements to allocate must be an 'integer(C_SIZE_T)'.
You can either declare a variable of this type, e.g.
'integer(C_SIZE_T) :: sz', to store the number of elements to
allocate, or you can use the 'int(..., C_SIZE_T)' intrinsic
function. e.g. set 'sz = L * M * N' or use 'int(L * M * N,
C_SIZE_T)' for an L x M x N array.
3. Declare a 'type(C_PTR) :: p' to hold the return value from FFTW's
allocation routine. Set 'p = fftw_alloc_real(sz)' for a real
array, or 'p = fftw_alloc_complex(sz)' for a complex array.
4. Associate your pointer 'arr' with the allocated memory 'p' using
the standard 'c_f_pointer' subroutine: 'call c_f_pointer(p, arr,
[...dimensions...])', where '[...dimensions...])' are an array of
the dimensions of the array (in the usual Fortran order). e.g.
'call c_f_pointer(p, arr, [L,M,N])' for an L x M x N array.
(Alternatively, you can omit the dimensions argument if you
specified the shape explicitly when declaring 'arr'.) You can now
use 'arr' as a usual multidimensional array.
5. When you are done using the array, deallocate the memory by 'call
fftw_free(p)' on 'p'.
For example, here is how we would allocate an L x M 2d real array:
real(C_DOUBLE), pointer :: arr(:,:)
type(C_PTR) :: p
p = fftw_alloc_real(int(L * M, C_SIZE_T))
call c_f_pointer(p, arr, [L,M])
_...use arr and arr(i,j) as usual..._
call fftw_free(p)
and here is an L x M x N 3d complex array:
complex(C_DOUBLE_COMPLEX), pointer :: arr(:,:,:)
type(C_PTR) :: p
p = fftw_alloc_complex(int(L * M * N, C_SIZE_T))
call c_f_pointer(p, arr, [L,M,N])
_...use arr and arr(i,j,k) as usual..._
call fftw_free(p)
See *note Reversing array dimensions:: for an example allocating a
single array and associating both real and complex array pointers with
it, for in-place real-to-complex transforms.

File: fftw3.info, Node: Accessing the wisdom API from Fortran, Next: Defining an FFTW module, Prev: Allocating aligned memory in Fortran, Up: Calling FFTW from Modern Fortran
7.6 Accessing the wisdom API from Fortran
=========================================
As explained in *note Words of Wisdom-Saving Plans::, FFTW provides a
"wisdom" API for saving plans to disk so that they can be recreated
quickly. The C API for exporting (*note Wisdom Export::) and importing
(*note Wisdom Import::) wisdom is somewhat tricky to use from Fortran,
however, because of differences in file I/O and string types between C
and Fortran.
* Menu:
* Wisdom File Export/Import from Fortran::
* Wisdom String Export/Import from Fortran::
* Wisdom Generic Export/Import from Fortran::

File: fftw3.info, Node: Wisdom File Export/Import from Fortran, Next: Wisdom String Export/Import from Fortran, Prev: Accessing the wisdom API from Fortran, Up: Accessing the wisdom API from Fortran
7.6.1 Wisdom File Export/Import from Fortran
--------------------------------------------
The easiest way to export and import wisdom is to do so using
'fftw_export_wisdom_to_filename' and 'fftw_wisdom_from_filename'. The
only trick is that these require you to pass a C string, which is an
array of type 'CHARACTER(C_CHAR)' that is terminated by 'C_NULL_CHAR'.
You can call them like this:
integer(C_INT) :: ret
ret = fftw_export_wisdom_to_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR)
if (ret .eq. 0) stop 'error exporting wisdom to file'
ret = fftw_import_wisdom_from_filename(C_CHAR_'my_wisdom.dat' // C_NULL_CHAR)
if (ret .eq. 0) stop 'error importing wisdom from file'
Note that prepending 'C_CHAR_' is needed to specify that the literal
string is of kind 'C_CHAR', and we null-terminate the string by
appending '// C_NULL_CHAR'. These functions return an 'integer(C_INT)'
('ret') which is '0' if an error occurred during export/import and
nonzero otherwise.
It is also possible to use the lower-level routines
'fftw_export_wisdom_to_file' and 'fftw_import_wisdom_from_file', which
accept parameters of the C type 'FILE*', expressed in Fortran as
'type(C_PTR)'. However, you are then responsible for creating the
'FILE*' yourself. You can do this by using 'iso_c_binding' to define
Fortran intefaces for the C library functions 'fopen' and 'fclose',
which is a bit strange in Fortran but workable.

File: fftw3.info, Node: Wisdom String Export/Import from Fortran, Next: Wisdom Generic Export/Import from Fortran, Prev: Wisdom File Export/Import from Fortran, Up: Accessing the wisdom API from Fortran
7.6.2 Wisdom String Export/Import from Fortran
----------------------------------------------
Dealing with FFTW's C string export/import is a bit more painful. In
particular, the 'fftw_export_wisdom_to_string' function requires you to
deal with a dynamically allocated C string. To get its length, you must
define an interface to the C 'strlen' function, and to deallocate it you
must define an interface to C 'free':
use, intrinsic :: iso_c_binding
interface
integer(C_INT) function strlen(s) bind(C, name='strlen')
import
type(C_PTR), value :: s
end function strlen
subroutine free(p) bind(C, name='free')
import
type(C_PTR), value :: p
end subroutine free
end interface
Given these definitions, you can then export wisdom to a Fortran
character array:
character(C_CHAR), pointer :: s(:)
integer(C_SIZE_T) :: slen
type(C_PTR) :: p
p = fftw_export_wisdom_to_string()
if (.not. c_associated(p)) stop 'error exporting wisdom'
slen = strlen(p)
call c_f_pointer(p, s, [slen+1])
...
call free(p)
Note that 'slen' is the length of the C string, but the length of the
array is 'slen+1' because it includes the terminating null character.
(You can omit the '+1' if you don't want Fortran to know about the null
character.) The standard 'c_associated' function checks whether 'p' is
a null pointer, which is returned by 'fftw_export_wisdom_to_string' if
there was an error.
To import wisdom from a string, use 'fftw_import_wisdom_from_string'
as usual; note that the argument of this function must be a
'character(C_CHAR)' that is terminated by the 'C_NULL_CHAR' character,
like the 's' array above.

File: fftw3.info, Node: Wisdom Generic Export/Import from Fortran, Prev: Wisdom String Export/Import from Fortran, Up: Accessing the wisdom API from Fortran
7.6.3 Wisdom Generic Export/Import from Fortran
-----------------------------------------------
The most generic wisdom export/import functions allow you to provide an
arbitrary callback function to read/write one character at a time in any
way you want. However, your callback function must be written in a
special way, using the 'bind(C)' attribute to be passed to a C
interface.
In particular, to call the generic wisdom export function
'fftw_export_wisdom', you would write a callback subroutine of the form:
subroutine my_write_char(c, p) bind(C)
use, intrinsic :: iso_c_binding
character(C_CHAR), value :: c
type(C_PTR), value :: p
_...write c..._
end subroutine my_write_char
Given such a subroutine (along with the corresponding interface
definition), you could then export wisdom using:
call fftw_export_wisdom(c_funloc(my_write_char), p)
The standard 'c_funloc' intrinsic converts a Fortran 'bind(C)'
subroutine into a C function pointer. The parameter 'p' is a
'type(C_PTR)' to any arbitrary data that you want to pass to
'my_write_char' (or 'C_NULL_PTR' if none). (Note that you can get a C
pointer to Fortran data using the intrinsic 'c_loc', and convert it back
to a Fortran pointer in 'my_write_char' using 'c_f_pointer'.)
Similarly, to use the generic 'fftw_import_wisdom', you would define
a callback function of the form:
integer(C_INT) function my_read_char(p) bind(C)
use, intrinsic :: iso_c_binding
type(C_PTR), value :: p
character :: c
_...read a character c..._
my_read_char = ichar(c, C_INT)
end function my_read_char
....
integer(C_INT) :: ret
ret = fftw_import_wisdom(c_funloc(my_read_char), p)
if (ret .eq. 0) stop 'error importing wisdom'
Your function can return '-1' if the end of the input is reached.
Again, 'p' is an arbitrary 'type(C_PTR' that is passed through to your
function. 'fftw_import_wisdom' returns '0' if an error occurred and
nonzero otherwise.

File: fftw3.info, Node: Defining an FFTW module, Prev: Accessing the wisdom API from Fortran, Up: Calling FFTW from Modern Fortran
7.7 Defining an FFTW module
===========================
Rather than using the 'include' statement to include the 'fftw3.f03'
interface file in any subroutine where you want to use FFTW, you might
prefer to define an FFTW Fortran module. FFTW does not install itself
as a module, primarily because 'fftw3.f03' can be shared between
different Fortran compilers while modules (in general) cannot. However,
it is trivial to define your own FFTW module if you want. Just create a
file containing:
module FFTW3
use, intrinsic :: iso_c_binding
include 'fftw3.f03'
end module
Compile this file into a module as usual for your compiler (e.g.
with 'gfortran -c' you will get a file 'fftw3.mod'). Now, instead of
'include 'fftw3.f03'', whenever you want to use FFTW routines you can
just do:
use FFTW3
as usual for Fortran modules. (You still need to link to the FFTW
library, of course.)

File: fftw3.info, Node: Calling FFTW from Legacy Fortran, Next: Upgrading from FFTW version 2, Prev: Calling FFTW from Modern Fortran, Up: Top
8 Calling FFTW from Legacy Fortran
**********************************
This chapter describes the interface to FFTW callable by Fortran code in
older compilers not supporting the Fortran 2003 C interoperability
features (*note Calling FFTW from Modern Fortran::). This interface has
the major disadvantage that it is not type-checked, so if you mistake
the argument types or ordering then your program will not have any
compiler errors, and will likely crash at runtime. So, greater care is
needed. Also, technically interfacing older Fortran versions to C is
nonstandard, but in practice we have found that the techniques used in
this chapter have worked with all known Fortran compilers for many
years.
The legacy Fortran interface differs from the C interface only in the
prefix ('dfftw_' instead of 'fftw_' in double precision) and a few other
minor details. This Fortran interface is included in the FFTW libraries
by default, unless a Fortran compiler isn't found on your system or
'--disable-fortran' is included in the 'configure' flags. We assume
here that the reader is already familiar with the usage of FFTW in C, as
described elsewhere in this manual.
The MPI parallel interface to FFTW is _not_ currently available to
legacy Fortran.
* Menu:
* Fortran-interface routines::
* FFTW Constants in Fortran::
* FFTW Execution in Fortran::
* Fortran Examples::
* Wisdom of Fortran?::

File: fftw3.info, Node: Fortran-interface routines, Next: FFTW Constants in Fortran, Prev: Calling FFTW from Legacy Fortran, Up: Calling FFTW from Legacy Fortran
8.1 Fortran-interface routines
==============================
Nearly all of the FFTW functions have Fortran-callable equivalents. The
name of the legacy Fortran routine is the same as that of the
corresponding C routine, but with the 'fftw_' prefix replaced by
'dfftw_'.(1) The single and long-double precision versions use 'sfftw_'
and 'lfftw_', respectively, instead of 'fftwf_' and 'fftwl_'; quadruple
precision ('real*16') is available on some systems as 'fftwq_' (*note
Precision::). (Note that 'long double' on x86 hardware is usually at
most 80-bit extended precision, _not_ quadruple precision.)
For the most part, all of the arguments to the functions are the
same, with the following exceptions:
* 'plan' variables (what would be of type 'fftw_plan' in C), must be
declared as a type that is at least as big as a pointer (address)
on your machine. We recommend using 'integer*8' everywhere, since
this should always be big enough.
* Any function that returns a value (e.g. 'fftw_plan_dft') is
converted into a _subroutine_. The return value is converted into
an additional _first_ parameter of this subroutine.(2)
* The Fortran routines expect multi-dimensional arrays to be in
_column-major_ order, which is the ordinary format of Fortran
arrays (*note Multi-dimensional Array Format::). They do this
transparently and costlessly simply by reversing the order of the
dimensions passed to FFTW, but this has one important consequence
for multi-dimensional real-complex transforms, discussed below.
* Wisdom import and export is somewhat more tricky because one cannot
easily pass files or strings between C and Fortran; see *note
Wisdom of Fortran?::.
* Legacy Fortran cannot use the 'fftw_malloc' dynamic-allocation
routine. If you want to exploit the SIMD FFTW (*note SIMD
alignment and fftw_malloc::), you'll need to figure out some other
way to ensure that your arrays are at least 16-byte aligned.
* Since Fortran 77 does not have data structures, the 'fftw_iodim'
structure from the guru interface (*note Guru vector and transform
sizes::) must be split into separate arguments. In particular, any
'fftw_iodim' array arguments in the C guru interface become three
integer array arguments ('n', 'is', and 'os') in the Fortran guru
interface, all of whose lengths should be equal to the
corresponding 'rank' argument.
* The guru planner interface in Fortran does _not_ do any automatic
translation between column-major and row-major; you are responsible
for setting the strides etcetera to correspond to your Fortran
arrays. However, as a slight bug that we are preserving for
backwards compatibility, the 'plan_guru_r2r' in Fortran _does_
reverse the order of its 'kind' array parameter, so the 'kind'
array of that routine should be in the reverse of the order of the
iodim arrays (see above).
In general, you should take care to use Fortran data types that
correspond to (i.e. are the same size as) the C types used by FFTW. In
practice, this correspondence is usually straightforward (i.e.
'integer' corresponds to 'int', 'real' corresponds to 'float',
etcetera). The native Fortran double/single-precision complex type
should be compatible with 'fftw_complex'/'fftwf_complex'. Such simple
correspondences are assumed in the examples below.
---------- Footnotes ----------
(1) Technically, Fortran 77 identifiers are not allowed to have more
than 6 characters, nor may they contain underscores. Any compiler that
enforces this limitation doesn't deserve to link to FFTW.
(2) The reason for this is that some Fortran implementations seem to
have trouble with C function return values, and vice versa.

File: fftw3.info, Node: FFTW Constants in Fortran, Next: FFTW Execution in Fortran, Prev: Fortran-interface routines, Up: Calling FFTW from Legacy Fortran
8.2 FFTW Constants in Fortran
=============================
When creating plans in FFTW, a number of constants are used to specify
options, such as 'FFTW_MEASURE' or 'FFTW_ESTIMATE'. The same constants
must be used with the wrapper routines, but of course the C header files
where the constants are defined can't be incorporated directly into
Fortran code.
Instead, we have placed Fortran equivalents of the FFTW constant
definitions in the file 'fftw3.f', which can be found in the same
directory as 'fftw3.h'. If your Fortran compiler supports a
preprocessor of some sort, you should be able to 'include' or '#include'
this file; otherwise, you can paste it directly into your code.
In C, you combine different flags (like 'FFTW_PRESERVE_INPUT' and
'FFTW_MEASURE') using the ''|'' operator; in Fortran you should just use
''+''. (Take care not to add in the same flag more than once, though.
Alternatively, you can use the 'ior' intrinsic function standardized in
Fortran 95.)

File: fftw3.info, Node: FFTW Execution in Fortran, Next: Fortran Examples, Prev: FFTW Constants in Fortran, Up: Calling FFTW from Legacy Fortran
8.3 FFTW Execution in Fortran
=============================
In C, in order to use a plan, one normally calls 'fftw_execute', which
executes the plan to perform the transform on the input/output arrays
passed when the plan was created (*note Using Plans::). The
corresponding subroutine call in legacy Fortran is:
call dfftw_execute(plan)
However, we have had reports that this causes problems with some
recent optimizing Fortran compilers. The problem is, because the
input/output arrays are not passed as explicit arguments to
'dfftw_execute', the semantics of Fortran (unlike C) allow the compiler
to assume that the input/output arrays are not changed by
'dfftw_execute'. As a consequence, certain compilers end up optimizing
out or repositioning the call to 'dfftw_execute', assuming incorrectly
that it does nothing.
There are various workarounds to this, but the safest and simplest
thing is to not use 'dfftw_execute' in Fortran. Instead, use the
functions described in *note New-array Execute Functions::, which take
the input/output arrays as explicit arguments. For example, if the plan
is for a complex-data DFT and was created for the arrays 'in' and 'out',
you would do:
call dfftw_execute_dft(plan, in, out)
There are a few things to be careful of, however:
* You must use the correct type of execute function, matching the way
the plan was created. Complex DFT plans should use
'dfftw_execute_dft', Real-input (r2c) DFT plans should use use
'dfftw_execute_dft_r2c', and real-output (c2r) DFT plans should use
'dfftw_execute_dft_c2r'. The various r2r plans should use
'dfftw_execute_r2r'.
* You should normally pass the same input/output arrays that were
used when creating the plan. This is always safe.
* _If_ you pass _different_ input/output arrays compared to those
used when creating the plan, you must abide by all the restrictions
of the new-array execute functions (*note New-array Execute
Functions::). The most difficult of these, in Fortran, is the
requirement that the new arrays have the same alignment as the
original arrays, because there seems to be no way in legacy Fortran
to obtain guaranteed-aligned arrays (analogous to 'fftw_malloc' in
C). You can, of course, use the 'FFTW_UNALIGNED' flag when creating
the plan, in which case the plan does not depend on the alignment,
but this may sacrifice substantial performance on architectures
(like x86) with SIMD instructions (*note SIMD alignment and
fftw_malloc::).

File: fftw3.info, Node: Fortran Examples, Next: Wisdom of Fortran?, Prev: FFTW Execution in Fortran, Up: Calling FFTW from Legacy Fortran
8.4 Fortran Examples
====================
In C, you might have something like the following to transform a
one-dimensional complex array:
fftw_complex in[N], out[N];
fftw_plan plan;
plan = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
In Fortran, you would use the following to accomplish the same thing:
double complex in, out
dimension in(N), out(N)
integer*8 plan
call dfftw_plan_dft_1d(plan,N,in,out,FFTW_FORWARD,FFTW_ESTIMATE)
call dfftw_execute_dft(plan, in, out)
call dfftw_destroy_plan(plan)
Notice how all routines are called as Fortran subroutines, and the
plan is returned via the first argument to 'dfftw_plan_dft_1d'. Notice
also that we changed 'fftw_execute' to 'dfftw_execute_dft' (*note FFTW
Execution in Fortran::). To do the same thing, but using 8 threads in
parallel (*note Multi-threaded FFTW::), you would simply prefix these
calls with:
integer iret
call dfftw_init_threads(iret)
call dfftw_plan_with_nthreads(8)
(You might want to check the value of 'iret': if it is zero, it
indicates an unlikely error during thread initialization.)
To check the number of threads currently being used by the planner,
you can do the following:
integer iret
call dfftw_planner_nthreads(iret)
To transform a three-dimensional array in-place with C, you might do:
fftw_complex arr[L][M][N];
fftw_plan plan;
plan = fftw_plan_dft_3d(L,M,N, arr,arr,
FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
In Fortran, you would use this instead:
double complex arr
dimension arr(L,M,N)
integer*8 plan
call dfftw_plan_dft_3d(plan, L,M,N, arr,arr,
& FFTW_FORWARD, FFTW_ESTIMATE)
call dfftw_execute_dft(plan, arr, arr)
call dfftw_destroy_plan(plan)
Note that we pass the array dimensions in the "natural" order in both
C and Fortran.
To transform a one-dimensional real array in Fortran, you might do:
double precision in
dimension in(N)
double complex out
dimension out(N/2 + 1)
integer*8 plan
call dfftw_plan_dft_r2c_1d(plan,N,in,out,FFTW_ESTIMATE)
call dfftw_execute_dft_r2c(plan, in, out)
call dfftw_destroy_plan(plan)
To transform a two-dimensional real array, out of place, you might
use the following:
double precision in
dimension in(M,N)
double complex out
dimension out(M/2 + 1, N)
integer*8 plan
call dfftw_plan_dft_r2c_2d(plan,M,N,in,out,FFTW_ESTIMATE)
call dfftw_execute_dft_r2c(plan, in, out)
call dfftw_destroy_plan(plan)
*Important:* Notice that it is the _first_ dimension of the complex
output array that is cut in half in Fortran, rather than the last
dimension as in C. This is a consequence of the interface routines
reversing the order of the array dimensions passed to FFTW so that the
Fortran program can use its ordinary column-major order.

File: fftw3.info, Node: Wisdom of Fortran?, Prev: Fortran Examples, Up: Calling FFTW from Legacy Fortran
8.5 Wisdom of Fortran?
======================
In this section, we discuss how one can import/export FFTW wisdom (saved
plans) to/from a Fortran program; we assume that the reader is already
familiar with wisdom, as described in *note Words of Wisdom-Saving
Plans::.
The basic problem is that is difficult to (portably) pass files and
strings between Fortran and C, so we cannot provide a direct Fortran
equivalent to the 'fftw_export_wisdom_to_file', etcetera, functions.
Fortran interfaces _are_ provided for the functions that do not take
file/string arguments, however: 'dfftw_import_system_wisdom',
'dfftw_import_wisdom', 'dfftw_export_wisdom', and 'dfftw_forget_wisdom'.
So, for example, to import the system-wide wisdom, you would do:
integer isuccess
call dfftw_import_system_wisdom(isuccess)
As usual, the C return value is turned into a first parameter;
'isuccess' is non-zero on success and zero on failure (e.g. if there is
no system wisdom installed).
If you want to import/export wisdom from/to an arbitrary file or
elsewhere, you can employ the generic 'dfftw_import_wisdom' and
'dfftw_export_wisdom' functions, for which you must supply a subroutine
to read/write one character at a time. The FFTW package contains an
example file 'doc/f77_wisdom.f' demonstrating how to implement
'import_wisdom_from_file' and 'export_wisdom_to_file' subroutines in
this way. (These routines cannot be compiled into the FFTW library
itself, lest all FFTW-using programs be required to link with the
Fortran I/O library.)

File: fftw3.info, Node: Upgrading from FFTW version 2, Next: Installation and Customization, Prev: Calling FFTW from Legacy Fortran, Up: Top
9 Upgrading from FFTW version 2
*******************************
In this chapter, we outline the process for updating codes designed for
the older FFTW 2 interface to work with FFTW 3. The interface for FFTW
3 is not backwards-compatible with the interface for FFTW 2 and earlier
versions; codes written to use those versions will fail to link with
FFTW 3. Nor is it possible to write "compatibility wrappers" to bridge
the gap (at least not efficiently), because FFTW 3 has different
semantics from previous versions. However, upgrading should be a
straightforward process because the data formats are identical and the
overall style of planning/execution is essentially the same.
Unlike FFTW 2, there are no separate header files for real and
complex transforms (or even for different precisions) in FFTW 3; all
interfaces are defined in the '<fftw3.h>' header file.
Numeric Types
=============
The main difference in data types is that 'fftw_complex' in FFTW 2 was
defined as a 'struct' with macros 'c_re' and 'c_im' for accessing the
real/imaginary parts. (This is binary-compatible with FFTW 3 on any
machine except perhaps for some older Crays in single precision.) The
equivalent macros for FFTW 3 are:
#define c_re(c) ((c)[0])
#define c_im(c) ((c)[1])
This does not work if you are using the C99 complex type, however,
unless you insert a 'double*' typecast into the above macros (*note
Complex numbers::).
Also, FFTW 2 had an 'fftw_real' typedef that was an alias for
'double' (in double precision). In FFTW 3 you should just use 'double'
(or whatever precision you are employing).
Plans
=====
The major difference between FFTW 2 and FFTW 3 is in the
planning/execution division of labor. In FFTW 2, plans were found for a
given transform size and type, and then could be applied to _any_ arrays
and for _any_ multiplicity/stride parameters. In FFTW 3, you specify
the particular arrays, stride parameters, etcetera when creating the
plan, and the plan is then executed for _those_ arrays (unless the guru
interface is used) and _those_ parameters _only_. (FFTW 2 had "specific
planner" routines that planned for a particular array and stride, but
the plan could still be used for other arrays and strides.) That is,
much of the information that was formerly specified at execution time is
now specified at planning time.
Like FFTW 2's specific planner routines, the FFTW 3 planner
overwrites the input/output arrays unless you use 'FFTW_ESTIMATE'.
FFTW 2 had separate data types 'fftw_plan', 'fftwnd_plan',
'rfftw_plan', and 'rfftwnd_plan' for complex and real one- and
multi-dimensional transforms, and each type had its own 'destroy'
function. In FFTW 3, all plans are of type 'fftw_plan' and all are
destroyed by 'fftw_destroy_plan(plan)'.
Where you formerly used 'fftw_create_plan' and 'fftw_one' to plan and
compute a single 1d transform, you would now use 'fftw_plan_dft_1d' to
plan the transform. If you used the generic 'fftw' function to execute
the transform with multiplicity ('howmany') and stride parameters, you
would now use the advanced interface 'fftw_plan_many_dft' to specify
those parameters. The plans are now executed with 'fftw_execute(plan)',
which takes all of its parameters (including the input/output arrays)
from the plan.
In-place transforms no longer interpret their output argument as
scratch space, nor is there an 'FFTW_IN_PLACE' flag. You simply pass
the same pointer for both the input and output arguments. (Previously,
the output 'ostride' and 'odist' parameters were ignored for in-place
transforms; now, if they are specified via the advanced interface, they
are significant even in the in-place case, although they should normally
equal the corresponding input parameters.)
The 'FFTW_ESTIMATE' and 'FFTW_MEASURE' flags have the same meaning as
before, although the planning time will differ. You may also consider
using 'FFTW_PATIENT', which is like 'FFTW_MEASURE' except that it takes
more time in order to consider a wider variety of algorithms.
For multi-dimensional complex DFTs, instead of 'fftwnd_create_plan'
(or 'fftw2d_create_plan' or 'fftw3d_create_plan'), followed by
'fftwnd_one', you would use 'fftw_plan_dft' (or 'fftw_plan_dft_2d' or
'fftw_plan_dft_3d'). followed by 'fftw_execute'. If you used 'fftwnd'
to to specify strides etcetera, you would instead specify these via
'fftw_plan_many_dft'.
The analogues to 'rfftw_create_plan' and 'rfftw_one' with
'FFTW_REAL_TO_COMPLEX' or 'FFTW_COMPLEX_TO_REAL' directions are
'fftw_plan_r2r_1d' with kind 'FFTW_R2HC' or 'FFTW_HC2R', followed by
'fftw_execute'. The stride etcetera arguments of 'rfftw' are now in
'fftw_plan_many_r2r'.
Instead of 'rfftwnd_create_plan' (or 'rfftw2d_create_plan' or
'rfftw3d_create_plan') followed by 'rfftwnd_one_real_to_complex' or
'rfftwnd_one_complex_to_real', you now use 'fftw_plan_dft_r2c' (or
'fftw_plan_dft_r2c_2d' or 'fftw_plan_dft_r2c_3d') or 'fftw_plan_dft_c2r'
(or 'fftw_plan_dft_c2r_2d' or 'fftw_plan_dft_c2r_3d'), respectively,
followed by 'fftw_execute'. As usual, the strides etcetera of
'rfftwnd_real_to_complex' or 'rfftwnd_complex_to_real' are no specified
in the advanced planner routines, 'fftw_plan_many_dft_r2c' or
'fftw_plan_many_dft_c2r'.
Wisdom
======
In FFTW 2, you had to supply the 'FFTW_USE_WISDOM' flag in order to use
wisdom; in FFTW 3, wisdom is always used. (You could simulate the FFTW
2 wisdom-less behavior by calling 'fftw_forget_wisdom' after every
planner call.)
The FFTW 3 wisdom import/export routines are almost the same as
before (although the storage format is entirely different). There is
one significant difference, however. In FFTW 2, the import routines
would never read past the end of the wisdom, so you could store extra
data beyond the wisdom in the same file, for example. In FFTW 3, the
file-import routine may read up to a few hundred bytes past the end of
the wisdom, so you cannot store other data just beyond it.(1)
Wisdom has been enhanced by additional humility in FFTW 3: whereas
FFTW 2 would re-use wisdom for a given transform size regardless of the
stride etc., in FFTW 3 wisdom is only used with the strides etc. for
which it was created. Unfortunately, this means FFTW 3 has to create
new plans from scratch more often than FFTW 2 (in FFTW 2, planning e.g.
one transform of size 1024 also created wisdom for all smaller powers of
2, but this no longer occurs).
FFTW 3 also has the new routine 'fftw_import_system_wisdom' to import
wisdom from a standard system-wide location.
Memory allocation
=================
In FFTW 3, we recommend allocating your arrays with 'fftw_malloc' and
deallocating them with 'fftw_free'; this is not required, but allows
optimal performance when SIMD acceleration is used. (Those two
functions actually existed in FFTW 2, and worked the same way, but were
not documented.)
In FFTW 2, there were 'fftw_malloc_hook' and 'fftw_free_hook'
functions that allowed the user to replace FFTW's memory-allocation
routines (e.g. to implement different error-handling, since by default
FFTW prints an error message and calls 'exit' to abort the program if
'malloc' returns 'NULL'). These hooks are not supported in FFTW 3;
those few users who require this functionality can just directly modify
the memory-allocation routines in FFTW (they are defined in
'kernel/alloc.c').
Fortran interface
=================
In FFTW 2, the subroutine names were obtained by replacing 'fftw_' with
'fftw_f77'; in FFTW 3, you replace 'fftw_' with 'dfftw_' (or 'sfftw_' or
'lfftw_', depending upon the precision).
In FFTW 3, we have begun recommending that you always declare the
type used to store plans as 'integer*8'. (Too many people didn't notice
our instruction to switch from 'integer' to 'integer*8' for 64-bit
machines.)
In FFTW 3, we provide a 'fftw3.f' "header file" to include in your
code (and which is officially installed on Unix systems). (In FFTW 2,
we supplied a 'fftw_f77.i' file, but it was not installed.)
Otherwise, the C-Fortran interface relationship is much the same as
it was before (e.g. return values become initial parameters, and
multi-dimensional arrays are in column-major order). Unlike FFTW 2, we
do provide some support for wisdom import/export in Fortran (*note
Wisdom of Fortran?::).
Threads
=======
Like FFTW 2, only the execution routines are thread-safe. All planner
routines, etcetera, should be called by only a single thread at a time
(*note Thread safety::). _Unlike_ FFTW 2, there is no special
'FFTW_THREADSAFE' flag for the planner to allow a given plan to be
usable by multiple threads in parallel; this is now the case by default.
The multi-threaded version of FFTW 2 required you to pass the number
of threads each time you execute the transform. The number of threads
is now stored in the plan, and is specified before the planner is called
by 'fftw_plan_with_nthreads'. The threads initialization routine used
to be called 'fftw_threads_init' and would return zero on success; the
new routine is called 'fftw_init_threads' and returns zero on failure.
The current number of threads used by the planner can be checked with
'fftw_planner_nthreads'. *Note Multi-threaded FFTW::.
There is no separate threads header file in FFTW 3; all the function
prototypes are in '<fftw3.h>'. However, you still have to link to a
separate library ('-lfftw3_threads -lfftw3 -lm' on Unix), as well as to
the threading library (e.g. POSIX threads on Unix).
---------- Footnotes ----------
(1) We do our own buffering because GNU libc I/O routines are
horribly slow for single-character I/O, apparently for thread-safety
reasons (whether you are using threads or not).

File: fftw3.info, Node: Installation and Customization, Next: Acknowledgments, Prev: Upgrading from FFTW version 2, Up: Top
10 Installation and Customization
*********************************
This chapter describes the installation and customization of FFTW, the
latest version of which may be downloaded from the FFTW home page
(http://www.fftw.org).
In principle, FFTW should work on any system with an ANSI C compiler
('gcc' is fine). However, planner time is drastically reduced if FFTW
can exploit a hardware cycle counter; FFTW comes with cycle-counter
support for all modern general-purpose CPUs, but you may need to add a
couple of lines of code if your compiler is not yet supported (*note
Cycle Counters::). (On Unix, there will be a warning at the end of the
'configure' output if no cycle counter is found.)
Installation of FFTW is simplest if you have a Unix or a GNU system,
such as GNU/Linux, and we describe this case in the first section below,
including the use of special configuration options to e.g. install
different precisions or exploit optimizations for particular
architectures (e.g. SIMD). Compilation on non-Unix systems is a more
manual process, but we outline the procedure in the second section. It
is also likely that pre-compiled binaries will be available for popular
systems.
Finally, we describe how you can customize FFTW for particular needs
by generating _codelets_ for fast transforms of sizes not supported
efficiently by the standard FFTW distribution.
* Menu:
* Installation on Unix::
* Installation on non-Unix systems::
* Cycle Counters::
* Generating your own code::

File: fftw3.info, Node: Installation on Unix, Next: Installation on non-Unix systems, Prev: Installation and Customization, Up: Installation and Customization
10.1 Installation on Unix
=========================
FFTW comes with a 'configure' program in the GNU style. Installation
can be as simple as:
./configure
make
make install
This will build the uniprocessor complex and real transform libraries
along with the test programs. (We recommend that you use GNU 'make' if
it is available; on some systems it is called 'gmake'.) The "'make
install'" command installs the fftw and rfftw libraries in standard
places, and typically requires root privileges (unless you specify a
different install directory with the '--prefix' flag to 'configure').
You can also type "'make check'" to put the FFTW test programs through
their paces. If you have problems during configuration or compilation,
you may want to run "'make distclean'" before trying again; this ensures
that you don't have any stale files left over from previous compilation
attempts.
The 'configure' script chooses the 'gcc' compiler by default, if it
is available; you can select some other compiler with:
./configure CC="<the name of your C compiler>"
The 'configure' script knows good 'CFLAGS' (C compiler flags) for a
few systems. If your system is not known, the 'configure' script will
print out a warning. In this case, you should re-configure FFTW with
the command
./configure CFLAGS="<write your CFLAGS here>"
and then compile as usual. If you do find an optimal set of 'CFLAGS'
for your system, please let us know what they are (along with the output
of 'config.guess') so that we can include them in future releases.
'configure' supports all the standard flags defined by the GNU Coding
Standards; see the 'INSTALL' file in FFTW or the GNU web page
(http://www.gnu.org/prep/standards/html_node/index.html). Note
especially '--help' to list all flags and '--enable-shared' to create
shared, rather than static, libraries. 'configure' also accepts a few
FFTW-specific flags, particularly:
* '--enable-float': Produces a single-precision version of FFTW
('float') instead of the default double-precision ('double').
*Note Precision::.
* '--enable-long-double': Produces a long-double precision version of
FFTW ('long double') instead of the default double-precision
('double'). The 'configure' script will halt with an error message
if 'long double' is the same size as 'double' on your
machine/compiler. *Note Precision::.
* '--enable-quad-precision': Produces a quadruple-precision version
of FFTW using the nonstandard '__float128' type provided by 'gcc'
4.6 or later on x86, x86-64, and Itanium architectures, instead of
the default double-precision ('double'). The 'configure' script
will halt with an error message if the compiler is not 'gcc'
version 4.6 or later or if 'gcc''s 'libquadmath' library is not
installed. *Note Precision::.
* '--enable-threads': Enables compilation and installation of the
FFTW threads library (*note Multi-threaded FFTW::), which provides
a simple interface to parallel transforms for SMP systems. By
default, the threads routines are not compiled.
* '--enable-openmp': Like '--enable-threads', but using OpenMP
compiler directives in order to induce parallelism rather than
spawning its own threads directly, and installing an 'fftw3_omp'
library rather than an 'fftw3_threads' library (*note
Multi-threaded FFTW::). You can use both '--enable-openmp' and
'--enable-threads' since they compile/install libraries with
different names. By default, the OpenMP routines are not compiled.
* '--with-combined-threads': By default, if '--enable-threads' is
used, the threads support is compiled into a separate library that
must be linked in addition to the main FFTW library. This is so
that users of the serial library do not need to link the system
threads libraries. If '--with-combined-threads' is specified,
however, then no separate threads library is created, and threads
are included in the main FFTW library. This is mainly useful under
Windows, where no system threads library is required and
inter-library dependencies are problematic.
* '--enable-mpi': Enables compilation and installation of the FFTW
MPI library (*note Distributed-memory FFTW with MPI::), which
provides parallel transforms for distributed-memory systems with
MPI. (By default, the MPI routines are not compiled.) *Note FFTW
MPI Installation::.
* '--disable-fortran': Disables inclusion of legacy-Fortran wrapper
routines (*note Calling FFTW from Legacy Fortran::) in the standard
FFTW libraries. These wrapper routines increase the library size
by only a negligible amount, so they are included by default as
long as the 'configure' script finds a Fortran compiler on your
system. (To specify a particular Fortran compiler foo, pass
'F77='foo to 'configure'.)
* '--with-g77-wrappers': By default, when Fortran wrappers are
included, the wrappers employ the linking conventions of the
Fortran compiler detected by the 'configure' script. If this
compiler is GNU 'g77', however, then _two_ versions of the wrappers
are included: one with 'g77''s idiosyncratic convention of
appending two underscores to identifiers, and one with the more
common convention of appending only a single underscore. This way,
the same FFTW library will work with both 'g77' and other Fortran
compilers, such as GNU 'gfortran'. However, the converse is not
true: if you configure with a different compiler, then the
'g77'-compatible wrappers are not included. By specifying
'--with-g77-wrappers', the 'g77'-compatible wrappers are included
in addition to wrappers for whatever Fortran compiler 'configure'
finds.
* '--with-slow-timer': Disables the use of hardware cycle counters,
and falls back on 'gettimeofday' or 'clock'. This greatly worsens
performance, and should generally not be used (unless you don't
have a cycle counter but still really want an optimized plan
regardless of the time). *Note Cycle Counters::.
* '--enable-sse' (single precision), '--enable-sse2' (single,
double), '--enable-avx' (single, double), '--enable-avx2' (single,
double), '--enable-avx512' (single, double),
'--enable-avx-128-fma', '--enable-kcvi' (single),
'--enable-altivec' (single), '--enable-vsx' (single, double),
'--enable-neon' (single, double on aarch64),
'--enable-generic-simd128', and '--enable-generic-simd256':
Enable various SIMD instruction sets. You need compiler that
supports the given SIMD extensions, but FFTW will try to detect at
runtime whether the CPU supports these extensions. That is, you
can compile with'--enable-avx' and the code will still run on a CPU
without AVX support.
- These options require a compiler supporting SIMD extensions,
and compiler support is always a bit flaky: see the FFTW FAQ
for a list of compiler versions that have problems compiling
FFTW.
- Because of the large variety of ARM processors and ABIs, FFTW
does not attempt to guess the correct 'gcc' flags for
generating NEON code. In general, you will have to provide
them on the command line. This command line is known to have
worked at least once:
./configure --with-slow-timer --host=arm-linux-gnueabi \
--enable-single --enable-neon \
"CC=arm-linux-gnueabi-gcc -march=armv7-a -mfloat-abi=softfp"
To force 'configure' to use a particular C compiler foo (instead of
the default, usually 'gcc'), pass 'CC='foo to the 'configure' script;
you may also need to set the flags via the variable 'CFLAGS' as
described above.

File: fftw3.info, Node: Installation on non-Unix systems, Next: Cycle Counters, Prev: Installation on Unix, Up: Installation and Customization
10.2 Installation on non-Unix systems
=====================================
It should be relatively straightforward to compile FFTW even on non-Unix
systems lacking the niceties of a 'configure' script. Basically, you
need to edit the 'config.h' header (copy it from 'config.h.in') to
'#define' the various options and compiler characteristics, and then
compile all the '.c' files in the relevant directories.
The 'config.h' header contains about 100 options to set, each one
initially an '#undef', each documented with a comment, and most of them
fairly obvious. For most of the options, you should simply '#define'
them to '1' if they are applicable, although a few options require a
particular value (e.g. 'SIZEOF_LONG_LONG' should be defined to the size
of the 'long long' type, in bytes, or zero if it is not supported). We
will likely post some sample 'config.h' files for various operating
systems and compilers for you to use (at least as a starting point).
Please let us know if you have to hand-create a configuration file
(and/or a pre-compiled binary) that you want to share.
To create the FFTW library, you will then need to compile all of the
'.c' files in the 'kernel', 'dft', 'dft/scalar', 'dft/scalar/codelets',
'rdft', 'rdft/scalar', 'rdft/scalar/r2cf', 'rdft/scalar/r2cb',
'rdft/scalar/r2r', 'reodft', and 'api' directories. If you are
compiling with SIMD support (e.g. you defined 'HAVE_SSE2' in
'config.h'), then you also need to compile the '.c' files in the
'simd-support', '{dft,rdft}/simd', '{dft,rdft}/simd/*' directories.
Once these files are all compiled, link them into a library, or a
shared library, or directly into your program.
To compile the FFTW test program, additionally compile the code in
the 'libbench2/' directory, and link it into a library. Then compile
the code in the 'tests/' directory and link it to the 'libbench2' and
FFTW libraries. To compile the 'fftw-wisdom' (command-line) tool (*note
Wisdom Utilities::), compile 'tools/fftw-wisdom.c' and link it to the
'libbench2' and FFTW libraries

File: fftw3.info, Node: Cycle Counters, Next: Generating your own code, Prev: Installation on non-Unix systems, Up: Installation and Customization
10.3 Cycle Counters
===================
FFTW's planner actually executes and times different possible FFT
algorithms in order to pick the fastest plan for a given n. In order to
do this in as short a time as possible, however, the timer must have a
very high resolution, and to accomplish this we employ the hardware
"cycle counters" that are available on most CPUs. Currently, FFTW
supports the cycle counters on x86, PowerPC/POWER, Alpha, UltraSPARC
(SPARC v9), IA64, PA-RISC, and MIPS processors.
Access to the cycle counters, unfortunately, is a compiler and/or
operating-system dependent task, often requiring inline assembly
language, and it may be that your compiler is not supported. If you are
_not_ supported, FFTW will by default fall back on its estimator
(effectively using 'FFTW_ESTIMATE' for all plans).
You can add support by editing the file 'kernel/cycle.h'; normally,
this will involve adapting one of the examples already present in order
to use the inline-assembler syntax for your C compiler, and will only
require a couple of lines of code. Anyone adding support for a new
system to 'cycle.h' is encouraged to email us at <fftw@fftw.org>.
If a cycle counter is not available on your system (e.g. some
embedded processor), and you don't want to use estimated plans, as a
last resort you can use the '--with-slow-timer' option to 'configure'
(on Unix) or '#define WITH_SLOW_TIMER' in 'config.h' (elsewhere). This
will use the much lower-resolution 'gettimeofday' function, or even
'clock' if the former is unavailable, and planning will be extremely
slow.

File: fftw3.info, Node: Generating your own code, Prev: Cycle Counters, Up: Installation and Customization
10.4 Generating your own code
=============================
The directory 'genfft' contains the programs that were used to generate
FFTW's "codelets," which are hard-coded transforms of small sizes. We
do not expect casual users to employ the generator, which is a rather
sophisticated program that generates directed acyclic graphs of FFT
algorithms and performs algebraic simplifications on them. It was
written in Objective Caml, a dialect of ML, which is available at
<http://caml.inria.fr/ocaml/index.en.html>.
If you have Objective Caml installed (along with recent versions of
GNU 'autoconf', 'automake', and 'libtool'), then you can change the set
of codelets that are generated or play with the generation options. The
set of generated codelets is specified by the
'{dft,rdft}/{codelets,simd}/*/Makefile.am' files. For example, you can
add efficient REDFT codelets of small sizes by modifying
'rdft/codelets/r2r/Makefile.am'. After you modify any 'Makefile.am'
files, you can type 'sh bootstrap.sh' in the top-level directory
followed by 'make' to re-generate the files.
We do not provide more details about the code-generation process,
since we do not expect that most users will need to generate their own
code. However, feel free to contact us at <fftw@fftw.org> if you are
interested in the subject.
You might find it interesting to learn Caml and/or some modern
programming techniques that we used in the generator (including monadic
programming), especially if you heard the rumor that Java and
object-oriented programming are the latest advancement in the field.
The internal operation of the codelet generator is described in the
paper, "A Fast Fourier Transform Compiler," by M. Frigo, which is
available from the FFTW home page (http://www.fftw.org) and also
appeared in the 'Proceedings of the 1999 ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI)'.

File: fftw3.info, Node: Acknowledgments, Next: License and Copyright, Prev: Installation and Customization, Up: Top
11 Acknowledgments
******************
Matteo Frigo was supported in part by the Special Research Program SFB
F011 "AURORA" of the Austrian Science Fund FWF and by MIT Lincoln
Laboratory. For previous versions of FFTW, he was supported in part by
the Defense Advanced Research Projects Agency (DARPA), under Grants
N00014-94-1-0985 and F30602-97-1-0270, and by a Digital Equipment
Corporation Fellowship.
Steven G. Johnson was supported in part by a Dept. of Defense NDSEG
Fellowship, an MIT Karl Taylor Compton Fellowship, and by the Materials
Research Science and Engineering Center program of the National Science
Foundation under award DMR-9400334.
Code for the Cell Broadband Engine was graciously donated to the FFTW
project by the IBM Austin Research Lab and included in fftw-3.2. (This
code was removed in fftw-3.3.)
Code for the MIPS paired-single SIMD support was graciously donated
to the FFTW project by CodeSourcery, Inc.
We are grateful to Sun Microsystems Inc. for its donation of a
cluster of 9 8-processor Ultra HPC 5000 SMPs (24 Gflops peak). These
machines served as the primary platform for the development of early
versions of FFTW.
We thank Intel Corporation for donating a four-processor Pentium Pro
machine. We thank the GNU/Linux community for giving us a decent OS to
run on that machine.
We are thankful to the AMD corporation for donating an AMD Athlon XP
1700+ computer to the FFTW project.
We thank the Compaq/HP testdrive program and VA Software Corporation
(SourceForge.net) for providing remote access to machines that were used
to test FFTW.
The 'genfft' suite of code generators was written using Objective
Caml, a dialect of ML. Objective Caml is a small and elegant language
developed by Xavier Leroy. The implementation is available from
'http://caml.inria.fr/' (http://caml.inria.fr/). In previous releases
of FFTW, 'genfft' was written in Caml Light, by the same authors. An
even earlier implementation of 'genfft' was written in Scheme, but Caml
is definitely better for this kind of application.
FFTW uses many tools from the GNU project, including 'automake',
'texinfo', and 'libtool'.
Prof. Charles E. Leiserson of MIT provided continuous support and
encouragement. This program would not exist without him. Charles also
proposed the name "codelets" for the basic FFT blocks.
Prof. John D. Joannopoulos of MIT demonstrated continuing tolerance
of Steven's "extra-curricular" computer-science activities, as well as
remarkable creativity in working them into his grant proposals.
Steven's physics degree would not exist without him.
Franz Franchetti wrote SIMD extensions to FFTW 2, which eventually
led to the SIMD support in FFTW 3.
Stefan Kral wrote most of the K7 code generator distributed with FFTW
3.0.x and 3.1.x.
Andrew Sterian contributed the Windows timing code in FFTW 2.
Didier Miras reported a bug in the test procedure used in FFTW 1.2.
We now use a completely different test algorithm by Funda Ergun that
does not require a separate FFT program to compare against.
Wolfgang Reimer contributed the Pentium cycle counter and a few fixes
that help portability.
Ming-Chang Liu uncovered a well-hidden bug in the complex transforms
of FFTW 2.0 and supplied a patch to correct it.
The FFTW FAQ was written in 'bfnn' (Bizarre Format With No Name) and
formatted using the tools developed by Ian Jackson for the Linux FAQ.
_We are especially thankful to all of our users for their continuing
support, feedback, and interest during our development of FFTW._

File: fftw3.info, Node: License and Copyright, Next: Concept Index, Prev: Acknowledgments, Up: Top
12 License and Copyright
************************
FFTW is Copyright (C) 2003, 2007-11 Matteo Frigo, Copyright (C) 2003,
2007-11 Massachusetts Institute of Technology.
FFTW is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation,
Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA You can
also find the GPL on the GNU web site
(http://www.gnu.org/licenses/gpl-2.0.html).
In addition, we kindly ask you to acknowledge FFTW and its authors in
any program or publication in which you use FFTW. (You are not
_required_ to do so; it is up to your common sense to decide whether you
want to comply with this request or not.) For general publications, we
suggest referencing: Matteo Frigo and Steven G. Johnson, "The design and
implementation of FFTW3," Proc. IEEE 93 (2), 216-231 (2005).
Non-free versions of FFTW are available under terms different from
those of the General Public License. (e.g. they do not require you to
accompany any object code using FFTW with the corresponding source
code.) For these alternative terms you must purchase a license from
MIT's Technology Licensing Office. Users interested in such a license
should contact us (<fftw@fftw.org>) for more information.