<span id="Load-balancing-1"></span><h4 class="subsection">6.4.2 Load balancing</h4>
<span id="index-load-balancing"></span>
<p>Ideally, when you parallelize a transform over some <em>P</em>
processes, each process should end up with work that takes equal time.
Otherwise, all of the processes end up waiting on whichever process is
slowest. This goal is known as &ldquo;load balancing.&rdquo; In this section,
we describe the circumstances under which FFTW is able to load-balance
well, and in particular how you should choose your transform size in
order to load balance.
<p>Load balancing is especially difficult when you are parallelizing over
heterogeneous machines; for example, if one of your processors is a
old 486 and another is a Pentium IV, obviously you should give the
Pentium more work to do than the 486 since the latter is much slower.
FFTW does not deal with this problem, however&mdash;it assumes that your
processes run on hardware of comparable speed, and that the goal is
therefore to divide the problem as equally as possible.
<p>For a multi-dimensional complex DFT, FFTW can divide the problem
equally among the processes if: (i) the <em>first</em> dimension
<code>n0</code> is divisible by <em>P</em>; and (ii), the <em>product</em> of
the subsequent dimensions is divisible by <em>P</em>. (For the advanced
interface, where you can specify multiple simultaneous transforms via
some &ldquo;vector&rdquo; length <code>howmany</code>, a factor of <code>howmany</code> is
included in the product of the subsequent dimensions.)
<p>For a one-dimensional complex DFT, the length <code>N</code> of the data
should be divisible by <em>P</em> <em>squared</em> to be able to divide
the problem equally among the processes.