A fast multipole method for stellar dynamics

The approximate computation of all gravitational forces between $N$ interacting particles via the fast multipole method (FMM) can be made as accurate as direct summation, but requires less than $\mathcal{O}(N)$ operations. FMM groups particles into spatially bounded cells and uses cell-cell interactions to approximate the force at any position within the sink cell by a Taylor expansion obtained from the multipole expansion of the source cell. By employing a novel estimate for the errors incurred in this process, I minimise the computational effort required for a given accuracy and obtain a well-behaved distribution of force errors. For relative force errors of $\sim10^{-7}$, the computational costs exhibit an empirical scaling of $\propto N^{0.87}$. My implementation (running on a 16 core node) out-performs a GPU-based direct summation with comparable force errors for $N\gtrsim10^5$.


Background
The computation of the mutual gravitational forces at every time step dominates the computational costs of all N -body simulations. When simulating collisionless stellar dynamics, the N -body model is merely a Monte-Carlo representation of a smooth phase-space distribution and the N -body force is only ever an estimate for the smooth force field of the continuous system modelled (see also Dehnen and Read ). In particular, the N -body force unavoidably carries an estimation error. This motivates the use of approximate methods for computing the N -body force, such as the Barnes and Hut () tree code, as long as the approximation errors are small compared to the estimation errors.
N -body simulations of collisional stellar dynamics are of a completely different nature. Here, the particles simulate individual stars and the N -body force carries no estimation error. Consequently, the (negative) gravitational potential and its derivative, the acceleration, must be calculated with high accuracy. This is typically achieved by direct summation, when equation () is translated into computer code and the only errors are owed to finite computational precision. This computation incurs a cost of O(N) for a single particle and thus O(N  ) per unit time for running a full simulation. As a consequence, realistic simulations with N ∼  - for globular clusters and galactic centres are still very challenging and large parameter studies impossible. Measures employed to ameliorate this situation include the usage of powerful special-purpose hardware devices (Makino and Taiji ) or graphical processing units (GPUs, Gaburov et al. ), as well as separating the highly fluctuating forces due to close neighbours, in order to reduce the frequency of expensive far-field force computations (Ahmad and Cohen ).
While these measures substantially reduce the effective costs, the complexity of N  remains. The alternative of using approximate methods also for collisional stellar dynamics is so far untested. The requirements for such a method differ from that in collisionless N -body methods in two important aspects: (i) there is no gravitational softening and (ii) to preserve the validity of the N -body model, the approximation errors must be much smaller than what is common in collisionless N -body simulations.
A straightforward approach is to use the tree code with a small opening angle and/or high expansion order, resulting in a scheme with O(N ln N) costs. A more efficient approach is to use the fast multipole method (FMM; Greengard and Rokhlin ; Cheng et al. ) which has costs of only O(N). An initial attempt by Capuzzo-Dolcetta and http://www.comp-astrophys-cosmol.com/content/1/1/1 Miocchi () to port this technique from its original realm of molecular dynamics to astrophysics failed to obtain better practical efficiency than the tree code. However, when adapting the FMM to the inhomogeneity of stellar systems and the low force accuracy required in collisionless dynamics (by using a hierarchical tree data structure and a flexible opening angle), it is substantially faster than the tree code (Dehnen , ).
The critical question here is whether FMM can be tuned to be more efficient than direct summation at force accuracies and particle numbers required by collisional N -body techniques. The goal of this study is to address this question by tuning FMM for the application to collisional N -body simulations, investigating the resulting dependence of computational costs and numerical accuracy on the various numerical parameters, and assessing its practical efficiency.
This paper is organised as follows. In Section  and Appendix A, the mathematical (and algorithmic) foundations of FMM are derived and laid down. Section  (and Appendix B) introduces and motivates my approach for quantifying the resulting acceleration errors; Section  provides useful estimates for the errors of individual FMM interactions; Section  deals with optimising the multipoleacceptance criterion; and in Section  the method is tuned to obtain a force accuracy target with minimal computational effort. Finally, in Section  possible extensions and applications are discussed, and Section  concludes.

FMM basics
The tree code approximates the sum () by first dividing source particles a into groups bounded by geometric cells, each of which is well-separated from the sink position x b , and then computing the forces of each source cell from their multipole moments. This corresponds to Taylor expanding the Greens function ψ(x b -x a ) about the distance to an appropriate centre z of each source cell.
The essence of the fast multipole method is to Taylor expand the Greens function not only at the source positions x a , but also at the sink positions x b . This latter amounts to approximating (a contribution to) the gravitational field within each sink cell by its local Taylor expansion about some appropriate potential expansion centre s. Obviously, this approach is beneficial only if the forces for a large fraction of the sinks within a cell are to be computed simultaneously.

Mathematical background
The FMM relations are most easily derived using Cartesian coordinates. However, for Newtonian gravity, ψ = |r| - , the resulting relations are inefficient. Instead, exploiting that this Greens function satisfies ∇  ψ =  for r =  naturally leads to spherical harmonics. Cheng et al. () have already given (without derivation) the corresponding FMM relations, but in a form ill-suited for computer code. In Appendix A, I derive equivalent but much more compact and computationally convenient relations. These are summarised here.
Let r = (x, y, z) with spherical polar coordinates r, θ , φ, then Θ m n (r) = (-) m (nm)! r n+ P m n (cos θ )e imφ , (  a ) Υ m n (r) = (-) m r n (n + m)! P m n (cos θ )e imφ (b) with integer indices  ≤ |m| ≤ n are (complex-valued) harmonic functions, i.e. ∇  Υ m n =  for all r and ∇  Θ m n =  for all r = . The Υ m n are homogeneous polynomials of total degree n in x, y, and z (they are defined in Appendix A. without reference to polar coordinates; see also Table ). With these definitions, the FMM relations for the computation of the potential due to all particles within source cell A and at any position x b within sink cell B are Here, p is the expansion order and δΨ A→B the error of the approximated potential. This expansion converges with in- Other important relations are those for the multipoles M m n with respect to another expansion centre and for the field tensors F m n of the local expansion (a) with respect to another expansion centre Moreover, the computation of the acceleration a from the local expansion (a) requires . Finally, the gravity generated from a source distribution with given multipoles is given by Relations (b), (d), and (e) are equivalent to the much more complicated equations (), (), and () of Cheng et al. (), given without derivation. a There are (p + )  independent real-valued numbers F m n (as well as M m n , see also Appendix A..), and their computation via equations (b), (d), and (e) requires O(p  ) operations. b These operation counts can be reduced to O(p  ) by rotating r into the z direction (see Appendix A.). Figure  plots the time required per interaction computation as function of expansion order p, showing an effective p . scaling of the computational costs at p ≤ , shallower than the O(p  ) asymptote.

Algorithmic approach
.. The tree code: walking the tree Let us first consider the tree code, which also uses the multipole expansion but is algorithmically simpler than FMM. The basic data structure is a hierarchical tree of spatial cells, which are either cubic with eight daughters cells (octtree) or cuboidal with two daughters (binary tree). In a first step, the multipoles M m n have to be computed for each cell from those of their daughter cells, using the MM kernel (equation (d), see also Table ), or (in case of final cells) of their particles, using the PM kernel (equation (c)).
Next, the force for each sink position is computed using a separate tree walk starting with the root cell. The force Table 1 The FMM kernels Name Meaning Equation

P2P
particle to particle (1) P2M particle to multipole (3c) M2M multipole to multipole (3d) M2P multipole to particle (3g) M2L multipole to local expansion (3b )  P2L  particle to local expansion  see table legend  L2L local expansion to local expansion (3e) L2P local expansion to particle (3f) The tree code replaces direct summation (P2P) with P2M-M2M-M2P, while FMM uses P2M-M2M-M2L-L2L-L2P, see also generated by a cell C is computed via its multipole expansion, using the MP kernel (equation (g)), if a multipoleacceptance criterion is met, i.e. if the cell is considered to be well-separated from the sink position. Otherwise, the cell is opened: the force is computed as the sum of the forces generated by the daughters cells (recursing if necessary). Thus, the tree code replaces direct summation's PP kernel with the PM, MM, and MP kernels, see the left panel of Figure  for a schematic view.

.. FMM: the dual tree walk
An adaptive FMM algorithm also uses a hierarchical tree data structure. As with the tree code, the cell multipoles M m n have to be precomputed for every cell in a first step. Next, the forces for all sink positions and generated by all source particles are approximated using a single dual tree walk (Dehnen ). This algorithm considers cell → cell interactions and starts with the root → root interaction. If the interacting cells are well separated, the interaction is approximated using the ML kernel (equation (b)), which computes and accumulates the local field tensors F m n (s B ) for the expansion of gravity within the sink cell B and due to all sources within the source cell A (in a mutual version of the algorithm, the interactions A → B and B → A are considered simultaneously). Otherwise, the interaction is split, typically into those between the daughters of the larger of the two interacting cells with the smaller.
Finally, the local field tensors F m n (s) are passed down the tree using the LL kernel, and the local expansions are evaluated at the sink positions using the LP kernel. Thus, the FMM replaces the MP kernel of the tree code with the ML, LL and LP kernels, see also Figure .
Of course, in both tree code and FMM, direct summation (PP kernel) is used whenever computationally preferable, i.e. for interactions involving only a few sources and sinks.

Quantifying the approximation accuracy
Before the method can optimised for accuracy, a sensible quantitative measure for this accuracy is needed as well as an acceptable value for this measure. http://www.comp-astrophys-cosmol.com/content/1/1/1 Figure 2 Schematic view of the tree code (left) and FMM (right). The tree code approximates the force from a source particle (blue star) at a sink position (red star) using the P2M and M2M kernels (green arrows) to compute the multipoles at the cell centres (blue circles) followed by the M2P kernel (pink arrow). The P2M and M2M kernels are called once per source particle and cell, respectively, while the M2P kernel is called many times for each sink position. FMM replaces many calls to the M2P kernel by a single call to the M2L kernel (red arrow) followed by the L2L and L2P kernels (green arrows). Again, the L2L and L2P kernels are called once per sink cell and particle, respectively, but a single M2L kernel replaces many M2P kernels of the tree code, because it accounts for all sink positions within the sink cell.
With direct-summation, the accuracy is limited only by the finite precision of computer arithmetic (round-off error). If double (-bit) precision is not used throughout, it is customary to use the conservation of the total energy for quality control (e.g. Gaburov et al. ). However, as shown in Appendix B, the relative energy error is much smaller than the typical relative force error, simply because it is an average over many force errors. Even worse, the computation of the total energy, required for measuring its error, typically incurs a larger error. Thus, any measured non-conservation of the total energy is dominated by measurement error rather than true non-conservation due to acceleration errors.
With the tree code and FMM, the situation is subtly different, as discussed in Appendix B.. Here, the measured non-conservation of energy actually reflects the amplitude of the acceleration errors in an average sense. However, an average measure for the effect of approximation errors cannot reflect their effect on the correctness of the simulation. For example, a single large force error has hardly any effect on the energy conservation but may seriously affect the validity of the simulation. While this latter goal is difficult to quantify, it is certainly better to consider the whole distribution of acceleration errors and pay particular attention to large-error outliers, than merely monitor an average.

Scaling acceleration errors
Obviously, the absolute errors δa = |a computed -a true | are not very useful by themselves and must be normalised to be meaningful. One option is to divide δa by some mean field strengthā. While this makes sense for the average particle, it fails for those in the outskirts of the stellar system, where the field strength diminishes well below its mean.
To overcome such issues, a natural choice is the relative error δa/a. However, this is still problematic in the centre of a stellar system, where forces from the outward lying parts largely cancel. In such a situation, a can be small and hence the relative error large, even if each individual pairwise force has been computed with high accuracy. One option for avoiding this problem is -in analogy to the error estimate of numerical quadrature in case of an integrand oscillating around zero -to normalise δa with the sum of the absolute values of all pair-wise accelerations. In general f b ≥ a b , while in the outskirts of a stellar system f → a ≈ GM/r  such that the scaled error δa/f approaches the relative error δa/a as desired. Conversely, in the centre f a (for a Plummer sphere, for example, f → GM/r  s as r →  in the continuum limit) and δa/f behaves sensibly if a → .

The acceleration errors of direct summation
In order to assess the errors currently tolerated in collisional N -body simulations, the GPU-based direct-summation library sapporo (Gaburov et al. ) was applied to two sets of, respectively, N =   and N =   equalmass particles, drawn randomly from a (Plummer ) sphere (without any outer truncation). Figure  shows the resulting distributions of acceleration errors as compared to direct summation in double (-bit) precision. As expected, the typical relative (or scaled) error is ∼  - , comparable to the relative round-off error of single-precision floating-point arithmetic. However, there is a clear tail of large relative errors (middle panel). This is due to particles at small radii, whose acceleration is small, because the pairwise forces with other particles mostly cancel out, while the (round-off ) errors accumulate.
There is a significant increase in the error amplitude with particle number N : the errors for N =   are on average ∼ √  larger than for N =   . This worrying property suggests that the fidelity of simulations using sapporo http://www.comp-astrophys-cosmol.com/content/1/1/1 Figure 3 Acceleration errors from the GPU. Distribution of acceleration errors δa i for N = 10 5 (blue) and N = 10 6 (black) particles drawn from a Plummer sphere when using the state-of-the-art GPU-based direct-summation library sapporo (version 1.6) as compared to direct summation in double (64-bit) precision. The top, middle, and bottom panel refer to, respectively, the normalised (by the mean accelerationā), relative, and scaled (by f defined in equation (4)) acceleration errors. The thin vertical lines indicate the rms error (dashed) as well as the median and the 99 and 99.9 percentiles (solid). Bins are 0.1 dex wide. diminishes with N , implying that using this library with N   is not advisable.
From this exercise I conclude that in practice relative (or scaled) acceleration errors with an rms value of a few  - and maximum ∼  times larger are accepted in N -body simulations of collisional stellar dynamics.

Assessing the approximation errors
In order to optimise any implementation of FMM for high accuracy and low computational costs, a good understanding of and accurate estimates for the errors incurred by each individual FMM interaction are required. To this end, I now perform some numerical experiments.
I create a Plummer sphere of N =   particles and build an oct-tree. For each cell, the centre z ses of the smallest enclosing sphere for all its particles is found (see Section ..). I use z = s = z ses for each cell and pre-compute the cells' multipole moments M m n (z). Finally, the dual tree walk is performed using the multipole-acceptance criterion with the opening angle Here, for each cell C are (approximations for) the radii of the smallest spheres centred on z and s and containing all sources and sinks, respectively. In the experiments of this section ρ z = ρ s for each cell, because z = s and because all particles are source and sink simultaneously, but in general ρ z and ρ s may differ.
With the simple criterion () the multipole expansion is guaranteed to converge and have bounded errors. c Cell → cell interactions with N A N B < p  , cell → particle interactions with N C < p  , and particle → cell interactions with N C < p  are ignored, because direct summation is faster than FMM and will be preferred in a practical application. For the remaining well-separated interactions, the accelerations of all particles within the sink cell and due to all particles within the source cell are calculated in -bit precision using both FMM and direct summation. I then evaluate for each sink particle the acceleration error δa ≡ |a fmm -a true | () with a true obtained by direct summation.

Cell-cell interactions
Cell-cell interactions involve the ML kernel of the PM + [MM] + ML + [LL] + LP chain of kernels. They are by far the most common and most important of all interactions encountered in the dual tree walk. For a random subset of cell-cell interactions generated by my experiments, the top panel of Figure  plots the maximum (over all particles within the sink cell) of δa normalised by the average acceleration M A /r  against θ , while the bottom panel plots the maximum relative force error δa/a. As expected, the errors decrease with smaller θ and increasing p, though there is substantial scatter at any given θ and p. At θ ∼ , the expansion order has little effect on the errors, implying that θ  is required for small errors.

.. Comparing with simple error estimates
The approximation error from a single FMM interaction with θ <  has the theoretical strict upper bound (Dehnen ) which is plotted as thin curves in the top panel of Figure . Obviously, this upper bound is satisfied, but typically it is - times larger than the actual largest error. Moreover, equation () predicts diverging errors for θ → , while the actual errors behave much nicer. This is presumably because diverging errors only occur for rare sink positions combined with extreme source distributions (such as all particles concentrated near one point at the edge of the source sphere), which are not realised in these experiments.
Figure  also shows as dashed lines the simple power laws θ p , which give closer, though not strict, bounds δa θ p M A /r  and δa/a θ p () to the actual errors.

.. Better error estimates
The simple error estimates () are still quite inaccurate: the maximal error is often much smaller (see also the dashed histograms in the right panels of Figure ). The offsets of θ p from the actual errors increase with p (see left panels of Figure ). This effect vanishes if the same limit for N A N B is used for all p, suggesting that it is caused by smoother distributions for larger numbers N A of sources. Indeed, if I simply divide the estimates () by √ N A the scatter of the residuals is much reduced, but a systematic trend with p remains.
However, there is more information about the distribution of sources than merely their number: their multipole moments M m n for n ≤ p. In order to incorporate this information into an error estimate, I first compute for each http://www.comp-astrophys-cosmol.com/content/1/1/1 cell the multipole power By design these (i) satisfy P n,A ≤ M A ρ n z,A for any distribution of sources; (ii) are invariant under rotation (of the coordinate system) and hence independent of the interaction direction; and (iii) provide an upper bound for the amplitude of the multipole: |M m n (z)| ≤ P n /n!. Having computed P n for each source cell, one can evaluate In the right panels of Figure , these new error estimates are compared with the simple estimates () of the last subsection by displaying the distributions of the ratio of the actual maximum error to these estimates. The main difference between the two sets of estimators is their accuracy: there is much less scatter for the new (solid histograms) than for the old estimators (dashed). Consequently, there are hardly any interactions for which the force error is overestimated by more than a factor ten, while the simple estimators () overestimated the force error by more http://www.comp-astrophys-cosmol.com/content/1/1/1 Figure 6 Errors of the P2L kernel. As Figure 4 but for particle → cell interactions, see Section 4.2.2.
than that for many interactions, in particular at large p.
Another remarkable property of the new error estimator is its consistency with respect to expansion order: there is no systematic drift with expansion order. The number of underestimated force errors (abscissa >  in the right panels of Figure ) is small but there is a clear tail of underestimated absolute errors (top panel). As this is not present for the relative errors, it must be caused by the deviation of the acceleration from the mean M A /r  . Indeed, the maximum error is expected to occur on the side of the sink towards the source, where the acceleration is larger, about M A /(rρ s,B )  . When accounting for this by simply replacing r in () with rρ s,B , the tail of underestimated force errors is diminished, but the overall distributions widens and a tail of overestimated errors appears.

Particle-cell interactions
Just occasionally, the dual tree walk algorithm encounters particle-cell interactions. Most of them will be computed using direct summation, leaving only the few with populous cells for the FMM approximation.
For particle → cell and cell → particle interactions the FMM approximation uses the PL and MP kernels, respectively. Because these kernels correspond to the ML kernel in the limits of ρ z,A →  and ρ s,B → , respectively, all the algebra developed in the previous sub-section still applies. Figure  is equivalent to Figure  for cell → particle interactions (which dominate in the tree code). The most notable difference to Figure  is the streaky nature of the relations in the left panels, implying a multi-modal distribution of errors at any given θ and p, as also evident from the dashed histograms in the right panels. The cause for this is simply that in an oct-tree cell size is quantised. In fact, the improved error estimates () account for this ef-http://www.comp-astrophys-cosmol.com/content/1/1/1 fect resulting in narrow mono-modal distributions of error offsets. Figure  is equivalent to Figure  for particle → cell interactions. Clearly, at any given θ and p, the errors are larger than for any other type of interactions and are in fact approaching the theoretical limit (solid curves in the top left panel). What is more, not much can be done about this in terms of error estimates: since the source is just a particle without inner structure, the improved estimates () are simply a rescaling by a factor  from the simple power laws (a simple shift between the dashed and solid histograms in the right panels). They are nonetheless equally accurate as for the cell → cell interactions and suffer from a similar level of force underestimation (for a few percent of interactions and by less than a factor two).

Optimising the multipole-acceptance criterion
With the improved error estimates in hand, the practical implementation of FMM for high accuracy can finally be considered. The main questions arising in this context are: • What to pick for the expansion centres z and s? • When to consider two cells well-separated?
• What expansion order p to use?
The possible answers to these questions affect both the computational cost and the approximation accuracy. Hence, for a given accuracy target, there exists an optimal choice for all these parameters, in the sense of minimal CPU time (and memory) consumption. This section deals with the algorithmic aspects of this problem, i.e. the choice for z and s and the functional form of the multipoleacceptance criterion. The tuning of the parameters (of the multipole-acceptance criterion as well as the expansion order) with the aim of minimal computational effort for a given accuracy is the subject of the next section.
Astonishingly, this issue of optimal choice for z and s and the multipole-acceptance criterion has not been much investigated. Instead, implementations of multipole methods often employ either of two simple strategies. The tree code generally uses a fixed order p and an expansion centred on the cells' centres of mass, while two cells are considered well-separated if the simple geometric multipoleacceptance criterion () is satisfied, such that θ crit controls the accuracy.
With traditional FMM, on the other hand, the expansion centres z and s are both taken to be the geometric cell centres and two cells are deemed well-separated as soon as the expansion converges, corresponding to θ crit = . When using hierarchical cubic grids (instead of an adaptive tree), this is implemented by interacting only between nonneighbouring cells on the same grid level whose parent cells are neighbours (e.g. Cheng et al. ). The accuracy is then only controlled by the expansion order p.

Choice of expansion centres z and s
As far as I am aware, all existing FMM implementations use the same position for the multipole and potential expansion centres, i.e. z = s, for each cell. For traditional FMM, these are equal to the geometric cell centres. This has the benefit of a finite number of possible interaction directionsr, in particular when θ crit = , for which the coefficients Θ m n (r) could be pre-computed. However, the computation of these coefficients on the fly is often faster than a table look-up. Moreover, in view of Figure  θ crit =  appears ill-suited for high accuracy.
In fact, the restriction z = s reduces the freedom and hence the potential for optimising the method. Nonetheless, when aiming for low accuracy, choosing z = s = z com , the cells' centres of mass, has some advantages. First, the dipoles vanish and the low-order multipoles tend to be near-minimal. Second, if using a mutual version of the algorithm (when the interactions A → B and B → A are done simultaneously), the computational costs are reduced and the approximated forces satisfy Newton's third law exactly, i.e. F ab + F ba =  (Dehnen ).
However, in practice there is no benefit from such an exact obedience of Newton's law, as the total momentum is not exactly conserved, because of integration errors arising from the fact that the particles have individual time steps. Moreover, the degree of deviation from exact momentum conservation in such a case does not reflect the true accumulated force errors. In a more general method, the approximated forces will deviate from the ideal F ab + F ba =  by an amount comparable to their actual force errors and the non-conservation of total momentum is somewhat indicative of the accumulated effect of the force errors (see also Appendix B.).

.. Choice of the potential expansion centre s
The results of Section , in particular the functional form of E A→B in equation (), suggest to choose the potential expansion centres s such that the resulting sink radii ρ s , and hence the estimated interaction errors, are minimal. Thus, s = z ses , the centre of the smallest enclosing sphere. Finding the smallest enclosing sphere for a set of n points has complexity O(n). Doing this for every sink cell would incur a total cost of O(N ln N) and be prohibitively expensive.
Instead, I use an accurate approximation by finding for each cell the smallest sphere enclosing the spheres of its grand-daughter cells. This incurs a total cost of O(N) and is implemented via the Computational Geometry Algorithms Library (www.cgal.org, Fischer et al. ), using an algorithm of Matoušek et al. (). http://www.comp-astrophys-cosmol.com/content/1/1/1 .. Choice of the multipole expansion centre z As already mentioned above, setting z = z com has some virtue for low expansion orders p. However, for high expansion orders, the high-order multipoles become ever more important, suggesting that z = z ses may be a better choice. In order to assess the relative merits of these methods, I repeated the experiments of Section  for both methods and compared the resulting maximum absolute and relative force errors incurred for the same cell → cell interactions (for which the two methods give different θ ).
I found that the errors for the two methods are very similar with an rms deviation of ∼ . dex, but a very small mean deviation. At p  there is a trend of more accurate forces for z = z com , while at p  smaller errors are obtained with z = z ses . This trend is simply a consequence of P k being smaller for z = z com than for z = z ses at low k and larger at high k. This together with the improved error estimates () also explains that (for an interaction A → B) z = z com tends to give more accurate forces if ρ z,A < ρ s,B , while z = z ses tends to be more accurate if ρ z,A > ρ s,B .

A simple FMM implementation
Let us first experiment with an implementation that uses the simple multipole-acceptance criterion () and a fixed expansion order p. This is the standard choice for the tree code and as such implemented in many gravity solvers used in astrophysics. The computational costs of such an implementation roughly scale as p α /θ  crit with α ∼ ., because the number of interactions increases as θ - crit for large N , while the cost for one is ∝ p . . Together with the simple error estimate (), this means that if one aims each FMM interaction to satisfy δa/a < , then the minimum cost for fixed occurs for θ crit = e -α/ ≈ ..
Thus, the optimal opening angle is independent of p. The accuracy is then controlled by the expansion order, requiring p  for <  - (according to Figure ). The computational costs rise roughly like | ln | α with decreasing .
I applied the FMM method with z = z ses , expansion order p = , and θ crit = . to N =   equal-mass particles drawn from a Plummer sphere. Figure  plots the resulting distributions of absolute (top), relative (middle), and scaled (bottom) acceleration errors. All three distributions are mono-modal, but very wide, much wider than those obtain from GPU-based direct summation ( Figure ). In particular, there are extended tails towards very large relative or scaled errors, containing only % of the particles but reaching up to  times the median error. These tails are due to particles at large radii and, for the relative errors only, also at small radii (see the discussion in Section .).
There are two main effects responsible for these properties of the error distributions. First, errors from a single FMM interaction follow a distribution with variance of - dex. The maximum errors reported in Section  only occur for particles near the edges and corners of the sink cell, while most have smaller errors. Moreover, the force errors due to FMM interactions of the same sink cell with source cells in opposing directions tend to partially cancel rather than add up. Both explain why the median errors reported in Figure  are much smaller than the maximum relative error incurred by a single cell → cell interaction, which according to Figure  More important is a second effect: the final force errors are not the sum of the relative errors of individual FMM interactions, which are controlled by the simple multipoleacceptance criterion, but of their absolute errors δa. Since, according to equation (), δa ∼ θ p M A /r  ∼ θ p+ M A /ρ  z,A , the FMM interactions with cells of large surface density M/ρ  z dominate the error budget. In fact, the particles at very large radii have δa/a ≈ δa/f ∼  - , exactly as expected from a few FMM interactions with near maximal errors.

Towards better multipole-acceptance criteria
This discussion suggests that multipole-acceptance criteria which balance the absolute force errors of individual FMM interactions are preferable. When working with the simple estimators () or the error bound (), this leads to critical opening angles which depend on the properties of the interacting cells, such as their mass or surface density.
Such an approach can indeed be made to work (Dehnen ), but the aim here is to go beyond that and use the im-http://www.comp-astrophys-cosmol.com/content/1/1/1 Figure 7 but for the multipole-acceptance criterion (16a) with = 2 × 10 -7 (left) or (16b) with = 10 -7 (right). The values for a and f are either taken from the direct-summation run (black), or obtained by low-oder FMM (red, see Section 5.4.2). In all cases the computational effort is similar to that of the FMM run shown in Figure 7 (since a ≤ f criterion (16a) gives a tighter constraint than (16b) and hence requires larger for the same computational effort).

Figure 8 Acceleration errors for improved FMM. Same as
proved error estimates (). This results in the multipoleacceptance criteria with the aim to obtain δa/a and δa/f , respectively. The black histograms in Figure  show the error distributions resulting from these criteria, when the values for a b and f b used in equation () have been taken from the direct-summation comparison run. The distributions for δa/a in the left and δa/f in both panels are remarkably narrow with a median error ∼ as targeted, a steep truncation towards large errors, and a maximum error ∼  . The tail of large δa/a in the right panel is due to particles at small radii, for which a f such that criterion (b) allows large δa/a.
The difference between these error distributions and those shown in Figure  and resulting from the simple geometric multipole-acceptance criterion () is remarkable. While the median errors are comparable, the criteria () do not produce extended tails of large errors of the quantity controlled (δa/a in left and δa/f in the right panels of Figure ), and the maximum errors are more than  orders of magnitude smaller. What is more, the tails towards small errors have also been somewhat reduced, indicating that the improved criterion avoids overly accurate individual FMM interactions.
This improvement has been achieved without increasing the overall computational effort, but by carefully considering the error contribution from each approximated interaction.

Practical multipole-acceptance criteria
In a real application one has, of course, no a priori knowledge of a b or f b for any particle and must instead use something else in the multipole-acceptance criteria (). In some situations, a suitable scale can be gleaned from the properties of the system modelled. For example, if simulating a star cluster of known mass profile M(r) and centre x  , one may simply use

.. Using accelerations from the previous time step
Employing the accelerations a b from the previous time step in equation (a) requires no extra computations. However, it means that the gravity solver is not selfcontained, but requires some starter to get the initial accelerations.
Also, using information from the previous time step subtly introduces an artificial arrow of time into the simulation, because δa new < a old implies δa new /a new < a old /a new . Hence, a particle moving in a direction of increasing acceleration has, on average, smaller δa/a than when moving in the opposite direction, or in reversed time. However, http://www.comp-astrophys-cosmol.com/content/1/1/1 the time integration methods currently employed almost exclusively in N -body simulations of collisional stellar dynamics are irreversible and introduce their own arrow of time. This suggests, that the additional breach of time symmetry by the magnitude (not the direction) of the force error may not be a serious problem in practice. d .. Estimating a b or f b using low-order FMM As Section  has shown, the error estimateẼ A→B used in the multipole-acceptance criteria () still has significant uncertainty, and using highly accurate values for a b or f b in equation () is unnecessary. Instead, rough estimates should suffice. Such estimates can be obtained via a loworder FMM. This amounts to running the FMM twice: once with a simple multipole-acceptance criterion to obtain rough estimates for a b or f b , and then again using the sophisticated criteria () employing the results of the first run.
The acceleration scale f (defined in equation ()) is similar to the gravitational potential (), except that its Greens function is |r| - . This implies that it too can be estimated using FMM, albeit not using an explicitly harmonic formulation.
I implemented both options, estimating a or f via FMM, using the lowest possible order (p =  for f and p =  for gravity -recall that a = ∇Ψ is approximated at one order lower than the potential Ψ ) and multipole-acceptance criterion θ < . To this end, I use s = z = z com and a mutual version of the dual tree walk. The resulting estimates for f or a = |a| have rms relative errors of ∼ %. The additional computational effort is still much smaller than that of the high-accuracy approximation of gravity itself, though estimating f is faster because it is a scalar rather than a vector and because no square-root needs to be calculated.
The distributions of acceleration errors resulting from using these estimates in equation () are shown in red in Figure . They are only very slightly worse than those in black, which have been obtained using the exact values of a b and f b in equation ().

Optimising adaptive FMM
The previous section provided answers to the first two questions asked at its beginning, but not to the one after the optimal expansion order p. To answer this question I now report on some experiments, which also provide the actual computational costs for a given required force accuracy.
All experiments are run on a single compute node with  Intel Xeon E- CPUs, which support the AVX instruction set (see below), and using code generated by the gcc compiler (version ..).

Implementation details
The FMM relations of Section  and Appendix A (using the rotation-accelerated ML kernel of Appendix A. when faster) have been implemented in computer code. The code employs a one-sided version of the dual tree walk, which considers the interactions A → B and B → A independently. The code is written in the C++ programming language and has been tested using various compilers and hardware. The implementation employs vectorisation and shared-memory parallelism as outlined below.

.. Vectorisation
Most current CPUs support vector sizes of  (SSE),  (AVX), or  (MIC) bytes, allowing K = , , or  identical simultaneous double-precision floating-point operations (or twice as many in single precision). Because the FMM kernels do not (usually) relate adjacent elements, their efficient vectorisation is not straightforward (and well beyond compiler optimisation). I explicitly implement a method computing K ML kernels simultaneously. To this end, the multipole moments of the K source cells are loaded into a properly aligned buffer (similar to transposing a matrix) before, and afterwards the K field tensors are added from their vector-buffer to the sink cells' field tensors. Unfortunately, this loading and storing (which cannot be vectorised) reduces the speed-up obtained by the simultaneous kernel computations.
Conversely, direct summation is perfectly suitable for vectorisation and a speed-up of a factor K is achievable. The code prefers direct summation whenever this is deemed to be faster, based on a threshold for the number of particle-particle interactions 'caught' in a given cell-cell interaction.

.. Multi-threading
All parts of the implementation use multi-threading and benefit from multi-core architectures. This is done via hierarchical task-based parallelism implemented via threading building blocks (tbb, Reinders ), an open source task parallel library with a work-stealing scheduler. The algorithms for multi-threaded tree building and dual tree walk are quite similar to those described by Taura et al. () and I refrain from giving details here.

.. Precision and expansion order
This study reports only on one particular implementation aimed at high accuracy. It uses double precision ( bits) floating-point arithmetic throughout, z = z ses , and expansion orders p ≤ .

Wall-clock time versus accuracy
I applied my implementation with criteria (a) and (b) to N =   particles drawn from a Plummer sphere, and using low-order estimates for a b and f b in equation (). I varied the expansion order p and the accuracy parameter and for each run plot in Figure  the total wall-clock time against the rms and the . percentile acceleration errors. http://www.comp-astrophys-cosmol.com/content/1/1/1 Figure 9 Error-cost relation. Wall-clock time versus relative (top) and scaled (bottom) acceleration error for N = 10 7 particles drawn from a Plummer sphere. The top panel reports runs using multipole-acceptance criterion (16a) with a low-order-FMM estimate for a b , while the bottom panel reports runs using multipole-acceptance criterion (16b) with a low-order-FMM estimate for f b . Each pair of open and filled symbols (of same colour and ordinate) refers to another FMM run with expansion order p as indicated and a different value for parameter in equation (16). The timings include all phases of the computation, including tree building and low-order estimation of a b or f b -for comparison, the direct-summation calculation for obtaining the 'true' accelerations took 25k seconds on the same hardware. The thin dotted and dashed lines are power laws with exponents -0.16 and -0.2, respectively.
The rms error is always ten times smaller than the . percentile, e implying the absence of extended large-error tails. For any fixed expansion order p, the relation between time and error can be approximated by a constant plus a power law that becomes flatter for larger p. At any given error, there is an optimal expansion order p in the sense of providing the fastest approximation. When using this optimal expansion order, the fastest FMM computation for a given error scales very nearly like a power law with exponent ∼ -.. Thus when reducing the error by a factor ten, the computational costs rise only by a factor ∼ ..
Constraining the relative error (top panel of Figure ) is slightly more costly than constraining the scaled error (bottom panel). This is largely because f > a as discussed in the caption to Figure , but also because estimating f is easier and faster than estimating a. Of course, the estimation of a can be easily avoided in practice by using the accelerations from the previous time step. for the same runs as in Figure 9 (using the same colour coding). Full symbols indicate that the expansion order is optimal, i.e. obtained minimal wall-clock time for the given error measure (lowest line in Figure 9).

Accuracy versus parameter
In any practical application there is, of course, no possibility to check on the actual error, so it is important to test how well it is reflected by the parameter . As can be seen from Figure , the rms value for the respective error (δa/a if using criterion (a) and δa/f if using criterion (b)) is typically slightly less than for the optimal expansion order p. At intermediate values ( ∼  - ) the error is actually a factor ∼  smaller. The . percentile of the errors is typically a factor ten larger.

Complexity: scaling with the number N of particles
The overall cost of my high-accuracy FMM implementation is dominated by the computation of all node-node interactions during the dual tree walk. All other phases (establishing the hierarchical tree structure, computing z, s, and M m n for each cell; passing down F m n and evaluating gravity for each sink position) contribute much less (see Table ). When using a simple geometric multipoleacceptance criterion, such as equation (), the FMM is well known to have complexity O(N) (e.g. Cheng et al. ). This is because distant interactions contribute less than O(N), so that the overall costs are dominated by the local interactions only (Dehnen ).
I am not aware of theoretical estimates for the complexity for the case of more sophisticated multipole-acceptance http://www.comp-astrophys-cosmol.com/content/1/1/1 The runs used p = 10 and = 10 -6.25 . The timings are given in seconds and refer to, respectively, the tree building; the estimation of f via low-order FMM; the passing up of z, s, min{f }, and M m n ; the dual tree walk; and the passing down of F m n and evaluation of gravity. See also Figure 11.

Figure 11
Scaling of computational costs with N. Wall-clock time for the computation of the mutual gravitational forces between N particles drawn from a Plummer sphere. The FMM (full squares) is parameterised (see Table 2) to yield acceleration errors very similar to those of direct summation on GPUs using the sapporo library (open triangles, using a NVIDIA K20M GPU accelerator). f The direct summation on 16 CPUs (open squares) uses double precision and besides the accelerations also computes the gravitational potential and the scale f (equation (4)).
criteria, but Dehnen () reports an empirical scaling proportional to N . for his approach of a massdependent opening angle. Table  and Figure  present the timings obtained with my implementation using p = , =  -. , and low-order FMM estimates of f b in equation (b). With these settings, the acceleration errors are comparable to those generated via the sapporo library on a GPU (the current state-of-the-art force solver for collisional N -body simulations), as reported in Section ..
From Table , it can be seen that the costs for tree building grow faster than linearly with N (N ln N is expected), those for the upward and downward passes roughly linearly with N (as expected), but those for the FMM estimation of f and the dual tree walk less than linearly. As a result, the total computational costs are very well fit by the power law N . for N >   , see Figure . Figure  also shows the timings for a (double-precision) direct-summation on the same hardware (yielding much more accurate accelerations) and for a mixed-precision direct-summation on a GPU using the sapporo library (yielding comparably accurate accelerations). At large (but realistic) N FMM out-performs direct summation, even if accelerated using a GPU.

Scaling with the number of CPUs
Figure  plots the strong scaling factor t  /nt n for my multi-threaded implementation. The scaling drops to % for  cores, which is not untypical for multi-threaded programs. This drop is presumably caused by imbalances at synchronisation points, of which the implementation has many. Most of these are not algorithmically required, but allow for a much easier implementation. Clearly, any massively parallel implementation needs to address this issue to retain good scaling for large numbers of processors.

Beyond simple gravity approximation
So far, I have considered the approximate computation of the unsoftened gravitational potential and acceleration at all particle positions with equal relative (or scaled) accuracy. However, the fast multipole method can be easily modified or extended beyond that.
For example, one may want to have individual accuracy parameters b instead of a global one. This is easily accommodated by replacing min b∈B {a b } in criterion (a) with min b∈B { b a b } and analogously for criterion (b).
When using individual b , but also in general, it may be beneficial to adapt the expansion order p to the accuracy actually required for a given cell → cell interaction. This could be implemented by using the lowest p ≤ p max for which the multipole-acceptance criterion is satisfied.

Force computation for a subset of particles
Most N -body codes employ adaptive individual time steps for each particle. The standard technique is Makino's () block-step scheme, where the forces of all active particles are computed synchronously. Active are those particles with time step smaller than some threshold (which varies from one force computation to the next).
When using FMM in such a situation, only interactions with sink cells contain at least one active particle must be considered. If the fraction of active particles in such cells is small (but non-zero), FMM becomes much less efficient per force computation. Fortunately, however, active particles are typically spatially correlated (because the time steps of adjacent particles are similar), such that the fraction of active particles is either zero or large.
I performed some practical tests, where only particles within some distance from the origin of the system were considered active. Figure  plots the wall-clock time vs. the number N a of active particles for N =   . As expected the costs for preparation phase (tree building and upward pass) are largely independent of N a (the slight increase of the red curve at large N a is because s and ρ s are computed as part of the upward pass, but only for cells with active particles).

Figure 13
Computational costs for a subset of sinks. Wall-clock time for the computation of gravity for the innermost N a of N = 10 7 particles (of a Plummer sphere), using the same parameters (p and ) as in Figure 11.
The costs for the interaction and downward pass, on the other hand, decrease roughly like N . a for N a   . The net effect is that for N a /N ., the costs are almost completely dominated by the preparation phase, and hence independent of N a . The precise point of this transition depends on N and the FMM parameters. For smaller N and/or more accurate forces, the relative contribution of the tree walk phase increases and the transition occurs at smaller N a .
There is certainly some room for improvement by, e.g. using a smaller expansion order p than is optimal for N a = N and/or re-cycling the tree structure from the previous time step. Both measures reduce the costs of the preparation phase and increase that of the interaction phase (at given ), but shall reduce the overall costs if N a N .

Softened gravity or far-field force
Gravitational softening amounts to replacing the Newtonian Greens function ψ = |r| - by Dehnen () with softening length h and softening kernel ϕ(q) → q - as q → ∞. This corresponds to replacing each source point by a smooth mass distribution with density This Greens function () is no longer harmonic and harmonic FMM cannot be used. One obvious option is to use the more general Cartesian FMM of Appendix A. (Dehnen ). The computational costs of this approach grow faster with expansion order p, such that small approximation errors (requiring high p) become significantly more expensive. However, small approximation errors are hardly required in situations where gravitational softening is employed. Alternatively, if softening is restricted to a finite region, i.e. if (r) =  for |r| ≥ h, harmonic FMM can still be used to compute gravity from all sources at distances |r| ≥ h, while direct summation could be used for neighbours, sources at |r| < h. This approach is sensible only if the number of neighbours is sufficiently bounded (so that the cost incurred by the direction summation remains small). This is the case, in particular, if the number of neighbours is kept (nearly) constant by adapting the individual softening lengths h i in order to adapt the numerical resolution (Price and Monaghan ). In practice, this requires to carry with each cell the radius h z > ρ z of the smallest sphere centred on z which contains all softening spheres of its sources, and allow a FMM interaction A → B only if |z A -s B | > h z,A + ρ s,B . http://www.comp-astrophys-cosmol.com/content/1/1/1 The same technique can be used to restrict the FMM approximation to the far field for each particle, i.e. the force generated by all sources outside of a sphere of known radius h b around x b .

Jerk, snap, crackle, and pop
The jerk is the total time derivative of the acceleration The simplest way to estimate this using FMM, is to not allow the expansion centres to have any velocity (ż =ṡ = ), such that differentiating the FMM relations () w.r.t. time givesΨ and the jerk follows from  = -( {Ψ   }, {Ψ   },Ψ   ). Sincė z =ṡ = , the MM and LL kernels (equations (d) and (e)) work also for the time derivativesṀ m n andḞ m n of the multipoles and field tensors, respectively. The relations for the next order, the snap s =ä, can be derived by differentiating yet again.
With each additional order (jerk, snap, crackle, pop, . . .), the computational cost of the combined ML kernels is not more than the corresponding multiple of the ordinary ML kernel (i.e. acceleration plus jerk are twice as costly as just acceleration). This is a direct consequence of not allowing cell-centre velocities hence preventing the terms depending on z or s in equation () to carry any time dependence. In contrast, the computational costs of the PM and LP kernels grows quadratically with the order of time derivative. This is not really a problem, since those kernels are only needed once per particle, while the ML kernel is typically used  times more often.

The tidal field
FMM can also be used to approximate the Hessian of the potential, which is given by the components of Ψ m (in particular tr(T) =  as expected). Note, however, that the accuracy of this approximation is lower than that for the acceleration. T is of particular interest in collisionless N -body modelling, when with dimensionless parameter η  has been suggested as criterion for individual particle time steps (Dehnen and Read ). The matrix norm of T may be directly computed from Ψ m  as

Discussion and conclusions
The fast multipole method (FMM) approximates the computation of the mutual forces between N particles. I have derived the relevant mathematical background, giving much simpler formulaethan the existing literature, for the case of unsoftened gravity, when the harmony of the Greens function allows significant reduction of the computational complexity. Like the tree code, my FMM implementation uses a hierarchical tree of spatial cells. Unlike the tree code, FMM uses cell → cell interactions, which account for all interactions between sources in the first cell and sinks in the second. Almost all distant particle → particle interactions are 'caught' by fewer than O(N) cell → cell interactions, such that local interactions, requiring O(N) computations, dominate the overall workload (Dehnen ). With the tree code, the situation is reversed: the distant interactions require O(N ln N) computations and dominate the overall work. This implies that FMM has the best complexity of all known force solvers. What is more, the predominance of local as opposed to distant interactions makes FMM ideally suited for applications on super-computers, where communications (required by distant interactions) are increasingly more costly than computations. However, FMM is inherently difficult to parallelise and this study considered only a multi-threaded implementation with a taskparallel dual tree walk (the core of FMM). http://www.comp-astrophys-cosmol.com/content/1/1/1 Most previous implementations of FMM considered simple choices for the cell's multipole-and force-expansion centres and the multipole-acceptance criterion which decides whether a given cell → cell interaction shall be processed via the multipole expansion or be split into daughter interactions. Traditionally, a simple opening-angle based multipole-acceptance criterion has been used and cell centres equal to either the cell's geometric centre or its centre of mass. These choices, which presumably were based on computational convenience and intuition, inevitably result in a wide distribution of individual relative force errors with extended tails reaching ∼  times the median.
The main goal of this study was avoid such extended tails of large force errors and to minimise the computational effort at a given force accuracy. The key for achieving this goal is a reasonably accurate estimate, based on the multipole power of the source cell and the size of the sink cell, for the actual force error incurred by individual cell → cell interactions. Based on the insight from this estimate, I set the cell's force-expansion centres to (an approximation of ) the centre of the smallest sphere enclosing all its particles, when the cell size and hence the error estimates are minimal. I also use the new estimates in the multipoleacceptance criterion, such that each cell → cell interaction is considered on the merit of the error it likely incurs. This results in very well behaved distributions of the relative force errors, provided an initial estimate for the forces is at hand. This can either be taken from the previous time step or obtained via low-accuracy FMM.
After these improvements, the method has only two free parameters: the expansion order p and a parameter for the relative force error. Experiments showed that the actual rms relative force error is typically somewhat less than , while for any given there is an optimum p at which the computational cost are minimal. For =  -. , for example, p =  is optimal and the accelerations errors are comparable to those of direct summation on a GPU (the current state-of-the-art method for collisional N -body simulations). With these parameter settings, the computational costs scale like N . for large N and the method out-performs any direct-summation implementation for N   . When computing only the forces for N a < N of N particles, the costs are roughly proportional to N . a for N a /N ., but become independent of N a below that (where the costs for tree building dominate). For large N , this is still significantly faster than direct summation.
An implementation of the FMM on a GPU accelerator should yield a further significant speed-up compared to my CPU-based implementation, though this is certainly a challenging task, given that FMM is algorithmically more complex than direct summation or a tree code (both of which have been successfully ported to the GPU). Presumably a somewhat lesser challenge is a massively parallel im-plementation of the method, which can be run on a super computer.
A practical application of FMM in an actual collisional N -body simulation would be very interesting. Since the force between close neighbours is always computed directly (in double precision) as explained earlier, close encounters can be treated essentially in the same fashion as with existing techniques. However, an unfortunate hindrance to an application of the presented techniques originates from the long marriage of existing collisional N -body techniques with direct summation. Methods, such as the Ahmad-Cohen neighbour scheme, to reduce the need for the costly far-field force summations are not necessary with FMM, and the existing N -body tools are not well suited for an immediate application of FMM.

Appendix A: Derivation of the FMM relations
Here, the FMM relations given in Section  are derived and motivated. Differently from the main text, the multipole and force expansion centres, z and s, are not explicitly distinguished and instead z is used for either. The general case z = s is a trivial generalisation.

A.1 Cartesian FMM
The distance vector x b -x a between two particles residing in two well-separated cells A and B, respectively, can be decomposed into three components (see also Figure ) x This series converges (the remainder R p → ) as p → ∞, if |r a + r b | < |r|. Inserting () into the expression for the (negative) potential due to all source points in cell A and for any sink position x b in cell B, one obtains after re-arranging with the derivatives D n (r) ≡ ∇ n ψ(r). The FMM algorithm essentially works these equations backwards: in a first step, the multipoles M m (z) are computed for each cell via (c) and by utilising those of daughter cells via the shifting formula Second, for each cell the field tensors F n (z) of all its interactions are computed via (b) and added up. Finally, the field tensors are passed down the tree, utilising the shifting formula and the potential (and its derivative, the acceleration) is evaluated via (a) at each sink position. Equations () are the basis of Cartesian FMM, such as implemented in Dehnen's (, ) falcON algorithm. At each order n = |n|, there are n+  coefficients F n (as well as M n and D n ), and the total number of coefficients up to order p is p+  . The computational effort of the resulting algorithm is dominated by their computation in (b), which requires about p+  multiplications. Thus at large p a straightforward application of this method approaches an operation count of O(p  ). The computation (b) of the field tensors is essentially a convolution in index space and hence can be accelerated using a fast Fourier technique with costs O(p  ln p) (but see endnote b).

A.2 Harmonic tensors
For the important case ψ = |r| - corresponding to gravitational and electrostatic forces, the above method can be improved by exploiting that this Greens function is harmonic, i.e. ∇  ψ =  for |r| > . As a consequence, the D n = ∇ n ψ are harmonic too and satisfy ∇  D k = D k+(,,) + D k+(,,) + D k+(,,) = . () In other words: D n is traceless. At given degree n = k + , equation () gives n  constraints such that of the n+  terms only n +  are truly independent. In inner products, a traceless tensor only 'sees' the traceless part of its co-operand: where the 'reduced' tensor A n denotes the traceless part of A n . Furthermore, r n is related to D n via D n (r) = (-) n (n -)!! r n r n+ .
(   ) With these relations, the Taylor series of the harmonic Greens function becomes, for r > x . While at each order n there are only n +  truly independent terms, the expansion () still carries all n+  terms, amounting to a total of p+  terms in an expansion up to order p. The equivalent spherical harmonic expansion () only carries n +  terms per order h amounting to a total of (p + )  , i.e. at large p is much preferable.
The number of terms actually used can be reduced to (n + ) per order, for example, by omitting all terms with n z >  and recover their contribution via recursive application of D k+(,,) = -D k+(,,) -D k+(,,) () (Applequist ; Hinsen and Felderhof ). However, the resulting algebraic challenges are considerable, though the overall computational effort could well be reduced to O(p  ) operations (Joachim Stadel, private communication), but I am not aware of a systematic demonstration. http://www.comp-astrophys-cosmol.com/content/1/1/1

A.3 Spherical harmonics
The algebraic complications with obtaining an efficient Cartesian FMM stem from the fact that the Laplace operator involves three terms, such that the resulting recovery relation () has two terms instead of one on the righthand side. This problem can be avoided by Taylor expanding in other than Cartesian coordinates where the Laplace operator involves only two instead of three terms. The simplest possibility is a linear combination of Cartesian coordinates with complex coefficients. The standard FMM relations emerge from replacing x and y with while keeping z. Then ∂ ξ = ∂ x -i∂ y and ∂ η = -∂ x -i∂ y , such that ∂  x + ∂  y = -∂ ξ ∂ η and hence for harmonic functions or D k+(,,) = D k+(,,) in place of equation (). With this relation one can eliminate all mixed ξ -η derivatives in favour of z derivatives. This in turn allows a reduction in the number of indices from three to two by using the total number n of derivatives and the number |m| of ξ (for m < ) or η derivatives (for m > ). Somewhat surprisingly, the relations required for FMM are hardly covered by the rich literature on spherical harmonics (and FMM). To derive the relevant formulae, I follow the ideas of Maxwell (, see also James ) and define the differential operator When applied to harmonic functions, this operator satisfies which can be shown via equation () and is inevitably linked to are harmonic too. Moreover, the functions Θ m n (r) are homogeneous of degree -(n + ), i.e. Θ m n (αr) = α -(n+) Θ m n (r). I also define the solid spherical harmonic of degree n as That Υ m n is harmonic follows from the fact that if f (r) is harmonic, then so is r - f (r/r  ) (Hobson ) (try this with your undergraduate students). Note that Υ m n (r) is just a homogeneous polynomial of total degree n in x, y and z. These harmonics are related to the usual normalised surface spherical harmonic Υ m n (r) = (nm)!(n + m)! -/ r n Y m n (r).

A.6 Accelerating FMM relations
The FMM kernels ML, MM, and LL (equations (b), (d), (e)) all require O(p  ) operations. However, if the interactions or translations are along the z-axis, the costs are only O(p  ) because Υ m n (ẑ) = δ m /n!. One method to exploit this is to first translate along the z-axis and then perpendicular to the z-axis. For a vector r ⊥ perpendicular to the z-axis, Υ m n (r ⊥ ) vanishes whenever n + m is even. This implies that a translation along r ⊥ can be done faster than a general translation (in the limit of p → ∞, twice as fast).
This splitting method cannot be applied to the ML kernel (b) (because it is not a translation), which occurs many more times in the FMM algorithm than the MM and LL kernels. To accelerate the ML kernel, one can exploit that a rotation only costs O(p  ) operations, too. Thus, if one first rotates into a frame in which the interaction is along the z-axis, applies the ML kernel in the rotated frame, and finally rotates back into the original frame, the total costs are still O(p  ).

A.. Fast rotations
Since the spherical harmonics are homogeneous, a rotation (as opposed to a translation) does not mix between different orders n, and consequently the operation count is O(p  ). Thus, a general rotation is of the form Y m n (r) = n l=-n Γ ml n Y l n (r), wherer denotes the vector r in the rotated frame. Unfortunately, the matrices Γ n , also known as Wigner functions, are generally dense and non-trivial functions of the Euler angles. However, a rotation by angle α around the z-axis is simple: Y m n (r) = e -imα Y m n (r) (   ) with an operation count of only O(p  ). With this one can build a general rotation by first rotating around the z-axis, then swapping z and x, rotating again about the z-axis (the x-axis of the original frame), swapping z and x again, and performing a final rotation around the z-axis. Like rotations, swapping coordinate axes does not mix between different orders n and can be represented as Θ m n (r) = n l=-n B ml n Θ l n (r), where nowr denotes the vector r in the frame obtained by swapping two Cartesian coordinates. The important difference between equations () and () is that the matrices B n are constants. Recursive relations for these swap matrices can be derived via the operator algebra of Section A.. For example, for swapping x and z, one finds where it is understood that B ml n =  for |l| > n. A similar exercise for swapping y and z reveals that the swap matrices are given by i m-l B ml n , while the corresponding swap matrices for Υ m n are given by the transpose (because these matrices are orthonormal and the product () is invariant under coordinate swapping). Whereas the matrices B n are dense, the corresponding matrices for the real-valued harmonics (equations ()) are not (Pinchon and Hoggan ). For example, the matrices for swapping x and z for Θ m  and T m  are (omitting zero entries) , () http://www.comp-astrophys-cosmol.com/content/1/1/1 respectively. Thus, this method of achieving a general rotation not only avoids the (recursive) computation of the Wigner functions Γ n (which itself costs O(p  ) operations), but also benefits from the facts that the swap matrices B n have ≈ times fewer non-zero entries than the Γ n and are known a priori, such that they can be 'hard-wired' into computer code.

A.. A fast ML kernel
With these preliminaries, one can finally put together an accelerated O(p  ) version for performing the ML kernel (b). Let (x, y, z) = r, then one first rotates the multipoles M l k (around the z-axis) by angle α z = arctan(y/x), swaps x and z, rotates by α x = arctan x  + y  /z, and swaps x and z back. The obtainedM l k has z-axis aligned with the interaction direction, and the ML kernel can be performed viaF Finally, one must rotateF m n back to the original frame by first swapping x and z, rotating by -α x , swapping x and z again, followed by a final rotation by -α z .
These rotations and swaps can be accelerated further by exploiting that in () only multipolesM m n with |m| ≤ min{n, p -n} are needed and, similarly, thatF m n =  for |m| > min{n, p -n}. As Figure  demonstrates, the overhead due to the rotations pays off already for p = .