A fast multipole method for stellar dynamics
© Dehnen; licensee Springer 2014
Received: 30 January 2014
Accepted: 30 April 2014
Published: 11 September 2014
The approximate computation of all gravitational forces between N interacting particles via the fast multipole method (FMM) can be made as accurate as direct summation, but requires less than operations. FMM groups particles into spatially bounded cells and uses cell-cell interactions to approximate the force at any position within the sink cell by a Taylor expansion obtained from the multipole expansion of the source cell. By employing a novel estimate for the errors incurred in this process, I minimise the computational effort required for a given accuracy and obtain a well-behaved distribution of force errors. For relative force errors of ∼10−7, the computational costs exhibit an empirical scaling of . My implementation (running on a 16 core node) out-performs a GPU-based direct summation with comparable force errors for .
The computation of the mutual gravitational forces at every time step dominates the computational costs of all N-body simulations. When simulating collisionless stellar dynamics, the N-body model is merely a Monte-Carlo representation of a smooth phase-space distribution and the N-body force is only ever an estimate for the smooth force field of the continuous system modelled (see also Dehnen and Read ). In particular, the N-body force unavoidably carries an estimation error. This motivates the use of approximate methods for computing the N-body force, such as the Barnes and Hut () tree code, as long as the approximation errors are small compared to the estimation errors.
and its derivative, the acceleration, must be calculated with high accuracy. This is typically achieved by direct summation, when equation (1) is translated into computer code and the only errors are owed to finite computational precision.
This computation incurs a cost of for a single particle and thus per unit time for running a full simulation. As a consequence, realistic simulations with for globular clusters and galactic centres are still very challenging and large parameter studies impossible. Measures employed to ameliorate this situation include the usage of powerful special-purpose hardware devices (Makino and Taiji ) or graphical processing units (GPUs, Gaburov et al. ), as well as separating the highly fluctuating forces due to close neighbours, in order to reduce the frequency of expensive far-field force computations (Ahmad and Cohen ).
While these measures substantially reduce the effective costs, the complexity of remains. The alternative of using approximate methods also for collisional stellar dynamics is so far untested. The requirements for such a method differ from that in collisionless N-body methods in two important aspects: (i) there is no gravitational softening and (ii) to preserve the validity of the N-body model, the approximation errors must be much smaller than what is common in collisionless N-body simulations.
A straightforward approach is to use the tree code with a small opening angle and/or high expansion order, resulting in a scheme with costs. A more efficient approach is to use the fast multipole method (FMM; Greengard and Rokhlin ; Cheng et al. ) which has costs of only . An initial attempt by Capuzzo-Dolcetta and Miocchi () to port this technique from its original realm of molecular dynamics to astrophysics failed to obtain better practical efficiency than the tree code. However, when adapting the FMM to the inhomogeneity of stellar systems and the low force accuracy required in collisionless dynamics (by using a hierarchical tree data structure and a flexible opening angle), it is substantially faster than the tree code (Dehnen , ).
The critical question here is whether FMM can be tuned to be more efficient than direct summation at force accuracies and particle numbers required by collisional N-body techniques. The goal of this study is to address this question by tuning FMM for the application to collisional N-body simulations, investigating the resulting dependence of computational costs and numerical accuracy on the various numerical parameters, and assessing its practical efficiency.
This paper is organised as follows. In Section 2 and Appendix A, the mathematical (and algorithmic) foundations of FMM are derived and laid down. Section 3 (and Appendix B) introduces and motivates my approach for quantifying the resulting acceleration errors; Section 4 provides useful estimates for the errors of individual FMM interactions; Section 5 deals with optimising the multipole-acceptance criterion; and in Section 6 the method is tuned to obtain a force accuracy target with minimal computational effort. Finally, in Section 7 possible extensions and applications are discussed, and Section 8 concludes.
2 FMM basics
The tree code approximates the sum (1) by first dividing source particles a into groups bounded by geometric cells, each of which is well-separated from the sink position , and then computing the forces of each source cell from their multipole moments. This corresponds to Taylor expanding the Greens function about the distance to an appropriate centre z of each source cell.
The essence of the fast multipole method is to Taylor expand the Greens function not only at the source positions , but also at the sink positions . This latter amounts to approximating (a contribution to) the gravitational field within each sink cell by its local Taylor expansion about some appropriate potential expansion centre s. Obviously, this approach is beneficial only if the forces for a large fraction of the sinks within a cell are to be computed simultaneously.
2.1 Mathematical background
The FMM relations are most easily derived using Cartesian coordinates. However, for Newtonian gravity, , the resulting relations are inefficient. Instead, exploiting that this Greens function satisfies for naturally leads to spherical harmonics. Cheng et al. () have already given (without derivation) the corresponding FMM relations, but in a form ill-suited for computer code. In Appendix A, I derive equivalent but much more compact and computationally convenient relations. These are summarised here.
Here, p is the expansion order and the error of the approximated potential. This expansion converges with increasing p if with .
2.2 Algorithmic approach
2.2.1 The tree code: walking the tree
The FMM kernels
particle to particle
particle to multipole
multipole to multipole
multipole to particle
multipole to local expansion
particle to local expansion
see table legend
local expansion to local expansion
local expansion to particle
2.2.2 FMM: the dual tree walk
An adaptive FMM algorithm also uses a hierarchical tree data structure. As with the tree code, the cell multipoles have to be precomputed for every cell in a first step.
Next, the forces for all sink positions and generated by all source particles are approximated using a single dual tree walk (Dehnen ). This algorithm considers cell → cell interactions and starts with the root → root interaction. If the interacting cells are well separated, the interaction is approximated using the M2L kernel (equation (3b)), which computes and accumulates the local field tensors for the expansion of gravity within the sink cell B and due to all sources within the source cell A (in a mutual version of the algorithm, the interactions and are considered simultaneously). Otherwise, the interaction is split, typically into those between the daughters of the larger of the two interacting cells with the smaller.
Finally, the local field tensors are passed down the tree using the L2L kernel, and the local expansions are evaluated at the sink positions using the L2P kernel. Thus, the FMM replaces the M2P kernel of the tree code with the M2L, L2L and L2P kernels, see also Figure 2.
Of course, in both tree code and FMM, direct summation (P2P kernel) is used whenever computationally preferable, i.e. for interactions involving only a few sources and sinks.
3 Quantifying the approximation accuracy
Before the method can optimised for accuracy, a sensible quantitative measure for this accuracy is needed as well as an acceptable value for this measure.
With direct-summation, the accuracy is limited only by the finite precision of computer arithmetic (round-off error). If double (64-bit) precision is not used throughout, it is customary to use the conservation of the total energy for quality control (e.g. Gaburov et al. ). However, as shown in Appendix B, the relative energy error is much smaller than the typical relative force error, simply because it is an average over many force errors. Even worse, the computation of the total energy, required for measuring its error, typically incurs a larger error. Thus, any measured non-conservation of the total energy is dominated by measurement error rather than true non-conservation due to acceleration errors.
With the tree code and FMM, the situation is subtly different, as discussed in Appendix B.3. Here, the measured non-conservation of energy actually reflects the amplitude of the acceleration errors in an average sense. However, an average measure for the effect of approximation errors cannot reflect their effect on the correctness of the simulation. For example, a single large force error has hardly any effect on the energy conservation but may seriously affect the validity of the simulation. While this latter goal is difficult to quantify, it is certainly better to consider the whole distribution of acceleration errors and pay particular attention to large-error outliers, than merely monitor an average.
3.1 Scaling acceleration errors
Obviously, the absolute errors are not very useful by themselves and must be normalised to be meaningful. One option is to divide δa by some mean field strength . While this makes sense for the average particle, it fails for those in the outskirts of the stellar system, where the field strength diminishes well below its mean.
of the absolute values of all pair-wise accelerations. In general , while in the outskirts of a stellar system such that the scaled error approaches the relative error as desired. Conversely, in the centre (for a Plummer sphere, for example, as in the continuum limit) and behaves sensibly if .
3.2 The acceleration errors of direct summation
There is a significant increase in the error amplitude with particle number N: the errors for are on average larger than for . This worrying property suggests that the fidelity of simulations using sapporo diminishes with N, implying that using this library with is not advisable.
From this exercise I conclude that in practice relative (or scaled) acceleration errors with an rms value of a few 10−7 and maximum ∼10 times larger are accepted in N-body simulations of collisional stellar dynamics.
4 Assessing the approximation errors
In order to optimise any implementation of FMM for high accuracy and low computational costs, a good understanding of and accurate estimates for the errors incurred by each individual FMM interaction are required. To this end, I now perform some numerical experiments.
are (approximations for) the radii of the smallest spheres centred on z and s and containing all sources and sinks, respectively. In the experiments of this section for each cell, because and because all particles are source and sink simultaneously, but in general and may differ.
with obtained by direct summation.
4.1 Cell-cell interactions
4.1.1 Comparing with simple error estimates
which is plotted as thin curves in the top panel of Figure 4. Obviously, this upper bound is satisfied, but typically it is 10-100 times larger than the actual largest error.
Moreover, equation (10) predicts diverging errors for , while the actual errors behave much nicer. This is presumably because diverging errors only occur for rare sink positions combined with extreme source distributions (such as all particles concentrated near one point at the edge of the source sphere), which are not realised in these experiments.
to the actual errors.
4.1.2 Better error estimates
The simple error estimates (11) are still quite inaccurate: the maximal error is often much smaller (see also the dashed histograms in the right panels of Figure 4). The offsets of from the actual errors increase with p (see left panels of Figure 4). This effect vanishes if the same limit for is used for all p, suggesting that it is caused by smoother distributions for larger numbers of sources. Indeed, if I simply divide the estimates (11) by the scatter of the residuals is much reduced, but a systematic trend with p remains.
In the right panels of Figure 4, these new error estimates are compared with the simple estimates (11) of the last subsection by displaying the distributions of the ratio of the actual maximum error to these estimates. The main difference between the two sets of estimators is their accuracy: there is much less scatter for the new (solid histograms) than for the old estimators (dashed). Consequently, there are hardly any interactions for which the force error is overestimated by more than a factor ten, while the simple estimators (11) overestimated the force error by more than that for many interactions, in particular at large p. Another remarkable property of the new error estimator is its consistency with respect to expansion order: there is no systematic drift with expansion order.
The number of underestimated force errors (abscissa >1 in the right panels of Figure 4) is small but there is a clear tail of underestimated absolute errors (top panel). As this is not present for the relative errors, it must be caused by the deviation of the acceleration from the mean . Indeed, the maximum error is expected to occur on the side of the sink towards the source, where the acceleration is larger, about . When accounting for this by simply replacing r in (14) with , the tail of underestimated force errors is diminished, but the overall distributions widens and a tail of overestimated errors appears.
4.2 Particle-cell interactions
Just occasionally, the dual tree walk algorithm encounters particle-cell interactions. Most of them will be computed using direct summation, leaving only the few with populous cells for the FMM approximation.
For particle → cell and cell → particle interactions the FMM approximation uses the P2L and M2P kernels, respectively. Because these kernels correspond to the M2L kernel in the limits of and , respectively, all the algebra developed in the previous sub-section still applies.
4.2.1 Cell → particle interactions
4.2.2 Particle → cell interactions
5 Optimising the multipole-acceptance criterion
With the improved error estimates in hand, the practical implementation of FMM for high accuracy can finally be considered. The main questions arising in this context are:
What to pick for the expansion centres z and s?
When to consider two cells well-separated?
What expansion order p to use?
The possible answers to these questions affect both the computational cost and the approximation accuracy. Hence, for a given accuracy target, there exists an optimal choice for all these parameters, in the sense of minimal CPU time (and memory) consumption. This section deals with the algorithmic aspects of this problem, i.e. the choice for z and s and the functional form of the multipole-acceptance criterion. The tuning of the parameters (of the multipole-acceptance criterion as well as the expansion order) with the aim of minimal computational effort for a given accuracy is the subject of the next section.
Astonishingly, this issue of optimal choice for z and s and the multipole-acceptance criterion has not been much investigated. Instead, implementations of multipole methods often employ either of two simple strategies. The tree code generally uses a fixed order p and an expansion centred on the cells’ centres of mass, while two cells are considered well-separated if the simple geometric multipole-acceptance criterion (5) is satisfied, such that controls the accuracy.
With traditional FMM, on the other hand, the expansion centres z and s are both taken to be the geometric cell centres and two cells are deemed well-separated as soon as the expansion converges, corresponding to . When using hierarchical cubic grids (instead of an adaptive tree), this is implemented by interacting only between non-neighbouring cells on the same grid level whose parent cells are neighbours (e.g. Cheng et al. ). The accuracy is then only controlled by the expansion order p.
5.1 Choice of expansion centres z and s
As far as I am aware, all existing FMM implementations use the same position for the multipole and potential expansion centres, i.e. , for each cell. For traditional FMM, these are equal to the geometric cell centres. This has the benefit of a finite number of possible interaction directions , in particular when , for which the coefficients could be pre-computed. However, the computation of these coefficients on the fly is often faster than a table look-up. Moreover, in view of Figure 4 appears ill-suited for high accuracy.
In fact, the restriction reduces the freedom and hence the potential for optimising the method. Nonetheless, when aiming for low accuracy, choosing , the cells’ centres of mass, has some advantages. First, the dipoles vanish and the low-order multipoles tend to be near-minimal. Second, if using a mutual version of the algorithm (when the interactions and are done simultaneously), the computational costs are reduced and the approximated forces satisfy Newton’s third law exactly, i.e. (Dehnen ).
However, in practice there is no benefit from such an exact obedience of Newton’s law, as the total momentum is not exactly conserved, because of integration errors arising from the fact that the particles have individual time steps. Moreover, the degree of deviation from exact momentum conservation in such a case does not reflect the true accumulated force errors. In a more general method, the approximated forces will deviate from the ideal by an amount comparable to their actual force errors and the non-conservation of total momentum is somewhat indicative of the accumulated effect of the force errors (see also Appendix B.3).
5.1.1 Choice of the potential expansion centre s
The results of Section 4, in particular the functional form of in equation (13), suggest to choose the potential expansion centres s such that the resulting sink radii , and hence the estimated interaction errors, are minimal. Thus, , the centre of the smallest enclosing sphere. Finding the smallest enclosing sphere for a set of n points has complexity . Doing this for every sink cell would incur a total cost of and be prohibitively expensive.
Instead, I use an accurate approximation by finding for each cell the smallest sphere enclosing the spheres of its grand-daughter cells. This incurs a total cost of and is implemented via the Computational Geometry Algorithms Library (http://www.cgal.org, Fischer et al. ), using an algorithm of Matoušek et al. ().
5.1.2 Choice of the multipole expansion centre z
As already mentioned above, setting has some virtue for low expansion orders p. However, for high expansion orders, the high-order multipoles become ever more important, suggesting that may be a better choice. In order to assess the relative merits of these methods, I repeated the experiments of Section 4 for both methods and compared the resulting maximum absolute and relative force errors incurred for the same cell → cell interactions (for which the two methods give different θ).
I found that the errors for the two methods are very similar with an rms deviation of ∼0.15 dex, but a very small mean deviation. At there is a trend of more accurate forces for , while at smaller errors are obtained with . This trend is simply a consequence of being smaller for than for at low k and larger at high k. This together with the improved error estimates (14) also explains that (for an interaction ) tends to give more accurate forces if , while tends to be more accurate if .
5.2 A simple FMM implementation
Thus, the optimal opening angle is independent of p. The accuracy is then controlled by the expansion order, requiring for (according to Figure 4). The computational costs rise roughly like with decreasing ϵ.
There are two main effects responsible for these properties of the error distributions. First, errors from a single FMM interaction follow a distribution with variance of 1-2 dex. The maximum errors reported in Section 4 only occur for particles near the edges and corners of the sink cell, while most have smaller errors. Moreover, the force errors due to FMM interactions of the same sink cell with source cells in opposing directions tend to partially cancel rather than add up. Both explain why the median errors reported in Figure 7 are much smaller than the maximum relative error incurred by a single cell → cell interaction, which according to Figure 4 is ∼10−4.
More important is a second effect: the final force errors are not the sum of the relative errors of individual FMM interactions, which are controlled by the simple multipole-acceptance criterion, but of their absolute errors δa. Since, according to equation (11), , the FMM interactions with cells of large surface density dominate the error budget. In fact, the particles at very large radii have , exactly as expected from a few FMM interactions with near maximal errors.
5.3 Towards better multipole-acceptance criteria
This discussion suggests that multipole-acceptance criteria which balance the absolute force errors of individual FMM interactions are preferable. When working with the simple estimators (11) or the error bound (10), this leads to critical opening angles which depend on the properties of the interacting cells, such as their mass or surface density.
with the aim to obtain and , respectively.
The difference between these error distributions and those shown in Figure 7 and resulting from the simple geometric multipole-acceptance criterion (5) is remarkable. While the median errors are comparable, the criteria (16) do not produce extended tails of large errors of the quantity controlled ( in left and in the right panels of Figure 8), and the maximum errors are more than 2 orders of magnitude smaller. What is more, the tails towards small errors have also been somewhat reduced, indicating that the improved criterion avoids overly accurate individual FMM interactions.
This improvement has been achieved without increasing the overall computational effort, but by carefully considering the error contribution from each approximated interaction.
5.4 Practical multipole-acceptance criteria
In a real application one has, of course, no a priori knowledge of or for any particle and must instead use something else in the multipole-acceptance criteria (16). In some situations, a suitable scale can be gleaned from the properties of the system modelled. For example, if simulating a star cluster of known mass profile and centre , one may simply use with . I now consider other options.
5.4.1 Using accelerations from the previous time step
Employing the accelerations from the previous time step in equation (16a) requires no extra computations. However, it means that the gravity solver is not self-contained, but requires some starter to get the initial accelerations.
Also, using information from the previous time step subtly introduces an artificial arrow of time into the simulation, because implies . Hence, a particle moving in a direction of increasing acceleration has, on average, smaller than when moving in the opposite direction, or in reversed time. However, the time integration methods currently employed almost exclusively in N-body simulations of collisional stellar dynamics are irreversible and introduce their own arrow of time. This suggests, that the additional breach of time symmetry by the magnitude (not the direction) of the force error may not be a serious problem in practice.4
5.4.2 Estimating or using low-order FMM
As Section 4 has shown, the error estimate used in the multipole-acceptance criteria (16) still has significant uncertainty, and using highly accurate values for or in equation (16) is unnecessary. Instead, rough estimates should suffice. Such estimates can be obtained via a low-order FMM. This amounts to running the FMM twice: once with a simple multipole-acceptance criterion to obtain rough estimates for or , and then again using the sophisticated criteria (16) employing the results of the first run.
The acceleration scale f (defined in equation (4)) is similar to the gravitational potential (1), except that its Greens function is . This implies that it too can be estimated using FMM, albeit not using an explicitly harmonic formulation.
I implemented both options, estimating a or f via FMM, using the lowest possible order ( for f and for gravity - recall that is approximated at one order lower than the potential Ψ) and multipole-acceptance criterion . To this end, I use and a mutual version of the dual tree walk. The resulting estimates for f or have rms relative errors of ∼15%. The additional computational effort is still much smaller than that of the high-accuracy approximation of gravity itself, though estimating f is faster because it is a scalar rather than a vector and because no square-root needs to be calculated.
The distributions of acceleration errors resulting from using these estimates in equation (16) are shown in red in Figure 8. They are only very slightly worse than those in black, which have been obtained using the exact values of and in equation (16).
6 Optimising adaptive FMM
The previous section provided answers to the first two questions asked at its beginning, but not to the one after the optimal expansion order p. To answer this question I now report on some experiments, which also provide the actual computational costs for a given required force accuracy.
All experiments are run on a single compute node with 16 Intel Xeon E5-2670 CPUs, which support the AVX instruction set (see below), and using code generated by the gcc compiler (version 4.8.2).
6.1 Implementation details
The FMM relations of Section 2 and Appendix A (using the rotation-accelerated M2L kernel of Appendix A.6 when faster) have been implemented in computer code. The code employs a one-sided version of the dual tree walk, which considers the interactions and independently. The code is written in the C++ C++ programming language and has been tested using various compilers and hardware. The implementation employs vectorisation and shared-memory parallelism as outlined below.
Most current CPUs support vector sizes of 16 (SSE), 32 (AVX), or 64 (MIC) bytes, allowing identical simultaneous double-precision floating-point operations (or twice as many in single precision). Because the FMM kernels do not (usually) relate adjacent elements, their efficient vectorisation is not straightforward (and well beyond compiler optimisation). I explicitly implement a method computing K M2L kernels simultaneously. To this end, the multipole moments of the K source cells are loaded into a properly aligned buffer (similar to transposing a matrix) before, and afterwards the K field tensors are added from their vector-buffer to the sink cells’ field tensors. Unfortunately, this loading and storing (which cannot be vectorised) reduces the speed-up obtained by the simultaneous kernel computations.
Conversely, direct summation is perfectly suitable for vectorisation and a speed-up of a factor K is achievable. The code prefers direct summation whenever this is deemed to be faster, based on a threshold for the number of particle-particle interactions ‘caught’ in a given cell-cell interaction.
All parts of the implementation use multi-threading and benefit from multi-core architectures. This is done via hierarchical task-based parallelism implemented via threading building blocks (tbb, Reinders ), an open source task parallel library with a work-stealing scheduler. The algorithms for multi-threaded tree building and dual tree walk are quite similar to those described by Taura et al. () and I refrain from giving details here.
6.1.3 Precision and expansion order
This study reports only on one particular implementation aimed at high accuracy. It uses double precision (64 bits) floating-point arithmetic throughout, , and expansion orders .
6.2 Wall-clock time versus accuracy
The rms error is always ten times smaller than the 99.99 percentile,5 implying the absence of extended large-error tails. For any fixed expansion order p, the relation between time and error can be approximated by a constant plus a power law that becomes flatter for larger p. At any given error, there is an optimal expansion order p in the sense of providing the fastest approximation. When using this optimal expansion order, the fastest FMM computation for a given error scales very nearly like a power law with exponent . Thus when reducing the error by a factor ten, the computational costs rise only by a factor ∼1.5.
Constraining the relative error (top panel of Figure 9) is slightly more costly than constraining the scaled error (bottom panel). This is largely because as discussed in the caption to Figure 8, but also because estimating f is easier and faster than estimating a. Of course, the estimation of a can be easily avoided in practice by using the accelerations from the previous time step.
6.3 Accuracy versus parameter ϵ
6.4 Complexity: scaling with the number N of particles
Timings and errors for FMM runs for different N
2.81 × 10−7
1.32 × 10−6
3.61 × 10−7
2.51 × 10−6
3.85 × 10−7
3.32 × 10−6
4.05 × 10−7
2.20 × 10−6
From Table 2, it can be seen that the costs for tree building grow faster than linearly with N ( is expected), those for the upward and downward passes roughly linearly with N (as expected), but those for the FMM estimation of f and the dual tree walk less than linearly. As a result, the total computational costs are very well fit by the power law for , see Figure 11.
Figure 11 also shows the timings for a (double-precision) direct-summation on the same hardware (yielding much more accurate accelerations) and for a mixed-precision direct-summation on a GPU using the sapporo library (yielding comparably accurate accelerations). At large (but realistic) N FMM out-performs direct summation, even if accelerated using a GPU.
6.5 Scaling with the number of CPUs
7 Beyond simple gravity approximation
So far, I have considered the approximate computation of the unsoftened gravitational potential and acceleration at all particle positions with equal relative (or scaled) accuracy. However, the fast multipole method can be easily modified or extended beyond that.
For example, one may want to have individual accuracy parameters instead of a global one. This is easily accommodated by replacing in criterion (16a) with and analogously for criterion (16b).
When using individual , but also in general, it may be beneficial to adapt the expansion order p to the accuracy actually required for a given cell → cell interaction. This could be implemented by using the lowest for which the multipole-acceptance criterion is satisfied.
7.1 Force computation for a subset of particles
Most N-body codes employ adaptive individual time steps for each particle. The standard technique is Makino’s () block-step scheme, where the forces of all active particles are computed synchronously. Active are those particles with time step smaller than some threshold (which varies from one force computation to the next).
When using FMM in such a situation, only interactions with sink cells contain at least one active particle must be considered. If the fraction of active particles in such cells is small (but non-zero), FMM becomes much less efficient per force computation. Fortunately, however, active particles are typically spatially correlated (because the time steps of adjacent particles are similar), such that the fraction of active particles is either zero or large.
The costs for the interaction and downward pass, on the other hand, decrease roughly like for . The net effect is that for , the costs are almost completely dominated by the preparation phase, and hence independent of . The precise point of this transition depends on N and the FMM parameters. For smaller N and/or more accurate forces, the relative contribution of the tree walk phase increases and the transition occurs at smaller .
There is certainly some room for improvement by, e.g. using a smaller expansion order p than is optimal for and/or re-cycling the tree structure from the previous time step. Both measures reduce the costs of the preparation phase and increase that of the interaction phase (at given ϵ), but shall reduce the overall costs if .
7.2 Softened gravity or far-field force
This Greens function (17) is no longer harmonic and harmonic FMM cannot be used. One obvious option is to use the more general Cartesian FMM of Appendix A.1 (Dehnen ). The computational costs of this approach grow faster with expansion order p, such that small approximation errors (requiring high p) become significantly more expensive. However, small approximation errors are hardly required in situations where gravitational softening is employed. Alternatively, if softening is restricted to a finite region, i.e. if for , harmonic FMM can still be used to compute gravity from all sources at distances , while direct summation could be used for neighbours, sources at . This approach is sensible only if the number of neighbours is sufficiently bounded (so that the cost incurred by the direction summation remains small). This is the case, in particular, if the number of neighbours is kept (nearly) constant by adapting the individual softening lengths in order to adapt the numerical resolution (Price and Monaghan ).
In practice, this requires to carry with each cell the radius of the smallest sphere centred on z which contains all softening spheres of its sources, and allow a FMM interaction only if .
The same technique can be used to restrict the FMM approximation to the far field for each particle, i.e. the force generated by all sources outside of a sphere of known radius around .
7.3 Jerk, snap, crackle, and pop
and the jerk follows from . Since , the M2M and L2L kernels (equations (3d) and (3e)) work also for the time derivatives and of the multipoles and field tensors, respectively. The relations for the next order, the snap , can be derived by differentiating yet again.
With each additional order (jerk, snap, crackle, pop, …), the computational cost of the combined M2L kernels is not more than the corresponding multiple of the ordinary M2L kernel (i.e. acceleration plus jerk are twice as costly as just acceleration). This is a direct consequence of not allowing cell-centre velocities hence preventing the terms depending on z or s in equation (3) to carry any time dependence. In contrast, the computational costs of the P2M and L2P kernels grows quadratically with the order of time derivative. This is not really a problem, since those kernels are only needed once per particle, while the M2L kernel is typically used ≳100 times more often.
7.4 The tidal field
8 Discussion and conclusions
The fast multipole method (FMM) approximates the computation of the mutual forces between N particles. I have derived the relevant mathematical background, giving much simpler formulæthan the existing literature, for the case of unsoftened gravity, when the harmony of the Greens function allows significant reduction of the computational complexity.
Like the tree code, my FMM implementation uses a hierarchical tree of spatial cells. Unlike the tree code, FMM uses cell → cell interactions, which account for all interactions between sources in the first cell and sinks in the second. Almost all distant particle → particle interactions are ‘caught’ by fewer than cell → cell interactions, such that local interactions, requiring computations, dominate the overall workload (Dehnen ). With the tree code, the situation is reversed: the distant interactions require computations and dominate the overall work. This implies that FMM has the best complexity of all known force solvers. What is more, the predominance of local as opposed to distant interactions makes FMM ideally suited for applications on super-computers, where communications (required by distant interactions) are increasingly more costly than computations. However, FMM is inherently difficult to parallelise and this study considered only a multi-threaded implementation with a task-parallel dual tree walk (the core of FMM).
Most previous implementations of FMM considered simple choices for the cell’s multipole- and force-expansion centres and the multipole-acceptance criterion which decides whether a given cell → cell interaction shall be processed via the multipole expansion or be split into daughter interactions. Traditionally, a simple opening-angle based multipole-acceptance criterion has been used and cell centres equal to either the cell’s geometric centre or its centre of mass. These choices, which presumably were based on computational convenience and intuition, inevitably result in a wide distribution of individual relative force errors with extended tails reaching ∼1000 times the median.
The main goal of this study was avoid such extended tails of large force errors and to minimise the computational effort at a given force accuracy. The key for achieving this goal is a reasonably accurate estimate, based on the multipole power of the source cell and the size of the sink cell, for the actual force error incurred by individual cell → cell interactions. Based on the insight from this estimate, I set the cell’s force-expansion centres to (an approximation of) the centre of the smallest sphere enclosing all its particles, when the cell size and hence the error estimates are minimal. I also use the new estimates in the multipole-acceptance criterion, such that each cell → cell interaction is considered on the merit of the error it likely incurs. This results in very well behaved distributions of the relative force errors, provided an initial estimate for the forces is at hand. This can either be taken from the previous time step or obtained via low-accuracy FMM.
After these improvements, the method has only two free parameters: the expansion order p and a parameter ϵ for the relative force error. Experiments showed that the actual rms relative force error is typically somewhat less than ϵ, while for any given ϵ there is an optimum p at which the computational cost are minimal. For , for example, is optimal and the accelerations errors are comparable to those of direct summation on a GPU (the current state-of-the-art method for collisional N-body simulations). With these parameter settings, the computational costs scale like for large N and the method out-performs any direct-summation implementation for . When computing only the forces for of N particles, the costs are roughly proportional to for , but become independent of below that (where the costs for tree building dominate). For large N, this is still significantly faster than direct summation.
An implementation of the FMM on a GPU accelerator should yield a further significant speed-up compared to my CPU-based implementation, though this is certainly a challenging task, given that FMM is algorithmically more complex than direct summation or a tree code (both of which have been successfully ported to the GPU). Presumably a somewhat lesser challenge is a massively parallel implementation of the method, which can be run on a super computer.
A practical application of FMM in an actual collisional N-body simulation would be very interesting. Since the force between close neighbours is always computed directly (in double precision) as explained earlier, close encounters can be treated essentially in the same fashion as with existing techniques. However, an unfortunate hindrance to an application of the presented techniques originates from the long marriage of existing collisional N-body techniques with direct summation. Methods, such as the Ahmad-Cohen neighbour scheme, to reduce the need for the costly far-field force summations are not necessary with FMM, and the existing N-body tools are not well suited for an immediate application of FMM.
Appendix 1: Derivation of the FMM relations
Here, the FMM relations given in Section 2 are derived and motivated. Differently from the main text, the multipole and force expansion centres, z and s, are not explicitly distinguished and instead z is used for either. The general case is a trivial generalisation.
A.1 Cartesian FMM
At each order , there are coefficients (as well as and ), and the total number of coefficients up to order p is . The computational effort of the resulting algorithm is dominated by their computation in (28b), which requires about multiplications. Thus at large p a straightforward application of this method approaches an operation count of . The computation (28b) of the field tensors is essentially a convolution in index space and hence can be accelerated using a fast Fourier technique with costs (but see endnote b).
A.2 Harmonic tensors
(see equation (42) for a definition of ). While at each order n there are only truly independent terms, the expansion (32) still carries all terms, amounting to a total of terms in an expansion up to order p. The equivalent spherical harmonic expansion (33) only carries terms per order8 amounting to a total of , i.e. at large p is much preferable.
(Applequist ; Hinsen and Felderhof ). However, the resulting algebraic challenges are considerable, though the overall computational effort could well be reduced to operations (Joachim Stadel, private communication), but I am not aware of a systematic demonstration.
A.3 Spherical harmonics
The algebraic complications with obtaining an efficient Cartesian FMM stem from the fact that the Laplace operator involves three terms, such that the resulting recovery relation (34) has two terms instead of one on the right-hand side. This problem can be avoided by Taylor expanding in other than Cartesian coordinates where the Laplace operator involves only two instead of three terms.
or in place of equation (34). With this relation one can eliminate all mixed ξ-η derivatives in favour of z derivatives. This in turn allows a reduction in the number of indices from three to two by using the total number n of derivatives and the number of ξ (for ) or η derivatives (for ).
The real-valued functions for
m ∖ n
A.4 Spherical-harmonic FMM
As the Cartesian FMM relations (28) were based on equation (26), the spherical harmonic FMM relations (3) are based on equation (47), which for is completely equivalent but computationally more efficient.
A.5 Implementation details
A.5.1 Recursive evaluation of spherical harmonics
as well as their counterparts for , allow for an efficient and stable evaluation of and .
A.5.2 Real-valued spherical harmonics
The relevant relations for these real-valued spherical harmonics are best directly transcribed from the corresponding complex relations.
A.6 Accelerating FMM relations
The FMM kernels M2L, M2M, and L2L (equations (3b), (3d), (3e)) all require operations. However, if the interactions or translations are along the z-axis, the costs are only because .
One method to exploit this is to first translate along the z-axis and then perpendicular to the z-axis. For a vector perpendicular to the z-axis, vanishes whenever is even. This implies that a translation along can be done faster than a general translation (in the limit of , twice as fast).
This splitting method cannot be applied to the M2L kernel (3b) (because it is not a translation), which occurs many more times in the FMM algorithm than the M2M and L2L kernels. To accelerate the M2L kernel, one can exploit that a rotation only costs operations, too. Thus, if one first rotates into a frame in which the interaction is along the z-axis, applies the M2L kernel in the rotated frame, and finally rotates back into the original frame, the total costs are still .
A.6.1 Fast rotations
respectively. Thus, this method of achieving a general rotation not only avoids the (recursive) computation of the Wigner functions (which itself costs operations), but also benefits from the facts that the swap matrices have ≈4 times fewer non-zero entries than the and are known a priori, such that they can be ‘hard-wired’ into computer code.
A.6.2 A fast M2L kernel
Finally, one must rotate back to the original frame by first swapping x and z, rotating by , swapping x and z again, followed by a final rotation by .
These rotations and swaps can be accelerated further by exploiting that in (66) only multipoles with are needed and, similarly, that for . As Figure 1 demonstrates, the overhead due to the rotations pays off already for .
Appendix 2: The energy error of a simulation
The gravitational forces (and potentials) used in N-body simulations always carry some error. When using direct summation, this is solely due to round-off errors, while for approximate methods the approximation error should dominate round-off. Here, I investigate the consequences of these errors for the non-conservation of the total energy.
B.1 The energy error due to force errors
Thus, the relative energy error resulting from the force errors alone is much smaller than ε, simply because it is some average over many force errors.
B.2 The measurement error
If the same precision ε is used for computing the particle potentials and accelerations, this is much larger than the energy error (69) due to force errors.
B.3 Approximate gravity solvers
The situation is different for approximative methods, such as the tree code, FMM, and mesh-based techniques. All of these approximate the true potential, but use the exact derivatives of the approximated potential for the accelerations. Therefore, the total approximated energy should be conserved (modulo round-off errors), even if the approximation is poor.
For the FMM and the tree code the situation is actually different, because the approximated potential is not globally continuous but only piece-wise. This is because the concrete form of the approximation used for a given particle depends on its position (which determines how FMM approximates each pair-wise force). A particle crossing a boundary between such continuous regions suffers a jump in the (approximated) potential, and hence energy, while the corresponding kick in velocity (to conserve energy) is ignored. These discontinuities are part of the approximation error and their amplitudes proportional. The implication is that for the tree code and FMM energy is not conserved (even for accurate time integration) and the degree of non-conservation actually reflects the amplitude of the approximation errors in an average sense.
Cheng et al.’s () expressions are quite cumbersome because they are given in terms of the surface spherical harmonics in polar coordinates and because they contain phase-factors like owing to their unconventional definition for the which implies instead of .
Expressions like for the operation count relate to the asymptotic behaviour at large expansion orders p. While this is straightforward to specify, it is not necessarily very relevant, since in the range up to , as required in practice, the actual costs usually grow more slowly than implied by the asymptotic behaviour (see Figure 1 for a typical example) and because the numerical implementation may be data-dominated rather than computation dominated.
The situation is different for N-body simulations of collisionless stellar dynamics, where reversible integrators are used and the accepted force errors, and thus their time asymmetries, are significantly larger.
The increase of this ratio to ≈20 towards small errors may well be caused by inaccuracies of the direct summation used for calculating the errors.
The timings for the sapporo library also include additional computations (nearest neighbour finding and neighbour listing). These contribute negligibly at large N, but at small N they are, together with latency on the GPU, responsible for the deviation of the observed complexity from .
Using multi-index notation with , such that the first sum in (26) is over non-negative integer triples n with . Furthermore and .
The author thanks Joachim Stadel for many helpful discussions and the suggestion to allow , Alessia Gualandris for running sapporo to provide the data for Figure 3, and Simon Portegies Zwart and Jeroen Bédorf for providing the timings for sapporo 2 in Figure 11. This work was supported by STFC consolidated grant ST/K001000/1.
- Ahmad A, Cohen L: A numerical integration scheme for the N -body gravitational problem. J. Comput. Phys. 1973, 12: 389–402. 10.1016/0021-9991(73)90160-5ADSView ArticleGoogle Scholar
- Applequist J: Traceless Cartesian tensor forms for spherical harmonic functions: new theorems and applications to electrostatics of dielectric media. J. Phys. A, Math. Gen. 1989, 22: 4303–4330. 10.1088/0305-4470/22/20/011MathSciNetADSView ArticleGoogle Scholar
- Barnes J, Hut P:A hierarchical O(N log N) force-calculation algorithm. Nature 1986, 324: 446–449. 10.1038/324446a0ADSView ArticleGoogle Scholar
- Capuzzo-Dolcetta R, Miocchi P: A comparison between the fast multipole algorithm and the tree-code to evaluate gravitational forces in 3-D. J. Comput. Phys. 1998, 143: 29–48. 10.1006/jcph.1998.5949MathSciNetADSView ArticleGoogle Scholar
- Cheng H, Greengard L, Rokhlin V: A fast adaptive multipole algorithm in three dimensions. J. Comput. Phys. 1999, 155: 468–498. 10.1006/jcph.1999.6355MathSciNetADSView ArticleGoogle Scholar
- Dehnen W: A very fast and momentum-conserving tree code. Astrophys. J. 2000, 536: L39-L42. 10.1086/312724ADSView ArticleGoogle Scholar
- Dehnen W: Towards optimal softening in three-dimensional N -body codes - I. Minimizing the force error. Mon. Not. R. Astron. Soc. 2001, 324: 273–291. 10.1046/j.1365-8711.2001.04237.xADSView ArticleGoogle Scholar
- Dehnen W:A hierarchical O(N log N) force calculation algorithm. J. Comput. Phys. 2002, 179: 27–42. 10.1006/jcph.2002.7026MathSciNetADSView ArticleGoogle Scholar
- Dehnen W, Read JI: N -Body simulations of gravitational dynamics. Eur. Phys. J. Plus 2011., 126: Article ID 55 10.1140/epjp/i2011-11055-3Google Scholar
- Fischer, K, Gärtner, B, Herrmann, T, Hoffmann, M, Schönherr, S: Bounding volumes. In: CGAL User and Reference Manual, 4.2 edn., CGAL Editorial Board (2013). http://www.cgal.org/Manual/4.2, Fischer, K, Gärtner, B, Herrmann, T, Hoffmann, M, Schönherr, S: Bounding volumes. In: CGAL User and Reference Manual, 4.2 edn., CGAL Editorial Board (2013). http://www.cgal.org/Manual/4.2
- Gaburov E, Harfst S, Portegies Zwart S: SAPPORO: a way to turn your graphics cards into a GRAPE-6. New Astron. 2009,14(7):630–637. 10.1016/j.newast.2009.03.002ADSView ArticleGoogle Scholar
- Gradshteyn IS, Ryzhik I: Table of Integrals, Series, and Products. 5th edition. Academic Press, London; 1994.Google Scholar
- Greengard L, Rokhlin V: A fast algorithm for particle simulations. J. Comput. Phys. 1987, 73: 325–348. 10.1016/0021-9991(87)90140-9MathSciNetADSView ArticleGoogle Scholar
- Hinsen K, Felderhof BU: Reduced description of electric multipole potential in Cartesian coordinates. J. Math. Phys. 1992, 33: 3731–3735. 10.1063/1.529869MathSciNetADSView ArticleGoogle Scholar
- Hobson EW: The Theory of Spherical and Ellipsoidal Harmonics. Cambridge University Press, Cambridge; 1931.Google Scholar
- James RW: Transformation of spherical harmonics under change of reference frame. Geophys. J. Int. 1969, 17: 305–316. 10.1111/j.1365-246X.1969.tb00239.xADSView ArticleGoogle Scholar
- Makino J: Optimal order and time-step criterion for Aarseth-type N -body integrators. Astrophys. J. 1991, 369: 200–212. 10.1086/169751MathSciNetADSView ArticleGoogle Scholar
- Makino J, Taiji M: Scientific Simulations with Special-Purpose Computers: The GRAPE Systems. Wiley, New York; 1998.Google Scholar
- Matoušek J, Sharir M, Welzl E: A subexponential bound for linear programming. Algorithmica 1996, 16: 498–516. 10.1007/BF01940877MathSciNetView ArticleGoogle Scholar
- Maxwell JC: Treatise on Electricity and Magnetism. Oxford University Press, Oxford; 1892.Google Scholar
- Pinchon D, Hoggan PE: Rotation matrices for real spherical harmonics: general rotations of atomic orbitals in space-fixed axes. J. Phys. A, Math. Theor. 2007, 40: 1597–1610. 10.1088/1751-8113/40/7/011MathSciNetADSView ArticleGoogle Scholar
- Plummer HC: On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 1911, 71: 460–470. 10.1093/mnras/71.5.460ADSView ArticleGoogle Scholar
- Price DJ, Monaghan JJ: An energy-conserving formalism for adaptive gravitational force softening in smoothed particle hydrodynamics and N -body codes. Mon. Not. R. Astron. Soc. 2007, 374: 1347–1358. 10.1111/j.1365-2966.2006.11241.xADSView ArticleGoogle Scholar
- Reinders J: Intel Threading Building Blocks. O’Reilly Media, Sebastopol; 2007.Google Scholar
- Salmon JK, Warren MS: Skeletons from the treecode closet. J. Comput. Phys. 1994, 111: 136–155. 10.1006/jcph.1994.1050ADSView ArticleGoogle Scholar
- Taura K, Nakashima J, Yokota R, Maruyama N: A task parallel implementation of fast multipole methods. Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis 2012, 617–625. 10.1109/SC.Companion.2012.86View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. Open Access This is an Open Access article distributed under the terms of the Creative Commons Attribution License http://(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.