Sapporo2: A versatile direct $N$-body library

Astrophysical direct $N$-body methods have been one of the first production algorithms to be implemented using NVIDIA's CUDA architecture. Now, almost seven years later, the GPU is the most used accelerator device in astronomy for simulating stellar systems. In this paper we present the implementation of the Sapporo2 $N$-body library, which allows researchers to use the GPU for $N$-body simulations with little to no effort. The first version, released five years ago, is actively used, but lacks advanced features and versatility in numerical precision and support for higher order integrators. In this updated version we have rebuilt the code from scratch and added support for OpenCL, multi-precision and higher order integrators. We show how to tune these codes for different GPU architectures and present how to continue utilizing the GPU optimal even when only a small number of particles ($N<100$) is integrated. This careful tuning allows Sapporo2 to be faster than Sapporo1 even with the added options and double precision data loads. The code runs on a range of NVIDIA and AMD GPUs in single and double precision accuracy. With the addition of OpenCL support the library is also able to run on CPUs and other accelerators that support OpenCL.


Background
The class of algorithms, commonly referred to as direct N -body algorithms is still one of the most commonly used methods for simulations in astrophysics. These algorithms are relatively simple in concept, but can be applied to a wide range of problems. From the simulation of few body problems, such as planetary stability to star-clusters and even small scale galaxy simulations. However, these algorithms are also computationally expensive as they scale as O(N 2 ). This makes the method unsuitable for large N (> 10 6 ), for these large N simulations one usually resorts to a lower precision method like the Barnes-Hut tree-code method [1] or the Particle Mesh method that both scale as O(N log N ) (e.g. [2,3]). These methods, although faster, are also notably less accurate and not suitable for simulations that rely on the high accuracy that direct summation, coupled with higher order integrators, offer. On the other end of the spectrum you can find even higher accuracy methods which uses arbitrary precision [4]. The work of [4] indicates that the accuracy offered by the default (double precision) direct N -body methods is sufficient for most scientific problems.
The direct N -body algorithm is deceivingly simple, in the fundamental form it performs N 2 gravitational computations, which is a parallel problem that can be efficiently implemented on almost any computer architecture with a limited amount of code lines. A number of good examples can be found on the Nbabel.org website.
This site contains examples of a simple N -body simulation code implemented in a wide range of programming languages. However, in practice there are many variations of the algorithms in use, with up to eighth order integrations [5], algorithmic extensions such as block time-stepping [6], neighbour-schemes [7], see [8] and references therein for more examples. These variations transform the simple O(N 2 ) shared time-step implementation in a complex method, where the amount of parallelism can differ per time-step. Especially the dynamic block time-stepping method adds complexity to the algorithm, since the number of particles that participate in the computations changes with each integration step. This variable number of particles involved in computing forces requires different parallelisation strategies. In the worst case, there is only one particle integrated, which eliminates most of the standard parallelisation methods for N 2 algorithms. There is extensive literature on high performance direct N -body methods with the first being described in 1963 [9]. The method has been efficiently implemented on parallel machines [6], vector machines [10] and dedicated hardware such as the GRAPE's [11]. For an overview we refer the interested reader to the following reviews [8,12,13]. Furthermore, there has been extensive work on accelerating N -body methods using GPUs. There have been several N -body libraries to ease the development of N -body integrators that use the GPU. The first library that offered support for the GRAPE API was Kirin [14], however this library only supports single precision and is therefore less accurate than the GRAPE. With the introduction of the Yebisu library [15] there was support for double-single precision 1 , which achieved accuracy comparable to the GRAPE. The library also featured support for fourth and sixth order Hermite integrators in combination with minimized data send by performing the prediction on the GPU. This library, however, is not compatible with the GRAPE API and only supports a single GPU. In our previous work Sapporo1 [16], we added support for multiple GPUs in combination with the GRAPE API and double-single precision. Apart from libraries there are also N -body integrators that come with built-in support for GPU hardware. For example in [17], the authors combine Yebisu and phiGRAPE [18] in the new phiGPU code. This code is able to run on multiple GPUs and supports up to eighth order accuracy. In [19,20], the authors introduce the HiGPUs N -body code. This standalone code contains a sixth order integrator, and supports CUDA, OpenCL and IEEE-754 double precision accuracy. Finally, there is NBODY6 which uses GPU acceleration together with an Ahmad-Cohen neighbour scheme [7,21].
In this paper we present our direct N -body library, Sapporo2, since we focus on the library we will not make a full comparison with the standalone software packages mentioned above. The library contains built-in support for the second order leap-frog (GRAPE-5), fourth order Hermite (GRAPE-6) and sixth order Hermite integrators. The numerical precision can be specified at run time and depends on requirements for performance and accuracy. Furthermore, the library can keep track of the nearest neighbours by returning a list containing all particles within a certain radius. Depending on the available hardware the library operates with CUDA and OpenCL, and has the option to run on multiple-GPUs, if installed in the same compute node. The library computes the gravitational force on particles that are integrated with block time-step algorithms. However, the library can trivially be applied to any other O(N 2 ) particle method by replacing the force equations. For example, methods that compute the Coulomb interactions [22] or molecular dynamics [23] use similar methods as presented in this work.

Methods
With Graphic Processing Units (GPUs) being readily available in the computational astrophysics community for over 5 years we will defer a full description of their specifics and peculiarities [8,14,24,25]. Here we only give a short overview to stage the context for the following sections. In GPU enabled programs we distinguish two parts of code. The 'host' code, used to control the GPU, is executed on the CPU; whereas the 'device' code, performing the majority of the computations, is executed on the GPU. Each GPU consists of a set of multiprocessors and each of these multiprocessors contains a set of computational units. We send work to the GPU in blocks for further processing by the multiprocessors. In general a GPU requires a large amount of these blocks to saturate the device in order to hide most of the latencies that originate from communication with the off-chip memory. These blocks contain a number of threads that perform computations. These threads are grouped together in 'warps' for NVIDIA machines or 'wavefronts' on AMD machines. Threads that are grouped together share the same execution path and program counter. The smaller the number of threads that are grouped the smaller the impact of thread divergence. On current devices a warp consists of 32 threads and a wavefront contains 64 threads. This difference in size has effects on the performance (see Section 3).

Parallelisation method
To solve the mutual forces for an N -body system the forces exerted by the jparticles (sources) onto the i-particles (sinks) have to be computed. Depending on the used algorithm the sources and sinks can either belong to the same or a completely different particle set. Neither is it required that these sets have the same dimensions. In worst case situations this algorithm scales as O(N 2 ), but since each sink particle can be computed independently it is trivial to parallelise within a single time-step. The amount of parallelism, however, depends on the number of sink particles. For example, in high precision gravitational direct N -body algorithms that employ block time-stepping the number of sink particles ranges between 1 and N . In general the number of sinks is smaller than the number of sources, because only the particles of which the position and velocity require an update are integrated [6]. As a consequence the amount of available parallelism in this algorithm is very diverse, and depends directly on the number of active sink particles.
Currently there are two commonly used methods for solving N 2 like algorithms using GPUs. The first performs parallelisation over the sink particles [26,14,24] which launches a separate compute thread for each sink particle. This is efficient when the number of sinks is large (> 10 4 ), because then the number of compute threads is sufficiently high to saturate the GPU. However, when the number of sink particles is small (≤ 10 4 ) there are not enough active compute threads to hide the memory and instruction latencies. As a result, the GPU will be under utilized and only reaches a fraction of the available peak performance. We expect that future devices require an even larger number of running threads to reach peak performance, in which case the number of sink particles has to be even larger to continuously saturate the device. However, adjusting the number of sink particles to keep parallel efficiency is not ideal, because then one artificially increases the amount of work (and run time) in favor of efficiency. Therefore, a second method was introduced in Sapporo1 [16] which takes a slightly different approach. In Sapporo1 we parallelize over the source particles and keep the number of sink particles that is concurrently integrated fixed to a certain number. The source particles are split into subsets, each of which forms the input against which a set of sink particles is integrated. The smaller the number of sink particles the more subsets of source particles we can make. It is possible to saturate the GPU with enough subsets, so if the product of the number of sink and source particles is large enough 2 you can reach high performance even if the number of sinks or sources is small.
Of the two parallelisation methods the first one is most efficient when using a shared-time step algorithm, because fewer steps are involved in computing the gravity. However, the Sapporo1 method is more suitable for block time-stepping algorithms commonly used in high precision gravitational N -body simulations. Even though this method requires an extra step to combine the partial results from the different subsets. The Sapporo1 method is also applied in this work. With Sapporo1 being around for 5 years we completely rewrote it and renamed it to Sapporo2, which is compatible with current hardware and is easy to tune for future generation accelerator devices and algorithms using the supplied test scripts. The next set of paragraphs describe the implementation and the choices we made.

CUDA and OpenCL
When NVIDIA introduced the CUDA framework in 2007 it came with compilers, run time libraries and examples. CUDA is an extension to the C programming language and as such came with language changes. These extensions are part of the device and, more importantly, part of the host code 3 . The use of these extensions requires that the host code is compiled using the compiler supplied by NVIDIA. With the introduction of the 'driver API' 4 this was no longer required. The 'driver API' does not require modifications to the C language for the host code. However, writing CUDA programs with the 'driver API' is more involved than with the 'run time API', since actions that were previously done by the NVIDIA compiler now have to be performed by the programmer.
When the OpenCL programming language was introduced in 2009 it came with a set of extensions to the C language to be used in the device code. There are no changes to the language used for writing the host code, instead OpenCL comes with a specification of functions to interact with the device. This specification is very similar to the specification used in the CUDA driver API and follows the same program flow.
In order to support both OpenCL and CUDA in Sapporo2 we exploited the similarity between the CUDA driver API and the OpenCL API. We developed a set of C++ classes on top of these APIs which offer an unified interface for the host code. The classes encapsulate a subset of the OpenCL and CUDA functions for creating device contexts, memory buffers (including functions to copy data) and kernel operations (loading, compiling, launching). Then, depending on which class is included at compile time the code is executed using OpenCL or CUDA. The classes have no support for the more advanced CUDA features such as OpenGL and Direct3D interoperability.
Kernel-code With the wrapper classes the host-code is language independent. For the device code this is not the case, even though the languages are based on similar principles, the support for advanced features like C++ templates, printing and debugging functionality in CUDA makes it much more convenient to develop in pure CUDA. Once we have a working CUDA version we convert this to OpenCL. The use of templates in particular reduces the amount of code. In the CUDA version all possible kernel combinations are implemented using a single file with templates. For OpenCL a separate file has to be written for each combination of integrator and numerical precision. The method used to compute the gravitational force is comparable to the method used in Sapporo1 with only minor changes to allow double precision data loads/stores and more efficient loop execution.

Numerical Accuracy
During the development of Sapporo1 (before the GT200 chips) GPUs lacked support for IEEE-754 double precision computations and therefore all the compute work was done in either single or double-single precision. The resulting force computation had similar precision as the, at that time, commonly used GRAPE hardware [11,16]. This level of accuracy is sufficient for the fourth order Hermite integration scheme [27,28]. Currently, however there are integrators that accurately solve the equations of motions of stars around black-holes, planets around stars and similar systems that encounter high mass ratios. For these kind of simulations one often prefers IEEE-754 double precision to solve the equations of motion. The current generation of GPUs support IEEE-754, which enables computations that require this high level of accuracy. The data in Sapporo2 is, therefore, always stored in double precision. The advantage is that we can easily add additional higher order integrators that require double precision accuracy computations, without having to rewrite major parts of the host code. Examples of such integrators are the sixth and eighth order Hermite integrators [5]. The performance impact of double precision storage on algorithms that do not require double precision computations is limited. Before the actual computations are executed the particle properties are converted to either float or double-single and the precision therefore does not influence the computational performance. The penalty for loading and storing double the amount of data is relatively small as can be seen in the result section where Sapporo1 is compared to Sapporo2.

multiple GPUs
Our new N -body library can distribute the computational work over multiple GPUs, as long as they are installed in the same system. While in Sapporo1 this was implemented using the boost threading library, this is now handled using OpenMP.
The multi-GPU parallelisation is achieved by parallelising over the source particles. In Sapporo1 each GPU contained a copy of all source particles (as in [18]), but in Sapporo2 the source particles are distributed over the devices using the roundrobin method. Each GPU now only holds a subset of the source particles (similar to PhiGPU, HiGPU and NBODY6) which reduces memory requirements, transfer time and the time to execute the prediction step on the source particles. However, the order of the particle distribution and therefore, the order in which the addition is executed is changed when comparing Sapporo1 and Sapporo2. This in turn can lead to differences in the least significant digit when comparing the computed force of Sapporo1 to Sapporo2.

Other differences
The final difference between Sapporo1 and Sapporo2 is the way the partial results of the parallelisation blocks are combined. Sapporo1 contains two computational kernels to solve the gravitational forces. The first computes the partial forces for the individual blocks of source particles, and the second sums the partial results. With the use of atomic operators these two kernels can be combined, which reduces the complexity of maintaining two compute kernels when adding new functionality, at a minimal performance impact. The expectation is that future devices require more active threads to saturate the GPU, but at the same time offer improved atomic performance. The single kernel method that we introduced here will automatically scale to future devices and offers less overhead than launching a separate reduction kernel. This reduced overhead results in slightly better performance (few %) on current architectures compared to the original two kernel method. In total we now require three GPU kernels to compute gravity, one copy kernel to move particles from CPU buffers to GPU buffers, one kernel to predict the particles to the new time-step and finally, the gravity kernel to compute the forces.

Results
In astrophysics the current most commonly used integration method is the fourth order Hermite [27] integrator. This integrator requires the velocity, the acceleration and the first time derivative of the acceleration (jerk) to be computed. The integrator furthermore requires information of the nearest neighbouring particle, this to determine collisional events or binary formation. Finally, the more advanced integrators such as NBODY4 [29] and Kira [30] require a list of particles within a given radius from each particle to determine the perturber list. All this is what Sapporo1 computes and how the GRAPE hardware operates [11]. The used numerical precision in this method is the double-single variant. In order to compare the new implementation with the results of Sapporo1, all results in this section, unless indicated otherwise, refer to the double-single fourth order Hermite integrator. Furthermore, we have enabled the computation of the nearest neighbour and the list of nearby particles, as has Sapporo1. However if the user does not require this information it can be disabled by changing a template parameter in the code.
For the performance tests we used different machines, depending on which GPU was used. All the machines with NVIDIA GPUs have CUDA 5.5 toolkit and drivers installed. For the machine with the AMD card we used version 2.8.1.0 of the APP-SDK toolkit and driver version 13.4.
The full list of used GPUs can be found in Tab. 1, the table shows properties such as clock speed and number of cores. In order to compare the various GPUs we also show the theoretical performance, relative with respect to the GTX480. Since, theoretical performance is not always reachable we also show the relative practical performance as computed with a simple single precision N -body kernel that is designed for shared-time steps, similar to the N -body example in the CUDA SDK [24].

Thread-block configuration
Sapporo2 is designed around the concept of processing a fixed number of sink particles for a block time-step algorithm (see Section 2.1). Therefore, the first thing to determine is the smallest number of sink particles that gives full GPU performance. To achieve full performance the computation units on the GPUs have to be saturated with work. The GPU consists of a number of multiprocessors and the computation units are spread over these multiprocessors. When the host code sends work to the GPU this is done in sets of thread-blocks. Each thread-block is executed on a multiprocessor. The blocks contain a (configurable) number of threads that can work together, while the blocks themselves are treated as independent units of work. In this section we determine the optimal number of blocks and the number of threads per block to saturate the GPU when performing the gravity computations. We test a range of configurations where we vary the number of blocks per multiprocessor and the number of threads per block. The results for four different GPU architectures are presented in Fig. 1. In this figure each line represents a certain number of blocks per multi-processor, N blocks . The x-axis indicates the number of threads in a thread-block, N threads . The range of this axis depends on the hardware. For the HD7970 architecture we cannot launch more than N threads = 256, and for the GTX480 the limit is N threads = 576. For the two Kepler devices 680GTX and K20m we can launch up to N threads = 1024 giving these last two devices the largest set of configuration options. The y-axis shows the required wall-clock time to compute the forces using the indicated configuration, the bottom line indicates the most optimal configuration.
For the 680GTX and the K20m the N blocks configurations reach similar performance when N threads > 512. This indicates that at that point there are so many active threads per multi-processor, that there are not enough resources (registers and/or shared-memory) to accommodate multiple thread-blocks per multi-processor at the same time. To make the code suitable for block time-steps the configuration with the least number of threads, that gives the highest performance would be the most ideal. For the HD7970 this is N threads = 256 while for the Kepler architectures N threads = 512 gives a slightly lower execution time than N threads = 256 and N threads = 1024. However, we chose to use N threads = 256 for all configurations and use 2D threadblocks on the Kepler devices to launch 512 or 1024 threads. When we talk about 2D thread-blocks it means that we launch multiple threads per i-particle whereby each thread computes a part of the j-particles. This way we increase the number of total threads which the hardware can schedule in order to hide the memory latencies. Especially when the number of active i particles is ≤ 128 this helps to improve the performance and is discussed in more detail in the next section. For each architecture the default configuration is indicated with the circles in Fig. 1.

Block-size / active-particles
Now we inspect the performance of Sapporo2 in combination with a block time-step algorithm. We measured the time to compute the gravitational forces using either the NVIDIA GPU Profiler or the built-in event timings of OpenCL. The number of active sink particles, N active , is varied between 1 and the optimal N threads as specified in the previous paragraph. The results are averaged over 100 runs and presented in Fig. 2. We used 131072 source particles which is enough to saturate the GPU and is currently the average number of particles used in direct N -body simulations that employ a block time-step integration method.
The straight striped lines in Fig. 2 indicate the theoretical linear scaling from (0, 0) to (256, X) where X is the execution time of the indicated GPU when N active = 256. Visible in the figure are the jumps in the execution time that coincide with the warp (wavefront) size of 32 (64). For NVIDIA devices we can start 2D thread-blocks for all values of N active , since the maximum number of threads that can be active on the device is ≥ 512. The effect of this is visible in the more responsive execution times of the NVIDIA devices when decreasing N active compared to the AMD device. Each time N active drops below a multiple of the maximum number of active threads, the execution time will also decrease. When N active decreases from N active < ∼ 64 the execution time goes down linearly, because of the multiple blocks that can be started for any value of N active . The lines indicated with '1D' in the legend show the execution time, if we would not subdivide the work further using 2D threadblocks. This will under-utilize the GPU and results in increased execution times for N active < 128.
The performance difference between CUDA and OpenCL is minimal, which indicates that the compute part of both implementations inhabits similar behavior. For most values of N active the timings of Sapporo1 and Sapporo2 are comparable. Only for N active < 64 we see a slight advantage for Sapporo1 where the larger data loads of Sapporo2 result in a slightly longer execution time. However, the improvements made in Sapporo2 result in higher performance and a more responsive execution time compared to Sapporo1 when 128 ≥ N active < 160. For the HD7970, there is barely any improvement when N active decreases from 256 to 128. There is a slight drop in the execution time at N active = 192, which coincides with one less active wavefront compared to N active = 256. When N active ≤ 128 we can launch 2D blocks and the performance improves again and approaches that of the NVIDIA hardware, but the larger wavefront size compared to the warp size causes the the execution times to be less responsive to changes of N active .

Range of N
Now that we selected the thread-block configuration we continue with testing the performance when computing the gravitational forces using N sink and N source particles, resulting in N sink × N source force computations (we set N sink = N source ). The results are presented in the left panel of Fig. 3. This figure shows the results for the five GPUs using CUDA, OpenCL, Sapporo1 and Sapporo2. The execution time includes the time required to send the input data and retrieve the results from the device.
The difference between Sapporo1 and Sapporo2 (both the CUDA and OpenCL versions) on the K20m GPU are negligible. Sapporo1 is slightly faster for N < 10 4 , because of the increased data-transfer sizes in Sapporo2, which influence the performance more when the number of computations is relatively small. Sapporo2 is slightly faster than Sapporo1 when N ≥ 10 4 , because of the various optimisations added to the new version. The difference between the GTX680, K20m and the HD7970 configurations is relatively small. While the GTX Titan is almost 1.5× faster and the GTX480 almost 2× slower than these three cards. These numbers are not unexpected when inspecting their theoretical performance (see Tab. 1). For N < 10 5 we further see that the performance of the HD7970 is lower than for the NVIDIA cards. This difference is caused by slower data transfer rates between the host and device for the HD7970. Something similar can be seen when we compare the OpenCL version of the K20m with the CUDA version. Close inspection of the timings indicate that this difference is caused by longer CPU-GPU transfer times in the OpenCL version when transferring small amounts of data (< 100KB) which, for small N , forms a larger part of the total execution time.

Double precision vs Double-single precision
As mentioned in Section 2.2.2 the higher order integrators require the use of double precision computations. Therefore, we test the performance impact when using full native double precision instead of double-single precision. For this test we use the GTX680, K20m and the HD7970. The theoretical peak performance when using double precision computations is lower than the peak performance when using single precision computations. The double precision performance of the K20m is one third that of the single precision performance. For the GTX680 this is 1 24 th and for the HD7970 this is one fourth. As in the previous section we use the wall-clock time required to perform N 2 force computations (including the data send and receive time) to compare the devices. The results are presented in the right panel of Fig. 3, here the double precision timings are indicated with the open symbols and the double-single timings with the filled symbols.
As in the previous paragraph, when using double-single precision the performance is comparable for all three devices. However, when using double-precision the differences become more clear. As expected, based on the theoretical numbers, the GTX680 is slower than the other two devices. The performance of the K20m and the HD7970 are comparable for N > 10 4 . For smaller N the performance is more influenced by the transfer rates between the host and the device than by its actual compute speed.
Taking a closer look at the differences we see that the performance of the GTX680 in full double precision is about ∼ 10× lower than when using double-single precision. For the other two cards the double precision performance is roughly ∼ 2.8× lower. For all the devices this is roughly a factor of 2 difference from what can be expected based on the specifications. This difference can be explained by the knowledge that the number of operations is not exactly the same for the two versions 5 and even in the double single method we use the special operation units to compute the rsqrt 6 . Another reason for the discrepancy between the practical and theoretical numbers is that we keep track of the nearest neighbours which requires the same operations for the double-single and the double precision implementation. Combining this with the knowledge that we already execute a number of double precision operations to perform atomic additions and data reads, results in the observed difference between the theoretical and empirically found performance numbers.

Sixth order performance
The reason to use sixth order integrators compared to lower order integrators is that, on average, they are able to take larger time-steps. They are also better in handling systems that contain large mass ratios (for example when the system contains a supermassive black-hole). The larger time-step results in more active particles per block-step which improves the GPU efficiency. However, it also requires more operations than a fourth order integrator, something which is discussed in detail in [5]. Previous work [15,19,20] indicates that double-single accuracy is sufficient for a sixth order integrator. However, to give the user the choice we implemented both a double-single and a double precision version of this method. The performance results of these versions are presented in Fig. 4. As in the previous figures we present the time to compute N 2 forces. Presented are the performance of the sixth and fourth order kernels using double precision and using double-single precision. As expected, the sixth order requires more time than the fourth order as it executes the most operations. The difference between the fourth order in double-single precision and the sixth order in double-single precision is about a factor 2. When we use double precision instead of double-single precision for the sixth order method then the execution time goes up by another factor of 2. The difference between the double precision fourth order and the double precision sixth order is about a factor of 1.4. The factor 2 difference in performance is relatively small and expected from the operation count. Therefore, if the sixth order allows you to take time-step that are two or more times larger than when using a fourth order your total execution time will go down when using a sixth order integrator. This combined with the benefits of the sixth order integrator such as being able to integrate high mass ratios, where high accuracy is required to trace tight orbits, makes the sixth order method a viable solution for N -body methods.
3.6 Multi-GPU As described in Section 2, Sapporo2 supports multiple GPUs in parallel. The parallelised parts are the force computation, data transfer and prediction of the source particles. The transfer of particle properties to the device and the transfer of the force computation results from the device are serial operations. These operations have a small but constant overhead, independent of the number of GPUs. For the measurements in this section we use the total wall-clock time required to compute the forces on N particles (as in Section 3.3). The speed-up compared to 1 GPU is presented in Fig. 5. The timings are from the K20m GPUs which have enough memory to store up to 8 × 10 6 particles. We use shared-time steps for these timings. For N > 10 4 it is efficient to use all available GPUs in the system and for N ≤ 10 4 all multi-GPU configurations show similar performance. The only exception here is when N = 10 3 at which point the overhead of using 4 GPUs is larger than the gain in compute power. For large enough N the scaling is near perfect (T single-GPU /T multi-GPU ), since the execution time is dominated by the computation of the gravitational interactions. Note that for these experiments we have to transfer the full data-sets to the GPU, this is why the scaling for small N is less than perfect as it takes time to transfer the data over a PCI-Express bus. For block time-step simulations the number of particles being transferred, per time-step, will be smaller. However, the compute time is also smaller as less particles will have to integrated. Therefore, the scaling for small N will stay less than perfect in all situations.

Block time-step simulations
To test the performance of the multi-GPU implementation for block time-step simulations with Sapporo2 we use a sixth order Hermite integrator with block timesteps [31,5]. We perform simulations of Plummer [32] spheres using 1 and 4 GPUs with double-single (DS) and full double precision (DP) accuracy. The number of particles used ranges from 16k up to 512k particles. For each simulation we record the execution time, the energy error, the average number of active particles per block-step and the speed-up of using 4 GPUs over 1 GPU.
The chosen time-step criteria is critical when performing block time-step simulations. For fourth order Hermite the method most commonly used is the Aarseth method [33]. For the sixth order a generalized version of the Aarseth criterion can be used as, described in [5]. However, this generalized version is unstable when the force computation is not accurate enough 7 . Specifically, rounding errors in the jerk and snap computation can cause the time-step to go to zero. Before running production simulations one should carefully consider which accuracy and time-step method to use, however a full analysis of the best time-step method for these situations is beyond the scope of this work. In [34] the authors work around this time-step problem by taking the average of the Aarseth fourth order method and the sixth order extension to compute the time-step (their Eq. 8). In order to compare the timing and accuracy of our simulations we use this average method for both our DS and DP simulations. Note that using the sixth order time-step computation together with DS force computation may result in a time-step that approaches zero. While the sixth order time-step combined with full DP force computation will work without problems.
For these simulations we set η 4 = 0.01 and η 6 = 0.1 and simulate the model for one N -body time-unit. The presented execution times cover the full execution from the start to the end of a simulation. The time therefore, includes all required operations on the GPU side (predict, gravity, particle copy) as well as on the host side (corrections, time-step computation, particle copies). During the simulation the size of N active varies between 1 and N .
The resulting data for the simulations are presented in Fig. 6. There are a number of other things we can see in the figures. First of all we can see that the full double precision simulations run faster than the double-single simulations. Eventhough the compute work is faster for the double-single version (as we saw in Fig. 5), the reduced accuracy forces the integrator to take more smaller time-steps. This can be seen by the average number of particles per block which is smaller for the DS simulations than for the DP simulations. Another thing to note is that the results of the single GPU DS simulations are slightly different than the four GPU DS simulations. This is another consequence of the reduced accuracy, the changed addition order when running on more than a single GPU results in rounding differences. For DP the results for single and multi GPU simulations are so similar that the differences are not visible in the figures. The DP simulations are not only faster, they also produce an enery error that is almost two orders of magnitude smaller than that of the DS simulations. The energy error for the DP simulations is around 10 −12 and that of the DS simulations around 10 −10 .
In Fig. 5 we saw that the speed-up when going from 1 to 4 GPUs scales from a factor 1 to 4x when the number of particles increases. A similar effect we see occuring in the bottom right panel; when the number of active particles increases the speed-up also increases. The jump in speed-up for the DS when going from 256k particles to 512k particles is caused by the increase of N active between 256k and 512k.
These simulations show that the benefit of using more than a single GPU depends on the dataset size, the used accuracy as well as on the average size of N active . It is therefore important that one knows these numbers when performing many simulations. Especially, when using a sixth order integrator, as we did here, it is critical that one chooses a time-step method that is suitable for the used accuracy.

CPU
With the availability of CPUs with 8 or more cores that support advanced vector instructions there is the recurring question if it is not faster to compute the gravity on the CPU than on the GPU. Especially since there is no need to transfer data between the host and the device, an operation which can be relatively costly when the number of particles is ≤ 1024. To test exactly for which number of particles the CPU is faster than the GPU we added a CPU implementation to Sapporo2. This CPU version uses SSE2 vector instructions and OpenMP parallelisation and can be run in single or in double precision. The only kernel implemented is the fourth order integrator, including support for neighbour lists and nearest neighbours (particle-ID and distance). Because the performance of the GPU depends on the combination of sink and source particles we test a grid of combinations for the number of sink and source particles when measuring the time to compute the gravitational forces. The results for the CPU (a Xeon E5620 @ 2.4Ghz), using a single core, are presented in Fig. 7a. In this figure (and all the following figures) the x-axis indicates the number of sinks and the y-axis the number of sources. The execution time is indicated by the colour from blue (fastest) to red (slowest). The smooth transition from blue to red from the bottom left corner to the top right indicates that the performance does not preferentially depend on either the source or sink particles, but rather on the combined number of interactions. This matches our expectations, because the parallelisation granularity on the CPU is as small as the vector width, which is 4. On the GPU this granularity is much higher, as presented in Fig. 7b, here we see bands of different colour every 256 particles. Which corresponds to the number of threads used in a thread-block (N threads ). With 256 sink particles we have the most optimal performance of a block, however, if we would have 257 sink particles we process the first 256 sinks using optimal settings while the 257th sink particle is processed relatively inefficiently. This granularity becomes less obvious when we increase the number of interactions as presented in Fig. 7c. Here we see the same effect appearing as with the CPU (Fig. 7a), where the granularity becomes less visible once we saturate the device and use completely filled thread-blocks for most of the particles. The final panel, Fig. 7d, indicates per combination of source and sink particles which CPU or GPU configuration is the fastest. For the CPU we measured the execution time when using 1,2,4 or 8 cores. In this panel the colours indicate the method which gives the shortest execution times. Furthermore does it indicate if and by how much the GPU is faster than the 8 cores of the CPU.
When either the number of sinks or the number of sources is relative small (≤ 100) the CPU implementation performs best. However, when the number of sinks or sources is > 100 the GPU outperforms the CPU. When using a CPU implementation that uses the AVX or AVX2 instruction sets the borders of these regions would shift slightly upwards. The CPU would then be faster for a larger number of source/sink particles, but that would only be at most for a factor of 2 to 4 more particles. The data of Fig. 7 confirms that our choice to implement the Sapporo2 library for the GPU is an efficient method for realistic data-set sizes. Although our implementation uses SSE2 instructions it is not as advanced as the implementation of [35]. For example, we use intrinsic functions while they use the assembly operations directly. This is also visible when we compare their performance with our implementation. The implementation we tested here reaches about 60% of their performance, however they do not compute the nearest neighbour particle and do not keep track of the neighbourlist, both of which have a significant impact on the performance as they cause divergence in the execution stream.

XeonPhi
Because the Sapporo2 library can be built with OpenCL it should, theoretically, be possible to run on any device that supports OpenCL. To put this to the test, we compiled the library with the Intel OpenCL implementation. However, although the code compiled without problems it did not produce correct results. We tested the library both on an Intel CPU and the Intel XeonPhi accelerator. Neither the CPU, nor the XeonPhi produced correct results. Furthermore, the performance of the XeonPhi was about 100× smaller than what can be expected from its theoretical peak performance. We made some changes to the configuration parameters such as N threads and N blocks , however this did not result in any presentable performance. We suspect that the Intel OpenCL implementation, especially for XeonPhi, contains a number of limitations that causes it to generate bad performing and/or incorrect code. Therefore, the Sapporo2 library is not portable to Intel architectures with their current OpenCL implementation 8 . This does not imply that the XeonPhi has bad performance in general, since it is possible to achieve good performance on N -body codes that is comparable to GPUs. However, this requires code that is specifically tuned to the XeonPhi architecture (K. Nitadori, private communication 9 ).

Conclusion
The here presented Sapporo2 library makes it easy to enable GPU acceleration for direct N -body codes. We have seen that the difference between the CUDA and OpenCL implementation is minimal, when there are enough particles to make the simulation compute limited. However, if many small data transfers are required, for example when the integrator takes very small time-steps with few active particles, the CUDA implementation will be faster. Apart from the here presented fourth and sixth order integrators the library also contains a second order implementation. And because of the storage of data in double precision it can be trivially expanded with an eighth order integrator. The performance gain when using multiple GPUs implies that it is efficient to configure GPU machines that contain more than 1 GPU. This will improve the time to solution for simulations with more than 10 4 particles.
The OpenCL support and built-in tuning methods allow easy extension to other OpenCL supported devices. However, this would require a mature OpenCL library and matching hardware that supports atomic operations and double precision data types. For the CUDA devices this is not a problem since the current CUDA libraries already have mature support for the used operations and we expect that the library automatically scales to future architectures. The only property that has to be set is the number of thread-blocks per multiprocessor and this can be easily identified using the figures as presented in Section 3.1.
The library is freely available either as part of the AMUSE software package [36], which can be downloaded from: http://wwww.amusecode.org. or as standalone library from: https://github.com/treecode/sapporo2/.    Figure 2 Performance for different numbers of active sink particles. The x-axis indicates the number of active particles and the y-axis the required time to compute the gravitational force using 131072 source particles (N active × N gravity computations). The presented time only includes the time required to compute the gravity, the data transfer times are not included. In both panels the linear striped line shows the ideal scaling from the most optimal configuration with 256 active particles to the worst case situation with 1 active particle for one of the shown devices. The left panel shows the effect on the performance when using 1D thread-blocks instead of 2D on AMD and NVIDIA hardware. It also we shows the effect of using OpenCL instead of CUDA on NVIDIA hardware. When using 1D thread-blocks the GPU becomes underutilized when N active becomes smaller than ∼ 128. This is visible as the execution time increases while N active becomes smaller. The right panel compares the performance of the five different GPUs as indicated. Furthermore, it shows that the performance of Sapporo2 is comparable to that of Sapporo1. Table 1 GPUs used in this work. The first column indicates the GPU, followed by three columns that show the memory properties. The clock-speed in Mhz in the second, the bus width in bits in the third and the product of the two, the bandwidth in GB/s in the fourth. The fifth column contains the number of compute cores and the sixth their clock-speed in Mhz. The next two columns indicate the theoretical performance in TFlop/s, the single precision performance is in the seventh column and the double precision in the eight column. The next two columns gives the relative performance of each GPU where we set the GTX480 to 1. For the ninth column these numbers are determined using the theoretical peak single precision performance (TPP) of the chips. The tenth column indicates the relative practical single precision peak performance (PPP) which is determined using a simple embarrassingly parallel N -body code.      In all the subplots the x-axis indicates the number of sink particles and the y-axis the number of source particles used. For subplots a,b and c the raw execution times are presented and indicated with the colours. Plot d does not present the execution time but rather which of the configuration gives the best performance. If the GPU is faster than the 8 cores of the CPU we indicate by how much faster the GPU performs. To use more than a single CPU core we use OpenMP. Note that the colours are scaled per plot and are not comparable between the different subplots. All the GPU times include the time required to copy data between the host and device.