Adaptive Techniques for Clustered N-Body Cosmological Simulations

ChaNGa is an N-body cosmology simulation application implemented using Charm++. In this paper, we present the parallel design of ChaNGa and address many challenges arising due to the high dynamic ranges of clustered datasets. We focus on optimizations based on adaptive techniques for scaling to more than 128K cores. We demonstrate strong scaling on up to 512K cores of Blue Waters evolving 12 and 24 billion particles. We also show strong scaling of highly clustered datasets on up to 128K cores.


Introduction
Simulating the process of cosmological structure formation with enough resolution to determine galaxy morphologies requires an enormous dynamic range in space and time. Star formation (SF) is concentrated in dense gas clouds the size of just a few parsecs, while the assembly of galaxies happens over billion of years, driven by large scale structures extending over megaparsecs.
Constraints on cosmology are tightest on scales of tens of megaparsecs and larger due to observations of the Cosmic Microwave Background, giving us detailed initial conditions [1]; however our knowledge of the non-linear evolution of the Universe and of the properties of galaxies is still imperfect, as the detailed properties of Dark Matter [2] and of SF [3] remain only partially understood. On the other hand, simulations of large volumes of the Universe [4,5], and of individual galaxies at high resolution [6,7] have been fundamental in putting the standard hierarchical, Cold Dark Matter dominated model, on a robust footing [8]. Further understanding requires numerical simulations of increasing dynamical range, mass and spatial resolution and physical complexity, providing a powerful incentive to develop ever more sophisticated parallel codes [9,10].
Scaling such codes to large processor count requires overcoming not only spatial resolution challenges, but also large ranges in timescales. Ignoring these multiple timescales leads to the use of uniformly small time steps, and thus a lot more work, although it is easier to parallelize. Using different time steps for different particles is potentially more efficient but leads to an algorithm that is significantly harder to parallelize effectively. This paper presents the design of ChaNGa, a parallel n-body+SPH cosmology simulation program for the simulations of astrophysical systems on a wide range arXiv:1409.1929v1 [astro-ph.IM] 5 Sep 2014 of spatial and time scales. Most of the physical modules of ChaNGa have been imported from the well established tree+SPH code GASOLINE and we refer the readers to the existing literature [11,12,13] for more details.
In this paper we focus on the optimizations implemented in it to scale to large numbers of processors, and to deal with the challenges brought on by the high dynamic ranges of clustered datasets. However, we will begin with an overview of the field and place the approach taken by ChaNGa in the context of published material. We then briefly summarize some specific features of ChaNGa (some imported from GASOLINE), including force softening, smooth particle hydrodynamics, star formation, and multi-stepping. The parallel design of ChaNGa, based on overdecomposition of work, allowing a parallel run-time system to dynamically balance it, is presented next, along with descriptions of the phases of the computation. To set the context, and a baseline, for the optimizations presented, we first describe the single-stepping performance on relatively uniform data-sets. The clustered data-sets are then introduced, and a series of performance challenges along with strategies and optimizations developed to overcome them are described. These are accompanied by detailed performance analysis using the Projections performance visualization tool [14]. As of Spring 2014 our performance evaluation runs demonstrate good speedups to over 131,000 processor cores on NCSA's Blue Waters and up to a 3x speedup over the single-stepping algorithm 1 .

Current State of the Art
Because of the computational challenge and the non-trivial algorithms involved, cosmological N-body simulations have been an extensively studied topic over the years. In order to frame our work in ChaNGa, we review some of the recent successes in scaling cosmological simulations on the current generations of supercomputers. However, direct comparison of the absolute performance among different codes is difficult. Different choices of accuracy criteria for the force evaluations and the time integration will have a big impact on performance, and the choices for these criteria will be determined by the various scientific goals of the simulation. For example, understanding the development of structures at very high redshift (e.g. [15]) will present different parameter and algorithm choices than simulations that model the observations of current large scale structure (e.g. [16]).
2HOT [17] is an improved version of the HOT code which has been developed over the past two decades. It uses an Oct-tree for gravity and its gravity algorithm is very similar to that of ChaNGa. This code demonstrates near perfect strong scaling out to 262 thousand cores on Jaguar with a 128 billion particle simulation, implying 500,000 particles per core at the largest core count. The actual size of the scaling simulation (in Gigaparsecs) was not reported, but can be presumed to be a box of order 1 Gigaparsec based on the other simulations presented in [17]. HOT2 does implement a multi-step time-stepping algorithm, although it is not clear whether particles have individual time steps, and performance for the multi-step method was not presented.
The HACC [16] framework scales to millions of cores on a diverse set of architectures. It uses a modified TreePM algorithm: an FFT based particle-mesh on the large scales, a tree algorithm on intermediate scales and particle-particle on the smallest scales. HACC has been demonstrated to scale with near perfect parallel efficiency out to 16384 nodes on Titan with 1.1 trillion particles, and out to 1.6 million cores on Sequoia with 3.6 trillion particles. These are weak scaling results, typically with millions of particles per core. They also demonstrated strong scaling out to 8182 nodes on Titan and 16384 cores on Sequoia.
The GreeM code [15] demonstrates scaling of a trillion particle simulation to 82944 nodes (663522 cores) of the K computer, implying 1.5 million particles per core. This code also uses a TreePM algorithm with a hand-optimized particle force loop and a novel method to parallelize the FFT. They report that despite the new parallelization method, the FFT remains the bottleneck in their TreePM code. They also employ a multi-step method that splits the PM and particle forces, but the particles do not have individual time steps.
The Gadget-3 TreePM code was used to perform a large scale structure, DM-only simulation (the "Millenium XXL") on 12288 cores using 303 billion particles [18]. With over 16 million particles per core, special effort was needed to optimize the memory usage of the code because the simulation was limited by memory resources.
Most of these cosmological N-body codes with published performance data scale to millions of cores with almost perfect parallel efficiency, given very large problem size (typically trillions of particles). However, it becomes even more challenging to simulate a relatively smaller problem size with higher resolution using large number of cores. This is due to the fact that the distribution of clusters of particles in the simulated system tends to become more non-uniform as resolution increases, leading to load imbalance and making it hard to scale. The addition of hydrodynamics and cooling only exacerbates this problem. Recent projects that coupled gravity with hydrodynamics in galaxy formation simulations and scaled past a few thousand cores include EAGLE and Illustris. The codes used (GADGET-3 and AREPO) share many of the features of ChaNGa that are necessary for galaxy formation, including individual time steps for particles, gas dynamics, and star formation/feedback prescriptions [5,19]. While some codes handle non-uniform distributions well (e.g. Gadget-3) they have not been shown yet to scale to large (100,000 cores or greater) core counts. Hence, to our knowledge, ChaNGa is the first code to explicitly tackle both the uniform and highly clustered simulations with extremely large scaling. This is achieved by several techniques including multi-stepping and large scale dynamic load balancing described in Section 8.

ChaNGa
The N-body/Smooth Particle Hydrodynamics (SPH) code ChaNGa [20,21], is an application implemented using Charm++. ChaNGa includes a number of features appropriate for the simulation of cosmological structure formation, including high force accuracy, periodic boundary conditions, evolution in comoving coordinates, adaptive time-stepping, equation of state solvers and subgrid recipes for star formation and supernovae feedback. The code is also being compared with similar codes in the AGORA comparison project [22]. Cosmology research based on ChaNGa includes modeling the impact of a dwarf galaxy on the Milky Way [23], modeling the intracluster gas properties in merging galaxy clusters [24] and distinguishing the role of Warm Dark Matter in dwarf galaxy formation and structure [13]. In this section we describe the features of ChaNGa, particularly as they relate to cosmological structure formation. In addition to the physics features described below, ChaNGa has a number of usability features required for pushing a large simulation through a production system, such as the ability to efficiently checkpoint and restart on a different number of processors.

Gravitational Force Calculation
The gravitational force calculation is based on a modified version of the classic Barnes-Hut algorithm [25]. Details of our modifications are described in section 4, and many of our optimizations are taken from PKDGRAV [26], upon which our gravity calculation is based. As in PKDGRAV, we choose to expand to hexadecapole order the multipoles used for evaluating the far field due to a mass distribution within a node. For the force accuracies required for cosmological simulations, better than 1 percent [27,28], this higher order expansion is more efficient [29].

Force Softening
When simulating dark matter and stars, the goal is to understand the evolution of a smooth distribution function that closely approaches a Boltzmann collisionless fluid. As the N-body code is sampling this distribution using particles, a more accurate representation of the underlying mass distribution is obtained if the particles are not treated as point masses, but instead have their potential softened [30]. Softened forces are also of practical use since they limit the magnitude of the inter-particle force. Typically, the softening length is set at the inter-particle separation at the center of DM (Dark Matter) halos [27].
Calculating the non-Newtonian forces introduced by softening adds a complication to the multipole calculation: Newtonian forces have symmetries which greatly reduce the complexity of higher order multipoles, and the number of components of the multipole moments that need to be stored. ChaNGa implements softening using a cubic spline kernel, whose compact support means this complexity is not needed beyond a specified separation (convergence with Newtonian gravity is formally achieved at two softening lengths). Furthermore, rather than evaluating the more complex multipoles when softening is involved, ChaNGa evaluates all forces involving softening using only the monopole moments, using a stricter opening criterion to maintain accuracy.

Periodic boundary conditions
In order to efficiently and accurately simulate a portion of an infinite Universe, we perform the calculation assuming periodic boundary conditions. Because of the long range nature of gravity, the sum over the infinite number of periodic replicas converges very slowly. ChaNGa accelerates this convergence using Ewald summation [31], implemented similarly to [32] as more fully described in [26]. This technique has the advantage that the non-periodic force calculated from the tree-walk is not modified, and therefore is simple and fast.

Multi-stepping
In order to efficiently handle the wide range of timescales in a non-uniform cosmological simulation, ChaNGa allows each particle to have its own time step. In order to amortize overheads associated with the force calculation, such as tree building, the time steps are restricted to be power-of-two subdivisions of the base time step. Details of this scheme, including how to integrate the equations of motion in coordinates that follow the expansion of the Universe, are described in [33].

Smooth Particle Hydrodynamics
Despite being a small fraction of the energy density of the Universe, baryons play a significant role in the evolution of structure. Not only are they the means by which we can measure structure (e.g. via star light), they can also directly influence the structure of the dark matter via gravitational coupling [34]. Therefore following the physics of the baryonic gas is essential for accurate modeling of structure formation.
ChaNGa uses Smooth Particle Hydrodynamics (SPH) to solve the Euler equations with an implementation that closely follows [11]. Since SPH is based on particles, implementing it is a natural extension of the algorithms to calculate gravity on a set of particles. In particular, the tree structure used for the Barnes-Hut algorithm is used to find the near neighbor particles needed for the SPH kernel sums. SPH is also relatively communication intensive compared to gravity, so we utilize the Charm++ runtime system to adaptively overlap the communication latencies from SPH with the floating point operations needed by gravity. The current implementation of SPH in ChaNGa closely follows techniques already published by independent groups and includes an updated treatment of entropy and thermal diffusion [12,35], pressure gradients 2 and timestepping [36]. This last features ensures that sudden changes in the particle internal energy, e.g. caused by feedback, are captured and propagated to neighboring particles by shortening their time step. These improvements leads to a marked improvements in the treatment of shocks (as in the Sedov-Taylor blastwave test), and cold-hot gas instabilities. A qualitative example is shown in figure 1, where the classic "blob" test compares ChaNGa with GADGET-2.
As this paper focuses specifically on the scaling performance of ChaNGa we refer to existing works [11,13] and Wadsley et al. (in prep.) for SPH tests of this implementation.

Star Formation and Feedback
Again, a necessary component of simulating structure formation is predicting the light distribution. Hence we need a prescription for where the stars are forming. Furthermore, it is clear that star formation is a self-regulating process due to the injection of energy from supernova, ionizing radiation and stellar winds into the starforming gas. These processes are all happening well below the resolution scale of even the highest resolution cosmological simulations, so a sub-grid model is needed to include their effects. ChaNGa includes the physics of metal lines and molecular hydrogen cooling [35,37] and feedback from supernovae (SNe). In ChaNGa, we have implemented the "blast-wave" and "superbubbles" feedback models described in [38] and [39] respectively. In both models SF occurs in high gas density regions and the time distance scale for energy injection into the gas is then determined by physically motivated models. The "blastwave" prescription follows an analytic model of the Sedov blast wave and it has allowed us to successfully model a number of trends in galaxy populations including the Tully-Fisher relation [40], the massmetallicity relation [41], the stellar mass-halo mass relation [42] and the formation of DM cores in dwarf galaxies [43]. ChaNGa (top) vs GADGET-2 (bottom). The color density map shows how with the new SPH formulation of pressure gradients, artificial surface tension is suppressed and instabilities rapidly mix the "blob" with the surrounding medium, while poor handling of contact discontinuities preserve the blob in the now obsolete SPH implementation of GADGET-2 (GADGET-2 was originally introduced in 2001, see [44]). We have verified that ChaNGa gives results quite similar to alternative hydro codes, as the adaptive mesh refinement code ENZO [45]. This figure was produced with Pynbody [46].

Parallelization Approach
In ChaNGa, the particle distribution in space is represented in the form of a hierarchical tree structure where each node represents a portion of the 3D space containing the particles in that volume. The root node represents the entire simulation space and the children represent sub-regions. The leaf nodes of the tree are buckets containing a small set of particles.

Domain Decomposition
During domain decomposition, particles are divided among objects called tree pieces (or chares in the context of Charm++) which are mapped onto processors by the runtime system. Typically, there are more tree pieces than the number of processors to benefit from the overlap of communication with computation and the load balancing features enabled with over-decomposition.
ChaNGa supports various domain decomposition techniques, which have been evaluated previously [47]. We used space-filling curve (SFC) decomposition for the results in this paper as that is the method currently used for most scientific studies with ChaNGa.
The goal of this scheme is to identify a set of splitting points (splitters) along the space filling curve such that each range contains approximately equal number of particles. The algorithm used to identify the splitter keys is similar to the parallel histogram sort [48]. First, a single master object calculates a set of splitters along the SFC that partition the simulation domain into disjoint areas of roughly equal volume. It then broadcasts the splitter keys to all the tree pieces. The tree pieces evaluate the count of particles for each bin, which is reduced across all tree pieces back to the master process. The candidate keys are then adjusted based on the contributions received, and new splitters are broadcast for any bins that are not sufficiently close to an optimal partition. This process is repeated until a suitable set of splitter keys is determined such that all tree pieces have roughly equal number of particles. After the splitter keys are identified, particles are globally distributed to tree pieces according to the splitters, where each bin corresponds to one tree piece.

Tree Build
After particles have migrated and domain decomposition is finished, each tree piece builds its tree independently. The tree build is done in a top-down manner. The algorithm starts from the root, which contains the entire simulation space, and proceeds downwards to the leaves, which are buckets containing a small number of particles, typically 8 to 12. A tree piece has information about the extent of the domain held by other tree pieces; this information is used in the tree building process. A spatial binary tree is then constructed by bisecting the bounding box containing particles in the given volume. The tree building process bisects each node, starting at the root, into children, which represent sub-regions within the space, until a leaf node is constructed. If a node in the tree held by a tree piece contains particles in another tree piece, then that node becomes a boundary node.
We also take advantage of the fact that a tree piece can access other tree pieces within the same address space. All the tree pieces within the same address space are merged. After the merge, each tree piece has read-only access to the tree datastructure that is constructed by merging multiple tree pieces. For additional details, we refer the reader to [20].

Tree Traversal and Gravity
The object of tree traversal is to identify for each bucket of particles in the tree the list of nodes and particles whose information is needed for the gravity calculation. These interaction lists are constructed on a per bucket basis to amortize the overhead of the tree traversal.
Another optimization that is implemented in ChaNGa to improve the performance of the gravity phase is based on the observation that nearby buckets tend to have similar interaction lists [26]. The algorithm constructs the interaction list of a parent node before proceeding to the children, and maintains a checklist, passed down the tree, that reduces the number of nodes that need to be evaluated at each level.
Tree traversal results in remote access of nodes which are part of tree pieces on other processors. To optimize this remote data access, we have implemented a software cache, as shown in Figure 2. The Cache Manager serves node and particle requests made by a tree piece. If a node request is missed in the cache, then a request is sent to the corresponding tree piece. If there is already an outstanding request in the cache, no additional request is sent. When the response arrives, the requestors are informed and the walk resumes. This improves the performance by hiding the latency of remote requests by improving the chance of a requested node being already present in cache as well as reducing the number of messages sent and received for the remote node. To further reduce cache misses, we also perform a prefetch walk which obtains remote node information.
To effectively overlap communication and computation, we divide the tree traversal into local and remote. A local traversal is done on the portion of the tree which is within the local address space whereas a remote traversal is done on the remaining part of the tree which requires communication between the tree pieces. We use prioritization to give precedence to the remote traversal, which requires communication, over the computation-dominated local traversal. When the remote walk has sent out requests for the node and is waiting for the response, the local walk can be done. This enables overlap of communication with local computation and helps mask message latency. Figure 2 diagrams the gravity calculation in ChaNGa with a software cache.
Sequential code in ChaNGa is also well optimized. In particular, we take advantage of single-instruction, multiple-data (SIMD) parallelism inherent in the force calculation to accelerate that part of the computation using FMA or SSE vector instructions.

Datasets and Systems
We first describe the datasets used for our experiments and their characteristics. We have two large, uniform (Poisson distributed) datasets with 12 and 24 billion particles. Other than having periodic boundary conditions these two datasets are not particularly interesting for cosmology. We include them here to demonstrate the scaling of ChaNGa to large core counts. cosmo25 is the more challenging dataset: it is a 2 billion particle snapshot taken from the end (i.e. representing the current, very clustered, structure of the Universe) of a dark matter simulation of a 25 Megaparsec cube in a ΛCDM Universe. The force softening is 340 parsecs, and the simulation represents a challenge for load balancing. The version of this  simulation with gas dynamics and star formation is able to resolve the disks of spiral galaxies within this volume [Anderson et al, in preparation]. dwarf is our most challenging dataset: while it contains only 52 million particles spread throughout a 28.5 megaparsec volume, most of the particles are in a single high resolution region in which a dwarf galaxy is forming. The mass resolution in this region is equivalent to having 230 billion particles in the entire volume, and the force resolution within this region is 52 parsecs. This is a high resolution version of the DWF1 simulation discussed in [40].
We show the performance of ChaNGa on Blue Waters. Blue Waters is a hybrid Cray XE/XK system located at the National Center for Supercomputing Applications (NCSA). It contains 22, 640 Cray XE6 nodes and 4, 224 Cray XK7 nodes that include NVIDIA GPUs. Each dual-socket XE6 compute nodes contains two AMD Interlagos 6276 processors with a clock speed of 2.3 GHz and 64 GB of RAM.

Single Stepping
We now describe essential optimizations required for scaling the simpler datasets that are not highly clustered, and evaluate their performance. Later sections will describe optimizations for clustered datasets.

Single Stepping Improvements
We observed that from-scratch domain decomposition is not required at every step, especially for datasets which are not highly clustered. After the initial domain decomposition, it needs to be performed only when there is an imbalance in the load of tree pieces. By reusing the previously determined splitters, we reduce the overhead incurred in finding the splitters as well as the number of particle migrations. We use an adaptive mechanism to determine when to perform the domain decomposition. In this approach, load statistics of the tree pieces are collected and domain decomposition is only performed if an imbalance is detected. Otherwise, only particle migration is done based on the previous splitters. We use the quiescence detection [49] mechanism implemented in Charm++ to determine when all the migrations are finished.
In the unoptimized version of the code, the tree build requires all tree pieces to send the information about the first and the last particle in their domain, subject to the SFC. This information is used to determine ownership of nodes in the tree but requires heavy communication. We avoid this by using the boundary information to determine a set of candidate tree pieces which may have information about the node. One of them is then queried and in case that tree piece does not have the information, it forwards it to the appropriate tree piece.
Since load balancing incurs overhead, it should be done sparingly. We use the MetaBalancer [50] framework in Charm++ to determine when to invoke the load balancer. MetaBalancer monitors the application characteristics and invokes the load balancer only when needed. Figure 3 shows strong scaling results on up to 512K cores on Blue Waters evolving 12 and 24 billion particles. Our application exhibits almost perfect scaling up to the maximum number of cores. Each iteration consists of domain decomposition, load balancing, tree build and force calculation. Table 1 shows the break down of the time per step into the different phases. For the simulation evolving 12 billion particles, we achieve 93% parallel efficiency at 512K cores with the time per step Remote$Work$ Ewald$ Local$Work$ Figure 4: Time profile graph which shows processor utilization over time for 16K cores on Blue Waters for 12 billion particles. This shows overlap of communication with computation to achieve high utilization. being 2.6 seconds. For the 24 billion particles, we achieve 93.8% parallel efficiency with a time per step of 5.1 seconds. The efficiency is calculated with respect to 16K cores and 32K cores for 12 and 24 billion respectively.

Performance
The good scaling of the gravity phase is due to the overlap of communication and computation, the improved tree walk algorithm using an interaction list, the software request cache, prefetching, and other optimizations. The time for domain decomposition also scales with the increase in number of cores. Table 1 shows, for the 12 billion particles at 512K cores on Blue Waters, that domain decomposition takes on average 73 ms per step. At 128K cores the domain decomposition is 9 times faster in comparison to the unoptimized version. This is due to the use of the adaptive technique to determine when to perform full domain decomposition. The tree build time also scales well and takes 34 ms at 512K cores. At 128K cores, the tree build is approximately 6 times faster than the unoptimized version. Similar trends are seen in the 24 billion particle simulation. Table 2 contains the breakdown of the total time per step for the unoptimized version of the code. Comparing the results with table 1, for the 12 billion particle simulation, we reduce the total time by 15 to 49%. For the 24 billion particle simulation, we reduce the total time per step by 22 to 43%. The reduction in time occurs for all phases of the application. Figure 4 shows the time profile graph obtained using Projections [14]. This shows the average processor utilization over the course of one time step evolving 12 billion particles on 16K cores of Blue Waters. We can see that the local work, which is given a lower priority, overlaps with the communication needed for the higherpriority remote work, resulting in close to 100% processor utilization.

Clustered Dataset Challenges
For datasets such as dwarf, the particle distribution is concentrated at the center and therefore highly clustered. This creates many challenges in scaling, of which one of the most significant is communication imbalance. During the gravity phase, remote requests are sent for tree nodes that are not present in the local cache. In a   clustered dataset, some tree nodes are requested many more times than others. This results in the tree pieces owning those tree nodes receiving a large volume of node request messages. Figure 5a shows the number of requests received by processors for the dwarf simulation at 8K cores on Blue Waters. We can see that a handful of processors receive as many as 30K messages. Even though there is overlap of communication with computation, this causes significant performance degradation. This is because, at this scale, there is not enough local computation to overlap seconds of delay in receiving messages. One way to mitigate this problem is to replicate the information that is being requested to prevent a few processors from being the bottleneck.
We replicate the information about the tree nodes on multiple processors ensuring that no single processor becomes overloaded. Before the gravity phase begins, tree pieces send their node information to a set of tree piece proxies on other processors. The responsibility of the tree piece proxy is to store the node information sent to it and handle requests for those nodes. When a tree piece needs to request for a remote node, it chooses randomly one of the tree piece proxies to send the request to. Figure 5b shows the number of messages received by the processors when four tree piece proxies are created for each tree piece. For 8K core run on Blue Waters, replication reduces the maximum number of messages received from 32K to 4.2K   Figure 6 shows the time-profile graph where the x-axis is the time and y-axis is the processor utilization. Here, yellow regions constitute the local work, blue the ewald and maroon the remote work. Note the idle time, in figure 6a, before the remote work begins which is due to the delay in receiving messages and with no local work overlap. Figure 6b shows the impact of replication. The remote work can start earlier due to a smaller delay in request messages. The local work overlaps with the communication until remote work is ready to start. This is a very good example to show prioritization of remote work over local work and the overlap of communication with computation. Figure 7 shows the strong scaling performance for this dataset on core counts ranging from 1K to 16K. We compare the time for the gravity phase because the rest of the phases are the same in both cases. The gravity time is improved from 2.4 seconds to 1.7 seconds for 8K cores and from 2.1 seconds to 0.99 seconds on 16K cores.

Multi-stepping Challenges
A wide variation in mass densities can result in particles having dynamical times that vary by a large factor. In a single-stepping mode, good accuracy can only be achieved by performing the force calculation and particle position and velocity updates at the smallest timescale. However, hierarchical time stepping schemes can be used for a large dynamic range in densities at a low additional cost. We use adaptive time scales where forces are evaluated only on relevant particles instead of evaluating forces on all the particles at the smallest time scale. In a multi-step simulation, particles are assigned to time step rungs corresponding to the shortest time scale required for accurate simulation. Rungs corresponding to short time scales are evaluated more frequently than those for long time scales. Using multi-stepping for clustered datasets introduces a variety of challenges. The irregular distribution of particles in the simulation space as well as the division of particles into rungs creates severe load imbalance. In general, the challenge is higher for datasets with fewer particles. We discuss various optimizations that enable ChaNGa to scale a medium-sized 2 billion particle clustered dataset, cosmo25, on up to 128K cores on Blue Waters. Reaching this level of performance required overcoming challenges related to load imbalance, communication overhead with a decrease in computation per processor as well as the scalability of other phases of the simulation. Strong scaling of this nature will be required to run clustered cosmological simulations on future machines with hundreds of Petaflop/s performance, and presents a realistic proving ground for parallel strategy innovations.

Optimizations for the Gravity Phase
In a multiple time step simulation, the number of particles active in the fastest rung is typically only a fraction of the total number of particles being simulated. These active particles tend to be clustered, and therefore the distribution of particles among the tree pieces is highly imbalanced. One may consider performing from-scratch domain decomposition based on the active set of particles for these time steps but that results in large jumps of the domain boundaries. To prevent such sudden large variations of the boundaries, we perform from-scratch domain decomposition only when there is a significant number of particles active for that time step. But as one can imagine, this will result in tree pieces with a large variation in active particles and load. Figure 8 shows the distribution of the load on tree pieces for the fastest rung (rung 4) and the slowest rung (rung 0) of the cosmo25 dataset. The slowest rung has tree pieces with loads distributed around the mean. But the fastest rung has only 2405 tree pieces with active particles and some of them have a load which is 3000 times the average load of tree pieces and 40 times the average load of the system. Even though periodic load balancing is performed to distribute the load, the maximum load of the system will be limited by the most overloaded processor which in this case is the one having the most loaded tree piece. At larger scales of 128K cores there is not enough work to be distributed among all the cores which results in significant degradation of performance. We propose two adaptive strategies to overcome this problem.

Intra-node Work Pushing
We use the SMP mode of Charm++ to take advantage of the shared memory multiprocessor nodes used in HPC systems [51]. The SMP mode supports multithreading, where one Charm++ process is assigned per SMP node, with a single thread mapped to each physical core. One thread within a node is normally assigned as a communication thread responsible for internode communication, while the rest are used as worker threads that implement processing elements (PEs).
Within a Charm++ SMP process, data can be shared via pointers. The load balancing strategy works in a hierarchical fashion. Details are given in Section 8.4 but in essence it first tries to achieve load balance among the SMP processes and then balances the load among cores within the SMP process.
LBManager , which is an object present on each PE, has information about the average load of the system and load of other PEs on the same SMP process. The LBManager , on identifying that a PE is overloaded, instructs overloaded tree pieces at that PE to distribute the work among other less loaded PEs within the SMP process. A tree piece is responsible for calculating forces on a set of particles in its domain, grouped into buckets. We consider the bucket to be the smallest entity of work that can be distributed. PEs receiving a foreign bucket have access to the tree and all the data structures of the owner tree piece so that they can perform the tree traversal and gravity force calculations for the foreign bucket. Once the force calculations are done, the foreign bucket is marked as complete and the original PE is informed. Once all the foreign and local buckets are completed, the tree piece is done with the gravity calculations.
This work pushing adaptive strategy reaps the most benefit for time steps where the fastest rung is active. Figure 9a shows the time-line view from the projections tool [14] for rung 4 (the fastest rung). Here, each line corresponds to a PE and colored bars indicate busy time while white shows idle time. This plot is for a 32K run on Blue Waters and we have chosen the PE and the corresponding SMP process with the maximum load. We can see that the most loaded PE, which also contains the most loaded tree piece, is busy for about 2 seconds while other PEs are idle. Figure 9b shows the time line for the work pushing strategy for a set of PEs in the SMP process where one of the PE is assigned the most loaded tree piece. With the work pushing strategy, we are able to successfully distribute the work load among other PEs within the node. This results in a reduction of the gravity time from 2.3 seconds to 0.3 seconds for the fastest rung.

Intra-node Dynamic Rebalancing
For clustered datasets, it is often the case at the trailing end of the gravity calculation that some of the PEs are idle while others are busy. This could be due to misprediction of load or inability of the load balancer to balance the load perfectly. Figure 10a shows the Projections time-line view for this scenario where the colored bars indicate busy work while the white shows idle time. We found that such slight load imbalance in the application can be mitigated by more fine-grained parallelism within the SMP process. We use an intra-node dynamic rebalancing scheme where the idle PEs within the node pick work from the busy ones. The scheme is implemented using the CkLoop library [51] in Charm++, which enables fine-grained parallelism within a SMP process. As with the work-pushing scheme, buckets are the smallest entity of work that can be reassigned.
If all the tree pieces residing on a PE have finished their work, then the PE becomes idle. At each PE, the LBManager maintains a PE-private variable which keeps track of its status. Since the memory address is shared among the PEs on a SMP process, the LBManager can access the status variable of all the PEs within the SMP process. Whenever there is a significant number of idle PEs, the dynamic rebalancing scheme kicks in. Tree pieces then create chunks of work out of the unfinished buckets and add these to the node-level queue. The idle processors access the node-level queue and pick up work to execute. Due to the overhead associated with the node-level queue we only use the work-stealing scheme adaptively for the trailing end of the computation. Figure 10a shows the time-line for the slowest rung, rung 0, of cosmo25 dataset simulation for a 32K run on Blue Waters. We pick a subset of PEs to show this problem. We can see that the load is almost balanced but towards the end of the step there are some PEs which are idle while others are busy. Figure 10b shows the time-line with dynamic rebalancing. It is able to successfully handle small amounts of load imbalance and reduce the gravity time from 9.8 seconds to 8.5 seconds for rung 0.

SMP Request Cache
Data reuse can be critical in determining the performance of tree-based algorithms [52]. Modern SMP-based supercomputers offer several levels at which data sharing can be effective. Requests for the same remote elements from two traversals on a core can be merged. The fetched data can then be reused by all traversals on the core. Similarly, cores in the same SMP domain can share remotely fetched data. In the following we describe a two-level caching scheme that enables the data reuse across traversals on a core, as well as across cores on an SMP processor. This caching mechanism is transparent to the traversal code.
Each core on the SMP has a private cache, which stores pointers to remotely fetched data. There also exists one cache at the level of the SMP that is shared by all cores in the SMP. The shared cache contains the union of all the entries in the private caches of these PEs.
Briefly, the algorithm funnels all requests for remote data through the cache. If the data are found in the private cache, then they are immediately passed into the requesting traversal's visitor code. If the data are not found on the PE, we check whether some other piece on the PE has requested them previously. If so, a lightweight continuation is created to resume the traversal at the requested node upon its receipt. Otherwise, the more expensive, SMP-wide table lookup is performed.
We devised a scheme to manage concurrent accesses of the shared, SMP-wide cache table, where all requests for remote data generated by traversals on the SMP processor are funneled through a single core, which is termed the fetcher for that SMP processor. Cheap, intra-node messaging between PEs is used for efficiency.

Domain Decomposition
Simulations of datasets with nonuniform distributions are characterized by extensive movement of particles across tree piece boundaries over time. When unchecked, this leads to an increasingly nonuniform distribution of particles across tree pieces and eventually precludes a good balance of load across processors. In such scenarios, it becomes useful to repeat the full domain decomposition more frequently.
The first stage of domain decomposition, as described in Section 4, involves a series of histogramming steps to determine a set of splitters that partition the simulation domain into tree pieces of roughly uniform particle count. This is implemented in terms of broadcasts from a single sorter object, which refines the splitters, to the tree pieces, and reductions of particle counts for each bin back to the sorter process. In strong scaling scenarios for highly clustered datasets, domain decomposition may become a performance bottleneck, as the number of splitters generally depends on the number of processors used in the run. As such, we implemented a number of optimizations aimed at improving SFC domain decomposition performance. First, we replaced the broadcast of SFC keys from the sorter object with the broadcast of a bit vector indicating which of the bins evaluated in the previous step need further refinement. From the bit vector, the set of splitters to evaluate is determined once at each SMP node, and delivered to all tree pieces at that node for evaluation. This optimization greatly reduced the size of the buffers being broadcast. Secondly, we noticed that some histogramming steps were much more expensive than others, due to involving more splitters. This was particularly true for the first and last steps. The first histogramming step involved a full set of splitters due to none having been finalized yet. For this step, we were able to remove the broadcast of splitters by having tree pieces reuse the splitters determined the last time domain decomposition was done. We were also able to eliminate the last histogramming step in the original algorithm, in which the final set of splitters was broadcast to the tree pieces to collect a full histogram of particle counts. Instead, we modified the sorter object to preserve particle counts for all previously finalized splitters, so as to have the full set of counts at the end.
These optimizations significantly improved domain decomposition performance. For runs of the cosmo25 dataset on Blue Waters, the time for a full domain decomposition was reduced from 3.22 s to 1.52 s on 1024 nodes, a speedup of 2.1.

Hierarchical Multistep Load Balancer
Even if domain decomposition assigns almost equal number of particles to tree pieces, density variations in different regions of the simulated space can result in load imbalance. We experimented with domain decomposition based on load but the basic approach was not ideal for multi-stepping simulations as it led to large jumps in boundaries and significant movement of particles. Since execution time is determined by the most loaded processor, it becomes important to address the load imbalance problem without significant additional overhead.
Load balancing in Charm++ applications like ChaNGa is normally achieved by over-decomposing the problem into many more objects than processors and letting the Charm++ dynamic load balancing framework balance the load by mapping the objects to processors [53]. The framework can automatically instrument the computation load and communication pattern of tree pieces and other objects and store it in a distributed database. This information is then used by the load balancing strategies, which we optimized for ChaNGa, to map the objects to processors. Once the decision has been made, the load balancing framework migrates the objects to newly assigned processors. Alternatively, the load of the objects and their communication pattern can be determined using a model based on a priori knowledge. But for ChaNGa, we find that determining the load based on a heuristic called the principle of persistence is more accurate. Based on this heuristic we use recent history to determine the load of near-future iterations. This scheme works well for single-stepping simulations at a relatively small scale. However, multi-stepping simulations at very large scale impose several new challenges.
First, multi-stepped execution introduces some challenges in the measurement based load balancing to obtain accurate load information. Substeps within a big step in a multi-step run have selected number of active particles. Predicting the load of a tree piece based on the preceding substep will result in discrepancy between the expected load and the actual load. Therefore, we instrument and store the load of the tree pieces for different substeps/rungs separately. Whenever particles migrate from one tree piece to another, they carry a fraction of their load for the corresponding rungs for which they were active and contribute that to the new tree piece. This enables us to achieve very accurate prediction of the load of a tree piece for each substep even with migrations and multi-stepping.
Secondly, it is very challenging to collect communication pattern information in ChaNGa, even at small core count, due to a very large number of messages in the simulation, which may incur significant overhead on memory when performing load balancing. Therefore, we used an alternate strategy to implicitly take communication into account during load balancing by using an ORB-based (Orthogonal Recursive Bipartitioning) strategy, which preserves the communication locality. Lastly, in extremely large scale simulations, load balancing itself becomes a severe bottleneck. The original centralized load balancing strategies, where load balancing decision is made on one central processor, do not scale beyond a few hundred processors, which makes them infeasible for large scale simulations. To overcome this challenge, we implemented a scalable load balancing strategy suitable for multistepped execution based on the hierarchical load balancing framework [53,54] in Charm++ runtime. This new load balancing strategy performs ORB to distribute the tree pieces among processors. The processors are divided into independent groups organized in hierarchical fashion. Each group consists of 512 processors. At each level of the hierarchy, the root performs the load balancing strategy for the processors in its sub-tree. We found that two levels of hierarchy is enough to achieve good load balance with little overhead. At higher levels of the hierarchy refinement based load balancing strategy, which minimizes the migration by considering the current assignment of tasks, is used. At the lowest level of the hierarchy we use ORB to partition the tree pieces among the processors in that sub-group. The load balancer collects the centroid information of tree pieces along with their load. Taking the centroids into account, the tree pieces are spatially partitioned into two sets along the longest dimension. Similarly, at each stage of partitioning, the processors are also partitioned. During partitioning, tree pieces are divided into two partitions such that the loads of the partitions are almost equal. This is done recursively until one processor remains which is assigned the corresponding partition containing the tree pieces.
Another optimization to further reduce the overhead of load balancing is to combine the node level global load balancing with the intra-node load balancing strategies described in Section 8.1. We implemented such a two-level load balancing strategy, where the load is first balanced across SMP nodes, and then balanced inside each SMP node. The ORB algorithm described above is done for nodes rather than processors. Once the tree pieces are assigned to SMP nodes, they are further distributed among the PEs in the SMP node using a greedy strategy. This ensures that load is equally distributed among the SMP nodes. We perform an additional step of refinement to further improve the load balance for the rare cases when the load is not evenly balanced.  Step time 16 Step

Performance Evaluation
We now present the scaling performance of the cosmo25 simulation. Figure 11a shows the average time per iteration for this simulation with single-stepping and figure 11b shows the average time per iteration with multi-stepping. In a multistepping run, 16 substeps constitute a big step. To compare the time for singlestepping and multi-stepping, a single big multi-step covers the same dynamical time as 16 single steps. Table 3 gives a break down of the time taken for different phases for single-stepping and multi-stepping. We can see that at 8K cores the singlestepping simulation takes more than 3 times the time taken by multi-stepping and at 128K it takes twice as long. Note that the gravity time for multi-stepping is 4.5 times faster than single stepping. Due to sufficient sequential work to overlap communication and relatively balanced tree pieces, we are able to achieve 80% efficiency for single-stepping at 128K cores with an average step time of 2.7 seconds.
As described in section 8.1 the multi-stepping run has many challenges due to irregular distribution of particles in faster rungs. Incorporating the improvements mentioned above, we are able to scale to 128K cores with an efficiency of 48% with respect to 8K cores with a time step of 1.4 seconds. Note that if we consider the gravity force calculation time, we achieve an efficiency of 60% and the gravity time is 3 times faster in multi-stepping in comparison to the single-stepping run.

Conclusion
In this paper, we have described the design and features of our highly scalable parallel gravity code ChaNGa and went into the details of scaling challenges for clustered multiple time-stepping datasets. We have presented strong scaling results for uniform datasets on up to 512K cores on Blue Waters evolving 12 and 24 billion particles. We also present strong scaling results for cosmo25 and dwarf datasets, which are more challenging due to their highly clustered nature. We obtain good performance on up to 128K cores of Blue Waters and also show up to a 3 fold improvement in time with multi-stepping over single-stepping. Many features of the Charm++ runtime system were used to achieve these results. Starting with the standard load balancing and overlap of communication and computation enabled by the over-decomposition strategy, we employed a number of Charm++'s features. Of particular importance were features that allowed us to replace parts of our algorithm that scaled as the number of cores, such as quiescence detection for particle movement and the hierarchical load balancer. Also of importance were features such as CkLoop, SMP Cache and node level load balancing, that exploited SMP features of almost all modern supercomputers. With these features, we can bring to bear the computational resources of many 100s of thousands of processor cores on the highly clustered, large dynamic range simulations that are necessary for understanding the formation of galaxies in the context of of large scale structure.