PHEW: a parallel segmentation algorithm for three-dimensional AMR datasets - application to structure detection in self-gravitating flows

We introduce PHEW (Parallel HiErarchical Watershed), a new segmentation algorithm to detect structures in astrophysical fluid simulations, and its implementation into the adaptive mesh refinement (AMR) code ramses. PHEW works on the density field defined on the adaptive mesh, and can thus be used on the gas density or the dark matter density after a projection of the particles onto the grid. The algorithm is based on a"watershed"segmentation of the computational volume into dense regions, followed by a merging of the segmented patches based on the saddle point topology of the density field. PHEW is capable of automatically detecting connected regions above the adopted density threshold, as well as the entire set of substructures within. Our algorithm is fully parallel and uses the MPI library. We describe in great detail the parallel algorithm and perform a scaling experiment which proves the capability of phew to run efficiently on massively parallel systems.


Introduction
Over the last decades, computer simulations have become an indispensable tool for studying the formation of structure on all scales in our universe. The common feature of those simulations is the clustering of matter due to self gravity. This clustering is of fractal nature in the sense that -as long as gravity is the dominant force -aggregations of matter turn out to have internal substructures, which are themselves gravitational bound, and may even contain sub-substructures. A crucial tasks in the analysis of simulations is therefore the identification of overdense regions and, ideally, their entire hierarchy of substructure.
First algorithms to perform this task have been invented in the very early days of computer simulations in Astronomy and Astrophysics. A halo finder based on spherical overdensities (SO) was described already four decades ago by Press & Schechter (1974) who used it to find structure in their simulation of 1000 particles. Subsequently, the SO method has become one of the standard methods for halo finding. It consists in growing spherical regions around density peaks and assigning particles inside the spheres to the respective peak based on physical arguments. The also very popular friends-of-friends (FOF) method was introduced to halo finding by Davis et al. (1985). If two particles are separated by less than a user defined linking length, the particles are assigned to the same group. This results in groups of connected particles, the so-called FOF groups. On top of those two methods, a large variety of algorithms has been built over the last two decades: a recent halo finder comparison paper (Knebe et al. 2013) listed 38 different halo finders. For more detailed information about the halo finders which are on the market today, we refer to the series of papers that has emerged from the halo finding comparison project (Knebe et al. 2011;Onions et al. 2013;Knebe et al. 2013;Pujol et al. 2014).
Structure finding is not restricted exclusively to the computational cosmology community. Observers, for ex-ample, entered the field when they started to automatically identify clumps in position-position-velocity (PPV) cubes resulting from radio observation of molecular clouds. Stutzki & Guesten (1990) tried to fit the data by sums of triaxial Gaussian-shaped clumps and Williams et al. (1994) identified structure by contouring the dataset at evenly spaced levels without assuming an a priori shape for the clumps. More recently, Rosolowsky et al. (2008) showed how dendrograms can be used to exploit the hierarchy that naturally arises from contouring a PPV cube at multiple emission levels and used this technique to define substructures in molecular clouds.
With such a large choice of astrophysical structure finding tools at hand, one might ask the question why there needs to be yet another one. The trigger for the development of a new analysis tool was our need for "on-the-fly" structure finding in the astrophysical simulation code (Teyssier 2002), in order to locate gas and/or dark matter clumps while the simulation is running. As pointed out in Knebe et al. (2013) there is a general trend towards "on-the-fly" analysis for many reasons: most modern astrophysical simulations are performed on large computational infrastructure with distributed memory. The sizes of those simulations often exceed the total memory present in commonly used shared memory machines. The structure finding is therefore preferentially performed on the same machine that is running the simulation. Beyond that, the sizes of one single output of such simulations can quickly reach hundreds of GBs, up to several TBs. Storing many outputs for later post-processing is often not possible due to limited disk space, so that keeping only a catalogue of structure is the only viable solution.
Another reason for detecting structures while the simulation is advancing, is the possibility to couple the results of the halo decomposition to the simulation itself. In Bleuler & Teyssier (2014), for example, we have described a new algorithm for creation of sink particles, based on the properties of gas clumps detected "on-thefly". This application requires an extremely high frequency at which structure finding must be performed. It must therefore make efficient use of the parallel infrastructure, and deliver good scaling properties for increasing numbers of MPI tasks, up to the number of CPUs the simulation is running on. Otherwise it will unacceptably slow down the simulation.
These requirements resulted in the development of phew (Parallel HiErarchical Watershed), a new structure finding algorithm and its implementation into ram-ses 1 . While phew is not based on any pre-existing algorithm, it combines various concepts that have been used in other astrophysical structure finding tools before.
First, phew falls into the category of "watershedbased" algorithms. These algorithms assign particles or cells to density peaks by following the steepest gradient, resulting in the so-called "watershed segmentation" (see Section 2.1) of the negative density field. Other members of this category are denmax (Bertschinger & Gelb 1991), hop (Eisenstein & Hut 1998), skid (Stadel 2001), adaptahop (Aubert et al. 2004. Note that in contrast to the aforementioned codes which work on the particles directly, we use a mesh to define the density field 2 . Second, region merging in phew is based on the topological properties of saddle surfaces. This is the case as well for hop, adaptahop and subfind (Springel et al. 2001). As in the ahf halo finder (Knollmann & Knebe 2009), phew works on the density field deriving from particles that were previously projected onto the AMR mesh. In contrast to ahf, however, we do not use the AMR grid as a way of contouring the density field. A low density region which -for whatever reason -is refined to a high level does not compromise our results. Thus, in the landscape of existing halo finders, phew can be seen as filling the gap between p-hop (Skory et al. 2010) which does not find substructures but is a MPI-parallel version of hop, and adaptahop, a multithreaded software that does find substructures, but has not been yet MPI-parallelized.
The aim of this paper is to present a new structure finding algorithm that: 1-can be applied to any density field defined on an adaptive grid, 2-is capable of detecting substructure, 3-is parallelized using the MPI library on distributed memory systems, and 4-is fast enough to be run at every time step of a simulation without significantly slowing down the calculation. As briefly mentioned above, a previous version of phew has already been presented in Bleuler & Teyssier (2014). The algorithm described here differs from the previous one in the sense that it is now fully parallelized. This allows the algorithm to run now efficiently on thousands of CPUs and handle a complex topography with millions of density peaks and a rich hierarchy of substructures.
The article is organised as follows: in Section 2 we describe the serial version of the phew algorithm. In Section 3 we focus on the parallel implementation of the steps presented in Section 2. Section 4 contains scaling experiments which demonstrate the efficiency of the parallelization. Finally, we summarise and discuss our results, presenting an outlook on possible future work in Section 5.

The phew algorithm
In this section we describe the serial algorithm. As a starting point, we assume that we have a 3d density field on a AMR grid, particles have been projected onto the grid beforehand. The algorithm can be broken down in four main steps: In the first step, we assign every cell above a user defined density threshold to a local density maximum by ascending along the steepest gradient. This results in a primary segmentation of the computational volume into "peak patches": regions associated to certain density peak. We establish the connectivity between the peaks by identifying the saddle points. We eliminate the peaks with a low density contrast to the background by merging them to a neighbour through their densest saddle point. The structure surviving the noise removal is considered the finest (sub)-structure. In a last step, we recursively merge the substructure to form larger and larger composite objects.

Watersheds in image processing
Before we start with a more detailed description of the algorithm, we take a quick look over the fence into the field of mathematical morphology and its application to image processing. There, watershed algorithms are a well known and extensively studied tool for image segmentation. The basic idea is that a grayscale image can be thought of as a topographic relief. A drop of water that falls somewhere onto this relief will follow the line of steepest descent until it reaches a local minimum. All points that connect to the same local minimum in that manner form a catchment basin. The watershed algorithm therefore segments the picture into catchment basins. The boundaries of the catchment basins are the actual watersheds. This technique is usually applied to the magnitude of the images gradient. In this way, the watershed lines trace regions of high gradients and segment the original image it into connected regions of small gradients. An excellent overview of the watershed techniques used in image processing is given by Roerdink & Meijster (2000).
When comparing to watershed algorithms used for image segmentation, we have to consider a few aspects where our use case differs from the above one: some watershed algorithms are based on the image pixels being accessed in groups according to their gray level. While an 8 bit image contains only 256 gray levels, our density field is usually represented by an 8 byte float. Looping over all possible gray levels is clearly impossible in our case. Related to that, the limited number of grayscale levels introduces the problem of locally flat regions which do not contain a minimum. Since we use an almost continuous representation of densities we may safely ignore this issue.
Another distinguishing feature of watershed algorithms is whether and how they construct watershed pixels as the boundaries between adjacent catchment basins. This is not important for us as we want to assign every boundary cell to one or the other catchment basin. In this philosophy, the cell surfaces are the actual watersheds which are constructed from the segmented density field after the actual watershed algorithm has finished.
Another important difference is the cost for checking all neighbours of a cell/pixel. Working in 3 dimensions naturally increases the number of neighbours. Using an AMR grid further increases the number of possible neighbours since one has to consider possible neighbours at the same level as the original cell as well as one level above and below. Most importantly, the data structure in an AMR grid is very different from the one of a flat 2d array. The location of neighbouring cells in memory needs to be constructed before a neighbour can be checked for its density. Our main interest lies therefore in reducing the number of neighbours that have to be accessed. These aspects influence the choice of watershed algorithm for our purpose.

Watershed segmentation
In a first step, all cells above the density threshold are marked. We call those cells "test cells". For every test cell the densest neighbouring cell is identified and stored. If a cell has no denser neighbour, it is a local density peak. The peak obtains a peak ID which is written into the PPatch label of the corresponding cell. The test cells are sorted by decreasing density. Once sorted, every cell copies the PPatch label from its densest neighbour. The previous sorting ensures that the densest neighbour has been accessed before and has  Fig. 1: Visualization of the main steps of phew on a 1d density field (first panel). The segmentation into peak patches is shown in the second panel. Based on the relevance of a peak (peak-to-saddle ratio) we decide whether a peak represents "noise" or substructure. Irrelevant peaks are merged through their highest saddle points (third panel). The surviving objects are labeled as Level 0 clumps and denote the finest level of substructure. The substructure is merged based on a saddle threshold (third panel) into parent structure (fourth panel). therefore already obtained its PPatch label. Thus, every cell is assigned to a peak after this one pass. All cells marked with the same PPatch label form a peak patch (see Figure 1, second panel). Note that our peak patches correspond to the catchment basins introduced in Section 2.1. Since we are working on peaks rather than minima, we introduce this new terminology to avoid the cumbersome notion of an "inverted catchment basin". Note that this procedure is very similar to the hill climbing method described in Roerdink & Meijster (2000) which was introduced by Meyer (1994).

Saddle point search
Before we can merge peak patches, we have to establish the connectivity between them. All test cells are checked for neighbouring cells that belong to a different peak patch. If such a neighbouring cell is found, the average density of the starting cell and its neighbour is considered as the density at the common surface of the two bordering peak patches. The maximum density on the connecting surface is the saddle between the two peaks and stored. At the end of this step, each peak has its list of neighbouring peaks together with the corresponding saddle point densities. We denote the maximum saddle point of a peak as the "key saddle" and the corresponding neighbour as "key neighbour".

Noise removal
A known problem of the watershed method is oversegmentation. The presence of a huge number of local minima -for example due to random particle noise or transient gas density fluctuations -causes segmentation into as many catchment basins as there are local minima. Generally speaking, there are two possible strategies to deal with this problem: not creating the over-segmentation in the first place or merging oversegmented regions. Preventing over-segmentation the can be obtained using markers to preselect allowed minima (e.g. Moga & Gabbouj 1998). This usually requires a human intervention, which in our case is not possible. Another way is to use the so-called hierarchical watershed algorithm 3 (Beucher 1994). Hierarchical watershed algorithms merge artificial catchment basins to 3 Note that more modern approaches to region merging in image segmentation use the original image for merging while the watershed is computed on the gradient image (e.g., Peng & Zhang 2011). Using the watershed on the gradient image results in regions of similar gray values, where the densities inside our peak patches are very inhomogeneous. Approaches to region merging are thus fundamentally different in image processing than they are in our case. more important ones based on some criteria. What we will describe in the following turns our watershed algorithm into a hierarchical algorithm in the Beucher (1994) sense, where our merging criterion is inspired by the notion of a signal-to-noise ratio.
After having previously identified the saddle points, we classify the peaks based on their contrast to the background. We define the contrast as the ratio of the peak density to the key saddle density and name it "relevance". This is sketched in the second panel of Figure  1. Every peak is assigned a NewPeak label which is initialized to the peaks own peak ID. The peaks are sorted by decreasing peak density. For each peak, the key saddle is determined from the list of saddle points and the relevance is computed. Peaks with a relevance below a relevance threshold are considered noise 4 . If the peak is relevant, it is not touched. For an irrelevant peak, we check whether its key saddle links it to a denser peak. If this is the case, it will inherit the NewPeak label from this key peak. As in the watershed segmentation, the previous sorting makes sure that the NewPeak labels can propagate through long chains of connected peaks in just one loop. If a peak is both isolated and irrelevant, it is discarded.
When two peaks merge, their lists of saddle points are merged as well. If both peaks used to have a connection to the same third peak, the maximum of the two saddles is kept. Now, we iterate the procedure: from the updated lists of saddle points, the key saddles are determined. Peaks are accessed in the order of decreasing peak density and irrelevant peaks are merged. After an iteration without any mergers, all irrelevant peaks have been merged or discarded and the noise removal is finished. Note that the described merging process follows exactly the same principle as the watershed segmentation. We have simply replaced cells by peaks, densest neighbour cells with key neighbours and the PPatch label by the NewPeak label. We call the structures which survive the noise removal Level 0 clumps. They constitute the finest structure (see Figure 1, third panel) in our hierarchy.

Saddle threshold merging
If desired, the remaining peaks and their associated clumps can be merged further to form composite clumps.
This happens by exactly repeating the previous merging process with a different merging criterion. We have implemented a density threshold for the key saddle as a criterion. If the key saddle density is above that threshold, a peak is merged to its key neighbour (see Figure  1, fourth panel). Another possible criterion is the repeated use of the relevance threshold, this time with a higher value.

A hierarchy of saddle points
We have seen in Section 2.4 that saddle points are removed in groups or levels by merging through them. All key saddles which link their peak to a denser one are removed at once. Through the merging, other saddle points become key saddles and the next level of saddle points is removed. By repeating this process, a natural hierarchy of saddle points and clumps is produced. In Figure 2 we illustrate the construction of this hierarchy. We start with the Level 0 clumps after the noise removal (no substructure except for noise) and assume that the saddle threshold for merging is below any of the saddles depicted in Figure 2. The Level 1 saddle points are identified and used for merging. The resulting objects are Level 1 clumps as they have one level of substructure. In general, a Level n clump is formed through a merger which removes a Level n saddle point and contains n levels of substructure. This produces a very natural hierarchy of saddle points and clumps based on the levels of substructure. Note that the level of a saddle point does not reflect its density. A more traditional way of grouping substructure based on the density of the saddle that connects two substructure objects as it is for example produced by adaptahop can easily be recovered from this hierarchy.

Merging order
We will see in Section 3 that we have to drop the idea of sorting the peaks globally when we parallelize phew. This will alter the order in which peaks are merged in an unpredictable way. It is therefore crucial that the phew allows the order of mergers to change without causing different results. This not true in general. Yet, as we will show in this section, it is the case when we respect the three merging rules: 1. A peak is only merged to a denser one (upward). 2. A peak is only merged through its key saddle. 3. The density of the key saddle or the relevance are used as merging criterion. The result of the merging procedure is uniquely determined by the set of saddle points that is used for merging. This is a subset of all saddle points. In order to affect the outcome of the merging process, changing the order of mergers therefore has to change the set of used saddle points. Let us consider a peak n connected to its key neighbour m through the key saddle s nm at the very beginning of the merging process. The peak density of m is higher than that of n, m > n. There are three possible types of mergers related to n or m that can happen before n is considered for merging. We will show that none of them can change the fate of n.
1. A third peak might be merged into m. Due to upward merging, this cannot change the peak density of m and therefore decision if n will be merged into m is not influenced. 2. Peak m might merge into another peak m . The saddle s nm will still exist, now linking n to m . Due to upward merging we have m > m > n which means that n is still the lower of the two peaks connected by s nm . The decision whether n is merged through s nm is unaltered. 3. A third peak i might be merged into n. The peak density of n cannot change due to that since it would mean that peak i had a higher density than n which contradicts the upward merging. The key saddle cannot change because this would mean that peak i had a saddle point s ij higher than s nm . This would im-ply that the saddle point s ni through which i was merged into n was even higher, s ni > s ij otherwise s ni had not been the key saddle of peak i. Yet, s ni > s ij > s nm contradicts that s nm is the key saddle of peak n. The peak density of n and its key saddle are thus unchanged, therefore the relevance of n is not changed either.
This shows that we can arbitrarily delay the moment when we consider a peak for merging as long as we respect the three merging rules. The mergers happening in the mean time cannot change the properties deciding if and through which saddle this peak will be merged. A possible way to prevent violation of merging rule (ii) is to consider all peaks for merging until no further mergers are possible before any new key saddle of the merged peaks is computed. This results in using the saddle points for merging on a "level-by-level" basis. This is a key to the parallelization of phew since it will allow performing a big number of operations (mergers), in between each round of communication (finding new key saddles). Note that this line of argumentation breaks when we violate merging rule (iii) and use for example the clump mass as merging criterion. The mass is a property that changes with every merger. Therefore, altering the merging order does change the mass of a clump at the moment it is considered for merging and can thus change the decision whether the clumps should be merged or not.

Parallel implementation
We now turn to the implementation of the previously described steps in a parallel, distributed-memory framework. Where a detailed description of an algorithmic block in words would prevent readability of the paper, we refer the interested reader to a corresponding block written in pseudocode located in Appendix B. We assume that the computational domain has been previously decomposed into non-overlapping spatial domains, each domain containing a partition of the AMR mesh on which the density field is defined. In every MPI task, the local partition of the mesh is referred to as the "active cells". They are wrapped by a thin layer of cells that belong to other tasks. These ghost cells are referred to as belonging to the "virtual boundaries". These virtual boundaries are updated through MPI communication before phew is called to make sure that the densities in the virtual boundary cells are equal to the densities in the corresponding active cells hosted by other MPI tasks.

Parallel watershed
The watershed segmentation is non-local by nature. This can easily be understood by imagining a mountain ridge. Two drops of water falling onto both sides of the ridge will initially move away into different directions. They might flow into different rivers which flow into different lakes, or they might as well end up in two rivers which join before reaching a lake. The two situations cannot be distinguished based on local properties. Parallelization of the watershed algorithm is therefore a non-trivial task. In the literature, one finds various approaches to parallelization for the different watershed algorithms (see e.g. Roerdink & Meijster 2000). Our technique is very close to the technique described in Moga (1997) and called "hill climbing by locally ordered queues". Each task performs a loop over all its active cells, in order to identify first the test cells (cells above the density threshold). For faster access, the indices of all test cells are stored in an array. A loop over all test cells is performed where the densities of all neighbouring cells are checked. The index of the densest neighbouring cell is stored for each test cell, since it will be used several times during the algorithm. Note that the densest neighbour of a cell can lie inside the virtual boundary, while test cells are always inside the active domain.

Virtual peak boundary
As we have already described in Section 2, our peak patch merging step is analogous to the segmentation step. The patches now take the role of the cells, the PPatch label is replaced by the NewPeak label and the densest neighbouring cell is replaced by the key neighbouring patch. As explained before, the parallelization of the peak patch segmentation is exploiting the virtual boundaries surrounding each MPI domain. If we want to use the same strategy to parallelize the merging process, we need the analog of the virtual mesh boundary: a virtual peak boundary. In contrast to our usual virtual mesh boundary, the virtual peak boundary does not represent a fixed region in space. As the merging process advances, new connections appear and new peaks have to be introduced in the virtual peak boundary. Our virtual peak boundary is therefore more dynamic than our virtual mesh boundary. Figure 3 shows a possible layout of peaks in memory. Note the distinction between a peaks global ID and its local index. The latter of the two is the position of the peak in local memory. The peaks that are located inside a tasks MPI domain are called active peaks. They take the first N active places in memory. The active peaks are followed by the ghost peaks that belong to the virtual peak boundary. Since it is unknown in advance how much space for ghost peaks will be necessary, we set as a default value that can be modified by the user. The preset N max is mostly a large overestimation of the effectively used space in memory for peaks (see, fourth row in Table 2), designed to be sufficient for all setups we have tested. However, the memory consumption for peak properties is still negligible compared to the necessary space for the AMR grid. All peak properties such as the peak density are allocated up to N max . Since every task is aware of its starting number of global peak IDs, switching from global peak ID to local peak index and vice versa is trivial for active peaks. To recover a boundary peaks global ID from its local index, we simply store the global ID in memory at the position of its local index. For the opposite direction we use a hash table that contains the local peak index for a given global peak ID (hash key) 5 . Whenever we introduce a new boundary peak into the virtual peak boundary, it obtains the local peak index corresponding to the first free space in memory. The global peak ID is stored and a hash key is computed. Which peaks need to be present in the virtual peak boundary depends on the connectivity of peaks. The initial state of the virtual boundary will thus be constructed while searching for saddle points that connect the peaks.

The peak communicator
By introducing a peak into the virtual peak boundary, it only obtains a local peak index. No properties except the global peak ID of a newly introduced boundary peak are present at this stage. We now describe how information is transferred from the MPI task which hosts a peak (the "owner" of that peak) into the virtual peak boundaries of other tasks and vice versa. There are two types of communication: inward communication (collect, red arrows in Figure 3) from all processes which have a certain peak inside their peak boundary to the owner of the peak, and outward communication (scatter, green arrows in Figure 3) to update the peak properties in the virtual boundaries. When performing a collect communication, one has to specify whether one is computing a sum, minimum or maximum of the incoming values belonging to the same peak. When a scatter communication is performed, the peak properties of boundary peaks are overwritten with their equivalent from the peaks owner. A typical communication 5 We use a simple hash function based on the remainder of a division of the peak ID by a prime number chosen according to the maximum size of the virtual peak boundary. Collisions are dealt with by chaining in the form of a linked list (Knuth 1998). We found this to be sufficient for our purpose (see Table 2). pattern for a peak property is therefore a collect communication followed by a scatter communication.
Before this communication can effectively be performed, we need to build a communication structure which we refer to as the "peak communicator". We describe the construction of the peak communicator for the situation with 3 MPI tasks depicted in Figure 3. We allocate a matrix C of size N task × N task . The entry c ij is the number of peaks inside the virtual peak boundary of task i that are owned by task j. Each task builds its line of C in a loop over the boundary peaks by looking at their global peak IDs. Through MPI communication, the lines of C are shared between all task and C is complete. The resulting communication matrix is depicted in Figure 5.
We will first consider a collect operation. In this case, the lines in C are the send counter vectors and the columns are the receive counter vectors. These counters contain the amount of information that has to be sent/received to/from another MPI task. From those counters, we create the vectors of corresponding offsets. The n-th entry in the send/receive offset vector is simply the sum of the first n − 1 elements of the send/receive counter vector. The send and receive counters for our example are shown in Figure 5. The sum off all entries in the send/receive counters give the length of the send/receive buffer. These buffers are allocated and the send buffer is filled with the peak property to be communicated by looping over all boundary peaks. The receive buffers are filled with the corresponding values through calling MPI ALLTOALLV 6 . This command efficiently sends each package in the send buffer to the right location in the recipients receive buffers.
In order to complete the setup of the peak communicator, we use the established structure to perform a collect communication of the global peak ID. This information allows the identification of a position in the receive buffer with an active peak which completes the communication structure. The peak communicator is rebuilt whenever new peaks have potentially been added to the virtual peak boundary of any MPI task.
The communication structure for the scatter communication is identical to the one for the collect operation described above. Send and receive buffers, offsets and counters just switch their role.

The saddle point matrix
To keep track of the saddle points, we establish a symmetric saddle matrix M , where the entry m ij is the density of the saddle point connecting the peaks i and j. As The construction of the sparse matrices is performed locally the way described in Section 2.3. Whenever a connection is found to a peak that is not yet present in the virtual peak boundary, the given peak is introduced by assigning it a local index. See Algorithm 1 for the pseudocode describing the saddle point search on each task.

Communication of saddle points
We could now use a collect communication on the saddle points for every peak in the entire computational box. As a result of that, every task would have access to all saddle points of all his active peaks. The global key saddle and key neighbour could then be determined by every MPI task for his active peaks. However, this approach would introduce a lot of communication and unnecessarily fill the sparse saddle matrices. The only necessary information to perform one iteration in the merging process is the (global) key saddle density of a peak and the corresponding key neighbour. This global maximum saddle can be found by comparing the local maxima of each MPI task. We thus minimise communication by performing a collect communication only on the local maximum of each row in the saddle point matrix. Together with the local maximum saddle density, we collect the global peak ID that denotes the local key neighbour. The owner of a peak can now compute the global key saddle for a given peak by comparing all the local maxima. The global peak ID that was received from the MPI task which hosts the global key saddle is the key neighbour of the peak. If not already present, the key neighbour is introduced into the virtual peak boundary of the owner task and the key saddle density is written into the sparse saddle matrix of the owner. Every MPI task can now perform a complete iteration in the merging process without any further communication of saddle point densities.

Merging in parallel
We are now set for the actual merging of the peaks. We introduce two new peak properties: a logical variable called alive which is initialised to true and set to false when a peak is merged into another one, and the NewPeak label which is initialised to the global peak ID for all active peaks. These two new properties and the peak density are updated in the virtual peak boundaries using a scatter communication. A permutation which sorts the active peaks by decreasing density is computed. Now we propagate the NewPeak label through the key saddles in a level-by-level fashion. On each level, we iterate until no NewPeak label is moved, while the virtual boundaries are updated after every iteration. This is perfectly analogous to the parallel watershed segmentation. After every level of saddle points we update the alive variable, the saddle point matrices and the virtual boundaries. The merger routine is described in Algorithm 2 in pseudocode. The substructure merging is performed in exactly the same way, we just re-place the relevance threshold by the saddle density threshold.

scaling test
We use a previously run cosmological dark matter simulation with 512 3 particles for a scaling experiment. We restart the simulation from the output corresponding to redshift z = 0 using various numbers of MPI tasks. Before phew can run, we project the particle density onto the AMR grid using the CIC (Cloud-In-Cell, Hockney & Eastwood 1981) algorithm. Once we have constructed the grid-based density field, we run phew with a density threshold of 80 times the cosmological critical density (noted ρ crit ) and a relevance threshold of 3. After merging the peak patches into Level 0 clumps (sub-haloes), we merge to form haloes by applying a saddle threshold of 200ρ crit . The first column in Table 1 summarizes parameters and runtime statistics obtained for 1024 tasks. We see a rich hierarchy of saddle points spread over many levels. The numbers of iterations necessary show that there is structure extending over several domain boundaries at every stage of the process (peak patches, clumps, haloes). Note that we phew finds exactly the same structures, independent of the number of MPI tasks that have been used. This empirically confirms what we have described in Section 2.7. It is also worth mentioning that the iteration pattern looks surprisingly similar for the other 512 3 runs in our scaling experiment. The total number of necessary iterations increases from 35 to 45 when going from 32 to 2048 tasks while it would be only 3 when for the serial algorithm. An example of the hierarchical structure that is found by phew is shown in Figure 8 which depicts a halo with four levels of substructure taken from our scaling experiment.
In our numerical experiment, phew was run five times in a row, for five main simulation time steps following the restart. We measure the total runtime of each call to phew as well as the time spent on the different algorithmic steps. We find the variance of the runtimes to be negligible and conclude that the timings are stable. Note that the preliminary construction of the density field is performed inside the watershed segmentation block. However, the CIC algorithm is quick compared to the watershed segmentation. We also measure the amount of time necessary for each MPI task to write the properties of the structure inside its domain to disk.
The runtimes for the various numbers of MPI tasks are plotted in the top two panels of Figure 6. The top panel shows an excellent scaling of the algorithm up to 1024 MPI tasks which is four times the numbers of The merging in our test scales well up to ∼ 256 MPI tasks. The bottom panel shows the maximum number of sparse matrix elements over all MPI tasks compared to 1/N tasks and rescaled to one at 32 MPI tasks. The increase seen in this number for of tasks is due to the growing load imbalance in terms of peaks per task and the increase in the surface to volume ratio of the domain segmentation. It explains the increase of the scaled runtime of the noise removal very well up to 512 tasks. The overall scaling of the algorithm is satisfactory up to 1024 MPI tasks which is four times the number of CPUs the original simulation was run on. tasks that were used to perform the original simulation. In this regime, the total runtime of phew is dominated by the watershed segmentation and the saddle point search. The most costly operations inside those two blocks are the construction and access of neighbouring cells. The total workload of those blocks thus scales linearly with the number of test cells per MPI task.
The second panel shows that the runtime of those two blocks does actually scale over the entire range of numbers of tasks that we have tested. The second panel in Figure 6 shows that the merging procedures scale well up to 256 tasks. The scaling of the merging process in this region is mainly controlled by two effects: with a growing number of tasks, the load imbalance of the peaks between the different MPI tasks increases. This is unavoidable as the domain decomposition is optimised for all AMR cells, not for the test cells only, and even less for the peak patches. The second reason is the growing ratio of surface to volume as the computational box is divided in smaller parts. This results in more ghost peaks per active peak which causes a higher workload per active peak. Those two effects are quantified in the first two rows of Table 2.
The solid line in the bottom panel of Figure 6 is a result of both effects mentioned above. It depicts max{N sparse }, the maximum number of used sparse matrix elements over all MPI tasks. In perfect scaling conditions, this number would decrease as 1/N tasks . We thus multiply max{N sparse } by N tasks and rescale to one at 32 tasks. We compare this to the runtime of the noise removal (also scaled). We observe that this "worst case" number of entries in the sparse saddle point matrix does explain the scaling of the merging process up to 512 tasks. Beyond that, we believe that MPI communications become the performance bottleneck.
In Table 2 we also show the maximum ratio of ghost peaks to active peaks. For 2048 tasks we have a value of 24%. This shows that the number N max defined in Equation 1 is an overestimation of the effectively used memory for ghost peaks for this setup. In the same table, we also list the number of hast table collisions.
There are very few collisions as the hash table is far from filling up and we conclude that the relatively simple hash function that we use is good enough for our purpose. Another fact worth mentioning is the relatively constant ratio of non-zero entries in saddle point matrix to the number of peaks seen in the third line of Table 2. Divided by two (due to the symmetry of the saddle point matrix), this number gives a good idea of the effective number of neighbours per peak.
As a second test we perform a "weak scaling" comparison of our 1024 task run with another 1024 task run but this time on a larger, 1024 3 particle box. The second column of Table 1 lists the statistics of that run. The numbers of test cells, peaks, clumps and haloes all increase by the expected factor of ≈ 8. We thus divide the runtimes of phew for this setup by 8 and compare to the runtime of the 1024 task run on the 512 3 box. This comparison is plotted in Figure 7. The figure shows that the runtime per data decreases for all parts of phew by increasing the size of the data. Especially the efficiency of merging routines benefit a lot from the increased size of the dataset. We thus conclude that we can enlarge the range of N task where phew scales well, by increasing the size of the simulation.

Conclusions
We have presented phew, a new structure finding algorithm and its MPI parallel implementation into the AMR code ramses. phew finds density peaks and their associated regions in a 3D density field by performing a watershed segmentation. The merging is based on the saddle point topology. We have described a two-step approach to merging. In a first step, we merge irrelevant density fluctuations which we consider as noise. In a second step we merge the finest substructure hierarchically, into large, connected regions above the adopted density threshold. This merging process naturally re-  We show a small sub-volume of the 512 3 particle box used in our scaling experiment. The coordinates indicate the fraction of the box size. The sub-volume contains ≈ 2 × 10 6 particles. The objects that emerge after the noise removal (Level 0 clumps) are indicated in the second panel, where all particles belonging to the same object share a color. Every subsequent panel shows the status after a further round of merging as it is described in Section 2.6.
sults in a tree-like representation of substructure similar to the dendrograms presented by Rosolowsky et al. (2008). The main focus of this article is on the parallel implementation of the algorithm which we have described in detail. Our implementation is truly parallel, meaning that it produces exactly the same results for varying numbers of MPI tasks. To test the parallelization of phew, we have performed a scaling experiment on a snapshot from a cosmological dark matter simulation. We have found excellent scaling in the relevant range of MPI tasks. When using the same number of MPI tasks that was used for the actual simulation, the runtime of phew ∼ 10% the time it takes to advance the simulation by one time step. This allows for frequent usage of phew on-the-fly and thus more fine-grained information about how matter assembles in simulations.
phew has similarities with already existing watershed based halo finders, such as denmax (Bertschinger & Gelb 1991), hop (Eisenstein & Hut 1998), skid (Stadel  et al., in prep), but these are either not yet parallelized, do not find substructure or work only on particles. On a first sight, it looks like our approaches to define substructure or parallelization cannot be applied to particle-based data structures since we operate on a mesh-defined density field, while the other codes work on the particle distribution directly. However, the only two concepts that we use which are naturally provided by the grid, namely a local density and the notion of a neighbour, can be also defined for other data structures that do not rely on a grid. Once these properties are defined, the algorithm presented in this paper can be applied to particle data in the same way as we apply it to grid data. At the current stage, our implementation of phew is a topological tool only, meaning that it identifies regions in space disregarding physical properties such as the kinetic or gravitational energy of the matter in that volume. For the application of phew as a genuine halo finder, we need to develop an unbinding procedure, which removes dark matter particles from regions they are not gravitationally bound to. We will exploit our hierarchical decomposition into substructure, to pass unbound particle to larger and larger regions, until the particles remain bound. This will unambiguously define the parent halo (or sub-halo) of the particles.