PHEW: a parallel segmentation algorithm for threedimensional AMR datasets
 Andreas Bleuler^{1}Email author,
 Romain Teyssier^{1},
 Sébastien Carassou^{1, 2} and
 Davide Martizzi^{1, 3}
https://doi.org/10.1186/s4066801500097
© Bleuler et al. 2015
Received: 22 October 2014
Accepted: 6 April 2015
Published: 9 June 2015
Abstract
We introduce phew (Parallel HiErarchical Watershed), a new segmentation algorithm to detect structures in astrophysical fluid simulations, and its implementation into the adaptive mesh refinement (AMR) code ramses. phew works on the density field defined on the adaptive mesh, and can thus be used on the gas density or the dark matter density after a projection of the particles onto the grid. The algorithm is based on a ‘watershed’ segmentation of the computational volume into dense regions, followed by a merging of the segmented patches based on the saddle point topology of the density field. phew is capable of automatically detecting connected regions above the adopted density threshold, as well as the entire set of substructures within. Our algorithm is fully parallel and uses the MPI library. We describe in great detail the parallel algorithm and perform a scaling experiment which proves the capability of phew to run efficiently on massively parallel systems. Future work will add a particle unbinding procedure and the calculation of halo properties onto our segmentation algorithm, thus expanding the scope of phew to genuine halo finding.
1 Introduction
Over the last decades, computer simulations have become an indispensable tool for studying the formation of structure on all scales in our universe. The common feature of those simulations is the clustering of matter due to self gravity. This clustering is of fractal nature in the sense that  as long as gravity is the dominant force  aggregations of matter turn out to have internal substructures, which are themselves gravitational bound, and may even contain subsubstructures. A crucial tasks in the analysis of simulations is therefore the identification of overdense regions and, ideally, their entire hierarchy of substructure.
First algorithms to perform this task have been invented in the very early days of computer simulations in Astronomy and Astrophysics. A halo finder based on spherical overdensities (SO) was described already four decades ago by Press and Schechter (1974) who used it to find structure in their simulation of 1,000 particles. Subsequently, the SO method has become one of the standard methods for halo finding. It consists in growing spherical regions around density peaks and assigning particles inside the spheres to the respective peak based on physical arguments. The also very popular friendsoffriends (FOF) method was introduced to halo finding by Davis et al. (1985). If two particles are separated by less than a user defined linking length, the particles are assigned to the same group. This results in groups of connected particles, the socalled FOF groups. On top of those two methods, a large variety of algorithms has been built over the last two decades: a recent halo finder comparison paper (Knebe et al. 2013) listed 38 different halo finders. For more detailed information about the halo finders which are on the market today, we refer to the series of papers that has emerged from the halo finding comparison project (Knebe et al. 2011; Onions et al. 2013; Knebe et al. 2013; Pujol et al. 2014).
On even larger scales, the identification and characterization of cosmic voids is an important task. Similar to haloes, the voids assemble in a hierarchical structure of voids and subvoids which can be found in observational and simulation data likewise. Way et al. (2011) and Way et al. (2014) give an overview on void finding techniques and the relation to the identification of overdensities.
Automatic detection of structure is also performed at galactic scales. For example, Astronomers performing radio observations of molecular clouds entered the field when they started to identify clumps in positionpositionvelocity (PPV) data cubes. Stutzki and Guesten (1990) tried to fit the data by sums of triaxial Gaussianshaped clumps and Williams et al. (1994) identified structure by contouring the dataset at evenly spaced levels without assuming an a priori shape for the clumps. More recently, Rosolowsky et al. (2008) showed how dendrograms can be used to exploit the hierarchy that naturally arises from contouring a PPV cube at multiple emission levels and used this technique to define substructures in molecular clouds.
With such a large choice of astrophysical structure finding tools at hand, one might ask the question why there needs to be yet another one. The trigger for the development of a new analysis tool was our need for ‘onthefly’ structure finding in the astrophysical simulation code (Teyssier 2002), in order to locate gas and/or dark matter clumps while the simulation is running. As pointed out in Knebe et al. (2013) there is a general trend towards ‘onthefly’ analysis for many reasons: most modern astrophysical simulations are performed on large computational infrastructure with distributed memory. The sizes of those simulations often exceed the total memory present in commonly used shared memory machines. The structure finding is therefore preferentially performed on the same machine that is running the simulation. Beyond that, the sizes of one single output of such simulations can quickly reach hundreds of GBs, up to several TBs. Storing many outputs for later postprocessing is often not possible due to limited disk space, so that keeping only a catalogue of structure is the only viable solution.
Another reason for detecting structures while the simulation is advancing, is the possibility to couple the results of the halo decomposition to the simulation itself. In Bleuler and Teyssier (2014), for example, we have described a new algorithm for creation of sink particles, based on the properties of gas clumps detected ‘onthefly’. This application requires an extremely high frequency at which structure finding must be performed. It must therefore make efficient use of the parallel infrastructure, and deliver good scaling properties for increasing numbers of MPI tasks, up to the number of CPUs the simulation is running on. Otherwise it will unacceptably slow down the simulation.
These requirements resulted in the development of phew (Parallel HiErarchical Watershed), a new structure finding algorithm and its implementation into ramses.^{1} While phew is not based on any preexisting algorithm, it combines various concepts that have been used in other astrophysical structure finding tools before.
First, phew falls into the category of ‘watershedbased’ algorithms. These algorithms assign particles or cells to density peaks by following the steepest gradient, resulting in the socalled ‘watershed segmentation’ (see Section 2.1) of the negative density field. Other members of this category are denmax (Bertschinger and Gelb 1991), hop (Eisenstein and Hut 1998), skid (Stadel 2001), adaptahop (Aubert et al. 2004), grasshopper (Potter and Stadel, in prep). Note that in contrast to the aforementioned codes which work on the particles directly, we use a mesh to define the density field.^{2} Void finding is typically performed using watershedbased algorithms too (e.g., Platen et al. 2007; AragónCalvo et al. 2010; Sutter et al. 2015).
Second, region merging in phew is based on the topological properties of saddle surfaces. This is the case as well for hop, adaptahop and subfind (Springel et al. 2001). As in the ahf halo finder (Knollmann and Knebe 2009), phew works on the density field deriving from particles that were previously projected onto the AMR mesh. In contrast to ahf, however, we do not use the AMR grid as a way of contouring the density field. A low density region which  for whatever reason  is refined to a high level does not compromise our results. Thus, in the landscape of existing halo finders, phew can be seen as filling the gap between phop (Skory et al. 2010) which does not find substructures but is a MPIparallel version of hop, and adaptahop, a multithreaded software that does find substructures, but has not yet been MPIparallelized.
The aim of this paper is to present a new structure finding algorithm that: (1) can be applied to any density field defined on an adaptive grid, (2) is capable of detecting substructure, (3) is parallelized using the MPI library on distributed memory systems, and (4) is fast enough to be run at every time step of a simulation without significantly slowing down the calculation. What is not discussed in the present paper is an unbinding procedure for particles that are located inside the volume occupied by a certain halo but not gravitationally bound to it, as well as the subsequent computation of halo properties. These functionalities will be added to phew in the future. As briefly mentioned above, a previous version of phew has already been presented in Bleuler and Teyssier (2014). The algorithm described here differs from the previous one in the sense that it is now fully parallelized. This allows the algorithm to run now efficiently on thousands of CPUs and handle a complex topography with millions of density peaks and a rich hierarchy of substructures.
The article is organised as follows: in Section 2 we describe the serial version of the phew algorithm. In Section 3 we focus on the parallel implementation of the steps presented in Section 2. Section 4 contains scaling experiments which demonstrate the efficiency of the parallelization. Finally, we summarise and discuss our results, presenting an outlook on possible future work in Section 5.
2 The phew algorithm

Watershed segmentation;

Saddle point search;

Noise removal;

Substructure merging.
2.1 Watersheds in image processing
Before we start with a more detailed description of the algorithm, we take a quick look over the fence into the field of mathematical morphology and its application to image processing. There, watershed algorithms are a well known and extensively studied tool for image segmentation. The basic idea is that a grayscale image can be thought of as a topographic relief. A drop of water that falls somewhere onto this relief will follow the line of steepest descent until it reaches a local minimum. All points that connect to the same local minimum in that manner form a catchment basin. The watershed algorithm therefore segments the picture into catchment basins. The boundaries of the catchment basins are the actual watersheds. This technique is usually applied to the magnitude of the images gradient. In this way, the watershed lines trace regions of high gradients and segment the original image it into connected regions of small gradients. An excellent overview of the watershed techniques used in image processing is given by Roerdink and Meijster (2000).
An important difference to the watershed algorithms used for image segmentation lies in the computational cost for checking all neighbours of a cell/pixel. Working in 3D naturally increases the number of neighbours. Using an AMR grid further increases the number of possible neighbours since one has to consider possible neighbours at the same level as the original cell as well as one level above and below. Most importantly, the data structure in an AMR grid is very different from the one of a flat 2D array. The location of neighbouring cells in memory needs to be constructed before a neighbor can be checked for its density. Our main interest lies therefore in reducing the number of neighbours that have to be accessed. This aspect influences the choice of watershed algorithm for our purpose.
2.2 Watershed segmentation
2.3 Saddle point search
Before we can merge peak patches, we have to establish the connectivity between them. All test cells are checked for neighbouring cells that belong to a different peak patch. If such a neighbouring cell is found, the average density of the starting cell and its neighbour is considered as the density at the common surface of the two bordering peak patches. The maximum density on the connecting surface is the density of the saddle between the two peaks and stored. At the end of this step, each peak has its list of neighbouring peaks together with the corresponding saddle point densities. We denote the maximum saddle point of a peak as the ‘key saddle’ and the corresponding neighbour as ‘key neighbour’.
2.4 Noise removal
A known problem of the watershed method is oversegmentation. The presence of a huge number of local minima  for example due to random particle noise or transient gas density fluctuations  causes segmentation into as many catchment basins as there are local minima. Generally speaking, there are two possible strategies to deal with this problem: not creating the oversegmentation in the first place or merging oversegmented regions. Preventing oversegmentation the can be obtained using markers to preselect allowed minima (e.g., Moga and Gabbouj 1998). This usually requires a human intervention, which in our case is not possible. Another way is to use the socalled hierarchical watershed algorithm^{3} (Beucher 1994). Hierarchical watershed algorithms merge artificial catchment basins to more important ones based on some criteria. What we will describe in the following turns our watershed algorithm into a hierarchical algorithm in the Beucher (1994) sense.
After having previously identified the saddle points, we classify the peaks based on their contrast to the background. We define the contrast as the ratio of the peak density to the key saddle density and name it ‘relevance’. This is sketched in the second panel of Figure 1. Every peak is assigned a ‘final peak’ label which is initialized to the peaks own peak ID and updated whenever a peak is merged to another one. The peaks are sorted by decreasing peak density. For each peak, the key saddle is determined from the list of saddle points and the relevance is computed. Peaks with a relevance below a relevance threshold are considered noise.^{4} If the peak is relevant, it is not touched. For an irrelevant peak, we check whether its key saddle links it to a denser peak. If this is the case, it will inherit the final peak label from this key peak. As in the watershed segmentation, the previous sorting makes sure that the final peak labels can propagate through long chains of connected peaks in just one loop. If a peak is both isolated and irrelevant, it is discarded.
When two peaks merge, their lists of saddle points are merged as well. If both peaks used to have a connection to the same third peak, the maximum of the two saddles is kept.
Now, we iterate the procedure: from the updated lists of saddle points, the key saddles are determined. Peaks are accessed in the order of decreasing peak density and irrelevant peaks are merged. After an iteration without any mergers, all irrelevant peaks have been merged or discarded and the noise removal is finished. Note that the described merging process follows exactly the same principle as the watershed segmentation. We have simply replaced cells by peaks, densest neighbour cells with key neighbours and the peak patch label by the final peak label. We call the structures which survive the noise removal Level 0 clumps. They constitute the finest structure (see Figure 1, third panel) in our hierarchy.
Using the relevance as a merging criterion results in a similar definition of a clump as it is obtained by algorithms that contour the dataset at evenly spaced levels in logspace (e.g., Williams et al. 1994). There, a peaktosaddle ratio above a given value guarantees that a contour level will fall between peak density and key saddle density and thus the detection of the corresponding clump as an individual object. However, a contour level can coincidentally lie between the peak density and the key saddle density of an object with a very low peaktosaddle ratio, resulting in the detection of an irrelevant density fluctuation as a clump. Merging based on the relevance removes this randomness from the analysis.
For a density field that is obtained from an underlying particle distribution, the relevance criterion can be interpreted as a signaltonoise criterion on the basis of an individual cell. We assume a roughly constant number of particles per cell as this number is often used as a refinement criterion for dark matter simulations in ramses. A large relevance thus translates into a small probability that the peak density is simply drawn from a Poisson distribution with the mean being equal to the saddle point density. A true signaltonoise criterion would consider the probability that the entire peak patch is consistent with being randomly drawn from the density at the saddle point. We would expect such a criterion to distinguish noise from physical structure more reliably. However, such a criterion is not compatible with our parallelization strategy of the merging procedure, as it includes quantities that are ‘additive’ under a merger  such as the size or the total mass of a peak patch  into the merging criterion. As we will describe in Section 2.7, this would make the outcome of the merging process depend on the exact order at which the peaks are considered for merging.
2.5 Saddle threshold merging
If desired, the remaining peaks and their associated clumps can be merged further to form composite clumps. This happens by exactly repeating the previous merging process with a different merging criterion. We have implemented a density threshold for the key saddle as a criterion. If the key saddle density is above that threshold, a peak is merged to its key neighbour (see Figure 1, fourth panel). Another possible criterion is the repeated use of the relevance threshold, this time with a higher value.
2.6 A hierarchy of saddle points
2.7 Merging order
 1.
A peak is only merged to a denser one (upward).
 2.
A peak is only merged through its key saddle.
 3.
The density of the key saddle or the relevance are used as merging criterion.
 1.
A third peak might be merged into m. Due to upward merging, this cannot change the peak density of m and therefore decision if n will be merged into m is not influenced.
 2.
Peak m might merge into another peak \(m'\). The saddle \(s_{nm}\) will still exist, now linking n to \(m'\). Due to upward merging we have \(m' > m > n\) which means that n is still the lower of the two peaks connected by \(s_{nm'}\). The decision whether n is merged through \(s_{nm}\) is unaltered.
 3.
A third peak i might be merged into n. The peak density of n cannot change due to that since it would mean that peak i had a higher density than n which contradicts the upward merging. The key saddle cannot change because this would mean that peak i had a saddle point \(s_{ij}\) higher than \(s_{nm}\). This would imply that the saddle point \(s_{ni}\) through which i was merged into n was even higher, \(s_{ni} > s_{ij}\) otherwise \(s_{ni}\) had not been the key saddle of peak i. Yet, \(s_{ni} > s_{ij}>s_{nm}\) contradicts that \(s_{nm}\) is the key saddle of peak n. The peak density of n and its key saddle are thus unchanged, therefore the relevance of n is not changed either.
This shows that we can arbitrarily delay the moment when we consider a peak for merging as long as we respect the three merging rules. The mergers happening in the mean time cannot change the properties deciding if and through which saddle this peak will be merged. A possible way to prevent violation of merging rule (ii) is to consider all peaks for merging until no further mergers are possible before any new key saddle of the merged peaks is computed. This results in using the saddle points for merging on a ‘levelbylevel’ basis. This is a key to the parallelization of phew since it will allow performing a big number of operations (mergers), in between each round of communication (finding new key saddles). Note that this line of argumentation breaks when we violate merging rule (iii) and use for example the clump mass as merging criterion. The mass is a property that changes with every merger. Therefore, altering the merging order does change the mass of a clump at the moment it is considered for merging and can thus change the decision whether the clumps should be merged or not.
3 Parallel implementation
3.1 Parallel watershed
The watershed segmentation is nonlocal by nature. This can easily be understood by imagining a mountain ridge. Two drops of water falling onto both sides of the ridge will initially move away into different directions. They might flow into different rivers which flow into different lakes, or they might as well end up in two rivers which join before reaching a lake. The two situations cannot be distinguished based on local properties. Parallelization of the watershed algorithm is therefore a nontrivial task. In the literature, one finds various approaches to parallelization for the different watershed algorithms (see e.g., Roerdink and Meijster 2000). Our technique is very close to the technique described in Moga (1997) and called ‘hill climbing by locally ordered queues’.
Each task performs a loop over all its active cells, in order to identify first the test cells (cells above the density threshold). For faster access, the indices of all test cells are stored in an array. A loop over all test cells is performed where the densities of all neighbouring cells are checked. The index of the densest neighbouring cell is stored for each test cell, since it will be used several times during the algorithm. Note that the densest neighbour of a cell can lie inside the virtual boundary, while test cells are always inside the active domain.
3.2 Virtual peak boundary
As we have already described in Section 2, our peak patch merging step is analogous to the segmentation step. The patches now take the role of the cells, the peak patch label is replaced by the final peak label and the densest neighbouring cell is replaced by the key neighbouring patch. As explained before, the parallelization of the peak patch segmentation is exploiting the virtual boundaries surrounding each MPI domain. If we want to use the same strategy to parallelize the merging process, we need the analog of the virtual mesh boundary: a virtual peak boundary. In contrast to our usual virtual mesh boundary, the virtual peak boundary does not represent a fixed region in space. As the merging process advances, new connections appear and new peaks have to be introduced in the virtual peak boundary. Our virtual peak boundary is therefore more dynamic than our virtual mesh boundary.
Since every task is aware of its starting number of global peak IDs, switching from global peak ID to local peak index and vice versa is trivial for active peaks. To recover a boundary peaks global ID from its local index, we simply store the global ID in memory at the position of its local index. For the opposite direction we use a hash table that contains the local peak index for a given global peak ID (hash key).^{6} Whenever we introduce a new boundary peak into the virtual peak boundary, it obtains the local peak index corresponding to the first free space in memory. The global peak ID is stored and a hash key is computed. Which peaks need to be present in the virtual peak boundary depends on the connectivity of peaks. The initial state of the virtual boundary will thus be constructed while searching for saddle points that connect the peaks.
3.3 The peak communicator
By introducing a peak into the virtual peak boundary, it only obtains a local peak index. No properties except the global peak ID of a newly introduced boundary peak are present at this stage. We now describe how information is transferred from the MPI task which hosts a peak (the ‘owner’ of that peak) into the virtual peak boundaries of other tasks and vice versa. There are two types of communication: inward communication (‘collect’, red arrows in Figure 4) from all processes which have a certain peak inside their peak boundary to the owner of the peak, and outward communication (‘scatter’, green arrows in Figure 4) to update the peak properties in the virtual boundaries. When performing a collect communication, one has to specify whether one is computing a sum, minimum or maximum of the incoming values belonging to the same peak. When a scatter communication is performed, the peak properties of boundary peaks are overwritten with their equivalent from the peaks owner. A typical communication pattern for a peak property is therefore a collect communication followed by a scatter communication.
Before this communication can be performed, we need to build a communication structure which we refer to as the ‘peak communicator’. We allocate a matrix C of size \(N_{\mathrm{task}} \times N_{\mathrm{task}}\). The entry \(c_{ij}\) is the number of peaks inside the virtual peak boundary of task i that are owned by task j. Each task builds its line of C in a loop over the boundary peaks by looking at their global peak IDs. Through MPI communication, the lines of C are shared between all task to complete C.^{7} The entries in the matrix C determine the amount of data that is sent to/received from another MPI task. This information is used to allocate send and receive buffers and to direct each entry in a tasks send buffer to the correct MPI task in a round of alltoall communication. In order to complete the setup of the peak communicator, we use the established structure to perform a collect communication of the global peak ID. This information allows the identification of a position in the receive buffer (or in the send buffer in the case of a scatter communication) with an active peak. This completes the buildup of the communication structure. The peak communicator needs to be rebuilt whenever new peaks have potentially been added to the virtual peak boundary of any MPI task.
3.4 The saddle point matrix
To keep track of the saddle points, we establish a symmetric saddle matrix M, where the entry \(m_{ij}\) is the density of the saddle point connecting the peaks i and j. As most of the peaks patches are not touching each other, we use a sparse matrix representation of M. Note that the indices i, j are the local peak indices, which makes M a sparse matrix of virtual size \(N_{\max} \times N_{\max}\). Since we are interested in the maximum entry of each line and the column where it is located when in comes to merging, we keep track of those two values when adding new entries into M. The maximum and its column need to be recomputed by checking each nonzero element of a line only after values have been removed from the given line in M which reduces the number necessary accesses to the sparse matrix.
The construction of the sparse matrices is performed locally the way described in Section 2.3. Whenever a connection is found to a peak that is not yet present in the virtual peak boundary, the given peak is introduced by assigning it a local index. See Algorithm 1 for the pseudocode describing the saddle point search on each task.
3.5 Communication of saddle points
We could now use a collect communication on the saddle points for every peak in the entire computational box. As a result of that, every task would have access to all saddle points of all his active peaks. The global key saddle and key neighbour could then be determined by every MPI task for his active peaks. However, this approach would introduce a lot of communication and unnecessarily fill the sparse saddle matrices. The only necessary information to perform one iteration in the merging process is the (global) key saddle density of a peak and the corresponding key neighbour. This global maximum saddle can be found by comparing the local maxima of each MPI task. We thus minimise communication by performing a collect communication only on the local maximum of each row in the saddle point matrix. Together with the local maximum saddle density, we collect the global peak ID that denotes the local key neighbour. The owner of a peak can now compute the global key saddle for a given peak by comparing all the local maxima. The global peak ID that was received from the MPI task which hosts the global key saddle is the key neighbour of the peak. If not already present, the key neighbour is introduced into the virtual peak boundary of the owner task and the key saddle density is written into the sparse saddle matrix of the owner. Every MPI task can now perform a complete iteration in the merging process without any further communication of saddle point densities.
3.6 Merging in parallel
We are now set for the actual merging of the peaks. We introduce two new peak properties: a logical variable called alive which is initialised to ‘true’ and set to ‘false’ when a peak is merged into another one, and the final peak label which is initialised to the global peak ID for all active peaks. These two new properties and the peak density are updated in the virtual peak boundaries using a scatter communication. A permutation which sorts the active peaks by decreasing density is computed. Now we propagate the final peak label through the key saddles in a levelbylevel fashion. On each level, we iterate until no final peak label is moved, while the virtual boundaries are updated after every iteration. This is perfectly analogous to the parallel watershed segmentation. After every level of saddle points we update the alive variable, the saddle point matrices and the virtual boundaries. The merger routine is described in Algorithm 2 in pseudocode. The substructure merging is performed in exactly the same way, we just replace the relevance threshold by the saddle density threshold.
4 Scaling test
Parameters and some runtime statistics for the 1,024 task runs of the experiment
\(N_{\mathrm{parts}}\)  512^{3}  1,024^{3} 
\(N_{\mathrm{tasks}}\)  1,024  1,024 
Density threshold  \(80 \rho_{\mathrm{crit}}\)  \(80 \rho_{\mathrm{crit}}\) 
Relevance threshold  3  3 
Saddle threshold  \(200 \rho_{\mathrm{crit}}\)  \(200 \rho_{\mathrm{crit}}\) 
Number of test cells  104,360,968  835,609,288 
Number of density peaks  6,714,764  53,994,995 
Number of relevant clumps  1,311,208  10,612,079 
Number of haloes^{*}  521,185  4,234,746 
Runtime  8.0 s  38.9 s 
Number of iterations for…  
…watershed segmentation  7  9 
…noise removal  
Level 1  7  7 
Level 2  5  6 
Level 3  4  4 
Level 4  2  3 
Level 5  1  2 
Level 6  1  1 
Level 7  1  1 
Level 8  1  
…substructure merging  
Level 1  4  3 
Level 2  3  4 
Level 3  3  3 
Level 4  2  2 
Level 5  1  2 
Level 6  1  1 
Level 7  1 
In our numerical experiment, phew was run five times in a row, for five main simulation time steps following the restart. We measure the total runtime of each call to phew as well as the time spent on the different algorithmic steps. We find the variance of the runtimes to be negligible and conclude that the timings are stable. Note that the preliminary construction of the density field is performed inside the watershed segmentation block. However, the CIC algorithm is quick compared to the watershed segmentation. We also measure the amount of time necessary for each MPI task to write the properties of the structure inside its domain to disk.
Runtime diagnostics for the parallelization of phew when various numbers of MPI tasks are used. \(\pmb{N_{\mathrm{active}}}\) and \(\pmb{N_{\mathrm{ghost}}}\) are the number of active peaks and ghost peaks respectively and \(\pmb{N_{\mathrm{tot}}=N_{\mathrm{active}}+N_{\mathrm{ghost}}}\) denotes the total number of peaks per MPI task. \(\pmb{N_{\mathrm{sparse}}}\) is the number of entries in the sparse saddle matrix and \(\pmb{N_{\mathrm{collisions}}}\) gives the number of hash table collisions. Sums, maxima and averages are taken over the all MPI tasks
\(\boldsymbol {N}_{\mathbf{tasks}}\)  32  64  128  256  512  1,024  2,048 

Load imbalance \((\frac{\max\{ N_{\mathrm{tot}} \}}{ \mathrm{avg} \{ N_{\mathrm{tot}}\}} )\)  1.4  1.5  1.8  2.4  2.8  3.3  3.9 
Surface effect \((\frac{\sum N_{\mathrm{ghost}}}{\sum N_{\mathrm{active}}} )\)  0.0087  0.012  0.016  0.021  0.030  0.040  0.055 
Connectivity \((\frac{\sum N_{\mathrm{sparse}}}{\sum N_{\mathrm{tot}}} )\)  9.4  9.4  9.4  9.3  9.3  9.3  9.2 
\(\max \{ \frac{N_{\mathrm{ghost}}}{N_{\mathrm{active}}} \}\)  0.012  0.017  0.044  0.064  0.10  0.15  0.24 
\(\max\{N_{\mathrm{tot}}\}\)  3.0 × 10^{5}  1.6 × 10^{5}  9.6 × 10^{4}  6.4 × 10^{4}  3.8 × 10^{4}  2.2 × 10^{4}  1.3 × 10^{4} 
\(\max\{N_{\mathrm{sparse}}\}\)  3.3 × 10^{6}  1.8 × 10^{6}  1.2 × 10^{6}  8.7 × 10^{5}  6.3 × 10^{5}  4.7 × 10^{5}  3.0 × 10^{5} 
\(\max\{N_{\mathrm{collisions}}\}\)  4  3  2  3  16  17  13 
The solid line in the bottom panel of Figure 6 is a result of both effects mentioned above. It depicts \(\max\{N_{\mathrm{sparse}}\}\), the maximum number of used sparse matrix elements over all MPI tasks. In perfect scaling conditions, this number would decrease as \(1/N_{\mathrm{tasks}}\). We thus multiply \(\max\{N_{\mathrm{sparse}}\}\) by \(N_{\mathrm{tasks}}\) and rescale to one at 32 tasks. We compare this to the runtime of the noise removal (also scaled). We observe that this ‘worst case’ number of entries in the sparse saddle point matrix does explain the scaling of the merging process up to 512 tasks. Beyond that, we believe that MPI communications become the performance bottleneck.
In Table 2 we also show the maximum ratio of ghost peaks to active peaks. For 2,048 tasks we have a value of 24%. This shows that the number \(N_{\max}\) defined in Equation (1) is an overestimation of the effectively used memory for ghost peaks for this setup. In the same table, we also list the number of hast table collisions. There are very few collisions as the hash table is far from filling up and we conclude that the relatively simple hash function that we use is good enough for our purpose. Another fact worth mentioning is the relatively constant ratio of nonzero entries in saddle point matrix to the number of peaks seen in the third line of Table 2. Divided by two (due to the symmetry of the saddle point matrix), this number gives a good idea of the effective number of neighbours per peak.
5 Conclusions
We have presented phew, a new structure finding algorithm and its MPI parallel implementation into the AMR code ramses. phew finds density peaks and their associated regions in a 3D density field by performing a watershed segmentation. The merging is based on the saddle point topology. We have described a twostep approach to merging. In a first step, we merge irrelevant density fluctuations which we consider as noise. In a second step we merge the finest substructure hierarchically, into large, connected regions above the adopted density threshold. This merging process naturally results in a treelike representation of substructure similar to the dendrograms presented by Rosolowsky et al. (2008).
The main focus of this article is on the parallel implementation of the algorithm which we have described in detail. Our implementation is truly parallel, meaning that it produces exactly the same results for varying numbers of MPI tasks. To test the parallelization of phew, we have performed a scaling experiment on a snapshot from a cosmological dark matter simulation. We have found excellent scaling in the relevant range of MPI tasks. When using the same number of MPI tasks that was used for the actual simulation, the runtime of phew \({\sim}10 \%\) the time it takes to advance the simulation by one time step. This allows for frequent usage of phew onthefly and thus more finegrained information about how matter assembles in simulations.
ramses has recently been demonstrated to scale well up to 38,016 MPI task (Alimi et al. 2012) when used to simulate a very large cosmological volume. Even the largest haloes that phew will identify in such a simulation cover only a small fraction of the computational volume. This essentially turns such a setup into a weak scaling experiment for phew, where the scalability is determined by the domain decomposition of ramses. Without having applied phew to such a large setup, we therefore expect the algorithm to show similar scaling properties in this range as the ramses code. A more challenging situation for the phew algorithm is posed by highresolution zoom simulations of one single halo. In such a situation, the parent halo is spread over almost all MPI tasks, leading to MPI communication across the entire computational domain during the merging process and therefore slightly less favorable scaling properties.
phew has similarities with already existing watershed based halo finders, such as denmax (Bertschinger and Gelb 1991), hop (Eisenstein and Hut 1998), skid (Stadel 2001), adaptahop (Aubert et al. 2004), grasshopper (Potter and Stadel, in prep), but these are either not yet parallelized, do not find substructure or work only on particles. On a first sight, it looks like our approaches to defining substructure or parallelization cannot be applied to particlebased data structures since we operate on a meshdefined density field, while the other codes work on the particle distribution directly. However, the only two concepts that we use which are naturally provided by the grid, namely a local density and the notion of a neighbour, can be also defined for other data structures that do not rely on a grid. Once these properties are defined, the algorithm presented in this paper can be applied to particle data in the same way as we apply it to grid data.
At the current stage, our implementation of phew is a topological tool only, meaning that it identifies regions in space disregarding physical properties such as the kinetic or gravitational energy of the matter in that volume. For the application of phew as a genuine halo finder, we need to develop an unbinding procedure, which removes dark matter particles from regions they are not gravitationally bound to. We will exploit our hierarchical decomposition into substructure, to pass unbound particle to larger and larger regions, until the particles remain bound. This will unambiguously define the parent halo (or subhalo) of the particles.
The ramses code including phew are publicly available and can be downloaded from http://www.bitbucket.org/rteyssie/ramses.
denmax can be considered an inbetween case since it uses a uniform grid to compute the density gradient which is then used to directly assign particles to peaks.
Note that more modern approaches to region merging in image segmentation use the original image for merging while the watershed is computed on the gradient image (e.g., Peng and Zhang 2011). Using the watershed on the gradient image results in regions of similar gray values, where the densities inside our peak patches are very inhomogeneous. Approaches to region merging are thus fundamentally different in image processing than they are in our case.
The relevance threshold is a user parameter that can be adapted to the setup. 1.5 is our standard choice for identifying gas clumps in ramses simulations. For identifying dark matter haloes, the value can be picked according to the expected number of dark matter particles per cell and the resulting Poisson noise in the density.
For situations where the memory consumption due to given estimate for \(N_{\max}\) becomes prohibitive, one could start with a lower number and for example double the size of the allocation onthefly whenever all available space for ghost peaks is occupied. However, we have not yet encountered a situation where it was necessary to use this strategy.
We use a simple hash function based on the remainder of a division of the peak ID by a prime number chosen according to the maximum size of the virtual peak boundary. Collisions are dealt with by chaining in the form of a linked list (Knuth 1998). We found this to be sufficient for our purpose (see Table 2).
The introduction of a \(N_{\mathrm{task}}^{2}\) sized matrix can become problematic when the number of MPI tasks is increased beyond the numbers we have tested for this publication, especially for supercomputers with relatively little available memory per core (Blue Gene architecture). In order to apply phew to even larger problems, one can drop the construction of the global matrix C by exploiting the fact that the nth MPI process only needs to be aware of the nth row and the nth column of C, but not of the entire matrix. Considering the fact that the rows/columns of C are sparse, one can thus replace the \(N_{\mathrm{task}}^{2}\) sized matrix by a fully scalable representation of the information contained in C.
Declarations
Acknowledgements
The authors want to thank Stephane Colombi for his advice on substructure merging. Furthermore the authors thank Doug Potter for helpful discussions about programming techniques. The computations leading to this publication have been performed at on the zBox4 and Schroedinger Supercomputers at the University of Zurich and at the Swiss Supercomputing Centre CSCS in Lugano. This work has been supported by the Swiss National Science Foundation SNF under the project ‘Computational Astrophysics’ and the PASC codesign project ‘Particles and Fields’.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Alimi, JM, et al.: Firstever full observable universe simulation. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 73. IEEE Computer Society Press, Washington (2012) Google Scholar
 AragónCalvo, MA, Platen, E, van de Weygaert, R, Szalay, AS: The Spine of the Cosmic Web. Astrophys. J. 723, 364382 (2010). doi:10.1088/0004637X/723/1/364 ADSView ArticleGoogle Scholar
 Aubert, D, Pichon, C, Colombi, S: The origin and implications of dark matter anisotropic cosmic infall on \({\approx} L_{\star}\) haloes. Mon. Not. R. Astron. Soc. 352, 376398 (2004). doi:10.1111/j.13652966.2004.07883.x ADSView ArticleGoogle Scholar
 Bertschinger, E, Gelb, JM: Cosmological Nbody simulations. Comput. Phys. 5, 164175 (1991) ADSView ArticleGoogle Scholar
 Beucher, S: In: Serra, J, Soille, P (eds.) Mathematical Morphology and Its Applications to Image Processing (1994) Google Scholar
 Bleuler, A, Teyssier, R: Towards a more realistic sink particle algorithm for the ramses code. Mon. Not. R. Astron. Soc. 445(4), 40154036 (2014) ADSView ArticleGoogle Scholar
 Davis, M, Efstathiou, G, Frenk, CS, White, SD: The evolution of largescale structure in a universe dominated by cold dark matter. Astrophys. J. 292, 371394 (1985) ADSView ArticleGoogle Scholar
 Eisenstein, DJ, Hut, P: HOP: a new groupfinding algorithm for Nbody simulations. Astrophys. J. 498, 137 (1998). doi:10.1086/305535 ADSView ArticleGoogle Scholar
 Hockney, RW, Eastwood, JW: Computer Simulation Using Particles. McGrawHill, New York (1981) Google Scholar
 Knebe, A, et al.: Haloes gone MAD: the halofinder comparison project. Mon. Not. R. Astron. Soc. 415, 22932318 (2011). doi:10.1111/j.13652966.2011.18858.x ADSView ArticleGoogle Scholar
 Knebe, A, et al.: Structure finding in cosmological simulations: the state of affairs. Mon. Not. R. Astron. Soc. 435, 16181658 (2013). doi:10.1093/mnras/stt1403 ADSView ArticleGoogle Scholar
 Knollmann, SR, Knebe, A: AHF: Amiga’s halo finder. Astrophys. J. Suppl. Ser. 182(2), 608 (2009) ADSView ArticleGoogle Scholar
 Knuth, DE: Sorting and Searching. The Art of Computer Programming, vol. 3. AddisonWesley, Reading (1998) Google Scholar
 Meyer, F: Topographic distance and watershed lines. Signal Process. 38(1), 113125 (1994) MATHView ArticleGoogle Scholar
 Moga, A: Parallel Watershed Algorithms for Image Segmentation. Tampere University of Technology, Tampere (1997) Google Scholar
 Moga, AN, Gabbouj, M: Parallel markerbased image segmentation with watershed transformation. J. Parallel Distrib. Comput. 51(1), 2745 (1998) MATHView ArticleGoogle Scholar
 Onions, J, et al.: Subhaloes gone Notts: spin across subhaloes and finders. Mon. Not. R. Astron. Soc. 429, 27392747 (2013). doi:10.1093/mnras/sts549 ADSView ArticleGoogle Scholar
 Peng, B, Zhang, D: Automatic image segmentation by dynamic region merging. IEEE Trans. Image Process. 20(12), 35923605 (2011) MathSciNetADSView ArticleGoogle Scholar
 Platen, E, van de Weygaert, R, Jones, BJT: A cosmic watershed: the WVF void detection technique. Mon. Not. R. Astron. Soc. 380, 551570 (2007). doi:10.1111/j.13652966.2007.12125.x ADSView ArticleGoogle Scholar
 Potter, D, Stadel, J:. GRASSHOPPER, in prep. Google Scholar
 Press, WH, Schechter, P: Formation of galaxies and clusters of galaxies by selfsimilar gravitational condensation. Astrophys. J. 187, 425438 (1974). doi:10.1086/152650 ADSView ArticleGoogle Scholar
 Press, WH, Teukolsky, SA, Vetterling, WT, Flannery, BP: Numerical Recipes 3rd Edition: The Art of Scientific Computing, 3rd edn. Cambridge University Press, New York (2007) Google Scholar
 Pujol, A, et al.: Subhaloes gone Notts: the clustering properties of subhaloes. Mon. Not. R. Astron. Soc. 438, 32053221 (2014). doi:10.1093/mnras/stt2446 ADSView ArticleGoogle Scholar
 Roerdink, JBTM, Meijster, A: The watershed transform: definitions, algorithms and parallelization strategies. Fundam. Inform. 41(12), 187228 (2000) MATHMathSciNetGoogle Scholar
 Rosolowsky, EW, Pineda, JE, Kauffmann, J, Goodman, AA: Structural analysis of molecular clouds: dendrograms. Astrophys. J. 679(2), 1338 (2008) ADSView ArticleGoogle Scholar
 Skory, S, Turk, MJ, Norman, ML, Coil, AL: Parallel hop: a scalable halo finder for massive cosmological data sets. Astrophys. J. Suppl. Ser. 191(1), 43 (2010) ADSView ArticleGoogle Scholar
 Springel, V, White, SDM, Tormen, G, Kauffmann, G: Populating a cluster of galaxies  I. Results at \(z=0\). Mon. Not. R. Astron. Soc. 328, 726750 (2001). doi:10.1046/j.13658711.2001.04912.x ADSView ArticleGoogle Scholar
 Stadel, JG: Cosmological Nbody simulations and their analysis. PhD thesis, University of Washington (2001) Google Scholar
 Stutzki, J, Guesten, R: High spatial resolution isotopic CO and CS observations of M17 SW: the clumpy structure of the molecular cloud core. Astrophys. J. 356, 513533 (1990). doi:10.1086/168859 ADSView ArticleGoogle Scholar
 Sutter, PM, et al.: VIDE: The Void IDentification and Examination toolkit. Astron. Comput. 9, 19 (2015) ADSView ArticleGoogle Scholar
 Teyssier, R: Cosmological hydrodynamics with adaptive mesh refinement. A new high resolution code called RAMSES. Astron. Astrophys. 385, 337364 (2002). doi:10.1051/00046361:20011817 ADSView ArticleGoogle Scholar
 Way, MJ, Gazis, PR, Scargle, JD: Structure in the 3D Galaxy Distribution: II. Voids and Watersheds of Local Maxima and Minima (2014). arXiv:1406.6111
 Way, MJ, Gazis, PR, Scargle, JD: Structure in the threedimensional galaxy distribution. I. Methods and example results. Astrophys. J. 727, 48 (2011). doi:10.1088/0004637X/727/1/48 ADSView ArticleGoogle Scholar
 Williams, JP, de Geus, EJ, Blitz, L: Determining structure in molecular clouds. Astrophys. J. 428, 693712 (1994). doi:10.1086/174279 ADSView ArticleGoogle Scholar