In situ and intransit analysis of cosmological simulations
 Brian Friesen^{1}Email authorView ORCID ID profile,
 Ann Almgren^{2},
 Zarija Lukić^{3},
 Gunther Weber^{4},
 Dmitriy Morozov^{5},
 Vincent Beckner^{2} and
 Marcus Day^{2}
DOI: 10.1186/s4066801600172
© Friesen et al. 2016
Received: 8 May 2016
Accepted: 17 August 2016
Published: 24 August 2016
Abstract
Modern cosmological simulations have reached the trillionelement scale, rendering data storage and subsequent analysis formidable tasks. To address this circumstance, we present a new MPIparallel approach for analysis of simulation data while the simulation runs, as an alternative to the traditional workflow consisting of periodically saving large data sets to disk for subsequent ‘offline’ analysis. We demonstrate this approach in the compressible gasdynamics/Nbody code Nyx, a hybrid \(\mbox{MPI}+\mbox{OpenMP}\) code based on the BoxLib framework, used for largescale cosmological simulations. We have enabled onthefly workflows in two different ways: one is a straightforward approach consisting of all MPI processes periodically halting the main simulation and analyzing each component of data that they own (‘in situ’). The other consists of partitioning processes into disjoint MPI groups, with one performing the simulation and periodically sending data to the other ‘sidecar’ group, which postprocesses it while the simulation continues (‘intransit’). The two groups execute their tasks asynchronously, stopping only to synchronize when a new set of simulation data needs to be analyzed. For both the in situ and intransit approaches, we experiment with two different analysis suites with distinct performance behavior: one which finds dark matter halos in the simulation using merge trees to calculate the mass contained within isodensity contours, and another which calculates probability distribution functions and power spectra of various fields in the simulation. Both are common analysis tasks for cosmology, and both result in summary statistics significantly smaller than the original data set. We study the behavior of each type of analysis in each workflow in order to determine the optimal configuration for the different data analysis algorithms.
Keywords
cosmology postprocessing halofinding power spectra in situ intransit1 Introduction
Data analysis and visualization are critical components of largescale scientific computing (Ross et al. 2008; Agranovsky et al. 2014; Nouanesengsy et al. 2014; Bleuler et al. 2015; Sewell et al. 2015). Historically such workflows have consisted of running each simulation on a static compute partition and periodically writing raw simulation data to disk for ‘postprocessing.’ Common tasks include visualization and size reduction of data, e.g., calculating statistics, field moments, etc. Other tasks can be domainspecific: for example, evolving large nuclear reaction networks on passively advected tracer particles in supernova simulations (Thielemann et al. 1986; Travaglio et al. 2004; Röpke et al. 2006). Often the data footprint of the output from these postprocessing tasks is much smaller than that of the original simulation data (Heitmann et al. 2014; Sewell et al. 2015).
As simulations grow larger, however, this approach becomes less feasible due to disk bandwidth constraints as well as limited disk capacity. Data analysis requirements are outpacing the performance of parallel file systems, and, without modifications to either workflows or hardware (or both), the current diskbased data management infrastructure will limit scientific productivity (Ross et al. 2008; Bennett et al. 2012). One way to avoid exposure to the increasingly disparate performance of disk I/O vs. inter and intranode bandwidth is to limit the volume of data which is written to disk. This strategy can be realized in different ways; one approach is simply to write data relatively infrequently, e.g., every large number of time steps when evolving timedependent problems. However, limiting the number of time steps at which grid data is saved in order to conserve disk space also discards simulation data by ‘coarsening’ in the temporal dimension (Nouanesengsy et al. 2014). For example, in order to produce a mock galaxy catalog from an Nbody simulation, it is essential to create halo merger trees, which describe the hierarchical mass assembly of dark matter halos (Mo et al. 2010). It is, however, well recognized that in order to recover converged merger trees, identification of halos with high temporal resolution is needed (Srisawat et al. 2013).
A second strategy for addressing the I/O problem is to shift data analysis from executing ‘offline’ (on disk; Sewell et al. 2015) to running while the simulation data is still in memory. Such a workflow can be expressed in a myriad of ways, two of which we explore in this work. One approach consists of all MPI processes periodically halting the simulation and executing analysis routines on the data in memory (‘in situ’). The second method consists of dividing the processes into two disjoint groups; one group evolves the simulation and periodically sends its data to the other, which performs the analysis while the simulation continues asynchronously (‘intransit;’ e.g., Bennett et al. 2012). While in situ approaches have long been recognized as efficient ways of avoiding I/O, less attention has been devoted to intransit methods.
In situ and intransit methods each have potential strengths and weaknesses. The former method requires no data movement beyond what is inherent to the analysis being performed. Its implementation is also relatively noninvasive to existing code bases, consisting often of adding a few strategically placed function calls at the end of a ‘main loop.’ However, if the analysis and simulation algorithms exhibit disparate scaling behavior, the performance of the entire code may suffer, since all MPI processes are required to execute both algorithms. Intransit methods, on the other hand, lead to more complex workflows and more invasive code changes, which may be undesirable (Sewell et al. 2015). They also require significant data movement, either across an interconnect or perhaps via specialized I/O accelerator (‘burst buffer’). However, they can be favorable in cases where the analysis code scales differently than that of the main simulation: since the analysis can run on a small, separate partition of MPI processes, the remaining processes can continue with the simulation asynchronously. This feature may become especially salient as execution workflows of modern codes become more heterogeneous, since different components will likely exhibit different scaling behavior.
The nature of the postprocessing analysis codes themselves also plays a role in the effectiveness of intransit implementations. In many scientific computing workflows, the ‘main’ simulation code performs a well defined task of evolving a physical system in time, e.g., solving a system of partial differential equations. As a result, its performance characteristics and science goals are relatively stationary. Analysis codes, in contrast, are implementations of a zoo of ideas for extracting scientific content from simulations. Being exploratory in nature, their goals are more ephemeral and heterogeneous than that of the simulation itself, which in general leads to more diverse performance behavior. The intransit framework presented here provides the ability for analysis codes to be run together with the simulation, but without a strict requirement of being able to scale to a large number of cores. It is therefore useful to think of this intransit capability as adding ‘sidecars’ to the main vehicle: in addition to resources allocated exclusively for running the simulation, we allocate a set of resources (often much smaller) for auxiliary analysis tasks.
In this work we explore both in situ and intransit data analysis workflows within the context of cosmological simulations which track the evolution of structure in the universe. Specifically, we have implemented both of these workflows in the BoxLib framework, and applied them to the compressible gasdynamics/Nbody code Nyx, used for simulating large scale cosmological phenomena (Almgren et al. 2013; Lukić et al. 2015). We test each of these workflows on two different analysis codes which operate on Nyx data sets, one which locates dark matter halos, and another which calculates probability distribution functions (PDFs) and power spectra of various scalar fields in the simulation. In Section 2 we describe the scientific backdrop and motivation for the data analysis implementations which we have tested. In Section 3 we provide the details of our implementation of the in situ and intransit workflows in the BoxLib framework. Section 4 contains a description of the two analysis codes which we explored using both in situ and intransit methods. We discuss the performance of the two codes in each of the two analysis modes in Section 5, and we discuss prospects and future work in Section 6.
2 Cosmological simulations
Cosmological models attempt to link the observed distribution and evolution of matter in the universe with fundamental physical parameters. Some of these parameters serve as initial conditions for the universe, while others characterize the governing physics at the largest known scales (Davis et al. 1985; Almgren et al. 2013). Numerical formulations of these models occupy a curious space in the data analysis landscape: on one hand, each cosmology simulation can be ‘scientifically rich’ (Sewell et al. 2015): exploratory analysis of the simulation may lead to new insights of the governing physical model which could be lost if the raw data is reduced in memory and discarded. On the other hand, many models exhibit a highly nonlinear response to the initial perturbations imposed at high redshift, in which case isolated ‘heroic’ simulations may not capture all of the features of interest which arise from such nonlinear behavior (Davis et al. 1985; Almgren et al. 2013). Instead, one may wish to perform many such simulations and vary the initial conditions of each in order to capture the nuanced behavior of the models; such behavior can often be expressed even in a highly reduced set of data (e.g., a density power spectrum). We emphasize, then, that the data analysis methods presented here represent only a subset of techniques which will be required to manage the simulation data sets in future scales of computational cosmology.
2.1 Formalism
Nyx is based on BoxLib, an \(\mbox{MPI}+\mbox{OpenMP}\) parallelized, blockstructured, adaptive mesh refinement (AMR) framework (BoxLib 2016). It evolves the twocomponent (hydrogen and helium) baryonic gas equations with a secondorder accurate piecewiseparabolic method, using an unsplit Godunov scheme with full corner coupling (Colella 1990; Almgren et al. 2010; Almgren et al. 2013). It solves the Riemann problem iteratively using a twoshock approximation (Colella and Glaz 1985). The dark matter particles interact with the AMR hierarchy using a ‘cloudincell’ method (Hockney and Eastwood 1988). The details of the numerical methods used to solve the above equations are provided in Almgren et al. (2013).
The Nyx code is fully implemented with AMR capabilities, including subcycling in time. The effectiveness of AMR has been validated in simulations of pure dark matter, as well as the ‘Santa Barbara cluster’ problem (Frenk et al. 1999; Almgren et al. 2013). However, the simulations presented in this work focus on the Lymanα forest, which consists of largescale systems in the intergalactic medium (IGM) which absorb radiation, preferentially Lymanα, from distant quasars (Lukić et al. 2015). The IGM spans virtually the entire problem domain in these simulations, and as a result the signals of interest (optical depth, fluxes, etc.) are rarely localized to a specific region. As a results, it is generally impractical from a code performance perspective to use AMR at all in Lymanα simulations, and as a results, all of our simulations here use a single level, with no AMR.
2.2 Simulation data and postprocessing
A typical Nyx simulation evolves a cosmology from high redshift (\(z > 100\)) to the present (\(z = 0\)) in many thousands of time steps. The size of each time step is limited by the standard CFL condition for the baryonic fluid, as well as by a quasiCFL constraint imposed on the dark matter particles, and additionally by the evolution of the scale factor a in the Friedmann equation; details of these constraints are described in Almgren et al. (2013). Currently the largest Nyx simulations are run on a \(4\text{,}096^{3}\) grid, and at this scale each plotfile at a single time step is \({\sim}4~\mathrm{TiB}\), with checkpoint files being even larger; a complete data set for a single simulation therefore reaches well into the petascale regime. A single simulation can therefore fill up a typical user allocation of scratch disk space (\(\mathcal{O}(10)~\mathrm{TiB}\)) in just a few time steps. We see then that modern simulation data sets represent a daunting challenge for both analysis and storage using current supercomputing technologies.
Nyx simulation data lends itself to a variety of popular postprocessing cosmological analysis tasks. For example, in galaxies and galaxy clusters, observations have indicated that dark matter is distributed in roughly spherical ‘halos’ that surround visible baryonic matter (Davis et al. 1985). These halos provide insight into the formation of the largest and earliest gravitationally bound cosmological structures. Thus a common task performed on cosmological simulation data sets is determining the distribution, sizes, and merger histories of dark matter halos, which are identified as regions in simulations where the dark matter density is higher than some prescribed threshold. A recent review (Knebe et al. 2013) enumerates 38 different algorithms commonly used to find halos. To process data from Nyx (an Eulerian code), we use a topological technique based on isodensity contours, as discussed in Section 4.1. The approach produces results similar to the ‘friendsoffriends’ (FOF) algorithm used for particle data (Davis et al. 1985).
A second common data postprocessing task in cosmological simulations is calculating statistical moments of different fields, like matter density, or Lymanα flux. The first two moments  the PDF and power spectrum  are often of most interest in cosmological simulation analysis. Indeed, it is fair to say that modern cosmology is essentially the study of the statistics of density fluctuations, whether probed by photons, or by more massive tracers, such as galaxies. The power spectrum of these fluctuations is the most commonly used statistical measure for constraining cosmological parameters (PalanqueDelabrouille et al. 2013; Anderson et al. 2014; Planck Collaboration et al. 2014), and is one of the primary targets for numerical simulations. In addition to cosmology, one may be interested in predictions for astrophysical effects from these simulations, like the relationship between baryonic density \(\rho_{b}\) and temperature T in the intergalactic medium, or details of galaxy formation.
3 In situ vs. intransit
Having established the scientific motivation for data postprocessing in cosmological simulations, we now turn to the two methods we have implemented for performing onthefly data analysis in BoxLib codes.
3.1 In situ
To implement a simulation analysis tool in situ in BoxLib codes such as Nyx, one appends to the function Amr::CoarseTimeStep() ^{1} a call to the desired analysis routine. All MPI processes which participate in the simulation execute the data analysis code at the conclusion of each time step, operating only on their own sets of grid data. As discussed earlier, the advantages of this execution model are that it is minimally invasive to the existing code base, and that it requires no data movement (except that inherent to the analysis itself). One potential disadvantage of in situ analysis is that if the analysis algorithm does not scale as well as the simulation itself, the execution time of the entire code (\(\mathrm{simulation} + \mathrm{analysis}\)) will suffer. Indeed, we encounter exactly this bottleneck when calculating power spectra, which we discuss in Section 4.2.
3.2 Intransit
The receipt of FABs onto a single MPI process is an inherently serial process. This property can affect code performance adversely if a large number of compute processes send their data to a small number of sidecar processes, because a compute process cannot continue with the simulation until all of its FABs have been sent, and each sidecar process can receive only one FAB at a time. In the example shown in Figure 1, process 5 receives FABs from processes 1 and 3; if, by coincidence, process 5 receives all four of process 3’s FABs in order, process 3 can continue with its next task before process 1; however, process 5 cannot continue until it has received all FABs from both processes 1 and 3. In this example the ratio of sending to receiving processes, \(R \equiv N_{s} / N_{r} = 2\), is relatively small; the serialization of this data transfer will have only a minor effect on aggregate code performance. However, if \(R \sim\mathcal{O}(100)\) or \(\mathcal{O}(1\text{,}000)\), the effect will be more pronounced.
3.3 Task scheduling
In both the in situ and intransit workflows in BoxLib, we have added a simple queuebased, firstinfirstout scheduler which governs the order of data analysis tasks being performed. As we generally have a small number of tasks to perform during analysis, this approach is quite satisfactory. If the number of analysis tasks grows larger (a trend which we expect), then the workloads of each of these tasks will become more complex, and the variability in scaling behavior of each may be large as well. In this case a more sophisticated scheduling system  in particular one which accounts for a heuristic describing the scalability of each task and allocates sidecar partitions accordingly  may become more useful.
4 Cosmological simulation analysis tools
Many types of cosmological simulations can be broadly characterized by a small set of quantities which are derived (and highly reduced) from the original simulation data set. Such quantities include the distribution and sizes of dark matter halos, PDFs and power spectra of baryon density, dark matter density, temperature, Lymanα optical depths, etc. (Lukić et al. 2015). We obtain these quantities from Nyx simulation data using two companion codes: Reeber, which uses topological methods to compute dark matter halo sizes and locations; and Gimlet, which computes statistical data of various fields. Because the algorithms in these codes are very different, the codes themselves exhibit different performance and scaling behavior. Therefore, they together span a useful parameter space for evaluating the utility of in situ and intransit methods. In addition to operating on data in memory, both Reeber and Gimlet are capable of running ‘offline,’ in the traditional postprocessing workflow described in Section 1. We describe each of these codes below.
4.1 Reeber
Reeber is a topological analysis code, which constructs merge trees of scalar fields. A merge tree describes the relationship among the components of superlevel sets, i.e., regions of the data set with values above a given threshold. The leaves of a merge tree represent maxima; its internal nodes correspond to saddles where different components merge; its root corresponds to the global minimum (Morozov and Weber 2013). In the case of Reeber operating on Nyx data sets, the points x and y correspond to distinct grid points \(\mathbf {r}_{1}\) and \(\mathbf {r}_{2}\), and the function f corresponds to the density \(\rho(\mathbf {r})\).
Given that so many models and methods for generating halo mass functions already exist (Knebe et al. 2013), it is prudent to validate the isodensity approach used by Reeber by comparing with others. We have found close agreement with the FOF halo mass function (see Figure 3), although the two approaches for finding halos are quite different. Even more useful for comparison would be analytic or semianalytic halo mass function results, such as the PressSchechter formalism (Press and Schechter 1974); however, as shown in Lukić et al. (2009), the FOF model and the spherical collapse model used in the PressSchechter function are incompatible.
Implementing a scalable representation of merge trees on distributed memory systems is a challenging but critical endeavor. Because the merge tree is a global representation of the entire field, traditional approaches for distributing the tree across independent processes inevitably require communicationintensive reduction to construct the final tree, which in turn lead to poor scalability. Furthermore, modern simulations operate on data sets that are too large for all topological information to fit on a single compute node. Reeber’s ‘localglobal’ representation addresses this problem by distributing the merge tree, so that each node stores detailed information about its local data, together with information about how the local data fits into the global merge tree. The overhead from the extra information is minimal, yet it allows individual processors to globally identify components of superlevel sets without any communication (Morozov and Weber 2013). As a result, the merge trees can be queried in a distributed way, where each processor is responsible for answering the query with respect to its local data, and a simple reduction is sufficient to add up contribution from different processes. A detailed description of merge trees, contour trees, and their ‘localglobal’ representation, which allows Reeber to scale efficiently on distributed memory systems, is given in Morozov and Weber (2013), Morozov and Weber (2014) and its application to halo finding in Morozov et al. (in preparation).
4.2 Gimlet

optical depth and flux of Lymanα radiation along each axis

mean Lymanα flux along each axis

2D probability distribution function (PDF) of temperature vs. baryon density

PDF and power spectrum of Lymanα flux along each axis

PDF and power spectrum of each of

baryon density

dark matter density

total matter density

neutral hydrogen density

The most interesting quantities calculated by Gimlet are power spectra of the Lymanα flux and matter density. Gimlet uses FFTW3 (Frigo and Johnson 2005) to calculate the discrete Fourier transformation (DFT) of the grid data required for power spectra. We note here also that FFTW3’s domain decomposition strategy for 3D DFTs has implications which are especially relevant for a study of in situ and intransit analysis performance. We discuss these in Section 5.
 1.
It divides the grid into onecellthick columns which span the length of the entire problem domain. All columns are aligned along one of the 3 axes. It then compute the 1D DFT and along each column individually, accumulating the results into a power spectrum for the entire grid. This approach captures lineofsight effects of the cosmology simulation, as discussed in Lukić et al. (2015). Each DFT can be executed without MPI or domain decomposition, since the memory footprint of each column is small, even for large problem domains.
 2.
It computes the 3D DFT and power spectrum of the entire grid. This requires exploiting FFTW3’s domain decomposition features enabled with MPI.
The two basic types of calculations Gimlet performs  PDFs and power spectra  exhibit quite different performance behavior. PDFs scale well, since each MPI process bins only local data. A single MPI_Reduce() accumulates the data bins onto a single process, which then writes it to disk. Power spectra, on the other hand, require calculating communicationintensive DFTs. The need to reorganize data to fit domain decomposition required by the FFTW3 library  which differs from the native Nyx decomposition  incurs additional expense. Gimlet’s overall scalability, therefore, is a convolution of these two extremes.
5 Performance
In this section we examine detailed performance figures for the in situ and intransit workflows described above. All of these tests were performed on Edison, a Cray XC30 supercomputing system, at the National Energy Research Scientific Computing Center (NERSC). Edison’s compute partition consists of 5,576 nodes, each configured with two 12core Intel Xeon ‘Ivy Bridge’ processors at 2.4 GHz, and 64 GB of DDR3 memory at 1,866 MHz. Compute nodes communicate using a Cray Aries interconnect which has ‘dragonfly’ topology. The highperformance scratch file system is powered by Lustre, and has a total capacity of 2.1 PB (1.87 PiB) and a peak bandwidth of 48 GB/s (44.7 GiB/s), distributed across 96 object storage targets (OSTs).
First we present results of some synthetic performance benchmarks, in order to establish baselines by which to compare the performance of real analyses performed by the Reeber and Gimlet. Then we present the results from the analysis codes themselves. We note that, although Nyx and other BoxLibbased codes use OpenMP to express intranode parallelism, we disabled OpenMP for all performance tests presented below (in Nyx, Gimlet, and Reeber), since the focus of this work is primarily on internode relationships. Furthermore, because both the Box size and the total problem domain size of Nyx simulations is typically a power of 2, we use only up to 8 MPI processes per socket on an Edison compute node, always with only 1 MPI process per core. The extra 4 cores on each socket generally do not expedite the computation significantly, both due to load imbalance (12 and 24 rarely divide evenly into the problem domain size), as well as to the fact that, even with only 8 out of 12 cores active on a given socket, we already saturate most of the available memory bandwidth due to the low arithmetic intensity of the finitevolume algorithms in Nyx (Williams et al. 2009). Again, we applied this powerof2 rule to the Gimlet and Reeber codes as well.
5.1 Lustre file write performance
In situ and intransit methods attempt to circumvent the limitations of not only disk capacity, but also disk bandwidth. It is therefore of interest to this study to measure the time a code would normally spend saving the required data to disk in the traditional postprocessing workflow. To do so, we measured the time to write grids of various sizes, each containing 10 state variables,^{2} to the Lustre file system on Edison. BoxLib writes simulation data in parallel using std::ostream::write to individual files; it does not write to shared files. The user specifies the number of files over which to distribute the simulation data, and BoxLib in turn divides those files among its MPI processes. Each process writes only its own data, and only one process writes to a given file at a time, although a single file may ultimately contain data from multiple processes. The maximum number of files allowable is the number of processes, such that each process writes its own data to a separate file.
We varied the size of the simulation grid from 128^{3} to \(2\text{,}048^{3}\), divided among Boxes of size 32^{3}. This is a small Box size for Lymanα simulations which typically do not use mesh refinement; however, it is entirely typical for many other BoxLibbased applications which perform AMR. The maximum number of processes for each test was 8,192, although for tests which had fewer than 8,192 total Boxes (namely, those with 128^{3} and 256^{3} simulation domains), we set the number of processes such that each process owned at least 1 Box. We also varied the number of total files used, from 32 up to 8,192, except in cases where there were fewer than 8,192 total Boxes.
5.2 Intransit MPI performance
We see from this figure that the fastest bandwidth we achieve across the interconnect is \({\sim}58~\mathrm{GiB}\mbox{/}\mathrm{s}\). The peak bandwidth for the entire Aries interconnect on Edison is 23.7 TB/s (\({\sim}21.6~\mathrm{TiB}\mbox{/}\mathrm{s}\)) distributed across 5,576 compute nodes, indicating a peak bandwidth per node of \({\sim}4.0~\mathrm{GiB}\mbox{/}\mathrm{s}\). From our test, the highest bandwidth per node was \({\sim}116~\mathrm{MiB}\mbox{/}\mathrm{s}\), only about 3% of peak. Since several different configurations in this test appear to plateau at the same bandwidth, this limit may be due to high latency costs associated with sending and receiving so many small Boxes. Were the Boxes much larger, as they often are in Lymanα simulations, the MPI traffic would consist of fewer and larger messages, which may achieve higher pernode bandwidth. However, even performing so far below the peak, the bandwidth of moving Boxes across the interconnect for the 512^{3} grid is still a factor of 10 faster than the fastest bandwidth achieved saving it to disk on the Edison Lustre file system (\({\sim}45\%\) of peak; cf. Figure 5).
From this simple study we can estimate that, in terms of pure transfer speed (neglecting analysis or simulation scaling behavior), the optimal ratio of compute to analysis on Edison is \(R \sim9\). For analysis algorithms which strong scale efficiently, this should be a useful metric. The poor bandwidth for the small grids in Figure 7 arises because there are more MPI processes than Boxes, so many processes do not participate in the transfer process; the simulation therefore has access to a smaller portion of the aggregate available bandwidth on the interconnect. In all cases, the total transfer time is on the order of seconds. If analysis is performed infrequently, this comprises a small component of the total run time of most Nyx simulations, which often take \({\gtrsim}\mathcal{O}(10^{5})~\mathrm{s}\). However as the frequency of analysis increases, the total time spent moving data can become considerable, and an in situ approach may become more attractive.
The complexity of intransit analysis presents a number of factors one should consider in order to optimize code performance. For example, the cost of moving large volumes of data across MPI groups via an interconnect is significant, but can nevertheless be dwarfed by the cost of the analysis computation itself. Additionally, some analysis algorithms scale poorly  significantly worse than the simulation  in which case incurring the penalty for moving data to a small set of sidecar processes so that the simulation can continue may lead to the best overall performance of the code. On the other hand, the data movement penalty makes intransit data processing impractical for applications which are already inexpensive to calculate, or which scale very well, or both. These types of analysis may be more amenable to in situ approaches.
Intransit analysis introduces an additional layer of load balancing complexity, in that an optimally performant simulation maximizes the asynchrony of computation and analysis. To illustrate this point, suppose a simulation for evolving a system forward in time reserves C MPI processes for computation and A for analysis. Denote by \(\tau_{c}\) the time required for the compute processes to complete one time step of the simulation (without analysis), and \(\tau _{a}\) the time for the analysis processes to complete analysis for one set of data. If the user requests that the analysis execute every n time steps, then an optimal configuration of compute and analysis groups would have \(n \tau_{c} \simeq\tau_{a}\). If the former is larger, then the analysis group finishes too early and has no work to do while it waits for the next analysis signal; the converse is true if the latter is larger. If one finds that \(n \tau_{c} > \tau_{a}\), then one option is to decrease A, the number of analysis processes. This will increase the \(\tau_{a}\) but will simultaneously decrease \(\tau _{c}\) since we assume that \(C + A = \mathrm{const}\). A second option to equilibrate the two time scales is to decrease n, although this option may not be useful in all cases, since n is likely driven by scientific constraints and not code performance. If one relaxes the restriction that \(C + A = \mathrm{const.}\), then one can adjust the size of the analysis group arbitrarily while keeping the compute group fixed in order to balance the two time scales \(\tau_{c}\) and \(\tau_{a}\).
The risk of load imbalance described above traces its roots to the static nature of the roles of compute and analysis processes which we have assumed for this example, and which has traditionally been characteristic of largescale simulations. Much of it could be ameliorated using dynamic simulation ‘steering,’ in which the simulation periodically analyzes its own load balance and adjusts the roles of its various processes accordingly. Returning to the above example, one may request an additional task from the analysis group: every m time steps, it measures the time spent in MPI calls between the compute and analysis groups (such calls remain unmatched while one group is still working). If \(n \tau_{c} > \tau_{a}\), then before the next time step the simulation resizes the MPI groups to increase C and decrease M, by an amount commensurate with the length of the time spent waiting for both groups to synchronize. An even more exotic solution would be to subsume all resources into the compute group until the nth time step, at which point the simulation spawns the analysis group on the spot, performs the analysis, and reassimilates those processes after analysis is complete. Another possibility would be to decompose the tasks of simulation evolution and postprocessing within a node, using OpenMP, for example. This may alleviate the data movement penalty, requiring data to move only across NUMA domains within a compute node (or perhaps not at all), the speed of which would be significantly faster than moving across the interconnect. The disadvantage of this approach would be that all codes involved in the simulation/postprocessing workflow would need to contain a hybrid \(\mbox{MPI}+\mbox{OpenMP}\) implementation; for codes which are old or large, or which have a large ecosystem of already existing ‘companion’ codes for postprocessing, this can be a laborious process. If OpenMP is not an option for controlling process placement on compute nodes, some MPI libraries provide mechanisms for placing processes on nodes by hand, although we are unaware of any truly portable method of doing so. The heuristics used to adjust the simulation configuration will likely be problemdependent and will need to draw from a statistical sample of simulation performance data; we are actively pursuing this line of research.
5.3 Problem setup
Having established some synthetic benchmarks for disk and interconnect bandwidths, we now turn to the application of these workflows on science problems in Nyx. Here we present an exploratory performance analysis of in situ and intransit implementations of both Reeber and Gimlet. We evolved a Lymanα forest problem (Lukić et al. 2015) uniform grid (no mesh refinement) for 30 time steps, resuming a previous simulation which stopped at redshift \(z \sim3\). We chose this cosmological epoch because this is often when the most ‘postprocessing’ is performed, due the wealth of observational data available for comparison (Heitmann et al. 2015; Lukić et al. 2015). The domain boundaries were periodic in all three dimensions. We ran the simulation in three different configurations: with no analysis, with Reeber, and with Gimlet.^{3} In simulations which perform analysis, the analysis code executed every 5 time steps, starting at step 1. We choose this configuration such that the last time step during which analysis is performed is number 26; this gives the sidecar group a ‘complete’ window of 5 simulation time steps (2630) in which to perform the last analysis. Finally, we repeated this calculation on two different grid sizes, 512^{3} and \(1\text{,}024^{3}\), in order to explore the performance of these algorithms at different scales.
Summary of problem configurations used for performance analysis of postprocessing implementation in BoxLib
Grid size  512^{3}  1,024^{3} 
Problem size  10 Mpc  20 Mpc 
Resolution  \({\sim}20~\mbox{kpc}\)  \({\sim}20~\mbox{kpc}\) 
Box size  64^{3}  128^{3} 
# Boxes  4,096  4,096 
# MPI procs in situ  2,048  4,096 
# MPI procs (CT)  2,048  4,096 
# MPI simulation procs (CS)  2,048  4,096 
In all simulations, we ran Nyx, Reeber, and Gimlet using pure MPI, with no OpenMP. We used version 3.3.4.6 of FFTW3, and compiled all codes with the GNU compiler suite, version 5.2.0, with ‘O3’ optimization. In all DFT calculations, we used FFTW3’s default ‘plan’ flag, FFTW_MEASURE, to determine the optimal FFTW3 execution strategy.
5.4 Results
The times to postprocess (dashed lines in each of the four figures) illustrate the strong scaling behavior of both analysis codes. The interplay among the different scalabilities presented here  those of the analysis suite, simulation code itself, and the combination of the two  are complex and require a nuanced interpretation. In Figure 8 we see that Reeber strong scales efficiently, although its performance begins to plateau at large numbers of MPI processes, due to the relatively small problem size (a 512^{3} problem domain). We also note that Reeber running in situ on 2,048 processes is actually slower than on 512 or 1,024 processes intransit, which is due to having too many processes operating on too small a problem.
Figure 8 also illustrates the relationship between the scalability of the analysis code alone and that of the entire code suite (simulation+analysis). In particular, we see that for ≤64 sidecar processes performing analysis, a single Reeber analysis call takes longer than the 5 intervening simulation time steps. As a result, the strong scaling behavior of the complete simulation+analysis suite mirrors that of Reeber almost exactly. However, for ≥128 sidecar processes, Reeber is equal to or faster than the 5 simulation time steps, such that the scalability of the entire code suite begins to decouple from that of Reeber alone. This behavior has different effects for the CS and CT intransit configurations. For the CS configuration, the scalability of the code becomes asymptotically flat for ≥128 sidecar processes, because Reeber completes before the 5 simulation time steps are done. Any number of additional sidecar processes above ∼256 is wasted in this mode. For the CT configuration, the endtoend wall clock time begins to increase for ≥128 sidecar processes, because even though Reeber is executing faster and faster, increasing number of sidecar processes decreases the number of simulation processes, which slows the entire code down.
In light of this behavior, it is critical to find the configuration which yields the best performance for intransit analysis. Deviating from this configuration leads either to wasted compute resources, or to a degradation of overall code performance. We note that the optimal setup for Reeber in both the CS and CT configurations for this particular Nyx simulation  using 128 to 256 sidecar processes for Reeber and the remainder for simulation  is faster than running the entire code suite in situ. In particular, the CS mode with \(2\text{,}048+256\) processes is faster than the in situ mode by \({\sim}30\%\) (\({\sim}70~\mathrm{s}\)). The CT mode is especially appealing, as it uses the same number of MPI resources as the in situ mode (2,048 processes total), and less than the corresponding CS configuration (2,176 to 2,304 processes).
Finally, we note that even the fastest postprocessing configuration (256 sidecars in the ‘CS’ intransit configuration) is still about \({\sim}35\%\) (\({\sim}40~\mathrm{s}\)) slower than running Nyx with no analysis at all. We identify at least two factors which contribute to this overhead cost. Firstly, the workload between the simulation MPI partition and the postprocessing partition is not perfectly balanced; one is always waiting for the other. Secondly, when we run intransit, we do not control the physical placement of MPI processes on the machine, whereas when running with no analysis, all processes do the same work and are more likely to be optimally placed together. Ideally, when postprocessing, one would gather the compute partition as closely together as possible, and similarly for the postprocessing partition. However, this is not guaranteed to occur, and indeed the data suggest that it indeed does not.
The aggregate performance of Gimlet therefore represents a convolution of the scaling properties of both PDFs and power spectra. Although the power spectrum calculation scaling behavior quickly saturates as discussed above, we expect nearly ideal strong scaling behavior for the calculation of PDFs. Therefore, when we see in Figure 9 that the time for analysis increases between 256 and 512 analysis processes, this is due to the DFT calculations. In particular, when calculating the DFT of a 512^{3} grid on 512 MPI processes, each process has exactly one xy plane of data. The ratio of worktocommunication may be low for such an extreme decomposition, leading to worse performance with 512 processes than with 256. Despite the jump in analysis time between 256 and 512 processes, however, the time decreases once again between 512 and 1,024 processes. When executing the DFT on 1,024 processes, we leave 512 idle, so the time for that component is likely exactly the same as it was with 512; in fact, the DFT calculation time will never decrease for >512 processes.^{4} The decrease in total analysis time is instead due to the continued strong scaling of the PDF calculations.
The relationship between the scalability of Gimlet and the entire code suite is different than for Reeber, chiefly because Gimlet execution takes significantly longer than does Reeber. For almost any number of sidecar processes in Figure 9, the time to execute Gimlet is longer than the corresponding 5 simulation time steps. (For 256 sidecar processes the two times are roughly equivalent.) As a result, Gimlet dominates the total code execution time, and the simulation+analysis strong scaling behavior follows that of Gimlet alone almost exactly.
As was the case with Reeber, we see in Figure 9 that some CS and CT configurations lead to faster endtoend run times than in situ. Coincidentally, the threshold at which CT and CS runs become faster is between 64 and 128 sidecar processes, the same as Reeber. When using 256 sidecar processes, the CS configuration is \({\sim}65 \%\) faster than the in situ run, significantly larger than the 35% speedup seen with the CS mode when running Reeber. These values are functions both of the scalability of the analysis codes being used, as well as the total time they take to execute. The key point is that for both analysis codes, one can construct an intransit MPI configuration which is significantly faster than running in situ.
We note also that for these larger problems the gap between the fastest postprocessing configurations and the case with no postprocessing is wider than for the 512^{3} problem. For the 512^{3} problems it was about \({\sim}35\%\), but for these \(1\text{,}024^{3}\) problems it is now about \({\sim}55\%\) (\({\sim}500~\mathrm{s}\)). This widening is likely due to exacerbation of the effects described above, particularly that the compute and postprocessing MPI partitions are now spread even more widely across the machine, leading to larger MPI communication costs.
6 Summary and prospects
In situ and intransit data postprocessing workflows represent a promising subset of capabilities which we believe will be required in order to be scientifically productive on current and future generations of computing platforms. They avoid the constraints of limited disk capacity and bandwidth by shifting the requisite data movement ‘upward’ in the architectural hierarchy, that is, from disk to compute node memory. One can imagine continuing this trend to even finer levels of granularity, shifting from data movement between compute nodes across an interconnect, to movement across NUMA domains within a compute node. This could be implemented in several different ways; one would be to use ‘thread teams’ introduced in OpenMP 4.0 (which is already widely implemented in modern compilers) to delegate simulation and analysis tasks within a compute node. A second approach would be to extend the MPI implementation we have introduced in this work to incorporate the shared memory ‘windows’ developed in MPI3.
We have demonstrated the capability of these new in situ and intransit capabilities in BoxLib by running two analysis codes in each of the two workflows. Although small in number, this sample of analyses  finding halos and calculation power spectra  is highly representative of postprocessing tasks performed on cosmological simulation data. We caution, however, that the results presented in Section 5.4 are highly problemdependent; although we found intransit configurations which yield faster overall performance than in situ for both analysis codes, the situation may be different when running simulations on larger grids, using larger numbers of processes, running analysis algorithms with different scaling behavior, using a different frequency of analysis, etc. Furthermore, we highlight the caveat that, in some situations, onthefly data postprocessing is not a useful tool, namely in exploratory calculations, which are and will continue to be critical components of numerical simulations. In these cases, other techniques will be more useful, including ondisk data compression. We anticipate, then, that a variety of tools and techniques will be required to solve these datacentric challenges in HPC.
Besides raw performance gains, in situ and intransit workflows can improve the ‘time to science’ for numerical simulations in other ways as well. For example, by running both simulation and analysis at the same time, one eliminates an extra step in the postprocessing pipeline. This reduces the chance for human error which can arise when one must compile and run two separate codes with two separate sets of inputs, parameters, etc.
The in situ and intransit workflows we have discussed are not limited purely to floatingpoint applications, even though that has been our focus in this work. Another critical component of simulation is visualization, and recently both the ParaView and VisIt frameworks have implemented functionality for performing visualization on data which resides in memory (‘ParaView Catalyst’ (2016) and ‘libsim’ (2016)).
Our implementations of in situ and intransit postprocessing show that these types of workflows are efficient on current supercomputing systems: the expense of data movement via MPI in the latter workflow is, in the cases examined here, small compared to the total time spent performing simulation or analysis. Therefore, the penalty for trading disk space for CPUhours (both of which are limited commodities) is not severe. While we have examined only two analysis codes in this work, in the future we will be able to evaluate the myriad other analysis workflows which are critical components of other BoxLib codes. Because we have built these capabilities into BoxLib itself, rather than into Nyx specifically, these workflows will support a wide variety of applications. The infrastructure described here will provide scientists working with BoxLibbased codes in astrophysics, subsurface flow, combustion, and porous media, an efficient way to manage and analyze the increasingly large datasets generated by their simulations.
The Amr::CoarseTimeStep() function is the ‘outermost loop’ in the timestepping algorithm which uses subcycling, i.e., the coarsest grids take the largest time steps (since they have the most lenient CFL condition), and the refined grids must take smaller time steps to satisfy their stricter CFL conditions. The procedure is hierarchical; each grid may take many more time steps than its coarser ‘neighbor,’ but ultimately all grids must synchronize at the end of the large time step taken by the coarsest level. If subcycling is disabled, then all grids, regardless of their level of refinement, take the same time step together. In the case of Lymanα simulations, we do not use AMR, and so there is only one time step for all grids in each call to Amr::CoarseTimeStep(). The subcycling procedure in Nyx simulations which do use AMR is discussed in Almgren et al. (2013).
In production runs, one may wish to execute both analysis codes in a single simulation, but for the purposes of this performance study, we do not consider this case, as it obscures the performance behavior we seek to study.
For simulations in which the number of MPI processes calling FFTW3 is equal to or larger than the number of available chunks, we could select an arbitrarily smaller number of processes to perform the FFTW3 call, since it seems that using the maximum possible number of chunks and MPI processes does not lead to the fastest performance of the DFT. However, the optimal number of processes will likely be problemdependent.
Declarations
Acknowledgements
We thank Brian Perry for valuable discussions. We also thank the anonymous reviewers for insightful comments which improved the quality of this work. Most figures in this work were generated with matplotlib (Hunter 2007). This work was in part supported by the Director, Office of Advanced Scientific Computing Research, Office of Science, of the U.S. DOE under Contract No. DEAC0205CH11231 to the Lawrence Berkeley National Laboratory. ZL acknowledges the support by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy Office of Advanced Scientific Computing Research and the Office of High Energy Physics. DM and GHW acknowledge the support by the Scientific Data Management, Analysis and Visualization at Extreme Scale program funded by the U.S. Department of Energy Office of Advanced Scientific Computing Research, program manager Lucy Nowell. Support for improvements to BoxLib to support the Nyx code and others was provided through the SciDAC FASTMath Institute. Calculations presented in this paper used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0205CH11231. This work made extensive use of the NASA Astrophysics Data System and of the astroph preprint archive at arXiv.org.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Agranovsky, A, et al.: Improved post hoc flow analysis via Lagrangian representations. In: 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), pp. 6775 (2014). doi:10.1109/LDAV.2014.7013206 View ArticleGoogle Scholar
 Almgren, AS, et al.: CASTRO: a new compressible astrophysical solver. I. Hydrodynamics and selfgravity. Astrophys. J. 715(2), 12211238 (2010). http://stacks.iop.org/0004637X/715/i=2/a=1221 ADSView ArticleGoogle Scholar
 Almgren, AS, et al.: Nyx: a massively parallel AMR code for computational cosmology. Astrophys. J. 765(1), 39 (2013). http://stacks.iop.org/0004637X/765/i=1/a=39 ADSMathSciNetView ArticleGoogle Scholar
 Anderson, L, et al.: The clustering of galaxies in the SDSSIII baryon oscillation spectroscopic survey: baryon acoustic oscillations in the data releases 10 and 11 galaxy samples. Mon. Not. R. Astron. Soc. 441(1), 2462 (2014). doi:10.1093/mnras/stu523 ADSView ArticleGoogle Scholar
 Bennett, JC, et al.: Combining insitu and intransit processing to enable extremescale scientific analysis. In: SC ’12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 49:149:9. IEEE Comput. Soc., Los Alamitos (2012). http://dl.acm.org/citation.cfm?id=2388996.2389063 Google Scholar
 Bleuler, A, et al.: Comput. Astrophys. Cosmol. 2(1), 5 (2015). doi:10.1186/s4066801500097 ADSView ArticleGoogle Scholar
 BoxLib (2016). https://ccse.lbl.gov/BoxLib/index.html
 Bryan, GL, Norman, ML, Stone, JM, Cen, R, Ostriker, JP: A piecewise parabolic method for cosmological hydrodynamics. Comput. Phys. Commun. 89, 149168 (1995). doi:10.1016/00104655(94)001914 ADSView ArticleMATHGoogle Scholar
 Colella, P: Multidimensional upwind methods for hyperbolic conservation laws. J. Comput. Phys. 87(1), 171200 (1990). doi:10.1016/00219991(90)90233Q ADSMathSciNetView ArticleMATHGoogle Scholar
 Colella, P, Glaz, HM: Efficient solution algorithms for the Riemann problem for real gases. J. Comput. Phys. 59(2), 264289 (1985). doi:10.1016/00219991(85)901469 ADSMathSciNetView ArticleMATHGoogle Scholar
 Davis, M, et al.: The evolution of largescale structure in a universe dominated by cold dark matter. Astrophys. J. 292, 371394 (1985). doi:10.1086/163168 ADSView ArticleGoogle Scholar
 Frenk, CS, White, SDM, Bode, P, Bond, JR, Bryan, GL, Cen, R, Couchman, HMP, Evrard, AE, Gnedin, N, Jenkins, A, Khokhlov, AM, Klypin, A, Navarro, JF, Norman, ML, Ostriker, JP, Owen, JM, Pearce, FR, Pen, UL, Steinmetz, M, Thomas, PA, Villumsen, JV, Wadsley, JW, Warren, MS, Xu, G, Yepes, G: The Santa Barbara cluster comparison project: a comparison of cosmological hydrodynamics solutions. Astrophys. J. 525, 554582 (1999). doi:10.1086/307908 ADSView ArticleGoogle Scholar
 Frigo, M, Johnson, S: The design and implementation of FFTW3. Proc. IEEE 93(2), 216231 (2005). doi:10.1109/JPROC.2004.840301 View ArticleGoogle Scholar
 Habib, S, et al.: The universe at extreme scale: multipetaflop sky simulation on the BG/Q (2012). arXiv:1211.4864
 Habib, S, et al.: HACC: simulating sky surveys on stateoftheart supercomputing architectures. New Astron. 42, 4965 (2016). doi:10.1016/j.newast.2015.06.003 ADSView ArticleGoogle Scholar
 Haardt, F, Madau, P: Radiative transfer in a clumpy universe. IV. New synthesis models of the cosmic UV/Xray background. Astrophys. J. 746, 125 (2012). doi:10.1088/0004637X/746/2/125 ADSView ArticleGoogle Scholar
 Heitmann, K, et al.: Largescale simulations of sky surveys. Comput. Sci. Eng. 16(5), 1423 (2014). doi:10.1109/MCSE.2014.49 MathSciNetView ArticleGoogle Scholar
 Heitmann, K, et al.: The Q continuum simulation: harnessing the power of GPU accelerated supercomputers. Astrophys. J. 219(2), Suppl., 34 (2015). http://stacks.iop.org/00670049/219/i=2/a=34 View ArticleGoogle Scholar
 Hockney, RW, Eastwood, JW: Computer Simulation Using Particles. CRC Press, Boca Raton (1988) View ArticleMATHGoogle Scholar
 Hunter, JD: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 9095 (2007). doi:10.1109/MCSE.2007.55 View ArticleGoogle Scholar
 Knebe, A, et al.: Structure finding in cosmological simulations: the state of affairs. Mon. Not. R. Astron. Soc. 435(2), 16181658 (2013). doi:10.1093/mnras/stt1403 ADSView ArticleGoogle Scholar
 Lukić, Z, et al.: The Lyman α forest in optically thin hydrodynamical simulations. Mon. Not. R. Astron. Soc. 446(4), 36973724 (2015). doi:10.1093/mnras/stu2377 ADSView ArticleGoogle Scholar
 Lukić, Z, Reed, D, Habib, S, Heitmann, K: The structure of halos: implications for group and cluster cosmology. Astrophys. J. 692, 217228 (2009). doi:10.1088/0004637X/692/1/217 ADSView ArticleGoogle Scholar
 Mihalas, D: Stellar Atmospheres, 2nd edn. Freeman, New York (1978) Google Scholar
 Mo, H, van den Bosch, FC, White, S: Galaxy Formation and Evolution. Cambridge University Press, Cambridge (2010) View ArticleGoogle Scholar
 Morozov, D, et al.: IsoFind: Halo finding using topological persistence (in preparation)
 Morozov, D, Weber, GH: Distributed merge trees. In: PPoPP ’13: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 93102. ACM, New York (2013). doi:10.1145/2442516.2442526 Google Scholar
 Morozov, D, Weber, GH: Distributed contour trees. In: Bremer, PT, Hotz, I, Pascucci, V, Peikert, R (eds.) Topological Methods in Data Analysis and Visualization III, pp. 89102. Springer, Berlin (2014) View ArticleGoogle Scholar
 Nouanesengsy, B, et al.: ADR visualization: a generalized framework for ranking largescale scientific data using analysisdriven refinement. In: 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV), pp. 4350 (2014). doi:10.1109/LDAV.2014.7013203 View ArticleGoogle Scholar
 PalanqueDelabrouille, N, et al.: The onedimensional Lyα forest power spectrum from BOSS. Astron. Astrophys. 559, A85 (2013). doi:10.1051/00046361/201322130 ADSView ArticleGoogle Scholar
 ParaView catalyst for in situ analysis (2016). http://www.paraview.org/insitu/
 Planck Collaboration, et al.: Planck 2013 results. XVI. Cosmological parameters. Astron. Astrophys. 571, A16 (2014). doi:10.1051/00046361/201321591 ADSView ArticleGoogle Scholar
 Press, WH, Schechter, P: Formation of galaxies and clusters of galaxies by selfsimilar gravitational condensation. Astrophys. J. 187, 425438 (1974). doi:10.1086/152650 ADSView ArticleGoogle Scholar
 Röpke, FK, et al.: Type Ia supernova diversity in threedimensional models. Astron. Astrophys. 453(1), 203217 (2006). doi:10.1051/00046361:20053430 ADSView ArticleGoogle Scholar
 Ross, RB, et al.: Visualization and parallel I/O at extreme scale. J. Phys. Conf. Ser. 125(1), 012099 (2008). http://stacks.iop.org/17426596/125/i=1/a=012099 ADSView ArticleGoogle Scholar
 Sewell, C, et al.: Largescale computeintensive analysis via a combined insitu and coscheduling workflow approach. In: SC ’15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 50:150:11. ACM, New York (2015). doi:10.1145/2807591.2807663 Google Scholar
 Srisawat, C, et al.: Sussing merger trees: the merger trees comparison project. Mon. Not. R. Astron. Soc. 436(1), 150162 (2013). doi:10.1093/mnras/stt1545 ADSView ArticleGoogle Scholar
 Thielemann, FK, Nomoto, K, Yokoi, K: Explosive nucleosynthesis in carbon deflagration models of type I supernovae. Astron. Astrophys. 158, 1733 (1986) ADSGoogle Scholar
 Travaglio, C, et al.: Nucleosynthesis in multidimensional SN Ia explosions. Astron. Astrophys. 425(3), 10291040 (2004). doi:10.1051/00046361:20041108 ADSView ArticleGoogle Scholar
 Viel, M, et al.: Warm dark matter as a solution to the small scale crisis: new constraints from high redshift Lymanα forest data. Phys. Rev. D 88(4), 043502 (2013). doi:10.1103/PhysRevD.88.043502 ADSView ArticleGoogle Scholar
 VisIt tutorial in situ (2016). http://www.visitusers.org/index.php?title=VisIttutorialinsitu
 Williams, S, Waterman, A, Patterson, D: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 6576 (2009). doi:10.1145/1498765.1498785 View ArticleGoogle Scholar