Cosmological N-body simulations: a challenge for scalable generative models

Deep generative models, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAs) have been demonstrated to produce images of high visual quality. However, the existing hardware on which these models are trained severely limits the size of the images that can be generated. The rapid growth of high dimensional data in many ﬁelds of science therefore poses a signiﬁcant challenge for generative models. In cosmology, the large-scale, three-dimensional matter distribution, modeled with N-body simulations , plays a crucial role in understanding of evolution of structures in the universe. As these simulations are computationally very expensive, GANs have recently generated interest as a possible method to emulate these datasets, but they have been, so far, mostly limited to two dimensional data. In this work, we introduce a new benchmark for the generation of three dimensional N-body simulations, in order to stimulate new ideas in the machine learning community and move closer to the practical use of generative models in cosmology. As a ﬁrst benchmark result, we propose a scalable GAN approach for training a generator of N-body three-dimensional cubes. Our technique relies on two key building blocks, (i) splitting the generation of the high-dimensional data into smaller parts, and (ii) using a multi-scale approach that eﬃciently captures global image features that might otherwise be lost in the splitting process. We evaluate the performance of our model for the generation of N-body samples using various statistical measures commonly used in cosmology. Our results show that the proposed model produces samples of high visual quality, although the statistical analysis reveals that capturing rare features in the data poses signiﬁcant problems for the generative models. We make the data, quality evaluation routines, and the proposed GAN architecture publicly available at https://github.com/nperraud/3DcosmoGAN .


Introduction
The recent advances in the field of deep learning have initiated a new era for generative models. Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) have become a very popular approach by demonstrating their ability to learn complicated representations to produce highresolution images (Karras et al. 2018). In the field of cosmology, high-resolution simulations of matter distribution Perraudin et al. Computational Astrophysics and Cosmology (2019) 6:5 Page 2 of 17 N -body simulations represent the matter in a cosmological volume, typically between 0.1-10 Gpc, as a set of particles, typically between 100 3 to 2000 3 . The initial 3D positions of the particles are typically drawn from a Gaussian random field with a specific power spectrum. Then, the particles are displaced over time according to the laws of gravity, properties of dark energy, and other physical effects included in the simulations. During this evolution, the field is becoming increasingly non-Gaussian, and displays characteristic features, such as halos, filaments, sheets, and voids (Bond et al. 1996;Dietrich et al. 2012).
N -body simulations that consist only of dark matter effectively solve the Poisson's equation numerically. This process is computationally expensive, as the forces must be recalculated in short time intervals to retain the precision of the approximation. This leads to the need for frequent updates of the particle positions. The speed of these simulations is a large computational bottleneck for cosmological experiments, such as the Dark Energy Survey, a Euclid, b or LSST. c Recently, GANs have been proposed for emulating the matter distributions in two dimensions (Rodríguez et al. 2018;Mustafa et al. 2019). These approaches have been successful in generating data of high visual quality, and almost indistinguishable from the real simulations to experts. Moreover, several summary statistics often used in cosmology, such as power spectra and density histograms, also revealed good levels of performance. Some challenges still remain when comparing sets of generated samples. In both works, the properties of sets of generated images did not match exactly; the covariance matrix of power spectra of the generated maps differed by order of 10% with the real maps.
While these results are encouraging, a significant difficulty remains in scaling these models to generate threedimensional data, which include several orders of magnitude more pixels for a single data instance. We address this problem in this work. We present a publicly available dataset of N -body cubes, consisting of 30 independent instances. Due to the fact that the dark matter distribution is homogeneous and isotropic, and that the simulations are made using periodic boundary condition, the data can be easily augmented through shifts, rotations, and flips. The data is in the form of a list of particles with spatial positions x, y, z. It can be pixelised into 3D histogram cubes, where the matter distribution is represented in density voxels. Each voxel contains the count of particles falling into it. If the resolution of the voxel cube is high enough, the particle-and voxel-based representations should be able to be used interchangeably for most of the applications. Approaches to generate the matter distribution in the particle-based representation could also be designed; in this work, however, we focus on the voxel-based representation. By publishing the N -body data and the accompanying codes we aim to encourage the development of large scale generative models capable of handling such data volumes.
We present a benchmark GAN system to generate 3D N -body voxel cubes. Our design of the novel GAN architecture scales to volumes of 256 3 voxels. Our proposed solution relies on two key building blocks. First, we split the generation of the high-dimensional data into smaller patches. Instead of assuming that the distribution of each patch is independent of the surrounding context, we model it as a function of the neighboring patches. Although splitting the generation process into patches provides a scalable solution to generate images of arbitrary size, it also limits the field of view of the generator, reducing its ability to learn global image features. The second core idea of our method addresses this problem by relying on a multi-scale approach that efficiently captures global dependencies that might otherwise be lost in the splitting process.
Our results constitute a baseline solution to the challenge. While the obtained statistical accuracy is currently insufficient for a real cosmological use case, we achieve two goals: (i) we demonstrate that the project is tractable by GAN architectures, and (ii) we provide a framework for evaluating the performance of new algorithms in the future.

Related work
Generative models that produce novel representative samples from high-dimensional data distributions are increasingly becoming popular in various fields such as image-toimage translation (Zhu et al. 2017), or image in-painting (Iizuka, Simo-Serra and Ishikawa 2017) to name a few. There are many different deep learning approaches to generative models. The most popular ones are Variational Auto-Encoders (VAE) (Kingma and Welling 2014), Autoregressive models such as PixelCNN (van den Oord et al. 2016), and Generative Adversarial Networks (GAN) (Goodfellow et al. 2014). Regarding prior work for generating 3D images or volumes, two main types of architectures -in particular GANs -have been proposed. The first type (Achlioptas et al. 2018;Fan et al. 2017) generates 3D point clouds with a 1D convolutional architecture by producing a list of 3D point positions. This type of models does not scale to cases where billion of points are present in a simulation, posing an important concern given the size of current and future N -body simulations. The second type of approaches, including Wu et al. (2016), Mosser et al. (2017), directly uses 3D convolutions to produce a volume. Although the computation and memory cost is independent of the number of particles, it scales with the number of voxels of the desired volume, which grows cubically with the resolution. While recursive models such as PixelCNN (van den Oord et al. 2016) can scale to some extent, they are slow to generate samples, as they build the output image pixel-by-pixel in a sequential manner. We take inspiration from PixelCNN to design a patch-by-patch approach, (2019) 6:5 Page 3 of 17 rather than a pixel-by-pixel approach, which significantly speeds up the generation of new samples. As mentioned above, splitting the generation process into patches reduces the ability of the generator to learn global image features. Some partial solutions to this problem can already be found in the literature, such as the Laplacian pyramid GAN (Denton et al. 2015) that provides a mechanism to learn at different scales for high quality sample generation, but this approach is not scalable as the sample image size is still limited. Similar techniques are used in the problem of super-resolution (Ledig et al. 2017;Lai et al. 2017;Wang et al. 2018). Recently, progressive growing of GANs (Karras et al. 2018) has been proposed to improve the quality of the generated samples and stabilize the training of GANs. The size of the samples produced by the generator is progressively increased by adding layers at the end of the generator and at the beginning of the discriminator. In the same direction, Brock et al. (2019), Lučić et al. (2019) achieved impressive quality in the generation of large images by leveraging better optimization. Problematically, the limitations of the hardware on which the model is trained occur after a certain increase in size and all of these approaches will eventually fail to offer the scalability we are after.
GANs were proposed for generating matter distributions in 2D. A generative model for the projected matter distribution, also called a mass map, was introduced by Mustafa et al. (2019). Mass maps are cosmological observables, as they are reconstructed by techniques such as, for example, gravitational lensing (Chang et al. 2018). Mass maps arise through integration of the matter density over the radial dimension with a specific, distancedependent kernel. The generative model presented in Mustafa et al. (2019) achieved very good agreement with the real data several important non-Gaussian summary statistics: power spectra, density histograms, and Minkowski functionals (Schmalzing et al. 1996). The distributions of these summaries between sets of generated and real data also agreed well. However, the covariance matrix of power spectra within the generated and real sets did not match perfectly, differing by the order of 10%.
A generative model working on 2D slices from N -body simulations was developed by Rodríguez et al. (2018). N -body slices have much more complex features, such as filaments and sheets, as they are not averaged out in projection. Moreover, the dynamic range of pixel values spans several orders of magnitude. GANs presented by Rodríguez et al. (2018) also achieved good performance, but only for larger cosmological volumes of 500 Mpc. Some mismatch in the power spectrum covariance was also observed.
Alternative approaches to emulating cosmological matter distributions using deep learning have been recently been proposed. Deep Displacement Model (He et al. 2018) uses a U-shaped neural network that learns how to modify the positions of the particles from initial conditions to a given time in the history of the universe.
Generative models have also been proposed for solving other problems in cosmology, such as generation of galaxies (Regier et al. 2015), adding baryonic effects to the dark matter distribution (Tröster et al. 2019), recovery of certain features from noisy astrophysical images (Schawinski et al. 2017

Cosmological N-body simulations
The distribution of matter, dark matter and other particles in the universe at large scale, under the influence of gravity, forms a convoluted network-like structure called the cosmic web (Bond et al. 1996;Coles and Chiang 2000;Forero-Romero et al. 2009;Dietrich et al. 2012). This distribution contains information vital to the study of dark matter, dark energy, and the very laws of gravity (Abbott et al. 2018;Hildebrandt et al. 2017;Joudaki et al. 2017). Simulations of these various computational cosmological models (Springel 2005;Potter et al. 2017) lead to understanding of the fundamentals of cosmological measurements (Fosalba et al. 2015;Busha et al. 2013), and other properties of the universe . These simulations are done using N -body techniques. N -body techniques simulate the cosmic web using a set of particles in three dimensional space, and evolve their positions with time. This evolution is governed by the underlying cosmological model and the laws of gravity. The end result of an N -body simulation is the position of billions of particles in space, as depicted in Fig. 1. Unfortunately, N -body simulations are extremely resource intensive, as they require days, or even weeks of computation to produce them (Teyssier et al. 2009;Boylan-Kolchin et al. 2009). Moreover, a large number of these N -body simulations is needed to obtain good statistical accuracies, which further increases the computational requirements.
This computational bottleneck opens up a leeway for deep learning and generative models to offer an alternative solution to the problem. Generative models have the potential to be able to learn the underlying data distribution of the N -body simulations using a seed set of N -body samples to train on.
There are multiple techniques for running N -body simulations, which agree well for large scales, but start to diverge for small scales, around wavenumber k = 1 Mpc -1 (Schneider et al. 2016). Moreover, baryonic feedback can also affect the small scale matter distribution (Mead et al. 2015;Huang et al. 2019;Barreira et al. 2019), and large uncertainty remains for these scales. . The density is represented by particle positions in 3D. In this work, the generative models use the representation of this distributon that is based on 3D voxel histogram of the particle positions

Data preprocessing
We produce samples of the cosmic web using standard Nbody simulation techniques. We used L- PICOLA Howlett et al. (2015) to create 30 independent simulations. The cosmological model used was ΛCDM with Hubble constant H 0 = 500h = 350 km/s/Mpc, d dark energy density Ω Λ = 0.724 and matter density Ω m = 0.276. We used the particle distribution at redshift z = 0. The output of the simulator is a list of particles 1024 3 3D positions. To obtain the matter distribution, we first convert it to a standard 256 3 3D cube using histogram binning. We consider these cubes as the raw data for our challenge and can be downloaded at https://zenodo.org/record/1464832. The goal is to build a generative model able to produce new 3D histograms. While 30 samples might seem as a low number of samples to train a deep neural network, each sample contains a large number of voxels. One can also expand the training data by relying on data augmentation, using various rotations and circular translations as described in Appendix 2. Problematically, the dynamic range of this raw data spans several orders of magnitude and the distribution is skewed towards smaller values, with a very elongated tail towards the larger values. Empirically, we find that this very skewed distribution makes learning a generative model difficult. Therefore, we first transform the data using a logarithm-based function, as described in Appendix 1. Note that this transform needs to be inverted before the evaluation procedure.

Evaluation procedure
The evaluation of a GAN is not a simple task Borji (2019), Grnarova et al. (2019). Fortunately, following Rodríguez et al. (2018), we can evaluate the quality of the generated samples with three summary statistics commonly used in the field of cosmology.
1. Mass histogram is the average (normalized) histogram of the pixel value in the image. Note that the pixel value is proportional to the matter density. 2. Power spectrum density (or PSD) is the amplitude of the Fourier modes as a function of their frequency (the phase is ignored). Practically, a set of bins is defined and the modes of similar frequency are averaged together. 3. Peak histogram is the distribution of maxima in the density distribution, often called "peak statistics", or "mass function". Peaks capture the non-Gaussian features present in the cosmic web. This statistic is commonly used on weak lensing data (Kacprzak et al. 2016;Martinet et al. 2017). A peak is defined as a pixel greater than every pixel in its 2-pixels neighbourhood (24 pixels for 2D and 124 for 3D). Other statistics such as Minkowski functionals, three point correlation functions, or phase distributions, could be considered. Nevertheless, we find that the three aforementioned statistics are currently sufficient to compare different generative models.

Distance between statistics
We define a score that reflects the agreement of the 3 aforementioned cosmological measures. Problematically, the scalars forming the 3 vectors representing them have very different scales and their metrics should represent the relative error instead of the absolute one. For this reason, we first compute the logarithm (in base 10) of the computed statistics s. As all statistics are positive but not strictly positive, we add a small value before computing the logarithm, i.e., s log = log 10 (s + ). is set to the maximum value of the statistic averaged over all real samples divided by 10 5 .
At this point, the relative difference connects to the difference of the real and fake s log , i.e.: s r logs f log ≈ log 10 s r s f . One could quantify the error using a norm, i.e.: Es r log -Es f log . However, such a distance does not take into account second-order moments. Rather, we take inspiration from the Fréchet Inception Distance (Heusel et al. 2017). We start by modeling the real and fake log statistics s r log , s f log as two multivariate Gaussian distributions. This allows us to compute the Fréchet Distance (FD) between the two distributions (Fréchet 1957), which is also the Wasserstein-2. The FD between two Gaussian distribution with mean and covariance (m r , C r ) and (m f , C f ) is given by Dowson and Landau (1982): Note that Dowson and Landau (1982) also proves that Tr(C r + C f -2C r C f ) is a metric for the space of covariance matrices. We choose the FD over the Kullback-Leibler (KL) divergence for two reasons: (a) the Wasserstein distance is still an appropriate distance when distributions have non-overlapping support and (b) the KL is computationally more unstable since the covariance matrices need to be inverted.

N-Body mass map generation challenge
Using the FD, we define a score for each statistic as where m, C are computed on the log statistics, i.e: m r = Es r log , m f = Es f log . Along with this manuscript, we release our dataset and our evaluation procedure in the hope that further contributions will improve our solution. All information can be found at https://github.com/nperraud/ 3DcosmoGAN. We hope that this dataset and evaluation procedure can be a tool for the evaluation of GANs in general. Practically, these cosmological statistics are very sensitive. We observed two important properties of this problem. First, a small variation in the generated images still has an impact on the statistics. The statistics can be highly affected by high density regions of the N -body data, and these regions are also the most rare in the training set. Second, while mode collapse may not directly affect the mean of the statistics, it can affect their second order moment significantly. We observed that obtaining a good statistical agreement (and hence a high score), is much more difficult than obtaining generated images that are indistinguishable for the human eye, especially for the 2-dimensional case. We found that the problems of high data volume, large dynamic range of the data, and strict requirement on good agreement in statistical properties of real and generated samples, pose significant challenge for generative models.
Interpretation of the S * scores We report the S * scores for the 3D case in Sect. 4 and the web page hosting the source code also includes baseline scores for the 2D case. We would like to emphasize that these scores are mostly suitable to compare two generative models applied to the same distribution. It is a priori unclear whether these scores provide a meaningful comparison when considering two generative models trained on different distributions. We therefore refrain from relying on such scores to compare the generative models trained on 2D and 3D data since the latter gives access to correlations along the third dimension that are absent from the 2D data. Finally, while in theory, the S * score is unbounded and can be arbitrarily large as d 2 → 0, its empirical evaluation is, in practice, limited by the estimation error of d 2 that depends on the number of samples used to estimate the mean vectors and the covariance matrices (m r , C r ) as well as the moments of the estimated statistics. In the case where the training set contains a large number of samples, one would expect the training score to be indicative of the score of the true distribution. One can see that the estimation error of the first term of (1) depends mostly on the variance of the statistic, while for the second term, it depends on the moments of order 3 and 4. Hence, the estimated score will reflect the variance of the estimated statistics given the number of samples used. A high score therefore means less variance within the corresponding statistic.

Sequential generative approach
We propose a novel approach to efficiently learn a Generative Adversarial Network model (see Sect. 3.1) for 2D and 3D images of arbitrary size. Our method relies on two building blocks: (1) a multi-scale model that improves the quality of the generated samples, both visually and quantitatively, by learning unique features at different scales (see Sect. 3.2), and (2) a training strategy that enables learning images of arbitrary size, that we call "conditioning on neighborhood" (see Sect. 3.3).

Generative adversarial networks (GANs)
Generative Adversarial Networks (GAN) rely on two competing neural networks that are trained simultaneously: the generator G, which produces new samples, and the discriminator D, which attempts to distinguish them from the real ones. During training, it is the generator's objective to fool the discriminator, while the discriminator resists by learning to accurately discriminate real and fake data. Eventually, if the optimization process is carried out successfully, the generator should improve to the point that its generated samples become indistinguishable from the real one. In practice, this optimization process is challenging and numerous variants of the original GAN approach have been proposed, many of them aiming to improve stability including e.g. Roth et al. (2017), Gulrajani et al. (2017), Arjovsky et al. (2017). In our work, we rely on the improved Wasserstein GAN (WGAN) approach introduced in Gulrajani et al. (2017). The latter optimizes the Wasserstein distance instead of the Jensen-Shannon divergence and penalizes the norm of gradient of the critic instead of using a hard clipping as in the original WGAN (Arjovsky et al. 2017). The resulting objective function is where P r is the data distribution and P g is the generator distribution implicitly defined by x x x = G(z), z ∼ p(z).
The latent variable z is sampled from a prior distribution p, typically a uniform or a Gaussian distribution. Eventually, Px x x is defined implicitly by sampling uniformly along Figure 2 Multiscale approach using multiple intermediary GANs, each learning features at different scales and all trained independently from each other. The same approach can in principle be extended to produce higher-resolution images, potentially adding more intermediary GANs if necessary straight lines between pair of points sampled from the true data distribution P r and the generator distribution P g . The weight λ is the penalty coefficient.

Multi-scale model
Our multi-scale approach is inspired by the Laplacian pyramid GAN (Denton et al. 2015). We refer to three image types of different sizes, namely I 3 = 32 × 32 × 32, I 2 = 64 × 64 × 64, I 1 = 256 × 256 × 256 e pixels, where I 2 is a down-scaled version of I 1 and I 3 is a down-scaled version of I 2 . The multi-scale approach is shown in Fig. 2 and uses three different GANs, M 1 , M 2 and M 3 , all trained independently from each other, and can therefore be trained in parallel. We train GAN M 3 to learn the data distribution of images I 3 , while the GANs M 2 and M 1 are conditioned on the images produced by M 3 and M 2 , respectively. In our implementation, we take M 3 to be a normal Deep Convolution GAN (DCGAN) that learns to produce down-scaled samples of size I 3 . We design GANs M 2 and M 1 to have the following properties: (1) they produce outputs using a sequential patch-by-patch approach and (2) the output of each patch is conditioned on the neighboring patches. This procedure allows handling of high data volume, while preserving the long-range features in the data. Moreover, different GANs learn salient features at different scales, which contribute to an overall improved quality of the samples produced the final GAN M 1 . Further details regarding the implementation details are provided in Appendix 3.

Conditioning on neighborhoods
The limited amount of memory available to train a GAN generator makes it impractical to directly produce large image samples. Using current modern GPUs with 16 GB of RAM and a state-of-the-art network architecture, the maximum sample size we were allowed to use was 32 3 , which is far from our target taken to be 256 3 . In order to circumvent this limitation, we propose a new approach that produces the full image (of size 256 3 ) patch-by-patch, each patch being of smaller size (32 3 in our case). This approach is reminiscent of the Pixel-CNN (van den Oord et al. 2016), where 2D images are generated pixel-by-pixel, rather than the entire picture being generated at once. Instead of assuming that the distribution of each patch is independent of the surrounding context, we model it as a function of the neighboring patches. The generation process is done using a raster-scan order, which implies that a patch depends on the neighboring patches produced before the current patch. The process illustrated in Fig. 3 is for the 2D case with three neighboring patches; the generalization to three dimensions is straightforward as it simply requires seven neighboring 3D patches.
In the generator, the neighborhood information, i.e. the borders (3 patches for 2D, 7 cubes for 3D), is first encoded using several convolutional layers. Then it is concatenated with the latent variable, is inputed to a fully connected layer before being reshaped into a 3D image. At this point, the down-sampled version of the image is concatenated. After a few extra convolutional layers, the generator produces the lowest rightmost patch with the aim of making it indistinguishable to the patch from the real data. The generator architecture is detailed in Fig. 4. In the case where there is no neighborhood information available, such as at the boundary of a cube, we pad with zeros. The discriminator is given images containing four patches where the lower right patch is either the real data or the fake data generated by the generator. The generator only produces patches of size 32 3 , irrespective of the size of the original image. This way this method can scale to any image size, which is a great advantage. The discriminator only has access to a limited part of the image and ignores the long-range dependencies. This issue, however, is handled by the multi-scale approach described in the previous section. We actually tried a model only conditioned on the neighborhoods as detailed in Appendix 4.3. It ended up performing significantly worse than using the multi-scale approach.

Experimental results
Our approach relies on a recursive approach that progressively builds a 3D cube, starting from a low-resolution and upsampling it at every step. We detail and analyze each step separately in order to understand the impact of each of these steps on the final performance of the full model. We also compare the results of our multi-scale approach to a simpler uni-scale model. Additional details regarding the network architectures and various implementation choices are available in Appendix 3.

Scale by scale analysis of the pipeline
In the following, we describe our model that relies on three different GANs, namely M 1 , M 2 and M 3 , to generate samples at distinct resolutions. We detail each step of the generation process below.
Step 1: low-scale generation (latent code to I 3 ) The low scale generation of a sample of size 32 3 is performed by the WGAN M 3 . The architecture of both the generator and the discriminator is composed a 5 3D convolutional layers with kernels of size 4 × 4 × 4. We use leaky ReLu non-linearity. Further details can be found in Table 2 of Appendix 3. Figure 5 shows the middle slice from 16 3D samples I 3 drawn from the generator of M 3 compared to real samples. In Fig. 6, we additionally plot our evaluation statistics for the 30 samples corresponding to the total number of N -body simulations used to build the dataset. The generated samples drawn from M 3 are generally similar to the true data, both from a visual and from a statistical point of view, although one can observe slight disagreements at higher frequencies. Note that the low number of samples and the size of each of them does not allow us to compute very accurate statistics, hence limiting our evaluation.
Step 2: up-scale (I 3 to I 2 ) Upscaling the sample size from 32 3 to 64 3 is performed by the WGAN M 2 . The architecture is similar to M 3 (see Table 3 in Appendix 3), except that the border patches are first encoded using three 3D convolutional layers and then concatenated with the latent variables before being fed to the generator.
In order to visualize the quality of up-scaling achieved by this first up-sampling step independently from the rest of the pipeline, we first down-scale the real 256 3 samples to 32 3 , and then provide them as input I 3 to the WGAN M 2 . We then observe the result of the up-scaling to 64 3 . Figure 7 shows slices from some generated I 2 samples, as well as the real down-scaled I 3 image and the real I 2 image. We observe a clear resemblance between the up-scaled fake samples and the real samples. The statistics for this step of the generation process are shown in Fig. 8. We observe more significant discrepancies for features that rarely occur in the training set such as for large peaks. This is however not surprising as learning from few examples is in general a difficult problem.
Step 3: up-scale (I 2 to I 1 ) The final upscaling from 64 3 to 256 3 is performed by the WGAN M 1 . The architecture of both the generator and the discriminator of M 1 is composed of eight 3D convolutional layers with inception modules Szegedy et al. (2014). The inception modules have filters of three different sizes: 1 × 1 × 1, 2 × 2 × 2 and 4 × 4 × 4. The input tensor is convolved with all three types of filters, using padding to keep the output shape the same as the input shape. Eventually, the three outputs are summed to recover the desired number of output channels.

Perraudin et al. Computational Astrophysics and Cosmology
To visualize the quality of up-scaling achieved by the final up sampling step, we down-scale the real 256 3 samples to 64 3 , and then provide them as inputs I 2 to the WGAN M 1 . Figure 9 shows the middle slices of two real and fake samples I 1 . Although, the up-sampling factor is 4, the WGAN M 1 is able to produce convincing samples even in terms of high frequency components.

Full multi-scale pipeline
Sample generation The generation process used to produce new samples proceeds as follows. First, a latent variable is randomly drawn from the prior distribution and is fed to M 3 which produces a 32 3 low-resolution cube. The latter is then upsampled by M 2 . At first, all border patches shown in Fig. 3 are set to zero. Then the 64 3 cube is built recursively (in 2 3 = 8 steps) where at each step, the previously generated patches are re-fed as borders into the generator. The generation of the full 256 3 cube is done in a similar fashion by M 1 . Note that this last step requires the generation of 8 3 = 512 patches/smaller cubes. An illustration, adapted to the 2D case for simplicity, is shown in Fig. 10. The full generation process takes around 7 seconds to produce one sample of size 256 3 using a single GPU node with Perraudin et al. Computational Astrophysics and Cosmology (2019) 6:5 Page 10 of 17

Figure 10
Up-scaling patches in 3-steps, using 3 different WGANs 24 cores compared to approximately 6-8 hours for a fast and approximate L-Picola (Howlett et al. 2015) simulator running on two identical nodes. Figure 11 shows a few slices from a 3D fake I 1 image generated using the full pipeline, alongside a random real I 1 image. Figure 12 shows the summary statistics of 500 GAN generated samples, compared to that of real samples. The visual agreement between the real and generated samples is good, although a trained human eye can still distinguish between real and fake samples. In particular, a careful visualization reveals that the transitions between the different patches are still imperfect and that the generated samples have less long range filaments than the true samples. The summary statistics agree very well for the middle range of mass density, with slight disagreements at the extremes. The shape of the power spec-  Table 1 Scores computed for the 30 GAN-generated 256 3 images using the full multi-scale pipeline. We refer to (2) and (1)  trum density (PSD) matches well, but the overall amplitude is too high for most of the k range. Naturally, we would expect the error of the multi-scale pipeline to be the result of the accumulation of errors from the three upsampling steps. In practice, we observe a lot of similarities between the statistics shown in Fig. 13 (from the last upscaling step) and in Fig. 12 (from the full pipeline). Finally, Table 1 presents the scores obtained for the 30 cubes of size 256 3 using the full multi-scale pipeline as generated by the full GAN sequence. As explained in Sect. 2.4, the reference score gives an indication of the variability within the training set, i.e, how similar are two independent sample sets from the training set. The reference mass histogram score is much higher than the PSD and the Peak histogram due to the fact that this statistic has in general much less variance. For that reason, it is probably easier to estimate, as indicated by the score of our pipeline. Cosmological analyses typically require the precision of less than few percent on these summary statistics, which is achieved by the GAN method only for specific scales, peak and density values. The scores of multiscale approach are much higher than the ones of the simpler single-scale approach described in the following section.

Comparison to the single scale model
In order to evaluate the effectiveness of the multi-scale approach, we compare our model to a uni-scale variant that is not conditioned on the down-sampled image. Here each patch is generated using only its neighboring 7 border patches and a latent variable that is drawn from the prior distribution. More simply, it is a direct implementation of the principle displayed in Fig. 3 left in 3D, where the discriminator is not conditioned on the down-sampled image. An important issue with the uni-scale model is that the discriminator never s more than 64 3 pixels at once. Hence, it is likely to fail to capture long-range correlations. Practically, we observed that the training process is unstable and subject to mode collapse. At generation time. the recursive structure of the model often lead to repeating pattern. The resulting models were of bad quality as shown in Fig. 14  and 15. This translates to the score presented in Table 1. This experiment demonstrates that conditioning on data at lower scales does play an important role in generating samples of good quality.

Conclusion
In this work we introduced a new benchmark for the generation of 3D N -body simulations using deep generative models. The dataset is made publicly available and contains matter density distributions represented as cubes of 256 × 256 × 256 voxels. While the performance of the generative model can be measured by visual inspection of the generated samples, as commonly done on datasets of natural images, we also offer a more principled alternative based on a number of summary statistics that are commonly used in cosmology. Our investigation into this problem has revealed that several factors make this task challenging, including: (i) the sheer volume of each data sample, which is not straightforwardly tractable using conventional GAN architectures, (ii) the large dynamic range of the data that spans several orders of magnitude; which requires a custom-designed transformation of the voxel values, and (iii) the need for high accuracy required for the model to be practically usable for a real cosmological study. Adding to the difficulties of (i) and (ii), this also requires accurately capturing features that are rare in the training set.
As a first baseline result for the newly introduced benchmark, we proposed a new method to train a deep generative model on 3D images. We split the generation process into the generation of smaller patches as well as condition on neighboring patches. We also apply a multi-Page 12 of 17 We find that the proposed baseline produces N -body cubes with good visual quality compared to the training data, but significant differences can still be perceived. Overall, the summary statistics between real and generated data match well, but notable differences are present for high voxel values in the mass and peak histograms. The power spectrum has the expected shape, with amplitude that is too high for most of the k values. The overall level of agreement is promising, but can not yet be considered as sufficient for practical applications in cosmology. Further development will be needed to achieve this goal; in order to encourage it, we have made the dataset and the code publicly available. In our current model, the discriminator only has access to a partial view of the final image. The dependencies at small scale that may exist between distant patches are therefore not captured by the discriminator. Extending this model to allow the discriminator to have a more global view would be the next logical extension of this work. We have also observed empirically that the extreme right tail of the histogram is often not fully captured by the generator. Designing architectures that would help the generative model to handle large dynamic range in the data could further improve performance. One could also get further inspiration from the literature on generative models for video data, such as Vondrick et al. (2016), Xiong et al. (2018), Saito et al. (2017). Given the observations made in our experiments, one might for instance expect Perraudin et al. Computational Astrophysics and Cosmology (2019) 6:5 Page 13 of 17 that the two stage approach suggested in Saito et al. (2017) could address some of the problem seen with the right tail of the distribution. Another interesting research direction would be to condition the generation process on cosmic time or cosmological parameters. One could for instance rely on a conditional GAN model such as Mirza and Osindero (2014).

Appendix 1: Input data transformation
The general practice in machine learning is that the input data is first standardized before giving it to any machine learning model. This preprocessing step is important as many neural networks and optimization blocks are scaledependent, resulting is most architectures working optimally only when the data is appropriately scaled. Problematically, because of the physical law of gravity, most of the universe is empty, while most of the matter is concentrated in a few small areas and filaments. The dataset had the minimum value of 0 and the maximum value of 185,874, with most of the voxels concentrated close to zero, and significantly skewed towards the smaller values and has an elongated tail towards the larger ones. Even with standardization, it is difficult for a generative model to learn very sparse and skewed distributions. Hence we transform the data using a special function, in a slightly different way than (Rodríguez et al. 2018).
In order to preserve the sensitivity of the smaller values, a logarithm-based transformation function is a good candidate. Nevertheless, to maintain the sensitivity to the large values, we should favor a linear function. In our attempt to coincide the best of the two regimes, we design a function that is logarithmic for lower values and linear for large one, i.e after a specified cutoff value. The exact forward transformation function, y = f (x, c, s) is defined as: where f (x, c) = 3 log e (x + 1) if x ≤ c, 3(log e (c + 1) + x c -1) otherwise.
Here c and s are selected hyper-parameters. For our experiments we found c = 20,000 and s = 3 to be good candidates. After the forward transformation, the distribution of the data becomes similar to a one-sided Laplacian distribution. We always invert this transformation once new data is generated, before calculating the summary statistics.

Appendix 2: N-Body training set augmentation
As N -body simulations are very expensive, we need to make sure to use all the information available using an optimal dataset augmentation. To augment the training set, the cubes are randomly rotated by multiples of 90 degrees and randomly translated along one of the 3 axes. The cosmological principle states that the spatial distribution of matter in the universe is homogeneous and isotropic when viewed on a large enough scale. As a consequence there should be no observable irregularities in the large scale structure over the course of evolution of the matter field that was initially laid down by the Big Bang (Dodelson 2003). Hence, the rotational and translational augmentations do not alter the data distribution that we are trying to model in any way. Moreover, we note that use circular translation in our augmentation scheme. This is possible because N -body simulations are created using the periodic boundary condition: a particle exiting the box on one side enters it immediately on the opposite side. Forces follow the same principle. This prevents the particles from collapsing to the middle of the box under gravity. These augmentations are important given that we only have 30 Nbody cubes in our training set.

Appendix 3: Architecture & implementation details
Implementation details We used Python and Tensorflow to code the models which are trained on GPUs with 16 GB of memory. All the GANs are WGANs with a Gaussian prior distribution. Using a single GPU, it takes around 7 seconds to produce one sample of size 256 3 compared to approximately 30 hours for a precise N -body simulator running on two nodes with 24 cores and a GPU, such as PkdGrav3 (Potter et al. 2017). In this project, a fast and approximate L-Picola (Howlett et al. 2015) simulator was used, with approximately 6 hours of runtime on two nodes. The batch size is set to 8 for all the experiments. All Wasserstein GANs were trained with using a gradient penalty loss with γ GP = 10, as described in Gulrajani et al. (2017). We use RMSprop with a learning rate 3 · 10 -5 for both the generator and discriminator. The discriminator was updated 5 times per generator update.