 Research
 Open Access
 Published:
On the parallelization of stellar evolution codes
Computational Astrophysics and Cosmologyvolume 5, Article number: 3 (2018)
Abstract
Multidimensional nucleosynthesis studies with hundreds of nuclei linked through thousands of nuclear processes are still computationally prohibitive. To date, most nucleosynthesis studies rely either on hydrostatic/hydrodynamic simulations in spherical symmetry, or on postprocessing simulations using temperature and density versus time profiles directly linked to huge nuclear reaction networks.
Parallel computing has been regarded as the main permitting factor of computationally intensive simulations. This paper explores the different pros and cons in the parallelization of stellar codes, providing recommendations on when and how parallelization may help in improving the performance of a code for astrophysical applications.
We report on different parallelization strategies succesfully applied to the spherically symmetric, Lagrangian, implicit hydrodynamic code SHIVA, extensively used in the modeling of classical novae and type I Xray bursts.
When only matrix buildup and inversion processes in the nucleosynthesis subroutines are parallelized (a suitable approach for postprocessing calculations), the huge amount of time spent on communications between cores, together with the small problem size (limited by the number of isotopes of the nuclear network), result in a much worse performance of the parallel application compared to the 1core, sequential version of the code. Parallelization of the matrix buildup and inversion processes in the nucleosynthesis subroutines is not recommended unless the number of isotopes adopted largely exceeds 10,000.
In sharp contrast, speedup factors of 26 and 35 have been obtained with a parallelized version of SHIVA, in a 200shell simulation of a type I Xray burst carried out with two nuclear reaction networks: a reduced one, consisting of 324 isotopes and 1392 reactions, and a more extended network with 606 nuclides and 3551 nuclear interactions. Maximum speedups of ∼41 (324isotope network) and ∼85 (606isotope network), are also predicted for 200 cores, stressing that the number of shells of the computational domain constitutes an effective upper limit for the maximum number of cores that could be used in a parallel application.
Introduction
Computational astrophysics has revolutionized our knowledge of the physics of stars. Simultaneously to the progress achieved in observational astrophysics (through highresolution spectroscopy and photometry, sometimes including multiwavelength observations with spaceborne and groundbased observatories), cosmochemistry (isotopic abundance determinations in presolar meteoritic grains) and nuclear physics (determination of nuclear cross sections at or close to stellar energies), computers have provided astrophysicists with the appropriate arena in which complex physical processes operating in stars (e.g., rotation, convection and mixing, mass loss...) can be properly modeled (see, e.g., Ref. Bodenheimer et al. 2006).
Stellar evolution models are becoming increasingly sophisticated and complex. The dawn of supercomputing and multicore machines has allowed to (partially) overcome the limitations imposed by the assumption of spherical symmetry. The payoff, however, is still very expensive. Two, and specially threedimensional simulations are so computationally demanding that other simplifications, such as the use of truncated nuclear reaction networks, large enough to account for the energetics of the star, must be adopted. Multidimensional nucleosynthesis studies with hundreds of nuclear species linked through thousands of nuclear processes are still prohibitive. Accordingly, most of our understanding of element synthesis in stars relies either on hydrostatic/hydrodynamic simulations in spherical symmetry (1D), or on postprocessing simulations using temperature and density versus time profiles extracted from stellar evolution models, and directly linked to huge nuclear reaction networks. Even such postprocessing calculations can sometimes become computationally very intensive: for instance, the sensitivity study of the effect of nuclear uncertainties in Xray bursts nucleosynthesis performed by Parikh et al. (2008), requiring 50,000 postprocessing calculations, with a network containing 600 species (from H to ^{113}Xe), and more than 3500 nuclear reactions, took about 9 CPU months in a singlecore computer.
In the 1D codes used in the modeling of a wide range of astrophysical scenarios, such as classical novae, Xray bursts, supernovae, or asymptotic giant branch (AGB) stars (e.g., FRANEC Limongi and Chieffi 2003, Chieffi and Limongi 2013, MESA Paxton et al. 2011, 2013, SHIVA José and Hernanz 1998, José 2016), stars are divided into \({\sim}100\text{s}  1000\text{s}\) of concentric shells. They also incorporate a similar number of nuclear processes, which link hundreds of nuclear species. The subroutines that handle the suite of different nuclear processes and the associated nucleosynthesis are often the most timeconsuming components of a stellar evolution code (unless very small nuclear reaction networks are used). Different strategies have been adopted to reduce the computational cost of such simulations, therefore improving the performance of a code. One possibility relies on the use of more efficient numerical techniques to handle integration of large nuclear networks (Timmes 1999, Longland et al. 2014). Another possibility involves parallelization of the stellar code, so that the high computational cost can be split and handled by different cores working cooperatively.
Parallel computing has been regarded as the main permitting factor of more precise, computationally intensive simulations. Indeed, most of the existing multidimensional stellar codes have been parallelized. Naively, parallelization simply relies on applying several cores to the solution of a single problem, so that speedups are accomplished by executing independent, nonsequentional portions of the code. In practice, however, parallelization comes with a high cost in both engineering and programming efforts. And on top of that, it may turn out that parallelization does not pay off altogether, for specific applications. Therefore, the main goal of this paper is to explore the advantages (and disadvantages) associated with the parallelization of stellar codes, outlining recommendations on when and how parallelization may help in improving the performance of a code for astrophysical applications. We discuss speedup factors ranging between 26 and 35 that allow the execution of hydrodynamic simulations coupled to large nuclear reaction networks in affordable times.
The structure of this paper is as follows: different strategies in the parallelization of a stellar evolution code (and of the matrix buildup and inversion processes in the nucleosynthesis subroutines) are described in Sects. 2 and 3. Special emphasis is devoted to the expected speedups obtained as a function of the size of the nuclear reaction network and the number of cores involved in the simulation. The performance of the parallelized version of SHIVA code is qualitatively compared with other codes, with similar or different architectures, in Sect. 4. The main results and conclusions of this work, together with a list of open issues, are summarized as well in Sect. 4.
Parallelization of a stellar code with a decoupled, timeexplicit treatment of the nucleosynthesis subroutines
The different strategies in the parallelization of a stellar evolution code described in this paper rely on the Message Passing Interface (MPI) communication protocol, and have been directly applied to SHIVA, a onedimensional (spherically symmetric), hydrodynamic code, in Lagrangian formulation, built originally to model classical nova outbursts (see Refs. José and Hernanz 1998, José 2016, for details). The code uses a comoving (Lagrangian) coordinate system, such that all physical variables (i.e., luminosity, L, velocity, u, distance to the stellar center, r, density, ρ, and temperature, T) are evaluated in a number of grid points directly attached to the fluid. In essence, this corresponds to a system of 5N variables (unknowns), where N is the overall number of shells of the computational domain. SHIVA’s computational flow is depicted in Fig. 1.
At each timestep, the set of 5N unknowns is determined from a system of 5N linearized equations (i.e., conservation of mass, momentum and energy, the definition of the Lagrangian velocity and an equation that accounts for energy transport), which is solved by means of an iterative technique—Henyey’s method (Henyey et al. 1964). The basic set of stellar structure equations, supplemented by a suitable equation of state (EOS, that includes radiation, ions, and electrons with different degrees of degeneracy), opacities and a nuclear reaction network, constitute the building blocks of any stellar evolution code. In SHIVA, convection and nuclear energy production are decoupled from the set of hydrodynamic equations, and handled by means of a timeexplicit scheme. In general, partial differential equations involving time derivatives can be discretized in terms of variables evaluated (i.e., known) at the previous timestep (explicit schemes) or at the current timestep (implicit schemes). Explicit schemes are usually easier to implement than implicit schemes. However, in explicit schemes the timestep is limited by the Courant–Friedrichs–Levy condition that prevents any disturbance traveling at the sonic speed from traversing more than one numerical cell, thus leading to unphysical results. Implicit schemes allow larger timesteps than explicit schemes, with no precondition on the timestep, but they require an iterative procedure to solve the system at each step. In SHIVA, all compositional changes driven by nuclear processes or convective transport are evaluated at the end of the iterative procedure,^{Footnote 1} once the temperature, density and the other physical variables have been determined at each computational shell. In particular, SHIVA implements a twostep, timeexplicit scheme to calculate the new chemical composition at each timestep (see Ref. Wagoner 1969). While such decoupling of the nucleosynthesis subroutines from the hydrodynamic equations has a minor effect on the results, it has a huge impact on the speedup factors that can be obtained after parallelization (see Sect. 4, for a more detailed discussion).
Parallelization strategy
The maximum theoretical speedup accomplished by a parallel application is defined as the ratio of the total execution times of the serial application, \(T_{\mathrm{S}}\), and the parallel application, \(T_{\mathrm{P}}\):
where \(N_{\mathrm{P}}\) is the number of processes participating in the parallel computation, \(T_{\mathrm{comm}}\) the time devoted to communications and message passing amongst cores, \(T_{\mathrm{in}}\) and \(T_{\mathrm{out}}\) are the initialization and output times, and \(p = T_{\mathrm{pp}}/T_{\mathrm{S}}\) is the socalled parallel content, or ratio of the serial execution times of the overall application, \(T_{\mathrm{S}}\), and the potentially parallel portion of the code (e.g., a subroutine), \(T_{\mathrm{pp}}\). The maximum attainable speedup^{Footnote 2} is, therefore, determined by the ratio between \(T_{\mathrm{comm}}\) and \(T_{\mathrm{S}}\). For \(T_{\mathrm{comm}} = 0\), Eq. (1) results in the wellknown Amdahl’s law, which provides an estimate of the theoretical speedup as a function of the parallel content and the number of cores used (Amdahl 1967). If the processes need to communicate frequently, the cost of communication will take a heavy toll on the total execution time. In this situation, speedups below unity are even possible (i.e., the parallel application will run slower than its sequential counterpart), and therefore, must be avoided.
A first analysis of SHIVA’s architecture suggests two main points where parallelization might be exploited: the solution of the linearized system of equations for the determination of the physical variables (i.e., Henyey’s method), and the multizone calculation of the nuclear energy generation rate and nucleosynthesis. The first one relies on the parallel solution of a system of 5N linear equations, where N is the number of shells adopted in the simulation. For a typical astrophysical application, \(N \sim100\text{}1000\). However, as will be discussed later (see Sect. 3.3), such a parallel approach only achieves acceptable performance for ≥10,000 equations. Very modest speedup factors are obtained otherwise (i.e., less than a factor of 2), which do not justify the effort. In contrast, the multizone calculation of nuclear energy generation and nucleosynthesis is computed independently at each shell, and can result in large speedup factors if parallelized. This is the specific parallelization strategy adopted hereafter, and presented in this Section. Each core goes redundantly through almost all processing stages. However, with regard to the nucleosynthesis part, each core performs the computation on a nonoverlapping subset of shells. After this, each core broadcasts its (partial) results, and from this stage onward, the simulation proceeds again on all cores redundantly. In this parallelization strategy adopted, there are only two points of communication: at the beginning of the simulation (where the root process broadcasts all the initial information and parameters to the rest of the processes), and repeatedly at each (successful) iteration, after the distributed computation of the nucleosynthesis has been performed. This choice maximizes parallel performance by keeping communication points to a minimum or, in other words, by maximizing the computation to communication ratio (McKenney 2011).
In order to obtain equivalent workloads on all cores, the total number of shells of the computational domain must be split up into approximately equally sized groups. The shells assigned to each core are consecutive, so that the different cores compute energy and nucleosynthesis for shells \(1 \ldots j\), \(j+1 \ldots i\), \(i+1 \ldots m\), and so on. The last core will have assigned shells \(m+1\) to N.
Performance prediction
At each iteration, each core broadcasts the new abundances obtained in the computation of their assigned shells. This represents an ALLGATHER communication procedure (Graham 2012), where all processes get the data sent by the other processing cores. The information is thereafter distributed by means of a ring algorithm where, in the first step, each core i sends its contribution to core \(i+1\) and receives the contribution from core \(i1\) (with wraparound). Subsequently, each core i forwards to core \(i+1\) the data received from core \(i1\) in the previous step (Pacheco 1997). The communication time taken by this algorithm is given by Thakur and Gropp (2003):
where n is the total data size received by any core from all other cores, α is the latency or startup time per message (which is independent of the message size), and β is the transfer time per byte. Actual values for α and β obtained in the simulations performed with the SHIVA code are given in Sect. 2.3. Note that both the latency and the transfer time depend specifically on the speed of the network and of the communications of the computer cluster (or multicore computer) where the parallel application is being executed. It will also depend on the heterogeneity of the cores (e.g. workstations with different processing power, or different Operating Systems), and ultimately on how finely the cluster has been tuned to optimize data transfer and communications. Such quantities are difficult to estimate analytically, and are frequently measured using real data and extrapolating communication times from observations (Foster 1995).
Results
Figure 2 shows the excellent speedup factors accomplished in a parallel simulation of a type I Xray burst performed with SHIVA, with \(N=200\) shells. Parallel execution times have been compared with respect to a serial execution time obtained with a single core. Simulations have been carried out with two different nuclear reaction networks: a reduced one, consisting of 324 isotopes and 1392 reactions (hereafter, Model 1), and a more extended network with 606 nuclides and 3551 nuclear interactions (Model 2; see Ref. José et al. 2010). Speedup factors of 26 and 35 are achieved in Models 1 and 2, respectively, when 42 cores are used in parallel to execute the application. Figure 2 also highlights the nonlinear scaling of the speedup factor with the number of cores adopted in the parallel execution. Both \(T_{\mathrm{comm}}\) and the overhead time vary with the number of cores adopted. This variation depends critically on the type of communication (e.g., all to all, broadcast, point to point sends and receives, gather, all gather, etc.^{Footnote 3}), but at any rate both \(T_{\mathrm{comm}}\) and the overhead time increase monotonically with the number of cores adopted, with a much more pronounced dependence of \(T_{\mathrm{comm}}\) on \(N_{\mathrm{P}}\) (Thakur and Gropp 2003).
The results obtained are so good and approach the performance of a perfect parallel application; this means that the computation to communication ratio is large enough so that processing work can be distributed in an extremely efficient way amongst cores. Accordingly, larger speedups are expected if the number of cores used in the parallel execution is increased. Figure 2 displays as well the theoretical speedups expected for both simulations, as given by Eq. (1). Such theoretical estimates do not take into account the communication or synchronization times, and as a result, the observed performance always falls short compared to the theoretical, ideal speedup.
As expected, higher speedups are obtained when we increase the problem size by using a nuclear reaction network with 606 isotopes and 3551 reactions (e.g., Model 2). The speedup accomplished in this simulation exceeds by approximately 34% the performance of the execution with a reduced nuclear network (i.e., 26 versus 35 speedup factors, respectively). This is a direct consequence of increasing the problem size, which is essentially equivalent to increasing the amount of parallelizable computation (that is, the nucleosynthesis calculation), and therefore the potential parallel content also increases (\(p = 0.99127\) for Model 1, whereas \(p = 0.99738\) for the simulation with a larger nuclear reaction network, i.e. Model 2). This, in turn, improves the curve of the modelled, theoretical speedup, hence diminishing the gap from an ideal speedup.
The theoretical performance of the parallelized SHIVA code, based on Eq. (1) and Eq. (2), taking into account the communication time between cores, can be expressed as:
where n and \(T_{\mathrm{S}}\) are specific of the simulation being executed, and the latency α and the transfer time per byte β depend solely on the communications infrastructure. Numerical experiments^{Footnote 4} yield \(\alpha= 1 \times10^{5}\) s and \(\beta= 5 \times10^{8}\) s. At the end of each iteration, all cores gather the nucleosynthesis results, together with the overall nuclear energy released and the values predicted for the new timestep, Δt (e.g., based on the variation of the most abundant isotopes, as in Wagoner’s method), from all shells. Taking all this into account, the total amount of bytes being transmitted works out as:
for Models 1 and 2, respectively. The expected performance of the parallel SHIVA code (Eq. (3)) for up to 200 cores is shown in Fig. 3, together with the experimental values obtained up to \(N_{\mathrm{P}}=42\) cores in the Hyperion cluster. It is interesting to note that there is still way for improvement. Indeed, maximum speedups of ∼41 and ∼85 are predicted when using 200 cores, for Models 1 and 2, respectively. The scaling efficiency (i.e., the ratio of actual scaling to ideal scaling) is 21% for Model 1 (speedup of 41 on 200 cores) and 43% for Model 2 (speedup of 85 on 200 cores). At this point, it is important to stress that as a result of the parallelization strategy adopted, the number of shells of the computational domain constitute an effective upper limit for the maximum number of cores that could be used in the parallel application. Moreover, it is also worth mentioning that the expected performance of the parallel SHIVA code, and in general, of any stellar evolution code, is limited by the number of shells adopted and also by the potentially parallel portion of the code.
It is also important to note that the model of performance presented here is valid for the execution environment discussed, and cannot be extrapolated to other clusters which may have different latencies and communication bandwidths. That said, this model can be taken as a reference for the capabilities of a parallelized application, and can be used to decide whether access time at some supercomputing facility, where latencies and transmission bandwidths are highly optimized for parallel executions, must be requested. In those platforms, even better speedup factors must be expected.
Parallelization of the nuclear energy generation and nucleosynthesis subroutines
In this section, we report on the expected speedups resulting from parallelization of the matrix buildup and inversion processes in the nucleosynthesis subroutines, for different sizes of the adopted nuclear reaction networks. This is a completely different parallelization approach compared to the one described in Sect. 2. In the strategy described for SHIVA, the method of solving the system of equations was not modified, but executed in parallel on a subset of nonoverlapping shells. Now, it is the buildup and inversion of the matrix containing the rates of the different nuclear interactions (i.e., the solution of the system of equations) that is being parallelized. The strategy adopted in this section is of interest for stellar evolution models that rely on reasonably large nuclear reaction networks, and also for postprocessing nucleosynthesis calculations, in which temperature and density versus time profiles (frequently extracted from stellar models) are directly coupled to huge nuclear networks.
Numerical treatment of nuclear abundances
The timeevolution of the chemical composition of a star relies on a set of differential equations that take into account all possible creation and destruction channels for the species included in the network. After linearization (e.g., finitedifferences), the overall system of equations can be written in matrix form as:
where \(\mathbf{X_{0}}\) is the matrix containing the set of abundances of the previous (or initial) step, A is the matrix containing the rates of the different nuclear inteactions, and X is the matrix with the new (unknown) abundances.
Different methods have been reported to solve Eq. (5), such as Wagoner’s twostep linearization technique (Wagoner 1969), Bader–Deuflehard’s semiimplicit method (Bader and Deuflhard 1983), or Gear’s backward differentiation technique (Gear 1971). The performance of these different integration methods for stellar nucleosynthesis calculations has been been analyzed in a number of studies (see Refs. Timmes 1999, Longland et al. 2014, and references therein). Here, we will explore the gain in performance driven by parallelization od one particular method: Wagoner’s. As described in Ref. Prantzos et al. (1987), Wagoner’s twostep linearization procedure exploits the special properties of matrix A, which consists of an upper left square matrix, an upper horizontal band, a left vertical band, and a diagonal band. The sparse nature of matrix A results from the fact that the different isotopes, when ordered in terms of increasing atomic number, are only linked with close neighbors through nuclear interactions that usually involve light particles^{Footnote 5} (e.g., n, p, α).
Parallelization strategy
A typical nucleosynthesis calculation consists of the following main processing steps:

1
Interpolation (calculation) of reaction rates from tables (analytic fits), for the specific temperature and density of each shell, at a given time.

2
Assembly of matrices \(\mathbf{X_{0}}\) and A.

3
Solution of Eq. (5), for the new abundances of all chemical species at each shell.

4
Convergence check; determination of the new timestep, Δt.

5
Determination of the overall nuclear energy released at each shell.
Stages 2 and 3 are by far the most timeconsuming parts of a simulation (97% of the execution time in the simulations reported in Sect. 3.3). Consequently, the parallelization strategy adopted in this work focused on providing the most efficient partitioning of matrix A, as required by the parallel solution of the system of equations performed by the parallel solver.
Reactionrate determinations are partitioned amongst cores, such that at each iteration step each core performs the interpolation (calculation) of only those reactions rates that are strictly needed for the construction of the local partition of matrix A (Eq. (5)). Given a typical nuclear reaction, of the form \(i(j,k)l\), there are 8 possible combinations contributing to matrix A: \(\mathbf{A}(i,i)\), \(\mathbf{A}(i,j)\), \(\mathbf {A}(j,j)\), \(\mathbf{A}(j,i)\), \(\mathbf{A}(k,i)\), \(\mathbf{A}(k,j)\), \(\mathbf{A}(l,i)\), and \(\mathbf{A}(l,j)\), according to the linearization technique described in Ref. Wagoner (1969). The parallel solution of the system of equations is obtained using MUMPS^{Footnote 6} (Amestoy et al. 2001, 2004), a widely used software for the solution of large sparse systems of linear algebraic equations, of the form \(\mathbf{A}\mathbf{x}=\mathbf {b}\), on distributedmemory (parallel) computers.
The right hand side of Eq. (5) is centralized in the root process. This requires that the complete solution from the previous iteration has to be gathered by the root process at some time during the simulation. In contrast, the solution of the system of equations is kept distributed, so that after solving the system of equations each of the cores holds a nonoverlapping subset of elements of the solution (i.e., a subset of the new abundances). At this point, the solution must be exploited in its distributed form, which requires that subsequent processing stages (e.g., convergence and accuracy) must be executed independently between cores.
After solving Eq. (5), each core checks convergence and accuracy^{Footnote 7} of its part of the solution. Finally, the overall nuclear energy released at the specific timestep is obtained by summing the energy generated by all interactions. This stage is parallelized by having each core compute the partial nuclear energy released by a subset of reactions. The above parallelization strategy requires that the cores communicate at four specific steps during the simulation:

1
During the parallel solution of the system of equations (MUMPS).

2
Once the system of equations is solved; the distributed solution is shared amongst all cores.

3
To check convergence and accuracy of the solution.

4
To sum up energy contributions from the distributed reactions; every core computes only the energy released by a subset of reactions.
The above communication requirements are considerably high, as shown by the performance results reported in the following section.
Results
The fact that parallelization of the nucleosynthesis subroutines demands much communication between cores makes the parallel application actually take longer to complete than its 1core counterpart (see Fig. 4, where the reference value—sequential version—corresponds to \(N_{\mathrm{P}}=1\) and the total execution time is depicted as the ratio between parallel and sequential execution times, \(t(N_{\mathrm{P}})/t(1)\)).
The execution time increases when cores physically separated (i.e., on different workstations) participate in the simulation. In sharp contrast, when the parallel application is run using cores within the same machine, the execution time is kept at bay with respect to the sequential version, and even small speedups are obtained when using a quad core machine, for two, three and four cores. Figure 5 shows the partial execution times spent on the determination of reaction rates (panel (a)), matrix assembly (b), convergence check (c), and determination of the overall nuclear energy released (d). It is clear that the parallelization strategy adopted for these different stages is excellent. For instance, the matrix assembly runs almost 5 times faster than the sequential version when using 5 cores and almost 7 times faster when using 10 cores. The convergence and accuracy check and nuclear energy computation times also yield increases in performance, both running consistently faster in the parallel version than in the sequential application. Performance results for the matrix buildup and inversion time are also shown in Panel (e). It reveals that the solution of the system of equations takes consistently longer if executed in parallel, for any number of cores used in the computation. Note that for the matrix inversion, we do not even get the small improvements when cores physically located on the same machine are used. Even though the execution time is more or less controlled up to four cores (for a simulation run on a quadcore machine), the performance plummets dramatically with a larger number of cores. The dramatic loss in performance is therefore provoked by the parallel solution of the system of equations. The relative time spent on communications is depicted in Panel (f). It clearly exhibits the same pattern underlined for the matrix inversion and total execution times. While the communication time increases slightly from one to four cores, it soars rapidly whenever physically separated cores are incorporated into the parallel execution. We conclude that the high communication costs, together with a relatively limited computation time, are responsible for the loss in performance.
One final aspect that deserves further discussion is the reason why the gains in performance found in the other stages (see Fig. 5) do not make up for the increase in communication times. Figure 6 shows the percentage of the total simulation time devoted to initialization, global communications (not including MUMPS internal communications during the solution of the system of equations), reaction rate calculations, matrix assembly, convergence check, determination of the nuclear energy released, and matrix inversion (i.e., solver). The sequential execution spends most of the time inverting the matrix (82%) and building the system of equations (15%). The calculation of the overall nuclear energy released accounts only for 1% of the total computation time. The relative time spent on the interpolation of reaction rates is just a 0.44% of the total execution time, whereas only 0.06% is spent on convergence checks. With an increasing number of cores participating in the simulation, the time spent on global communications and in the solution of the system of equations gradually tends to account for nearly all the computation time. This is the reason why improvements in performance in these stages have no major effect on the overall execution time.
Having such a loss in performance associated with the solution of the system of equations, it is compulsory to analyze whether the selection of MUMPS as a solver has been appropriate. MUMPS represents one of the few professional and supported public domain implementations of the multifrontal method. Amestoy et al. (2001) have shown that the MUMPS solver performance for large matrices is excellent. For matrices of order ≥100,000, very good speedups are accomplished (e.g., between 2.8 and to 3.7, with 4 cores; and between 7.1 and 10.6, with 16 cores). Note that speedups increase with the matrix size as the computation to communication ratio increases. For matrices of the order between 10,000 and 100,000, moderate speedups are accomplished with MUMPS (e.g., 2.4–3.1, with 4 cores, and 7.2–8.4, with 16 cores; Amestoy et al. 2001). Finally, not much data is available for matrices of order ≤10,000. This is due to the fact that as the problem dimension shrinks, the distributed computation time is also reduced, whilst communication time diminishes much less noticeably. Accordingly, the resulting speedups are dramatically reduced. For instance, Fox (2007), in solving a system with 5535 elements with the MUMPS solver, reports speedups of 1 (i.e., no speedup at all) with 4 cores, and a speedup of 1.8 for 16 cores. It seems clear that the poor performance reported in this work is mostly due to the size (order) of the nucleosynthesis matrix, too small to maximize the ratio between computation and communication times. Accordingly, the efficient parallelization of the matrix buildup and inversion processes in the nucleosynthesis subroutines is therefore not possible, unless ≥ 10,000 nuclear interactions are included.
Conclusions
This paper reports on several parallelization strategies that can be applied to stellar evolution codes, providing recommendations on when and how parallelization may help in improving the performance of a code for astrophysical applications. Parallelization frequently forces to think about a program in new ways and may virtually require partial or total rewriting of the serial code. It is therefore important to understand the potential benefits and risks beforehand, since sometimes parallelized codes may perform even worse than their sequential counterparts.
To this end, two different parallelization strategies have been reported in this work. With regard to the nucleosynthesis part, efforts have focused on the parallelization of the solution of the system of equations (that is, the buildup and inversion of the matrix containing the rates of the different nuclear interactions). In Wagoner’s twostep linearization technique, the integration method for stellar nucleosynthesis calculations discussed in this work, the iterative procedure places this application in the worst possible category for parallelization, in which all cores have to participate throughout the iteration, exchanging intermediate results on a regular basis. The huge amount ot time spent on communications between cores, together with the small problem size (limited by the number of isotopes of the nuclear network), result in a much worse performance of the parallel application than the 1core, sequential version of the code. This stems from the fact that the communication and message passing times between processes largely outgrow the time spent on computation. It is therefore not advisable to parallelize the nucleosynthesis portion of a stellar code (or, by extension, a postprocessing code) unless the number of isotopes adopted largely exceeds 10,000.
With regard to the parallelization of a complete stellar evolution code, efforts have focused on the spherically symmetric, Lagrangian, implicit hydrodynamic code SHIVA (José and Hernanz 1998, José 2016), in the framework of a 200shell simulation of a typical type I Xray burst. Two different nuclear reaction networks have been considered: a reduced one, consisting of 324 isotopes and 1392 reactions; and a more extended network, with 606 nuclides and 3551 nuclear interactions. The performance of the parallelized version of SHIVA turned out to be excellent: speedup factors of 26 and 35 have been obtained, for the reduced (i.e., Model 1) and extended networks (Model 2), respectively, when 42 cores were used. These results, however, did not match the maximum expected values for a perfect parallel application (i.e., the computation to communication ratio was large enough so that processing work could be distributed in an extremely efficient way amongst processes). To put these results into context, in our execution environment, a parallel simulation using 42 cores took ∼5.7 hr to compute 200,000 timesteps with a reduced nuclear network (cf., 6.1 days in its sequential version). The computation time increased to ∼20 hr when the extended network (with 606 nuclides and 3551 nuclear reactions) was used, for the same number of timesteps (cf., 28.6 days in its sequential version). Such excellent results completely justify the time invested in the parallelization of the code. Moreover, maximum speedups of ∼41 and ∼85 have been predicted by the performance model when using 200 cores, for the reduced and extended nuclear networks, respectively.
A key ingredient in achieving the large speedup factors reported above is the decoupling of the nucleosynthesis subroutines from the set of hydrodynamic/structure equations adopted in SHIVA. This approach, while having a minor effect on the expected energetics and chemical composition of a star, is essential to justify a parallelization effort. In sharp contrast, efforts to parallelize FRANEC (see Refs. Limongi and Chieffi 2003, Chieffi and Limongi 2013, and references therein), another Henyeytype code in which the nucleosynthesis and structure equations are solved simultaneously by means of a timeimplicit scheme,^{Footnote 8} yielded very poor speedup factors (A. Chieffi, private com.).
In summary, parallelization of a fully coupled, timeimplicit code can only result in large speedfactors if the most timeconsuming parts of the code (e.g., the nucleosynthesis subroutines) are decoupled from the hydro equations, and therefore, can be handled in a timeexplicit way. Most multidimensional, stellar evolution codes available to date (e.g., PROMETHEUS Fryxell et al. 1989; FLASH Fryxell et al. 2000; DJEHUTY Dearborn et al. 2005, 2006; GADGET2 Springel 2005) are (time) explicit. While, in general, explicit schemes are easier to implement than implicit schemes, the real payoff is the huge speedup factors achievable when parallelized, compared with their 1core, sequential versions.
Notes
 1.
As for the nuclear energy production and nucleosynthesis, neutrino losses are also implemented explicitly in the SHIVA code. However, as they do not require intense computation efforts, subroutines handling neutrino losses have not been parallelized in this work.
 2.
Note that \(T_{\mathrm{S}} \equiv T_{\mathrm{in}}+T_{\mathrm{pp}}+T_{\mathrm{out}}\) and \(T_{\mathrm{P}} \equiv T_{\mathrm{in}}+T_{\mathrm{pp}}/N_{\mathrm{P}}+T_{\mathrm{comm}}+T_{\mathrm{out}}\).
 3.
See, e.g., https://www.mpiforum.org/docs/mpi3.0/mpi30report.pdf.
 4.
All simulations reported in this paper have been executed in the 42core Hyperion cluster of the Astronomy and Astrophysics Group at UPC.
 5.
A few exceptions involve reactions such as ^{12}C + ^{12}C, ^{16}O + ^{16}O, ^{20}Ne + ^{20}Ne, that take place during some stages of the evolution of stars. See Refs. Iliadis (2015), José (2016), for details.
 6.
MUltifrontal Massively Parallel Sparse direct Solver; see http://mumps.enseeiht.fr/.
 7.
The SHIVA code uses a number of convergence and accuracy criteria to guarantee, for instance, that the new solution satisfies the mass, momentum and energy conservation equations.
 8.
More recent versions of the FRANEC code, known as FUNS, contain several solver schemes in which the equations of nucleosynthesis, mixing and structure can be handled in a coupled or decoupled way (O. Straniero, private com.). The extensively used MESA code (Paxton et al. 2011, 2013) also solves the nucleosynthesis and composition equations directly coupled to the structure equations. Note, however, that MESA contains a number of explicit modules that can be computed in parallel using OpenMP (Paxton et al. 2011).
Abbreviations
 AGB star:

Asymptotic Giant Branch star
 MPI:

Message Passing Interface
 EOS:

Equation of State
 MUMPS:

MUltifrontal Massively Parallel Sparse direct Solver
References
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, pp. 483–485. ACM Publ., New York (1967)
Amestoy, P., Guermouche, A., L’Excellent, J.Y., Pralet, S.: Hybrid scheduling for the parallel solution of linear systems. CERFACS, Tech. Rep., Toulouse, France (2004)
Amestoy, P.R., Duff, I.S., L’Excellent, J.Y., Koster, J.: A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 23, 15–41 (2001)
Amestoy, P.R., Duff, I.S., L’Excellent, J.Y., Li, X.S.: Performance and tuning of two distributed memory sparse solvers. In: Meza, J., Koelbel, C. (eds.) Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing. Society for Industrial & Applied Mathematics, Portsmouth (2001)
Bader, G., Deuflhard, P.: A semiimplicit midpoint rule for stiff systems of ordinary differential equations. Numer. Math. 41, 373–398 (1983)
Bodenheimer, P., Laughlin, G.P., Rózyczka, M., Yorke, H.W.: Numerical Methods in Astrophysics: An Introduction. CRC/Taylor and Francis, Boca Raton (2006)
Chieffi, A., Limongi, M.: Presupernova evolution of rotating solar metallicity stars in the mass range 13–120 M_{⊙} and their explosive yields. Astrophys. J. 764, 21–36 (2013)
Dearborn, D.S.P., Lattanzio, J.C., Eggleton, P.P.: Threedimensional numerical experimentation on the core helium flash of lowmass red giants. Astrophys. J. 639, 405–415 (2006)
Dearborn, D.S.P., Wilson, J.R., Mathews, G.J.: Relativistically compressed exploding white dwarf model for Sagittarius A East. Astrophys. J. 630, 309–320 (2005)
Foster, I.: Designing and Building Parallel Programs. AddisonWesley, Boston (1995)
Fox, J.: Fullykinetic PIC simulations for halleffect thrusters. PhD thesis, Massachusetts Institute of Technology (2007)
Fryxell, B., Müller, E., Arnett, W.D.: Hydrodynamics and nuclear burning. MaxPlanck Inst. for Astrophysics. Rep. 449, Garching, Germany (1989)
Fryxell, B., Olson, K., Ricker, P., Timmes, F.X., Zingale, M., Lamb, D.Q., MacNeice, P., Rosner, R., Truran, J.W., Tufo, H.: FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophys. J. Suppl. Ser. 131, 273–334 (2000)
Gear, C.W.: The automatic integration of ordinary differential equations. Commun. ACM 14, 176–179 (1971)
Graham, R.: MPI: a messagepassing interface standard, V3.0. University of Tennessee, Tech. Rep., Knoxville (2012)
Henyey, L.G., Forbes, J.E., Gould, N.L.: A new method of automatic computation of stellar evolution. Astrophys. J. 139, 306–317 (1964)
Iliadis, C.: Nuclear Physics of Stars, 2nd edn. WileyVCH Verlag, Weinheim (2015)
José, J.: Stellar Explosions: Hydrodynamics and Nucleosynthesis. CRC/Taylor and Francis, Boca Raton (2016)
José, J., Hernanz, M.: Nucleosynthesis in classical novae: CO versus ONe white dwarfs. Astrophys. J. 494, 680–690 (1998)
José, J., Moreno, F., Parikh, A., Iliadis, C.: Hydrodynamic models of type I Xray bursts: metallicity effects. Astrophys. J. Suppl. Ser. 189, 204–239 (2010)
Limongi, M., Chieffi, A.: Evolution, explosion, and nucleosynthesis of corecollapse supernovae. Astrophys. J. 592, 404–433 (2003)
Longland, R., Martin, D., José, J.: Performance improvements for nuclear reaction network integration. Astron. Astrophys. 563, 67–113 (2014)
McKenney, P.E.: Is Parallel Programming Hard, and, If so, What Can You do About It?. Paper Linux Technology Center, New York (2011)
Pacheco, P.: Parallel Programming with MPI. Morgan Kaufmann Publ., San Francisco (1997)
Parikh, A., José, J., Moreno, F., Iliadis, C.: The effects of variations in nuclear processes on type I Xray burst nucleosynthesis. Astrophys. J. Suppl. Ser. 178, 110–136 (2008)
Paxton, B., Bildsten, L., Dotter, A., Herwig, F., Lesaffre, P., Timmes, F.: Modules for experiments in stellar astrophysics (MESA). Astrophys. J. Suppl. Ser. 192, 3–35 (2011)
Paxton, B., Cantiello, M., Arras, P., Bildsten, L., Brown, E.F., Dotter, A., Mankovich, C., Montgomery, M.H., Stello, D., Timmes, F.X., Townsend, R.: Modules for experiments in stellar astrophysics (MESA): planets, oscillations, rotation, and massive stars. Astrophys. J. Suppl. Ser. 208, 4–42 (2013)
Prantzos, N., Arnould, M., Arcoragi, J.P.: Neutron capture nucleosynthesis during core helium burning in massive stars. Astrophys. J. 315, 209–228 (1987)
Springel, V.: The cosmological simulation code GADGET2. Mon. Not. R. Astron. Soc. 364, 1105–1134 (2005)
Thakur, R., Gropp, W.: Improving the performance of MPI collective communication on switched networks. In: Dongarra, J.D.L., Orlando, S. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 257–276. Springer, Berlin (2003)
Timmes, F.X.: Integration of nuclear reaction networks for stellar hydrodynamics. Astrophys. J. 124, 241–263 (1999)
Wagoner, R.V.: Synthesis of the elements within objects exploding from very high temperatures. Astrophys. J. Suppl. Ser. 18, 247–295 (1969)
Acknowledgements
This article benefited from discussions within the “ChETEC” COST Action (CA16117).
Availability of data and materials
A simplified version of the SHIVA code, freefall.f, is available at http://www.fen.upc.edu/users/jjose/CRCDownloads.html. The code applies Henyey’s method to simulate the freefall collapse of a homogeneous sphere. See Ref. José (2016), for details.
Funding
This work has been partially supported by the Spanish MINECO grant AYA2017–86274–P, by the E.U. FEDER funds, and by the AGAUR/Generalitat de Catalunya grant SGR661/2017.
Author information
Affiliations
Contributions
All authors have equally contributed to this work. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Jordi José.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Numerical methods
 Hydrodynamics
 Parallel computing
 Nuclear reactions
 Nucleosynthesis
 Abundances
 Stellar evolution
 Stellar explosions: classical novae
 Stellar explosions: Xray bursts