You are on page 1of 2

As a preliminary note to all reviewers, the inclusion of CVODE was to provide a baseline

comparison against a commonly used CPU implicit integration technique. CVODE was not
implemented for the GPU, specifically due to its variable order formulation.
Reviewer 1:
The overall suggestion in comments 1 & 7 is a change in the direction of the manuscript that we
agree with and intend to implement in revisions. This paper represents the current state of the art of
GPU based chemical kinetic integration, and as such should be focused on directing the future
efforts of the combustion community. To this end, we propose to modify the introduction and
discussion as follows. First, in the introduction, we will specifically highlight the strengths and
weaknesses of explicit vs implicit algorithms for the GPU especially as relates to thread divergence,
and the one-block/one-thread approaches, adaptive time-stepping and other key implementation
details. Additionally, the most promising future directions will be outlined. Secondly, the
performance comparison between the GPU/CPU algorithms will be revised to be less focused on
details and instead give a more high level summary of the advantages/disadvantages of each based
on problem type. Finally, we will discuss the viability of GPU based integration in large scale
simulations given that many supercomputers/computing clusters have GPU capability already
installed. In conjunction, the presentation of the performance graphs will be updated as suggested,
error analysis will be added, the absolute tolerance relaxed, and the adaptive time-stepping
procedures used for the algorithms will be outlined.
While we agree that a breakdown of computational costs for the various algorithms is an important
consideration, it would be be difficult to bring into this work both due to space constraints and the
difficulty of profiling individual methods inside a CUDA kernel, which would require a complete
rewrite of the solvers. This is an effort that would be best pursued in a future work.
Reviewer 3:
The relative performance comparison between the finite difference Jacobian and the analytic
Jacobian on the same hardware (i.e. a FD Jacobian on the GPU vs an analytic Jacobian on the GPU)
is a key point, and will be added to this work. In relation, the comparison to the performance in
[15] will be removed.
The reviewer's point about the per-thread and per-block performance of Stone and Davis' GPUCVODE is well taken, however we maintain that the one-thread basis is a more promising future
direction. First, while the one-block approach does reach its limit speedup faster (e.g. at < 104
ODEs), this is not particularly relevant for large scale reacting flow simulations. Secondly, the
work of Stone et al. demonstrated that in the limit of eliminating thread-divergence for the perthread approach (i.e. using identical initial ODE conditions) it reached a maximum speedup of ~29x
over a CPU-DVODE implementation, while the per-block approach only achieved a ~7x speedup.
We also know that the chemical similarity of the initial conditions has a strong effect on threaddivergence in a one-thread approach [14]. Therefore we see a potential to greatly improve the
performance of a per-thread approach, via investigation of thread-divergence reduction methods,
where no such option is apparent for a per-block approach. The discussion on this topic has been
updated as such.
Additionally, the reviewer's comment on a species concentration based Jacobian highlighted a
mistake in the manuscript. This should have referred to the sparse mole/volume based constant
pressure Jacobian proposed by Schwer et al and will be corrected. The reviewer's comments on
Section 3.3 and the caching algorithm, and Eq. 1 will be implemented. Additionally it will be
clearly noted that all GPU solvers are implemented on a per-thread basis.
Reviewer 2:

The introduction will be reworked in accordance with this reviewer's (and other reviewers)
comments.

You might also like