Abstract
General purpose computation on graphics processing unit (GPU) is
rapidly entering into various scientific and engineering fields.
Many applications are being ported onto GPUs for better
performance. Various optimizations, frameworks, and tools are
being developed for effective programming of GPU. As part of
communication and computation optimizations for GPUs, this paper
proposes and implements an optimization method called as kernel
coalesce that further enhances GPU performance and also optimizes
CPU to GPU communication time. With kernel coalesce methods,
proposed in this paper, the kernel launch overheads are reduced
by coalescing the concurrent kernels and data transfers are
reduced incase of intermediate data generated and used among
kernels. Computation optimization on a device (GPU) is performed
by optimizing the number of blocks and threads launched by tuning
it to the architecture. Block level kernel coalesce method
resulted in prominent performance improvement on a device without
the support for concurrent kernels. Thread level kernel coalesce
method is better than block level kernel coalesce method when the
design of a grid structure (i.e., number of blocks and threads)
is not optimal to the device architecture that leads to
underutilization of the device resources. Both the methods
perform similar when the number of threads per block is
approximately the same in different kernels, and the total number
of threads across blocks fills the streaming multiprocessor (SM)
capacity of the device. Thread multi-clock cycle coalesce method
can be chosen if the programmer wants to coalesce more than two
concurrent kernels that together or individually exceed the
thread capacity of the device. If the kernels have light weight
thread computations, multi clock cycle kernel coalesce method
gives better performance than thread and block level kernel
coalesce methods. If the kernels to be coalesced are a
combination of compute intensive and memory intensive kernels,
warp interleaving gives higher device occupancy and improves the
performance. Multi clock cycle kernel coalesce method for
micro-benchmark1 considered in this paper resulted in 10--40\%
and 80--92\% improvement compared with separate kernel launch,
without and with shared input and intermediate data among the
kernels, respectively, on a Fermi architecture device, that is,
GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is
coalesced to itself using thread level kernel coalesce method and
warp interleaving giving 131.9\% and 152.3\% improvement compared
with separate kernel launch and 39.5\% and 36.8\% improvement
compared with block level kernel coalesce method,
respectively.Copyright \copyright 2013 John Wiley & Sons, Ltd.
Users
Please
log in to take part in the discussion (add own reviews or comments).