Article,

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

B. Neelima, G. Ram Mohana Reddy, and P. Raghavendra.
Concurr. Comput., 27 (1): 47--68 (2015)

Abstract

General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi-clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro-benchmark1 considered in this paper resulted in 10--40\% and 80--92\% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9\% and 152.3\% improvement compared with separate kernel launch and 39.5\% and 36.8\% improvement compared with block level kernel coalesce method, respectively.Copyright \copyright 2013 John Wiley & Sons, Ltd.

BibTeX key: Neelima2015-nr
entry type: article
year: 2015
journal: Concurr. Comput.
number: 1
pages: 47--68
volume: 27

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Journal Article %1 Neelima2015-nr %A Neelima, B %A Ram Mohana Reddy, G %A Raghavendra, Prakash S %D 2015 %J Concurr. Comput. %K Expose Fermi_architecture Fusion GPU block_level_kernel_coalesce_method blocks_and_threads communication_and_computation_optimization general_purpose_computation_on_graphics_processing_unit_(GPGPU) graphics_processing_unit_(GPU) kernel_coalesce multi-clock_cycle_kernel_coalesce_method thread_level_kernel_coalesce_method warp_interleaving %N 1 %P 47--68 %T Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU %V 27 %X General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi-clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro-benchmark1 considered in this paper resulted in 10--40\% and 80--92\% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9\% and 152.3\% improvement compared with separate kernel launch and 39.5\% and 36.8\% improvement compared with block level kernel coalesce method, respectively.Copyright \copyright 2013 John Wiley & Sons, Ltd.

@article{Neelima2015-nr, abstract = {General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi-clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro-benchmark1 considered in this paper resulted in 10--40\% and 80--92\% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9\% and 152.3\% improvement compared with separate kernel launch and 39.5\% and 36.8\% improvement compared with block level kernel coalesce method, respectively.Copyright \copyright{} 2013 John Wiley \& Sons, Ltd.}, added-at = {2015-11-23T17:06:12.000+0100}, author = {Neelima, B and Ram Mohana Reddy, G and Raghavendra, Prakash S}, biburl = {https://www.bibsonomy.org/bibtex/2d6ab5871ce862c2d8812ffcf104ce1e0/christophv}, interhash = {72d390cf21a7190aa8c89f27632177d7}, intrahash = {d6ab5871ce862c2d8812ffcf104ce1e0}, journal = {Concurr. Comput.}, keywords = {Expose Fermi_architecture Fusion GPU block_level_kernel_coalesce_method blocks_and_threads communication_and_computation_optimization general_purpose_computation_on_graphics_processing_unit_(GPGPU) graphics_processing_unit_(GPU) kernel_coalesce multi-clock_cycle_kernel_coalesce_method thread_level_kernel_coalesce_method warp_interleaving}, number = 1, pages = {47--68}, timestamp = {2016-01-04T14:22:08.000+0100}, title = {Communication and computation optimization of concurrent kernels using kernel coalesce on a {GPU}}, volume = 27, year = 2015 }

BibSonomy

Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on