Hundreds of cores per chip and support for fine-grain
multithreading have made GPUs a central player in today's HPC
world. For many applications, however, achieving a high fraction
of peak on current GPUs, still requires significant programmer
effort. A key consideration for optimizing GPU code is
determining a suitable amount of work to be performed by each
thread. Thread granularity not only has a direct impact on
occupancy but can also influence data locality at the register
and shared-memory levels. This paper describes a software
framework to analyze dependencies in parallel GPU threads and
perform source-level restructuring to obtain GPU kernels with
varying thread granularity. The framework supports specification
of coarsening factors through source-code annotation and also
implements a heuristic based on estimated register pressure that
automatically recommends coarsening factors for improved memory
performance. We present preliminary experimental results on a
select set of CUDA kernels. The results show that the proposed
strategy is generally able to select profitable coarsening
factors. More importantly, the results demonstrate a clear need
for automatic control of thread granularity at the software
level for achieving higher performance.
%0 Book Section
%1 Unkule2012-gp
%A Unkule, Swapneela
%A Shaltz, Christopher
%A Qasem, Apan
%B Compiler Construction
%D 2012
%I Springer Berlin Heidelberg
%K Expose
%P 21--40
%T Automatic Restructuring of GPU Kernels for Exploiting
Inter-thread Data Locality
%X Hundreds of cores per chip and support for fine-grain
multithreading have made GPUs a central player in today's HPC
world. For many applications, however, achieving a high fraction
of peak on current GPUs, still requires significant programmer
effort. A key consideration for optimizing GPU code is
determining a suitable amount of work to be performed by each
thread. Thread granularity not only has a direct impact on
occupancy but can also influence data locality at the register
and shared-memory levels. This paper describes a software
framework to analyze dependencies in parallel GPU threads and
perform source-level restructuring to obtain GPU kernels with
varying thread granularity. The framework supports specification
of coarsening factors through source-code annotation and also
implements a heuristic based on estimated register pressure that
automatically recommends coarsening factors for improved memory
performance. We present preliminary experimental results on a
select set of CUDA kernels. The results show that the proposed
strategy is generally able to select profitable coarsening
factors. More importantly, the results demonstrate a clear need
for automatic control of thread granularity at the software
level for achieving higher performance.
@incollection{Unkule2012-gp,
abstract = {Hundreds of cores per chip and support for fine-grain
multithreading have made GPUs a central player in today's HPC
world. For many applications, however, achieving a high fraction
of peak on current GPUs, still requires significant programmer
effort. A key consideration for optimizing GPU code is
determining a suitable amount of work to be performed by each
thread. Thread granularity not only has a direct impact on
occupancy but can also influence data locality at the register
and shared-memory levels. This paper describes a software
framework to analyze dependencies in parallel GPU threads and
perform source-level restructuring to obtain GPU kernels with
varying thread granularity. The framework supports specification
of coarsening factors through source-code annotation and also
implements a heuristic based on estimated register pressure that
automatically recommends coarsening factors for improved memory
performance. We present preliminary experimental results on a
select set of CUDA kernels. The results show that the proposed
strategy is generally able to select profitable coarsening
factors. More importantly, the results demonstrate a clear need
for automatic control of thread granularity at the software
level for achieving higher performance.},
added-at = {2015-06-08T14:28:31.000+0200},
author = {Unkule, Swapneela and Shaltz, Christopher and Qasem, Apan},
biburl = {https://www.bibsonomy.org/bibtex/27bb8eb311c9352a74921f80da17ae762/christophv},
booktitle = {Compiler Construction},
interhash = {068b436f834ec7b9694b4f14b8a76380},
intrahash = {7bb8eb311c9352a74921f80da17ae762},
keywords = {Expose},
pages = {21--40},
publisher = {Springer Berlin Heidelberg},
series = {Lecture Notes in Computer Science},
timestamp = {2016-01-04T14:22:08.000+0100},
title = {Automatic Restructuring of {GPU} Kernels for Exploiting
Inter-thread Data Locality},
year = 2012
}