The OpenCL standard allows targeting a large variety of CPU, GPU
and accelerator architectures using a single unified programming
interface and language. While the standard guarantees
portability of functionality for complying applications and
platforms, performance portability on such a diverse set of
hardware is limited. Devices may vary significantly in memory
architecture as well as type, number and complexity of
computational units. To characterize and compare the OpenCL
performance of existing and future devices we propose a suite of
microbenchmarks, uCLbench. We present measurements for eight
hardware architectures -- four GPUs, three CPUs and one
accelerator -- and illustrate how the results accurately reflect
unique characteristics of the respective platform. In addition
to measuring quantities traditionally benchmarked on CPUs like
arithmetic throughput or the bandwidth and latency of various
address spaces, the suite also includes code designed to
determine parameters unique to OpenCL like the dynamic branching
penalties prevalent on GPUs. We demonstrate how our results can
be used to guide algorithm design and optimization for any given
platform on an example kernel that represents the key
computation of a linear multigrid solver. Guided manual
optimization of this kernel results in an average improvement of
61\% across the eight platforms tested.
%0 Book Section
%1 Thoman2011-zy
%A Thoman, Peter
%A Kofler, Klaus
%A Studt, Heiko
%A Thomson, John
%A Fahringer, Thomas
%B Euro-Par 2011 Parallel Processing
%D 2011
%I Springer Berlin Heidelberg
%K Expose OpenCL
%P 438--452
%T Automatic OpenCL Device Characterization: Guiding Optimized
Kernel Design
%X The OpenCL standard allows targeting a large variety of CPU, GPU
and accelerator architectures using a single unified programming
interface and language. While the standard guarantees
portability of functionality for complying applications and
platforms, performance portability on such a diverse set of
hardware is limited. Devices may vary significantly in memory
architecture as well as type, number and complexity of
computational units. To characterize and compare the OpenCL
performance of existing and future devices we propose a suite of
microbenchmarks, uCLbench. We present measurements for eight
hardware architectures -- four GPUs, three CPUs and one
accelerator -- and illustrate how the results accurately reflect
unique characteristics of the respective platform. In addition
to measuring quantities traditionally benchmarked on CPUs like
arithmetic throughput or the bandwidth and latency of various
address spaces, the suite also includes code designed to
determine parameters unique to OpenCL like the dynamic branching
penalties prevalent on GPUs. We demonstrate how our results can
be used to guide algorithm design and optimization for any given
platform on an example kernel that represents the key
computation of a linear multigrid solver. Guided manual
optimization of this kernel results in an average improvement of
61\% across the eight platforms tested.
@incollection{Thoman2011-zy,
abstract = {The OpenCL standard allows targeting a large variety of CPU, GPU
and accelerator architectures using a single unified programming
interface and language. While the standard guarantees
portability of functionality for complying applications and
platforms, performance portability on such a diverse set of
hardware is limited. Devices may vary significantly in memory
architecture as well as type, number and complexity of
computational units. To characterize and compare the OpenCL
performance of existing and future devices we propose a suite of
microbenchmarks, uCLbench. We present measurements for eight
hardware architectures -- four GPUs, three CPUs and one
accelerator -- and illustrate how the results accurately reflect
unique characteristics of the respective platform. In addition
to measuring quantities traditionally benchmarked on CPUs like
arithmetic throughput or the bandwidth and latency of various
address spaces, the suite also includes code designed to
determine parameters unique to OpenCL like the dynamic branching
penalties prevalent on GPUs. We demonstrate how our results can
be used to guide algorithm design and optimization for any given
platform on an example kernel that represents the key
computation of a linear multigrid solver. Guided manual
optimization of this kernel results in an average improvement of
61\% across the eight platforms tested.},
added-at = {2015-04-10T18:02:47.000+0200},
author = {Thoman, Peter and Kofler, Klaus and Studt, Heiko and Thomson, John and Fahringer, Thomas},
biburl = {https://www.bibsonomy.org/bibtex/239d85d51ec4f231d0f887bbdb9ecc9ac/christophv},
booktitle = {{Euro-Par} 2011 Parallel Processing},
interhash = {4021e837d3c19680f1683f4b62c3ef9d},
intrahash = {39d85d51ec4f231d0f887bbdb9ecc9ac},
keywords = {Expose OpenCL},
pages = {438--452},
publisher = {Springer Berlin Heidelberg},
series = {Lecture Notes in Computer Science},
timestamp = {2016-01-04T14:22:08.000+0100},
title = {Automatic {OpenCL} Device Characterization: Guiding Optimized
Kernel Design},
year = 2011
}