Zusammenfassung
Determining the best set of optimizations to apply to a kernel
to be executed on the graphics processing unit (GPU) is a
challenging problem. There are large sets of possible
optimization configurations that can be applied, and many
applications have multiple kernels. Each kernel may require a
specific configuration to achieve the best performance, and
moving an application to new hardware often requires a new
optimization configuration for each kernel. In this work, we
apply optimizations to GPU code using HMPP, a high-level
directive-based language and source-to-source compiler that can
generate CUDA / OpenCL code. However, programming with
high-level languages may mean a loss of performance compared to
using low-level languages. Our work shows that it is possible to
improve the performance of a high-level language by using
auto-tuning. We perform auto-tuning on a large optimization
space on GPU kernels, focusing on loop permutation, loop
unrolling, tiling, and specifying which loop(s) to parallelize,
and show results on convolution kernels, codes in the PolyBench
suite, and an implementation of belief propagation for stereo
vision. The results show that our auto-tuned HMPP-generated
implementations are significantly faster than the default HMPP
implementation and can meet or exceed the performance of
manually coded CUDA / OpenCL implementations.
Nutzer