@christophv

Performance Traps in OpenCL for CPUs

, , , and . Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, page 38--45. Washington, DC, USA, IEEE Computer Society, (February 2013)

Abstract

With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: ``OpenCL is not performance portable!'' or ``Why using OpenCL for CPUs after all?!''. We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance ``traps'' that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.

Links and resources

Tags

community