@gron

Orchestrating Data Transfer for the Cell/B.E. Processor

, , and . Proceedings of the 22nd annual international conference on Supercomputing, page 289--298. New York, NY, USA, ACM, (2008)
DOI: 10.1145/1375527.1375570

Abstract

In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur.</p> <p>In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further.

Description

Orchestrating data transfer for the cell/B.E. processor

Links and resources

Tags

community

  • @gron
  • @dblp
@gron's tags highlighted