copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Orchestrating Data Transfer for the Cell/B.E. Processor

T. Chen, H. Lin, and T. Zhang. Proceedings of the 22nd annual international conference on Supercomputing, page 289--298. New York, NY, USA, ACM, (2008)
DOI: 10.1145/1375527.1375570

Abstract

In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur. In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further.

Description

Orchestrating data transfer for the cell/B.E. processor

Links and resources

BibTeX key: Chen:2008:ODT:1375527.1375570
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 22nd annual international conference on Supercomputing
year: 2008
pages: 289--298
publisher: ACM
series: ICS '08
location: Island of Kos, Greece
acmid: 1375570
isbn: 978-1-60558-158-3
numpages: 10
DOI: 10.1145/1375527.1375570
url: http://doi.acm.org/10.1145/1375527.1375570

@gron's tags highlighted

Cite this publication

%0 Conference Paper %1 Chen:2008:ODT:1375527.1375570 %A Chen, Tong %A Lin, Haibo %A Zhang, Tao %B Proceedings of the 22nd annual international conference on Supercomputing %C New York, NY, USA %D 2008 %I ACM %K Cell Replication %P 289--298 %R 10.1145/1375527.1375570 %T Orchestrating Data Transfer for the Cell/B.E. Processor %U http://doi.acm.org/10.1145/1375527.1375570 %X In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur. In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further. %@ 978-1-60558-158-3

@inproceedings{Chen:2008:ODT:1375527.1375570, abstract = {In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur. In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further.}, acmid = {1375570}, added-at = {2012-10-13T20:53:53.000+0200}, address = {New York, NY, USA}, author = {Chen, Tong and Lin, Haibo and Zhang, Tao}, biburl = {https://www.bibsonomy.org/bibtex/2b5ddf23041f09ba601501b0368b1e39f/gron}, booktitle = {Proceedings of the 22nd annual international conference on Supercomputing}, description = {Orchestrating data transfer for the cell/B.E. processor}, doi = {10.1145/1375527.1375570}, interhash = {cc5486d1e39d40251a1cff19b17fd7c3}, intrahash = {b5ddf23041f09ba601501b0368b1e39f}, isbn = {978-1-60558-158-3}, keywords = {Cell Replication}, location = {Island of Kos, Greece}, numpages = {10}, pages = {289--298}, publisher = {ACM}, series = {ICS '08}, timestamp = {2012-10-13T20:53:53.000+0200}, title = {Orchestrating Data Transfer for the Cell/B.E. Processor}, url = {http://doi.acm.org/10.1145/1375527.1375570}, year = 2008 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Orchestrating Data Transfer for the Cell/B.E. Processor

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Orchestrating Data Transfer for the Cell/B.E. Processor

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Orchestrating Data Transfer for the Cell/B.E. Processor

Comments and Reviews
(0)