copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

W. Xian, and A. Takayuki. Parallel Computing, 37 (9): 521-535 (2011)
DOI: https://doi.org/10.1016/j.parco.2011.02.007

Abstract

GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384×384×384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3–4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8–96 GPUs, the performances increase by a factor about 1.1–1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re=13,000 was carried on successfully using the mesh system 2000×1000×1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0h were used for processing 100,000 time steps. Under this condition, the computational time (2.79h) and the data communication time (3.06h) are almost the same.

Links and resources

BibTeX key: xian2011multigpu
entry type: article
year: 2011
journal: Parallel Computing
number: 9
pages: 521-535
volume: 37
issn: 0167-8191
DOI: https://doi.org/10.1016/j.parco.2011.02.007
url: https://www.sciencedirect.com/science/article/pii/S0167819111000214

Cite this publication

%0 Journal Article %1 xian2011multigpu %A Xian, Wang %A Takayuki, Aoki %D 2011 %J Parallel Computing %K 76f65-direct-numerical-and-large-eddy-simulation-of-turbulence 76m28-particle-methods-and-lattice-gas-methods-in-fluid-mechanics %N 9 %P 521-535 %R https://doi.org/10.1016/j.parco.2011.02.007 %T Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster %U https://www.sciencedirect.com/science/article/pii/S0167819111000214 %V 37 %X GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384×384×384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3–4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8–96 GPUs, the performances increase by a factor about 1.1–1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re=13,000 was carried on successfully using the mesh system 2000×1000×1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0h were used for processing 100,000 time steps. Under this condition, the computational time (2.79h) and the data communication time (3.06h) are almost the same.

@article{xian2011multigpu, abstract = {GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384×384×384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3–4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8–96 GPUs, the performances increase by a factor about 1.1–1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re=13,000 was carried on successfully using the mesh system 2000×1000×1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0h were used for processing 100,000 time steps. Under this condition, the computational time (2.79h) and the data communication time (3.06h) are almost the same.}, added-at = {2021-10-19T22:09:57.000+0200}, author = {Xian, Wang and Takayuki, Aoki}, biburl = {https://www.bibsonomy.org/bibtex/228f293f1d1661a71186b82e3134e4915/gdmcbain}, doi = {https://doi.org/10.1016/j.parco.2011.02.007}, interhash = {a510e5b2e93ac2c5b597ff6209a97a73}, intrahash = {28f293f1d1661a71186b82e3134e4915}, issn = {0167-8191}, journal = {Parallel Computing}, keywords = {76f65-direct-numerical-and-large-eddy-simulation-of-turbulence 76m28-particle-methods-and-lattice-gas-methods-in-fluid-mechanics}, number = 9, pages = {521-535}, timestamp = {2021-10-19T22:09:57.000+0200}, title = {Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster}, url = {https://www.sciencedirect.com/science/article/pii/S0167819111000214}, volume = 37, year = 2011 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Comments and Reviews
(0)