Efficient Selection of Vector Instructions Using Dynamic
Programming
R. Barik, J. Zhao, und V. Sarkar. Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM
International Symposium on, Seite 201--212. ieeexplore.ieee.org, (Dezember 2010)
Zusammenfassung
Accelerating program performance via SIMD vector units is very
common in modern processors, as evidenced by the use of SSE,
MMX, VSE, and VSX SIMD instructions in multimedia, scientific,
and embedded applications. To take full advantage of the vector
capabilities, a compiler needs to generate efficient vector code
automatically. However, most commercial and open-source
compilers fall short of using the full potential of vector
units, and only generate vector code for simple innermost loops.
In this paper, we present the design and implementation of an
auto-vectorization framework in the back-end of a dynamic
compiler that not only generates optimized vector code but is
also well integrated with the instruction scheduler and register
allocator. The framework includes a novel compile-time efficient
dynamic programming-based vector instruction selection algorithm
for straight-line code that expands opportunities for
vectorization in the following ways: (1) scalar packing explores
opportunities of packing multiple scalar variables into short
vectors; (2) judicious use of shuffle and horizontal vector
operations, when possible; and (3) algebraic reassociation
expands opportunities for vectorization by algebraic
simplification. We report performance results on the impact of
autovectorization on a set of standard numerical benchmarks
using the Jikes RVM dynamic compilation environment. Our results
show performance improvement of up to 57.71\% on an Intel Xeon
processor, compared to non-vectorized execution, with a modest
increase in compile-time in the range from 0.87\% to 9.992\%. An
investigation of the SIMD parallelization performed by v11.1 of
the Intel Fortran Compiler (IFC) on three benchmarks shows that
our system achieves speedup with vectorization in all three
cases and IFC does not. Finally, a comparison of our approach
with an implementation of the Superword Level Parallelization
(SLP) algorithm from, shows that our approach yields a
performance improvement of up t- - o 13.78\% relative to SLP.
%0 Conference Paper
%1 Barik2010-jw
%A Barik, R
%A Zhao, Jisheng
%A Sarkar, V
%B Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM
International Symposium on
%D 2010
%I ieeexplore.ieee.org
%K Dynamic_Optimization Dynamic_Programming Instruction_Selection Intel_Fortran_compiler MMX RVM_dynamic_compilation SIMD_parallelization SIMD_vector SSE To_Read_3 VSE Vectorization algebraic_reassociation algebraic_simplification auto_vectorization_framework dynamic_compiler dynamic_programming instruction_scheduler instruction_sets parallel_processing program_compilers register_allocator scalar_packing straight_line_code vector_code_generation vector_instruction
%P 201--212
%T Efficient Selection of Vector Instructions Using Dynamic
Programming
%X Accelerating program performance via SIMD vector units is very
common in modern processors, as evidenced by the use of SSE,
MMX, VSE, and VSX SIMD instructions in multimedia, scientific,
and embedded applications. To take full advantage of the vector
capabilities, a compiler needs to generate efficient vector code
automatically. However, most commercial and open-source
compilers fall short of using the full potential of vector
units, and only generate vector code for simple innermost loops.
In this paper, we present the design and implementation of an
auto-vectorization framework in the back-end of a dynamic
compiler that not only generates optimized vector code but is
also well integrated with the instruction scheduler and register
allocator. The framework includes a novel compile-time efficient
dynamic programming-based vector instruction selection algorithm
for straight-line code that expands opportunities for
vectorization in the following ways: (1) scalar packing explores
opportunities of packing multiple scalar variables into short
vectors; (2) judicious use of shuffle and horizontal vector
operations, when possible; and (3) algebraic reassociation
expands opportunities for vectorization by algebraic
simplification. We report performance results on the impact of
autovectorization on a set of standard numerical benchmarks
using the Jikes RVM dynamic compilation environment. Our results
show performance improvement of up to 57.71\% on an Intel Xeon
processor, compared to non-vectorized execution, with a modest
increase in compile-time in the range from 0.87\% to 9.992\%. An
investigation of the SIMD parallelization performed by v11.1 of
the Intel Fortran Compiler (IFC) on three benchmarks shows that
our system achieves speedup with vectorization in all three
cases and IFC does not. Finally, a comparison of our approach
with an implementation of the Superword Level Parallelization
(SLP) algorithm from, shows that our approach yields a
performance improvement of up t- - o 13.78\% relative to SLP.
@inproceedings{Barik2010-jw,
abstract = {Accelerating program performance via SIMD vector units is very
common in modern processors, as evidenced by the use of SSE,
MMX, VSE, and VSX SIMD instructions in multimedia, scientific,
and embedded applications. To take full advantage of the vector
capabilities, a compiler needs to generate efficient vector code
automatically. However, most commercial and open-source
compilers fall short of using the full potential of vector
units, and only generate vector code for simple innermost loops.
In this paper, we present the design and implementation of an
auto-vectorization framework in the back-end of a dynamic
compiler that not only generates optimized vector code but is
also well integrated with the instruction scheduler and register
allocator. The framework includes a novel compile-time efficient
dynamic programming-based vector instruction selection algorithm
for straight-line code that expands opportunities for
vectorization in the following ways: (1) scalar packing explores
opportunities of packing multiple scalar variables into short
vectors; (2) judicious use of shuffle and horizontal vector
operations, when possible; and (3) algebraic reassociation
expands opportunities for vectorization by algebraic
simplification. We report performance results on the impact of
autovectorization on a set of standard numerical benchmarks
using the Jikes RVM dynamic compilation environment. Our results
show performance improvement of up to 57.71\% on an Intel Xeon
processor, compared to non-vectorized execution, with a modest
increase in compile-time in the range from 0.87\% to 9.992\%. An
investigation of the SIMD parallelization performed by v11.1 of
the Intel Fortran Compiler (IFC) on three benchmarks shows that
our system achieves speedup with vectorization in all three
cases and IFC does not. Finally, a comparison of our approach
with an implementation of the Superword Level Parallelization
(SLP) algorithm from, shows that our approach yields a
performance improvement of up t- - o 13.78\% relative to SLP.},
added-at = {2015-09-11T11:20:45.000+0200},
author = {Barik, R and Zhao, Jisheng and Sarkar, V},
biburl = {https://www.bibsonomy.org/bibtex/26565f852c682a6ed908c57db3723dc99/christophv},
booktitle = {Microarchitecture ({MICRO)}, 2010 43rd Annual {IEEE/ACM}
International Symposium on},
interhash = {cd4f612d5da02ebfd640c16d9bd5d119},
intrahash = {6565f852c682a6ed908c57db3723dc99},
keywords = {Dynamic_Optimization Dynamic_Programming Instruction_Selection Intel_Fortran_compiler MMX RVM_dynamic_compilation SIMD_parallelization SIMD_vector SSE To_Read_3 VSE Vectorization algebraic_reassociation algebraic_simplification auto_vectorization_framework dynamic_compiler dynamic_programming instruction_scheduler instruction_sets parallel_processing program_compilers register_allocator scalar_packing straight_line_code vector_code_generation vector_instruction},
month = dec,
pages = {201--212},
publisher = {ieeexplore.ieee.org},
timestamp = {2015-09-11T11:20:45.000+0200},
title = {Efficient Selection of Vector Instructions Using Dynamic
Programming},
year = 2010
}