The softmax (also called softargmax) function is widely used in machine
learning models to normalize real-valued scores into a probability
distribution. To avoid floating-point overflow, the softmax function is
conventionally implemented in three passes: the first pass to compute the
normalization constant, and two other passes to compute outputs from normalized
inputs. We analyze two variants of the Three-Pass algorithm and demonstrate
that in a well-optimized implementation on HPC-class processors performance of
all three passes is limited by memory bandwidth. We then present a novel
algorithm for softmax computation in just two passes. The proposed Two-Pass
algorithm avoids both numerical overflow and the extra normalization pass by
employing an exotic representation for intermediate values, where each value is
represented as a pair of floating-point numbers: one representing the
"mantissa" and another representing the "exponent". Performance evaluation
demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the
new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up
to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The
proposed Two-Pass algorithm also outperforms the traditional Three-Pass
algorithm on Intel Broadwell and AMD Zen 2 processors. To foster
reproducibility, we released an open-source implementation of the new Two-Pass
Softmax algorithm and other experiments in this paper as a part of XNNPACK
library at GitHub.com/google/XNNPACK.
%0 Generic
%1 dukhan2020twopass
%A Dukhan, Marat
%A Ablavatski, Artsiom
%D 2020
%K machinelearning optimization softmax
%T The Two-Pass Softmax Algorithm
%U http://arxiv.org/abs/2001.04438
%X The softmax (also called softargmax) function is widely used in machine
learning models to normalize real-valued scores into a probability
distribution. To avoid floating-point overflow, the softmax function is
conventionally implemented in three passes: the first pass to compute the
normalization constant, and two other passes to compute outputs from normalized
inputs. We analyze two variants of the Three-Pass algorithm and demonstrate
that in a well-optimized implementation on HPC-class processors performance of
all three passes is limited by memory bandwidth. We then present a novel
algorithm for softmax computation in just two passes. The proposed Two-Pass
algorithm avoids both numerical overflow and the extra normalization pass by
employing an exotic representation for intermediate values, where each value is
represented as a pair of floating-point numbers: one representing the
"mantissa" and another representing the "exponent". Performance evaluation
demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the
new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up
to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The
proposed Two-Pass algorithm also outperforms the traditional Three-Pass
algorithm on Intel Broadwell and AMD Zen 2 processors. To foster
reproducibility, we released an open-source implementation of the new Two-Pass
Softmax algorithm and other experiments in this paper as a part of XNNPACK
library at GitHub.com/google/XNNPACK.
@misc{dukhan2020twopass,
abstract = {The softmax (also called softargmax) function is widely used in machine
learning models to normalize real-valued scores into a probability
distribution. To avoid floating-point overflow, the softmax function is
conventionally implemented in three passes: the first pass to compute the
normalization constant, and two other passes to compute outputs from normalized
inputs. We analyze two variants of the Three-Pass algorithm and demonstrate
that in a well-optimized implementation on HPC-class processors performance of
all three passes is limited by memory bandwidth. We then present a novel
algorithm for softmax computation in just two passes. The proposed Two-Pass
algorithm avoids both numerical overflow and the extra normalization pass by
employing an exotic representation for intermediate values, where each value is
represented as a pair of floating-point numbers: one representing the
"mantissa" and another representing the "exponent". Performance evaluation
demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the
new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up
to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The
proposed Two-Pass algorithm also outperforms the traditional Three-Pass
algorithm on Intel Broadwell and AMD Zen 2 processors. To foster
reproducibility, we released an open-source implementation of the new Two-Pass
Softmax algorithm and other experiments in this paper as a part of XNNPACK
library at GitHub.com/google/XNNPACK.},
added-at = {2020-01-14T20:05:08.000+0100},
author = {Dukhan, Marat and Ablavatski, Artsiom},
biburl = {https://www.bibsonomy.org/bibtex/29eae91dabfa116a6f4a5628f8ad60597/cpankow},
description = {[2001.04438v1] The Two-Pass Softmax Algorithm},
interhash = {638a10ceced3846a7dad9615be598463},
intrahash = {9eae91dabfa116a6f4a5628f8ad60597},
keywords = {machinelearning optimization softmax},
note = {cite arxiv:2001.04438},
timestamp = {2020-01-14T20:05:08.000+0100},
title = {The Two-Pass Softmax Algorithm},
url = {http://arxiv.org/abs/2001.04438},
year = 2020
}