The Transformer is widely used in natural language processing tasks. To train
a Transformer however, one usually needs a carefully designed learning rate
warm-up stage, which is shown to be crucial to the final performance but will
slow down the optimization and bring more hyper-parameter tunings. In this
paper, we first study theoretically why the learning rate warm-up stage is
essential and show that the location of layer normalization matters.
Specifically, we prove with mean field theory that at initialization, for the
original-designed Post-LN Transformer, which places the layer normalization
between the residual blocks, the expected gradients of the parameters near the
output layer are large. Therefore, using a large learning rate on those
gradients makes the training unstable. The warm-up stage is practically helpful
for avoiding this problem. On the other hand, our theory also shows that if the
layer normalization is put inside the residual blocks (recently proposed as
Pre-LN Transformer), the gradients are well-behaved at initialization. This
motivates us to remove the warm-up stage for the training of Pre-LN
Transformers. We show in our experiments that Pre-LN Transformers without the
warm-up stage can reach comparable results with baselines while requiring
significantly less training time and hyper-parameter tuning on a wide range of
applications.
Description
On Layer Normalization in the Transformer Architecture
%0 Generic
%1 xiong2020layer
%A Xiong, Ruibin
%A Yang, Yunchang
%A He, Di
%A Zheng, Kai
%A Zheng, Shuxin
%A Xing, Chen
%A Zhang, Huishuai
%A Lan, Yanyan
%A Wang, Liwei
%A Liu, Tie-Yan
%D 2020
%K deeplearning transformer
%T On Layer Normalization in the Transformer Architecture
%U http://arxiv.org/abs/2002.04745
%X The Transformer is widely used in natural language processing tasks. To train
a Transformer however, one usually needs a carefully designed learning rate
warm-up stage, which is shown to be crucial to the final performance but will
slow down the optimization and bring more hyper-parameter tunings. In this
paper, we first study theoretically why the learning rate warm-up stage is
essential and show that the location of layer normalization matters.
Specifically, we prove with mean field theory that at initialization, for the
original-designed Post-LN Transformer, which places the layer normalization
between the residual blocks, the expected gradients of the parameters near the
output layer are large. Therefore, using a large learning rate on those
gradients makes the training unstable. The warm-up stage is practically helpful
for avoiding this problem. On the other hand, our theory also shows that if the
layer normalization is put inside the residual blocks (recently proposed as
Pre-LN Transformer), the gradients are well-behaved at initialization. This
motivates us to remove the warm-up stage for the training of Pre-LN
Transformers. We show in our experiments that Pre-LN Transformers without the
warm-up stage can reach comparable results with baselines while requiring
significantly less training time and hyper-parameter tuning on a wide range of
applications.
@misc{xiong2020layer,
abstract = {The Transformer is widely used in natural language processing tasks. To train
a Transformer however, one usually needs a carefully designed learning rate
warm-up stage, which is shown to be crucial to the final performance but will
slow down the optimization and bring more hyper-parameter tunings. In this
paper, we first study theoretically why the learning rate warm-up stage is
essential and show that the location of layer normalization matters.
Specifically, we prove with mean field theory that at initialization, for the
original-designed Post-LN Transformer, which places the layer normalization
between the residual blocks, the expected gradients of the parameters near the
output layer are large. Therefore, using a large learning rate on those
gradients makes the training unstable. The warm-up stage is practically helpful
for avoiding this problem. On the other hand, our theory also shows that if the
layer normalization is put inside the residual blocks (recently proposed as
Pre-LN Transformer), the gradients are well-behaved at initialization. This
motivates us to remove the warm-up stage for the training of Pre-LN
Transformers. We show in our experiments that Pre-LN Transformers without the
warm-up stage can reach comparable results with baselines while requiring
significantly less training time and hyper-parameter tuning on a wide range of
applications.},
added-at = {2023-08-23T21:05:33.000+0200},
author = {Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tie-Yan},
biburl = {https://www.bibsonomy.org/bibtex/20e7e589f20732b9e618449260b1ba5d1/arthi706},
description = {On Layer Normalization in the Transformer Architecture},
interhash = {4c0ba8cd34fcfaaad43dd0bb6c7c72ac},
intrahash = {0e7e589f20732b9e618449260b1ba5d1},
keywords = {deeplearning transformer},
note = {cite arxiv:2002.04745},
timestamp = {2023-08-23T21:05:33.000+0200},
title = {On Layer Normalization in the Transformer Architecture},
url = {http://arxiv.org/abs/2002.04745},
year = 2020
}