Since the introduction of the transformer model by Vaswani et al. (2017), a
fundamental question has yet to be answered: how does a model achieve
extrapolation at inference time for sequences that are longer than it saw
during training? We first show that extrapolation can be enabled by simply
changing the position representation method, though we find that current
methods do not allow for efficient extrapolation. We therefore introduce a
simpler and more efficient position method, Attention with Linear Biases
(ALiBi). ALiBi does not add positional embeddings to word embeddings; instead,
it biases query-key attention scores with a penalty that is proportional to
their distance. We show that this method trains a 1.3 billion parameter model
on input sequences of length 1024 that extrapolates to input sequences of
length 2048, achieving the same perplexity as a sinusoidal position embedding
model trained on inputs of length 2048 but training 11% faster and using 11%
less memory. ALiBi's inductive bias towards recency also leads it to outperform
multiple strong position methods on the WikiText-103 benchmark.
Description
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
%0 Generic
%1 press2021train
%A Press, Ofir
%A Smith, Noah A.
%A Lewis, Mike
%D 2021
%K efficient
%T Train Short, Test Long: Attention with Linear Biases Enables Input
Length Extrapolation
%U http://arxiv.org/abs/2108.12409
%X Since the introduction of the transformer model by Vaswani et al. (2017), a
fundamental question has yet to be answered: how does a model achieve
extrapolation at inference time for sequences that are longer than it saw
during training? We first show that extrapolation can be enabled by simply
changing the position representation method, though we find that current
methods do not allow for efficient extrapolation. We therefore introduce a
simpler and more efficient position method, Attention with Linear Biases
(ALiBi). ALiBi does not add positional embeddings to word embeddings; instead,
it biases query-key attention scores with a penalty that is proportional to
their distance. We show that this method trains a 1.3 billion parameter model
on input sequences of length 1024 that extrapolates to input sequences of
length 2048, achieving the same perplexity as a sinusoidal position embedding
model trained on inputs of length 2048 but training 11% faster and using 11%
less memory. ALiBi's inductive bias towards recency also leads it to outperform
multiple strong position methods on the WikiText-103 benchmark.
@misc{press2021train,
abstract = {Since the introduction of the transformer model by Vaswani et al. (2017), a
fundamental question has yet to be answered: how does a model achieve
extrapolation at inference time for sequences that are longer than it saw
during training? We first show that extrapolation can be enabled by simply
changing the position representation method, though we find that current
methods do not allow for efficient extrapolation. We therefore introduce a
simpler and more efficient position method, Attention with Linear Biases
(ALiBi). ALiBi does not add positional embeddings to word embeddings; instead,
it biases query-key attention scores with a penalty that is proportional to
their distance. We show that this method trains a 1.3 billion parameter model
on input sequences of length 1024 that extrapolates to input sequences of
length 2048, achieving the same perplexity as a sinusoidal position embedding
model trained on inputs of length 2048 but training 11% faster and using 11%
less memory. ALiBi's inductive bias towards recency also leads it to outperform
multiple strong position methods on the WikiText-103 benchmark.},
added-at = {2023-05-15T23:10:49.000+0200},
author = {Press, Ofir and Smith, Noah A. and Lewis, Mike},
biburl = {https://www.bibsonomy.org/bibtex/2175b6c95a2375a9892a86b8c90eb1c1d/hassanpour71},
description = {Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation},
interhash = {b07fb272693ce6e2bb916145f83d4b15},
intrahash = {175b6c95a2375a9892a86b8c90eb1c1d},
keywords = {efficient},
note = {cite arxiv:2108.12409},
timestamp = {2023-05-15T23:10:49.000+0200},
title = {Train Short, Test Long: Attention with Linear Biases Enables Input
Length Extrapolation},
url = {http://arxiv.org/abs/2108.12409},
year = 2021
}