D. Bolya, C. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. (2022)cite arxiv:2210.09461Comment: Accepted ICLR 2023 Oral (top 5%) final v2. This version includes stable diffusion experiments. See code at https://github.com/facebookresearch/ToMe.
Abstract
We introduce Token Merging (ToMe), a simple method to increase the throughput
of existing ViT models without needing to train. ToMe gradually combines
similar tokens in a transformer using a general and light-weight matching
algorithm that is as fast as pruning while being more accurate. Off-the-shelf,
ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518
models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3%
accuracy drop in each case. ToMe can also easily be applied during training,
improving in practice training speed up to 2x for MAE fine-tuning on video.
Training with ToMe further minimizes accuracy drop, leading to 2x the
throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find
that ToMe merges object parts into one token, even over multiple frames of
video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art
on images, video, and audio.
cite arxiv:2210.09461Comment: Accepted ICLR 2023 Oral (top 5%) final v2. This version includes stable diffusion experiments. See code at https://github.com/facebookresearch/ToMe
%0 Generic
%1 bolya2022token
%A Bolya, Daniel
%A Fu, Cheng-Yang
%A Dai, Xiaoliang
%A Zhang, Peizhao
%A Feichtenhofer, Christoph
%A Hoffman, Judy
%D 2022
%K data_augmentation
%T Token Merging: Your ViT But Faster
%U http://arxiv.org/abs/2210.09461
%X We introduce Token Merging (ToMe), a simple method to increase the throughput
of existing ViT models without needing to train. ToMe gradually combines
similar tokens in a transformer using a general and light-weight matching
algorithm that is as fast as pruning while being more accurate. Off-the-shelf,
ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518
models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3%
accuracy drop in each case. ToMe can also easily be applied during training,
improving in practice training speed up to 2x for MAE fine-tuning on video.
Training with ToMe further minimizes accuracy drop, leading to 2x the
throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find
that ToMe merges object parts into one token, even over multiple frames of
video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art
on images, video, and audio.
@misc{bolya2022token,
abstract = {We introduce Token Merging (ToMe), a simple method to increase the throughput
of existing ViT models without needing to train. ToMe gradually combines
similar tokens in a transformer using a general and light-weight matching
algorithm that is as fast as pruning while being more accurate. Off-the-shelf,
ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518
models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3%
accuracy drop in each case. ToMe can also easily be applied during training,
improving in practice training speed up to 2x for MAE fine-tuning on video.
Training with ToMe further minimizes accuracy drop, leading to 2x the
throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find
that ToMe merges object parts into one token, even over multiple frames of
video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art
on images, video, and audio.},
added-at = {2023-05-12T19:16:54.000+0200},
author = {Bolya, Daniel and Fu, Cheng-Yang and Dai, Xiaoliang and Zhang, Peizhao and Feichtenhofer, Christoph and Hoffman, Judy},
biburl = {https://www.bibsonomy.org/bibtex/2be6dab5547a2706612e5e346ab0b2fe9/hassanpour71},
description = {Token Merging: Your ViT But Faster},
interhash = {26db11ed683c8431b2c892b704ac090c},
intrahash = {be6dab5547a2706612e5e346ab0b2fe9},
keywords = {data_augmentation},
note = {cite arxiv:2210.09461Comment: Accepted ICLR 2023 Oral (top 5%) [final v2]. This version includes stable diffusion experiments. See code at https://github.com/facebookresearch/ToMe},
timestamp = {2023-05-12T19:16:54.000+0200},
title = {Token Merging: Your ViT But Faster},
url = {http://arxiv.org/abs/2210.09461},
year = 2022
}