Misc,

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

A. Baevski, A. Babu, W. Hsu, and M. Auli.
https://ai.facebook.com/blog/ai-self-supervised-learning-data2vec/, (2022)cite arxiv:2212.07525.

Abstract

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.

BibTeX key: baevski2022efficient
entry type: misc
year: 2022
howpublished: https://ai.facebook.com/blog/ai-self-supervised-learning-data2vec/
url: http://arxiv.org/abs/2212.07525
note: cite arxiv:2212.07525

BibSonomy

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on