Abstract
Current self-supervised learning algorithms are often modality-specific and
require large amounts of computational resources. To address these issues, we
increase the training efficiency of data2vec, a learning objective that
generalizes across several modalities. We do not encode masked tokens, use a
fast convolutional decoder and amortize the effort to build teacher
representations. data2vec 2.0 benefits from the rich contextualized target
representations introduced in data2vec which enable a fast self-supervised
learner. Experiments on ImageNet-1K image classification show that data2vec 2.0
matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time,
on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x
less time, and on GLUE natural language understanding it matches a retrained
RoBERTa model in half the time. Trading some speed for accuracy results in
ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
Users
Please
log in to take part in the discussion (add own reviews or comments).