CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation
Y. Wang, W. Wang, S. Joty, and S. Hoi. (2021)cite arxiv:2109.00859Comment: Accepted to EMNLP 2021. 13 pages.
Abstract
Pre-trained models for Natural Languages (NL) like BERT and GPT have been
recently shown to transfer well to Programming Languages (PL) and largely
benefit a broad set of code-related tasks. Despite their success, most current
methods either rely on an encoder-only (or decoder-only) pre-training that is
suboptimal for generation (resp. understanding) tasks or process the code
snippet in the same way as NL, neglecting the special characteristics of PL
such as token types. We present CodeT5, a unified pre-trained encoder-decoder
Transformer model that better leverages the code semantics conveyed from the
developer-assigned identifiers. Our model employs a unified framework to
seamlessly support both code understanding and generation tasks and allows for
multi-task learning. Besides, we propose a novel identifier-aware pre-training
task that enables the model to distinguish which code tokens are identifiers
and to recover them when they are masked. Furthermore, we propose to exploit
the user-written code comments with a bimodal dual generation task for better
NL-PL alignment. Comprehensive experiments show that CodeT5 significantly
outperforms prior methods on understanding tasks such as code defect detection
and clone detection, and generation tasks across various directions including
PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better
capture semantic information from code. Our code and pre-trained models are
released at https: //github.com/salesforce/CodeT5 .
Description
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
%0 Generic
%1 wang2021codet5
%A Wang, Yue
%A Wang, Weishi
%A Joty, Shafiq
%A Hoi, Steven C. H.
%D 2021
%K code
%T CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation
%U http://arxiv.org/abs/2109.00859
%X Pre-trained models for Natural Languages (NL) like BERT and GPT have been
recently shown to transfer well to Programming Languages (PL) and largely
benefit a broad set of code-related tasks. Despite their success, most current
methods either rely on an encoder-only (or decoder-only) pre-training that is
suboptimal for generation (resp. understanding) tasks or process the code
snippet in the same way as NL, neglecting the special characteristics of PL
such as token types. We present CodeT5, a unified pre-trained encoder-decoder
Transformer model that better leverages the code semantics conveyed from the
developer-assigned identifiers. Our model employs a unified framework to
seamlessly support both code understanding and generation tasks and allows for
multi-task learning. Besides, we propose a novel identifier-aware pre-training
task that enables the model to distinguish which code tokens are identifiers
and to recover them when they are masked. Furthermore, we propose to exploit
the user-written code comments with a bimodal dual generation task for better
NL-PL alignment. Comprehensive experiments show that CodeT5 significantly
outperforms prior methods on understanding tasks such as code defect detection
and clone detection, and generation tasks across various directions including
PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better
capture semantic information from code. Our code and pre-trained models are
released at https: //github.com/salesforce/CodeT5 .
@misc{wang2021codet5,
abstract = {Pre-trained models for Natural Languages (NL) like BERT and GPT have been
recently shown to transfer well to Programming Languages (PL) and largely
benefit a broad set of code-related tasks. Despite their success, most current
methods either rely on an encoder-only (or decoder-only) pre-training that is
suboptimal for generation (resp. understanding) tasks or process the code
snippet in the same way as NL, neglecting the special characteristics of PL
such as token types. We present CodeT5, a unified pre-trained encoder-decoder
Transformer model that better leverages the code semantics conveyed from the
developer-assigned identifiers. Our model employs a unified framework to
seamlessly support both code understanding and generation tasks and allows for
multi-task learning. Besides, we propose a novel identifier-aware pre-training
task that enables the model to distinguish which code tokens are identifiers
and to recover them when they are masked. Furthermore, we propose to exploit
the user-written code comments with a bimodal dual generation task for better
NL-PL alignment. Comprehensive experiments show that CodeT5 significantly
outperforms prior methods on understanding tasks such as code defect detection
and clone detection, and generation tasks across various directions including
PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better
capture semantic information from code. Our code and pre-trained models are
released at https: //github.com/salesforce/CodeT5 .},
added-at = {2023-01-30T00:48:23.000+0100},
author = {Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C. H.},
biburl = {https://www.bibsonomy.org/bibtex/2ed4d0cc2803736a3099f10616434a0e3/pomali},
description = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
interhash = {6069a4f640599cbe47a38441d46cdcdd},
intrahash = {ed4d0cc2803736a3099f10616434a0e3},
keywords = {code},
note = {cite arxiv:2109.00859Comment: Accepted to EMNLP 2021. 13 pages},
timestamp = {2023-01-30T00:48:23.000+0100},
title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation},
url = {http://arxiv.org/abs/2109.00859},
year = 2021
}