Inproceedings,

Improving Cross-Lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

Z. Huang, P. Yu, and J. Allan.
Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, page 1048–1056. New York, NY, USA, Association for Computing Machinery, (2023)
DOI: 10.1145/3539597.3570468

Abstract

Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.

BibTeX key: 10.1145/3539597.3570468
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining
year: 2023
pages: 1048–1056
publisher: Association for Computing Machinery
series: WSDM '23
isbn: 9781450394079
numpages: 9
location: Singapore, Singapore
DOI: 10.1145/3539597.3570468
url: https://doi.org/10.1145/3539597.3570468

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Conference Paper %1 10.1145/3539597.3570468 %A Huang, Zhiqi %A Yu, Puxuan %A Allan, James %B Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining %C New York, NY, USA %D 2023 %I Association for Computing Machinery %K cross-lingual-information-retrieval knowledge-distillation low-resource-language %P 1048–1056 %R 10.1145/3539597.3570468 %T Improving Cross-Lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation %U https://doi.org/10.1145/3539597.3570468 %X Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation. %@ 9781450394079

@inproceedings{10.1145/3539597.3570468, abstract = {Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.}, added-at = {2023-12-05T12:55:57.000+0100}, address = {New York, NY, USA}, author = {Huang, Zhiqi and Yu, Puxuan and Allan, James}, biburl = {https://www.bibsonomy.org/bibtex/2935dd6a181c4d7567804a4077514a819/simonh}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, doi = {10.1145/3539597.3570468}, interhash = {57d60c522f4c276a973b1b39695523a1}, intrahash = {935dd6a181c4d7567804a4077514a819}, isbn = {9781450394079}, keywords = {cross-lingual-information-retrieval knowledge-distillation low-resource-language}, location = {Singapore, Singapore}, numpages = {9}, pages = {1048–1056}, publisher = {Association for Computing Machinery}, series = {WSDM '23}, timestamp = {2023-12-05T12:55:57.000+0100}, title = {Improving Cross-Lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation}, url = {https://doi.org/10.1145/3539597.3570468}, year = 2023 }

BibSonomy

Improving Cross-Lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on