Compression-Based Parts-of-Speech Tagger for The Arabic Language
I. Alkhazi, и W. Teahan. International Journal of Computational Linguistics (IJCL), 10 (1):
1 - 15(апреля 2019)
Аннотация
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
%0 Journal Article
%1 alkhazi2019compressionbased
%A Alkhazi, Ibrahim S.
%A Teahan, William J.
%D 2019
%J International Journal of Computational Linguistics (IJCL)
%K Arabic Hidden Language Markov Model Model, Natural Part-of-Speech Processing, Statistical Tagger,
%N 1
%P 1 - 15
%T Compression-Based Parts-of-Speech Tagger for The Arabic Language
%U http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-95
%V 10
%X This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
@article{alkhazi2019compressionbased,
abstract = {This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.},
added-at = {2019-10-09T19:41:03.000+0200},
author = {Alkhazi, Ibrahim S. and Teahan, William J.},
biburl = {https://www.bibsonomy.org/bibtex/284a0fe941073f98cb103964f07dbbf0a/cscjournals},
interhash = {1b588ebe5032c86c0bcf333f17dc6712},
intrahash = {84a0fe941073f98cb103964f07dbbf0a},
issn = {2180-1266},
journal = {International Journal of Computational Linguistics (IJCL)},
keywords = {Arabic Hidden Language Markov Model Model, Natural Part-of-Speech Processing, Statistical Tagger,},
language = {English},
month = {April},
number = 1,
pages = {1 - 15},
timestamp = {2019-10-09T19:41:03.000+0200},
title = {Compression-Based Parts-of-Speech Tagger for The Arabic Language},
url = {http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-95},
volume = 10,
year = 2019
}