Article,

Compression-Based Parts-of-Speech Tagger for The Arabic Language

I. Alkhazi, and W. Teahan.
International Journal of Computational Linguistics (IJCL), 10 (1): 1 - 15 (April 2019)

Abstract

This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.

BibTeX key: alkhazi2019compressionbased
entry type: article
year: 2019
month: April
journal: International Journal of Computational Linguistics (IJCL)
number: 1
pages: 1 - 15
volume: 10
language: English
issn: 2180-1266
url: http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-95

BibSonomy

Compression-Based Parts-of-Speech Tagger for The Arabic Language

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on