Exploring Twitter as a Source of an Arabic Dialect Corpus
A. Alshutayri, и E. Atwell. International Journal of Computational Linguistics (IJCL), 8 (2):
37-44(июня 2017)
Аннотация
Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.
%0 Journal Article
%1 alshutayri2017exploring
%A Alshutayri, Areej Odah
%A Atwell, Eric
%D 2017
%J International Journal of Computational Linguistics (IJCL)
%K Arabic, Dialect, Dialectal Media, Multi Phonological Social Tweet Twitter, Variations,
%N 2
%P 37-44
%T Exploring Twitter as a Source of an Arabic Dialect Corpus
%U http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-83
%V 8
%X Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.
@article{alshutayri2017exploring,
abstract = {Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.},
added-at = {2018-12-14T08:22:33.000+0100},
author = {Alshutayri, Areej Odah and Atwell, Eric},
biburl = {https://www.bibsonomy.org/bibtex/2c198832c136904662020ec9c5629af16/cscjournals},
interhash = {0db6ebc04be575509d4b0729910a468d},
intrahash = {c198832c136904662020ec9c5629af16},
issn = {2180-1266},
journal = {International Journal of Computational Linguistics (IJCL)},
keywords = {Arabic, Dialect, Dialectal Media, Multi Phonological Social Tweet Twitter, Variations,},
language = {English},
month = {June},
number = 2,
pages = {37-44},
timestamp = {2018-12-14T08:22:33.000+0100},
title = {Exploring Twitter as a Source of an Arabic Dialect Corpus},
url = {http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-83},
volume = 8,
year = 2017
}