Article,

Designing A Rule Based Stemming Algorithm for Kambaata Language Text

J. Sumamo, and S. Teferra.
International Journal of Computational Linguistics (IJCL), 9 (2): 41-54 (June 2018)

Abstract

Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well. The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation. The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.

BibTeX key: sumamo2018designing
entry type: article
year: 2018
month: June
journal: International Journal of Computational Linguistics (IJCL)
number: 2
pages: 41-54
volume: 9
language: English
issn: 2180-1266
url: http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-93

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{sumamo2018designing, abstract = {Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well. The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation. The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.}, added-at = {2018-12-12T05:38:16.000+0100}, author = {Sumamo, Jonathan Samuel and Teferra, Solomon}, biburl = {https://www.bibsonomy.org/bibtex/2a6b2e37a19bb2dee85c8ef2285beeec9/cscjournals}, interhash = {19eb5fd50f98b1c01f0fc3afc896cbfe}, intrahash = {a6b2e37a19bb2dee85c8ef2285beeec9}, issn = {2180-1266}, journal = {International Journal of Computational Linguistics (IJCL)}, keywords = {Algorithm, Kambaata Language. Rule-Based Stemmer, Stemming}, language = {English}, month = {June}, number = 2, pages = {41-54}, timestamp = {2018-12-12T05:38:16.000+0100}, title = {Designing A Rule Based Stemming Algorithm for Kambaata Language Text}, url = {http://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-93}, volume = 9, year = 2018 }

BibSonomy

Designing A Rule Based Stemming Algorithm for Kambaata Language Text

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on