Article,

Brill's Rule-based Part of Speech Tagger for Kadazan

, and .
Int. J. on Recent Trends in Engineering and Technology,, 10 (1): 8 (January 2014)

Abstract

This paper presents the Part of Speech Tagger (POS) for Kadazan language by implementing Brill's approach which is also known as a Transformation-Based Error Driven Learning approach. Kadazan language is chosen because there is not even one POS tagger has been developed for this language yet. Hence, this study has been carried out in order to develop a POS tagger especially for Kadazan language that can tag Kadazan corpus systematically, help to reduce the ambiguity problem and at the same time can be used as a learning language tool. Therefore, the main objective of this study is to automate the tagging process for Kadazan language. Brill' approach is an enhance version of the original Rule-Based approach which it transforms the tags based on a set of predefined rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In order to achieve the main goal, several objectives have been set which are to create the specific lexical and contextual rules for Kadazan language, by applying Brill’s approach based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s approach. The tagging process is divided into four main phases. In first phase, Brill’s approach process begins by inputting a new untagged text into the system. In second phase, the input text will go through the initial state annotater to tag all the words inside the corpus to its most likely tags and produce a temporary corpus. In third phase, the temporary corpus is then compared to the goal corpus to detect if there is any errors occurred. In last phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus. The tagging approach has been trained using two Kadazan children’s story books which contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93 % of accuracy. This study has shown how Brill’s tagging approach can be used to identify tags for Kadazan language.

Tags

Users

  • @idescitation

Comments and Reviews