Inproceedings,

Seeing more than whitespace—Tokenisation and disambiguation in a North Sámi grammar checker

, , and .
Workshop on the Use of Computational Methods in the Study of Endangered Languages, 1, page 46. ComputEL, (2019)

Abstract

Communities of lesser resourced languages like North Sámi benefit from language tools such as spell checkers and grammar checkers to improve literacy. Accurate error feedback is dependent on well-tokenised input, but traditional tokenisation as shallow preprocessing is inadequate to solve the challenges of real-world language usage. We present an alternative where tokenisation remains ambiguous until we have linguistic context information available. This lets us accurately detect sentence boundaries, multiwords and compound error detection. We describe a North Sámi grammarchecker with such a tokenisation system, and show the results of its evaluation.

Tags

Users

  • @unhammer

Comments and Reviews