Abstract
Communities of lesser resourced languages like North Sámi benefit
from language tools such as spell checkers and grammar checkers to
improve literacy. Accurate error feedback is dependent on
well-tokenised input, but traditional tokenisation as shallow
preprocessing is inadequate to solve the challenges of real-world
language usage. We present an alternative where tokenisation
remains ambiguous until we have linguistic context information
available. This lets us accurately detect sentence boundaries,
multiwords and compound error detection. We describe a North Sámi
grammarchecker with such a tokenisation system, and show the
results of its evaluation.
Users
Please
log in to take part in the discussion (add own reviews or comments).