Seeing more than whitespace—Tokenisation and disambiguation in a North Sámi grammar checker
L. Wiechetek, K. Unhammer, and S. Moshagen. Workshop on the Use of Computational Methods in the Study of Endangered Languages, 1, page 46. ComputEL, (2019)
Abstract
Communities of lesser resourced languages like North Sámi benefit
from language tools such as spell checkers and grammar checkers to
improve literacy. Accurate error feedback is dependent on
well-tokenised input, but traditional tokenisation as shallow
preprocessing is inadequate to solve the challenges of real-world
language usage. We present an alternative where tokenisation
remains ambiguous until we have linguistic context information
available. This lets us accurately detect sentence boundaries,
multiwords and compound error detection. We describe a North Sámi
grammarchecker with such a tokenisation system, and show the
results of its evaluation.
%0 Conference Paper
%1 wiechetek2019seeing
%A Wiechetek, Linda
%A Unhammer, Kevin Brubeck
%A Moshagen, Sjur Nørstebø
%B Workshop on the Use of Computational Methods in the Study of Endangered Languages
%D 2019
%K disambiguation grammar hfst multiwords mwe myown tokenisation
%P 46
%T Seeing more than whitespace—Tokenisation and disambiguation in a North Sámi grammar checker
%U https://computel-workshop.org/wp-content/uploads/2019/02/CEL3_book_papers_draft.pdf#page=58
%V 1
%X Communities of lesser resourced languages like North Sámi benefit
from language tools such as spell checkers and grammar checkers to
improve literacy. Accurate error feedback is dependent on
well-tokenised input, but traditional tokenisation as shallow
preprocessing is inadequate to solve the challenges of real-world
language usage. We present an alternative where tokenisation
remains ambiguous until we have linguistic context information
available. This lets us accurately detect sentence boundaries,
multiwords and compound error detection. We describe a North Sámi
grammarchecker with such a tokenisation system, and show the
results of its evaluation.
@inproceedings{wiechetek2019seeing,
abstract = {
Communities of lesser resourced languages like North Sámi benefit
from language tools such as spell checkers and grammar checkers to
improve literacy. Accurate error feedback is dependent on
well-tokenised input, but traditional tokenisation as shallow
preprocessing is inadequate to solve the challenges of real-world
language usage. We present an alternative where tokenisation
remains ambiguous until we have linguistic context information
available. This lets us accurately detect sentence boundaries,
multiwords and compound error detection. We describe a North Sámi
grammarchecker with such a tokenisation system, and show the
results of its evaluation.},
added-at = {2019-06-07T10:00:43.000+0200},
author = {Wiechetek, Linda and Unhammer, Kevin Brubeck and Moshagen, Sjur N{\o}rsteb{\o}},
biburl = {https://www.bibsonomy.org/bibtex/23a50b54e053bb6312df12676551c9be3/unhammer},
booktitle = {Workshop on the Use of Computational Methods in the Study of Endangered Languages},
interhash = {dd27f7926035ebec8001fc2b6ee13b11},
intrahash = {3a50b54e053bb6312df12676551c9be3},
keywords = {disambiguation grammar hfst multiwords mwe myown tokenisation},
organization = {ComputEL},
pages = 46,
timestamp = {2019-06-07T10:00:43.000+0200},
title = {Seeing more than whitespace—Tokenisation and disambiguation in a North S{\'a}mi grammar checker},
url = {https://computel-workshop.org/wp-content/uploads/2019/02/CEL3_book_papers_draft.pdf#page=58},
venue = {Honolulu, Hawai’i},
volume = 1,
year = 2019
}