Article,

Do birds of a feather really flock together, or how to choose training samples for authorship attribution

M. Eder, and J. Rybicki.
Literary and Linguistic Computing, 28 (2): 229-236 (2013)
DOI: 10.1093/llc/fqs036

Abstract

This study investigates the problem of appropriate choice of texts for the training set in machine-learning classification techniques. Although intuition suggests picking the most typical texts (whatever ‘typical’ means) by the authors studied, any arbitrary choice might substantially affect the final results. Thus, to eschew cherry picking, we introduce a method of verification of the choice of ‘typical’ samples, inspired by k-fold cross-validation procedures. Namely, we use a bootstrap-like approach to choose randomly, in 500 iterations, the samples for the training and the test sets. Next, we examine the obtained 500 attribution accuracy scores: if the density function shows widespread results, the corpus is assumed to be very sensitive to the permutations of the training set. To test this methodology empirically, we have selected roughly similar corpora in five languages: English, French, German, Italian, and Polish. The results show considerable resistance of the English corpus to permutations, while the other corpora turned out to be more dependent on the choice of the samples; the Polish corpus produces both accuracy and consistency below any acceptable standards.

BibTeX key: Eder01062013
entry type: article
year: 2013
journal: Literary and Linguistic Computing
number: 2
pages: 229-236
volume: 28
DOI: 10.1093/llc/fqs036
eprint: http://llc.oxfordjournals.org/content/28/2/229.full.pdf+html
url: http://llc.oxfordjournals.org/content/28/2/229.abstract

Users

Comments and Reviewsshow / hide

@katharina. 9 years ago
Viele Informative Diagramme über die Auwahl von Literaturkörpern. Nicht im finalen Paper, da kein Platz.
References
Bookmarks
deleting review

Please log in to take part in the discussion (add own reviews or comments).

BibSonomy

Do birds of a feather really flock together, or how to choose training samples for authorship attribution

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on