Article,

Do birds of a feather really flock together, or how to choose training samples for authorship attribution

, and .
Literary and Linguistic Computing, 28 (2): 229-236 (2013)
DOI: 10.1093/llc/fqs036

Abstract

This study investigates the problem of appropriate choice of texts for the training set in machine-learning classification techniques. Although intuition suggests picking the most typical texts (whatever ‘typical’ means) by the authors studied, any arbitrary choice might substantially affect the final results. Thus, to eschew cherry picking, we introduce a method of verification of the choice of ‘typical’ samples, inspired by k-fold cross-validation procedures. Namely, we use a bootstrap-like approach to choose randomly, in 500 iterations, the samples for the training and the test sets. Next, we examine the obtained 500 attribution accuracy scores: if the density function shows widespread results, the corpus is assumed to be very sensitive to the permutations of the training set. To test this methodology empirically, we have selected roughly similar corpora in five languages: English, French, German, Italian, and Polish. The results show considerable resistance of the English corpus to permutations, while the other corpora turned out to be more dependent on the choice of the samples; the Polish corpus produces both accuracy and consistency below any acceptable standards.

Tags

Users

  • @katharina.

Comments and Reviewsshow / hide

  • @katharina.
    9 years ago
    Viele Informative Diagramme über die Auwahl von Literaturkörpern. Nicht im finalen Paper, da kein Platz.
Please log in to take part in the discussion (add own reviews or comments).