Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set
A. Jacobs, and A. Kinder. (2020)cite arxiv:2010.10801Comment: 18 pages, 3 tables.
Abstract
The Gutenberg Literary English Corpus (GLEC) provides a rich source of
textual data for research in digital humanities, computational linguistics or
neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg
English Poetry Corpus, has been submitted to quantitative text analyses
providing predictions for scientific studies of literature. Here we show that
in the entire GLEC quasi error-free text classification and authorship
recognition is possible with a method using the same set of five style and five
content features, computed via style and sentiment analysis, in both tasks. Our
results identify two standard and two novel features (i.e., type-token ratio,
frequency, sonority score, surprise) as most diagnostic in these tasks. By
providing a simple tool applicable to both short poems and long novels
generating quantitative predictions about features that co-determe the
cognitive and affective processing of specific text categories or authors, our
data pave the way for many future computational and empirical studies of
literature or experiments in reading psychology.
Description
[2010.10801] Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set
%0 Generic
%1 jacobs2020quasi
%A Jacobs, Arthur M.
%A Kinder, Annette
%D 2020
%K author classification dh identification literature recognition sentiment style text
%T Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set
%U http://arxiv.org/abs/2010.10801
%X The Gutenberg Literary English Corpus (GLEC) provides a rich source of
textual data for research in digital humanities, computational linguistics or
neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg
English Poetry Corpus, has been submitted to quantitative text analyses
providing predictions for scientific studies of literature. Here we show that
in the entire GLEC quasi error-free text classification and authorship
recognition is possible with a method using the same set of five style and five
content features, computed via style and sentiment analysis, in both tasks. Our
results identify two standard and two novel features (i.e., type-token ratio,
frequency, sonority score, surprise) as most diagnostic in these tasks. By
providing a simple tool applicable to both short poems and long novels
generating quantitative predictions about features that co-determe the
cognitive and affective processing of specific text categories or authors, our
data pave the way for many future computational and empirical studies of
literature or experiments in reading psychology.
@misc{jacobs2020quasi,
abstract = {The Gutenberg Literary English Corpus (GLEC) provides a rich source of
textual data for research in digital humanities, computational linguistics or
neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg
English Poetry Corpus, has been submitted to quantitative text analyses
providing predictions for scientific studies of literature. Here we show that
in the entire GLEC quasi error-free text classification and authorship
recognition is possible with a method using the same set of five style and five
content features, computed via style and sentiment analysis, in both tasks. Our
results identify two standard and two novel features (i.e., type-token ratio,
frequency, sonority score, surprise) as most diagnostic in these tasks. By
providing a simple tool applicable to both short poems and long novels
generating quantitative predictions about features that co-determe the
cognitive and affective processing of specific text categories or authors, our
data pave the way for many future computational and empirical studies of
literature or experiments in reading psychology.},
added-at = {2021-06-04T16:40:50.000+0200},
author = {Jacobs, Arthur M. and Kinder, Annette},
biburl = {https://www.bibsonomy.org/bibtex/2a91b29a4a22348914a4d50c3fea80da5/jaeschke},
description = {[2010.10801] Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set},
interhash = {f8714743e79ba96e36c5a83c516d7acd},
intrahash = {a91b29a4a22348914a4d50c3fea80da5},
keywords = {author classification dh identification literature recognition sentiment style text},
note = {cite arxiv:2010.10801Comment: 18 pages, 3 tables},
timestamp = {2021-06-04T16:40:50.000+0200},
title = {Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set},
url = {http://arxiv.org/abs/2010.10801},
year = 2020
}