C. Pethe, A. Kim, and S. Skiena. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 8373--8383. Online, Association for Computational Linguistics, (November 2020)
DOI: 10.18653/v1/2020.emnlp-main.672
Abstract
Books are typically segmented into chapters and sections, representing coherent sub-narratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving a F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.
%0 Conference Paper
%1 pethe-etal-2020-chapter
%A Pethe, Charuta
%A Kim, Allen
%A Skiena, Steve
%B Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
%C Online
%D 2020
%I Association for Computational Linguistics
%K events nlp scenedetection scenes
%P 8373--8383
%R 10.18653/v1/2020.emnlp-main.672
%T Chapter Captor: Text Segmentation in Novels
%U https://aclanthology.org/2020.emnlp-main.672
%X Books are typically segmented into chapters and sections, representing coherent sub-narratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving a F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.
@inproceedings{pethe-etal-2020-chapter,
abstract = {Books are typically segmented into chapters and sections, representing coherent sub-narratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving a F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.},
added-at = {2021-07-12T16:14:45.000+0200},
address = {Online},
author = {Pethe, Charuta and Kim, Allen and Skiena, Steve},
biburl = {https://www.bibsonomy.org/bibtex/2885d0b2ef872348d36876c328c647b9d/albinzehe},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
doi = {10.18653/v1/2020.emnlp-main.672},
interhash = {c2ec26b72708960eef8ff522db7eebaa},
intrahash = {885d0b2ef872348d36876c328c647b9d},
keywords = {events nlp scenedetection scenes},
month = nov,
pages = {8373--8383},
publisher = {Association for Computational Linguistics},
timestamp = {2021-07-12T16:14:52.000+0200},
title = {{C}hapter {C}aptor: {T}ext {S}egmentation in {N}ovels},
url = {https://aclanthology.org/2020.emnlp-main.672},
year = 2020
}