Non-Expert Evaluation of Summarization Systems is Risky
D. Gillick, and Y. Liu. Proceedings NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, page 148-151. (2010)
Abstract
We provide evidence that intrinsic evaluation of summaries using Amazon’s Mechanical Turk is quite difficult. Experiments mirroring evaluation at the Text Analysis Conference’s summarization track show that nonexpert judges are not able to recover system rankings derived from experts.
%0 Conference Paper
%1 Gillick:2010
%A Gillick, Dan
%A Liu, Yang
%B Proceedings NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
%D 2010
%K summarisation mechanical_turk
%P 148-151
%T Non-Expert Evaluation of Summarization Systems is Risky
%X We provide evidence that intrinsic evaluation of summaries using Amazon’s Mechanical Turk is quite difficult. Experiments mirroring evaluation at the Text Analysis Conference’s summarization track show that nonexpert judges are not able to recover system rankings derived from experts.
@inproceedings{Gillick:2010,
abstract = {We provide evidence that intrinsic evaluation of summaries using Amazon’s Mechanical Turk is quite difficult. Experiments mirroring evaluation at the Text Analysis Conference’s summarization track show that nonexpert judges are not able to recover system rankings derived from experts.},
added-at = {2011-08-05T09:25:15.000+0200},
author = {Gillick, Dan and Liu, Yang},
biburl = {https://www.bibsonomy.org/bibtex/2e349ad3abc82bbbd253bc5d94fdd20e7/diego_ma},
booktitle = {Proceedings NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk},
interhash = {82fe8aa6efa6c742f04c207373ef284e},
intrahash = {e349ad3abc82bbbd253bc5d94fdd20e7},
keywords = {summarisation mechanical_turk},
library = {Bibsonomy, MQRDG2010 (August 2011)},
pages = {148-151},
timestamp = {2011-08-05T09:25:15.000+0200},
title = {Non-Expert Evaluation of Summarization Systems is Risky},
year = 2010
}