<rdf:RDF xmlns:community="http://www.bibsonomy.org/ontologies/2008/05/community#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:admin="http://webns.net/mvcb/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:cc="http://web.resource.org/cc/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:swrc="http://swrc.ontoware.org/ontology#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xml:base="http://www.bibsonomy.org/user/diego_ma/evaluation"><owl:Ontology rdf:about=""><rdfs:comment>BibSonomy publications for /user/diego_ma/evaluation</rdfs:comment><owl:imports rdf:resource="http://swrc.ontoware.org/ontology/portal"/></owl:Ontology><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2995820d5ba4c489baf7d524acbafe365/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2995820d5ba4c489baf7d524acbafe365/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://www.aclweb.org/anthology-new/W/W09/#1300"/><swrc:date>Wed Oct 21 23:50:43 CEST 2009</swrc:date><swrc:booktitle>Proc BioNLP 2009</swrc:booktitle><swrc:pages>171-178</swrc:pages><swrc:title>Evaluation of the Clinical Question Answering Presentation.</swrc:title><swrc:year>2009</swrc:year><swrc:keywords>biomedical question_answering evaluation </swrc:keywords><swrc:abstract>Question answering is different from information retrieval in that it attempts to answer questions by providing summaries from numerous retrieved documents rather than by simply providing a list of documents that requires users to do additional work. However, the quality of answers that question answering provides has not been investigated extensively, and the practical approach to presenting question answers still needs more study. In addition to factoid answering using phrases or entities, most question answering systems use a sentence- based approach for generating answers. However, many sentences are often only meaningful or understandable in their context, and a passage-based presentation can often provide richer, more coherent context. However, passage-based presentations may introduce additional noise that places greater burden on users. In this study, we performed a quantitative evaluation on the two kinds of presentation produced by our online clinical question answering system, AskHERMES (http://www.AskHERMES.org). The overall finding is that, although irrelevant context can hurt the quality of an answer, the passage-based approach is generally more effective in that it provides richer context and matching across sentences.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Cao, Yong-gang"/></rdf:_1><rdf:_2><swrc:Person swrc:name="John Ely"/></rdf:_2><rdf:_3><swrc:Person swrc:name="Lamont Antieau"/></rdf:_3><rdf:_4><swrc:Person swrc:name="Hong Yu"/></rdf:_4></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/286832ff190062733a50f6134151fd996/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/286832ff190062733a50f6134151fd996/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://www.uwm.edu/~hongyu/publications.html"/><swrc:date>Fri Aug 14 09:50:03 CEST 2009</swrc:date><swrc:booktitle>Proc. Pacific Symposium on Biocomputing</swrc:booktitle><swrc:pages>328-339</swrc:pages><swrc:title>A Cognitive Evaluation of Four Online Search Engines for Answering Definitional Questions Posed By Physicians</swrc:title><swrc:year>2007</swrc:year><swrc:keywords>search inf-retr biomedical evaluation </swrc:keywords><swrc:abstract>The Internet is having a profound impact on physicians&#039; medical decision making. One recent survey of 277 physicians showed that 72\% of physicians regularly used the Internet to research medical information and 51\% admitted that information from web sites influenced their clinical decisions. This paper describes the first cognitive evaluation of four state-of-the-art Internet search engines: Google (i.e., Google and Scholar.Google), MedQA, Onelook, and PubMed for answering definitional questions (i.e., questions with the format of &#034;What is X?&#034;) posed by physicians. Onelook is a portal for online definitions, and MedQA is a question answering system that automatically generates short texts to answer specific biomedical questions. Our evaluation criteria include quality of answer, ease of use, time spent, and number of actions taken. Our results show that MedQA out- performs Onelook and PubMed in most of the criteria, and that MedQA surpasses Google in time spent and number of actions, two important efficiency criteria. Our results show that Google is the best system for quality of answer and ease of use. We conclude that Google is an effective search engine for medical definitions, and that MedQA exceeds the other search engines in that it provides users direct answers to their questions; while the users of the other search engines have to visit several sites before finding all of the pertinent information.</swrc:abstract><swrc:hasExtraField><swrc:Field swrc:value="Web (August 2009)" swrc:key="library"/></swrc:hasExtraField><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Yu, Hong"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Kaufman, David"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2c01cf8a5b8b361762c6653d5070a9627/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2c01cf8a5b8b361762c6653d5070a9627/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Article"/><owl:sameAs rdf:resource="http://dx.doi.org/10.1016/j.ijmedinf.2005.06.009"/><swrc:date>Fri May 15 06:50:52 CEST 2009</swrc:date><swrc:booktitle>Recent Advances in Natural Language Processing for Biomedical Applications Special Issue</swrc:booktitle><swrc:journal>International Journal of Medical Informatics</swrc:journal><swrc:month>June</swrc:month><swrc:number>6</swrc:number><swrc:pages>430--442</swrc:pages><swrc:title>Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions</swrc:title><swrc:volume>75</swrc:volume><swrc:year>2006</swrc:year><swrc:keywords>parsers biomedicine evaluation </swrc:keywords><swrc:abstract>We present an evaluation of Link Grammar and Connexor Machinese Syntax, two major broad-coverage dependency parsers, on a custom hand-annotated corpus consisting of sentences regarding protein-protein interactions. In the evaluation, we apply the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parsers for recovery of individual dependencies, fully correct parses, and interaction subgraphs. For Link Grammar, an open system that can be inspected in detail, we further perform a comprehensive failure analysis, report specific causes of error, and suggest potential modifications to the grammar. We find that both parsers perform worse on biomedical English than previously reported on general English. While Connexor Machinese Syntax significantly outperforms Link Grammar, the failure analysis suggests specific ways in which the latter could be modified for better performance in the domain.</swrc:abstract><swrc:hasExtraField><swrc:Field swrc:value="2063654" swrc:key="id"/></swrc:hasExtraField><swrc:hasExtraField><swrc:Field swrc:value="2" swrc:key="priority"/></swrc:hasExtraField><swrc:hasExtraField><swrc:Field swrc:value="Bibsonomy (May 2009)" swrc:key="library"/></swrc:hasExtraField><swrc:hasExtraField><swrc:Field swrc:value="2009-05-15 05:45:34" swrc:key="at"/></swrc:hasExtraField><swrc:hasExtraField><swrc:Field swrc:value="10.1016/j.ijmedinf.2005.06.009" swrc:key="doi"/></swrc:hasExtraField><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Pyysalo, Sampo"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Ginter, Filip"/></rdf:_2><rdf:_3><swrc:Person swrc:name="Pahikkala, Tapio"/></rdf:_3><rdf:_4><swrc:Person swrc:name="Boberg, Jorma"/></rdf:_4><rdf:_5><swrc:Person swrc:name="Jarvinen, Jouni"/></rdf:_5><rdf:_6><swrc:Person swrc:name="Salakoski, Tapio"/></rdf:_6></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2cafa3f81913c1ecb25362d7cb6369a30/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2cafa3f81913c1ecb25362d7cb6369a30/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://www.cs.cmu.edu/~sagae/docs/geaf07miyaoetal.pdf"/><swrc:date>Fri Jan 16 00:48:33 CET 2009</swrc:date><swrc:booktitle>Proceedings of the GEAF 2007 Workshop</swrc:booktitle><swrc:pages>21 pages</swrc:pages><swrc:publisher><swrc:Organization swrc:name="CSLI Publications"/></swrc:publisher><swrc:series>CSLI Studies in Computational Linguistics Online</swrc:series><swrc:title>Towards Framework-Independent Evaluation of Deep Linguistic Parsers</swrc:title><swrc:year>2007</swrc:year><swrc:keywords>parsers evaluation </swrc:keywords><swrc:abstract>This paper describes practical issues in the framework-independent evaluation of deep and shallow parsers. We focus on the use of two dependency-based syntactic representation formats in parser evaluation, namely, Carroll et al. (1998)&#039;s Grammatical Relations and de Marneffe et al. (2006)&#039;s Stanford Dependency scheme. Our approach is to convert the output of parsers into these two formats, and measure the accuracy of the resulting converted output. Through the evaluation of an HPSG parser and Penn Treebank phrase structure parsers, we found that mapping between different representation schemes is a non-trivial task that results in lossy conversions that may obscure important differences between different parsing approaches. We discuss sources of disagreements in the representation of syntactic structures in the two dependency-based formats, indicating possible directions for improved framework-independent parser evaluation.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Yusuke Miyao"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Kenji Sagae"/></rdf:_2><rdf:_3><swrc:Person swrc:name="Jun&#039;ichi Tsujii"/></rdf:_3></rdf:Seq></swrc:author><swrc:editor><rdf:Seq><rdf:_1><swrc:Person swrc:name="Ann Copestake"/></rdf:_1></rdf:Seq></swrc:editor></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/276f93e2113ded678a5b0e2afce098269/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/276f93e2113ded678a5b0e2afce098269/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://research.microsoft.com/~cyl/download/papers/WAS2004.pdf"/><swrc:date>Wed Mar 12 04:43:46 CET 2008</swrc:date><swrc:booktitle>Proc. ACL workshop on Text Summarization Branches Out</swrc:booktitle><swrc:pages>10</swrc:pages><swrc:title>ROUGE: A Package for Automatic Evaluation of summaries</swrc:title><swrc:year>2004</swrc:year><swrc:keywords>summarisation evaluation COMP448 </swrc:keywords><swrc:abstract>ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale sum- marization evaluation sponsored by NIST.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Chin-Yew Lin"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/223b316a5ac5cb3159a24e55b5c6e564d/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/223b316a5ac5cb3159a24e55b5c6e564d/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://research.microsoft.com/~cyl/download/papers/NAACL2003.pdf"/><swrc:date>Wed Mar 12 04:42:41 CET 2008</swrc:date><swrc:booktitle>Proc. HLT-NAACL</swrc:booktitle><swrc:pages>8 pages</swrc:pages><swrc:title>Automatic Evaluation of Summaries Using N-Gram Co-occurrence Statistics</swrc:title><swrc:year>2003</swrc:year><swrc:keywords>summarisation evaluation COMP448 </swrc:keywords><swrc:abstract>Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Chin-Yew Lin"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Eduard Hovy"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2449b4548c23384e9a02234898bdc9715/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2449b4548c23384e9a02234898bdc9715/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><swrc:date>Tue Jan 29 09:03:47 CET 2008</swrc:date><swrc:crossref>ZZZ-Brill:2000</swrc:crossref><swrc:pages>20-27</swrc:pages><swrc:title>Answer Extraction -- Towards Better Evaluations of {NLP} Systems</swrc:title><swrc:year>2000</swrc:year><swrc:keywords>answer_extraction evaluation molla_publication </swrc:keywords><swrc:abstract>We argue that reading comprehension tests are not particularly suited for the evaluation of NLP systems. Reading comprehension tests are specifically designed to evaluate human reading skills, and these require vast amounts of world knowledge and common-sense reasoning capabilities. Experience has shown that this kind of full-fledged question answering (QA) over texts from a wide range of domains is so difficult for machines as to be far beyond the present state of the art of NLP. To advance the field we propose a much more modest evaluation set-up, viz. Answer Extraction (AE) over texts from highly restricted domains. AE aims at retrieving those sentences from documents that contain the explicit answer to a user query. AE is less ambitious than full-fledged QA but has a number of important advantages over QA. It relies mainly on linguistic knowledge and needs only a very limited amount of world knowledge and few inference rules. However, it requires the solution of a number of key linguistic problems. This makes AE a suitable task to advance NLP techniques in a measurable way. Finally, there is a real demand for working AE systems in technical domains. We outline how evaluation procedures for AE systems over real world domains might look like and discuss their feasibility.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Rolf Schwitter"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Diego Moll{\&#039;a}"/></rdf:_2><rdf:_3><swrc:Person swrc:name="Rachel Fournier"/></rdf:_3><rdf:_4><swrc:Person swrc:name="Michael Hess"/></rdf:_4></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2a9ae1ef031141fe460eaa6deea6036d7/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2a9ae1ef031141fe460eaa6deea6036d7/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><swrc:date>Tue Jan 29 08:19:28 CET 2008</swrc:date><swrc:address>Budapest</swrc:address><swrc:booktitle>Proc. European Association for Computational Linguistics (EACL), workshop on Evaluation Initiatives in Natural Language Processing</swrc:booktitle><swrc:month>April</swrc:month><swrc:organization><swrc:Organization swrc:name="Association for Computational Linguistics"/></swrc:organization><swrc:pages>43-50</swrc:pages><swrc:publisher><swrc:Organization swrc:name="ACL"/></swrc:publisher><swrc:title>Intrinsic versus Extrinsic Evaluations of Parsing Systems</swrc:title><swrc:year>2003</swrc:year><swrc:keywords>parsers evaluation AnswerFinder gram_rels molla_publication </swrc:keywords><swrc:abstract>A wide range of parser and/or grammar evaluation methods have been reported in the literature. However, in most cases these evaluations take the parsers independently (intrinsic evaluations), and only in a few cases has the effect of different parsers in real applications been measured (extrinsic evaluations). This paper compares two evaluations of the Link Grammar parser and the Conexor Functional Dependency Grammar parser. The parsing systems, despite both being dependency-based, return different types of dependencies, making a direct comparison impossible. In the intrinsic evaluation, the accuracy of the parsers is compared independently by converting the dependencies into grammatical relations and using the methodology of \newcite{Carroll:1998} for parser comparison. In the extrinsic evaluation, the parsers&#039; impact in a practical application is compared within the context of answer extraction. The differences in the results are significant.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Diego Moll{\&#039;a}"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Ben Hutchinson"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2ddf167520fb423a651e0c5dcb062f1f2/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2ddf167520fb423a651e0c5dcb062f1f2/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Unpublished"/><swrc:date>Tue Jan 29 08:16:55 CET 2008</swrc:date><swrc:note>In preparation</swrc:note><swrc:title>In Vitro and In Vivo Evaluations of Parsing Systems Within the Context of Answer Extraction</swrc:title><swrc:year>2002</swrc:year><swrc:keywords>AnswerFinder parsers evaluation gram_rels molla_publication </swrc:keywords><swrc:abstract>A wide variety of parser and/or grammar evaluation methods have been reported in the literature. However, in most cases these evaluations take the parsers independently (\emph{in vitro} evaluations), and only in a few cases has the effect of different parsers in real applications been measured (\emph{in vivo} evaluations). This paper compares two evaluations of the Link Grammar parser and the Conexor Functional Dependency Grammar parser. The parsing systems, despite both being dependency-based, return different types of dependencies, making a direct comparison impossible. In the first evaluation, the accuracy of the parsers is compared \emph{in vitro} by converting the dependencies into grammatical relations and using the methodology of \newcite{Carroll:1998} for parser comparison. In the second evaluation, the parsers&#039; impact in a practical application is compared \emph{in vivo} within the context of answer extraction. The differences in the results are significant and raise questions on the usefulness of purely \emph{in vitro} evaluations.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Diego Moll{\&#039;a}"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Ben Hutchinson"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/24fb0f326a00848adbad66d4c583c203b/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/24fb0f326a00848adbad66d4c583c203b/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Proceedings"/><swrc:date>Fri Dec 14 02:49:13 CET 2007</swrc:date><swrc:address>Seattle, WA</swrc:address><swrc:booktitle>Proc. {ANLP/NAACL} 2000 Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems</swrc:booktitle><swrc:organization><swrc:Organization swrc:name="ACL"/></swrc:organization><swrc:title>Proc. {ANLP/NAACL} 2000 Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems</swrc:title><swrc:year>2000</swrc:year><swrc:keywords>evaluation answer_extraction question_answering </swrc:keywords><swrc:editor><rdf:Seq><rdf:_1><swrc:Person swrc:name="Eric Brill"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Eugene Charniak"/></rdf:_2><rdf:_3><swrc:Person swrc:name="Mary Harper"/></rdf:_3><rdf:_4><swrc:Person swrc:name="Marc Light"/></rdf:_4><rdf:_5><swrc:Person swrc:name="Ellen Riloff"/></rdf:_5><rdf:_6><swrc:Person swrc:name="Ellen Voorhees"/></rdf:_6></rdf:Seq></swrc:editor></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2c623d0ee5bd845e2ff2a001f8d919dda/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2c623d0ee5bd845e2ff2a001f8d919dda/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Proceedings"/><swrc:date>Fri Dec 14 02:49:12 CET 2007</swrc:date><swrc:address>Seattle, WA</swrc:address><swrc:booktitle>Proc. {ANLP/NAACL} 2000 Workshop on Syntactic and Semantic Complexity in Natural Language Processing Systems</swrc:booktitle><swrc:organization><swrc:Organization swrc:name="ACL"/></swrc:organization><swrc:title>Proc. {ANLP/NAACL} 2000 Workshop on Syntactic and Semantic Complexity in Natural Language Processing Systems</swrc:title><swrc:year>2000</swrc:year><swrc:keywords>evaluation </swrc:keywords><swrc:editor><rdf:Seq><rdf:_1><swrc:Person swrc:name="Amit Bagga"/></rdf:_1><rdf:_2><swrc:Person swrc:name="James Pustejovsky"/></rdf:_2><rdf:_3><swrc:Person swrc:name="Wlodek Zadrozny"/></rdf:_3></rdf:Seq></swrc:editor></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2383a87c4ecb2f1cfb810e0496e8e130a/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2383a87c4ecb2f1cfb810e0496e8e130a/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Misc"/><owl:sameAs rdf:resource="http://www.gte.com/AboutGTE/gto/anlp-naacl2000/comprehension.html"/><swrc:date>Fri Dec 14 02:48:25 CET 2007</swrc:date><swrc:address>Seattle, WA</swrc:address><swrc:note>\myurl{http://www.gte.com/AboutGTE/gto/anlp-naacl2000/comprehension.html}</swrc:note><swrc:title>Workshop on Reading Comprehension Texts as Evaluation for Computer-Based Language Understanding Systems</swrc:title><swrc:year>2000</swrc:year><swrc:keywords>evaluation </swrc:keywords><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name=" WRC"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/204adb0d4ddf8dd9b0a85406564e8083c/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/204adb0d4ddf8dd9b0a85406564e8083c/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://cvu.strath.ac.uk/dave/publications/caa99.html"/><swrc:date>Fri Dec 14 02:48:05 CET 2007</swrc:date><swrc:address>Loughborough University</swrc:address><swrc:booktitle>Proc. Third Annual Computer Assisted Assessment Conference</swrc:booktitle><swrc:pages>207-219</swrc:pages><swrc:title>Approaches to the Computerized Assessment of Free Text Responses</swrc:title><swrc:year>1999</swrc:year><swrc:keywords>evaluation question_answering </swrc:keywords><swrc:abstract>The automated assessment of student&#039;s essays is regarded by many as the Holy Grail of computer aided assessment. Whilst a few people search for the grail, many more deny its existence. This paper describes the various approaches that have been taken over the last 40 years in an attempt to solve the problems involved with the computerized assessment of free text. The earliest approaches were founded in simple style analysis. Systems, such as Project Essay Grade (PEG), were developed upon the idea that certain surface features of an essay could be manipulated in such a way as to predict the grade that a human examiner would assign to an essay. Other methods, such as Latent Semantic Analysis (LSA), also take a statistical approach to marking, but focus on actual textual content, analyzing groupings and context. The Educational Testing Service (ETS) originally attempted to tackle the problem from a classification point of view whilst more recent work by ETS bears similarities to PEG in its statistical approach. Most of the methods currently being developed have been shown to be capable of generating essay scores that correlate with a human grader&#039;s scores at least as well as two human graders correlate with each other. A novel approach being adopted by the authors to allow the comparison of students&#039; essays against a model answer involves the use of theories developed for inter-lingual machine translation. A few different methods for knowledge representation, and their current uses in machine translation are presented. Panlingua, an idea for knowledge representation developed by Chaumont Devin, is based on semantic network research. Using a system of nodes over four layers it attempts to model how the brain might translate from sensory patterns it sees or hears at the top level, through syntactic and semantic levels, to a representation of understanding at the deepest level. Another approach to machine translation involves work done by Bonnie Dorr at the University of Maryland based on Lexical Conceptual Structure (LCS) theory. This allows the knowledge represented in a text to be translated into a language independent data structure. These theories provide a way of representing knowledge that is not reliant on the surface syntax of the text representing the knowledge. This will hopefully allow &#039;fuzzy matching&#039; of sentences which have different syntactic structures but similar semantic meaning. The authors will be looking at the possibility of grading essays via a comparison of these data structures.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Dave Whittington"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Helen Hunt"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2d542f2135662a9d19701681af401bceb/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2d542f2135662a9d19701681af401bceb/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#TechnicalReport"/><owl:sameAs rdf:resource="http://members.rogers.com/peter.turney/ml_text_keys.html"/><swrc:date>Fri Dec 14 02:47:27 CET 2007</swrc:date><swrc:institution><swrc:Organization swrc:name="National Research Council of Canada"/></swrc:institution><swrc:number>ERB-1051</swrc:number><swrc:title>Extraction of Keyphrases from Text: Evaluation of Four Algorithms</swrc:title><swrc:year>1997</swrc:year><swrc:keywords>inf_retrieval evaluation </swrc:keywords><swrc:abstract>This report presents an empirical evaluation of four algorithms for automatically extracting keywords keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the...</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Peter Turney"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/230d08b19997804a63847dece1d45e94c/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/230d08b19997804a63847dece1d45e94c/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><owl:sameAs rdf:resource="http://www.site.uottawa.ca/\~{}scarlett/"/><swrc:date>Fri Dec 14 02:46:13 CET 2007</swrc:date><swrc:address>Montreal</swrc:address><swrc:booktitle>Proc. Thirteenth Canadian Conference on Artificial Intelligence</swrc:booktitle><swrc:title>The Power of the {TSNLP}: Lessons from a Diagnostic Evaluation of a Broad-Coverage Parser</swrc:title><swrc:year>2000</swrc:year><swrc:keywords>parsers evaluation </swrc:keywords><swrc:abstract>We show a diagnostic evaluation of DIPETT, a broad-coverage parser of English sentences. We consider the TSNLP suite as a diagnostic tool, and propose an alternative broader-coverage test suite of test sentences extracted from Quirk et al. We compare the diagnostic effectiveness of the two suites, and draw a few general conclusions. The evaluation results were used to make significant improvements to DIPETT.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Elizabeth Scarlett"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Stan Szpkowicz"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/29bce113e9807b86273add0de1371b1bf/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/29bce113e9807b86273add0de1371b1bf/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Misc"/><owl:sameAs rdf:resource="http://www.site.uottawa.ca/\~{}scarlett/"/><swrc:date>Fri Dec 14 02:46:12 CET 2007</swrc:date><swrc:school><swrc:University swrc:name="University of Ottawa"/></swrc:school><swrc:title>An Evaluation of a Rule-Based Parser of English Sentences</swrc:title><swrc:year>2000</swrc:year><swrc:keywords>parsers evaluation </swrc:keywords><swrc:abstract>... The thesis argues that a test suite for a broad coverage natural language parser must necessarily be systematic, broad in its coverage of phenomena tested, and corpus-like in its coverage of phenomenon interaction. A test suite of example sentences extracted from Quirk et al.&#039;s comprehensive English grammar is proposed, and the results of evaluating DIPETT on that suite are compared with the evaluation results on a publicly available test suite, TSNLP (Test Suites for Natural Language Processing).</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Elizabeth Scarlett"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/282a0e0c21ee324f677ac9dc34ef64812/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/282a0e0c21ee324f677ac9dc34ef64812/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><swrc:date>Fri Dec 14 02:42:33 CET 2007</swrc:date><swrc:booktitle>Proc. SIGIR&#039;05</swrc:booktitle><swrc:title>Evaluation of Resources for Question Answering Evaluation</swrc:title><swrc:year>2005</swrc:year><swrc:keywords>question_answering evaluation resources </swrc:keywords><swrc:abstract>In contrast to traditional information retrieval systems, which return ranked lists of documents that users must manually browse through, a question answering system attempts to directly answer natural language questions posed by the user. Although such systems possess language processing capabilities, they still rely on traditional document retrieval techniques to generate an initial candidate set of documents. In this paper, we argue that document retrieval for question answering represents a different task than retrieving documents in response to more general retrospective information needs. Thus, to guide future system development, specialized question answering test collections must be constructed. We have shown that the current evaluation resources have major shortcomings, and to remedy the situation, we have manually created a small, reusable question answering test collection for research purposes. This article describes our methodology for building this test collection and discusses issues we encountered along the way regarding the notion of ?answer correctness?.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Jimmy Lin"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2c54b2d5b84d2ca7473e72f1c3b5245a3/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2c54b2d5b84d2ca7473e72f1c3b5245a3/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Article"/><swrc:date>Fri Dec 14 02:41:36 CET 2007</swrc:date><swrc:journal>Communications of the ACM</swrc:journal><swrc:number>1</swrc:number><swrc:pages>73-79</swrc:pages><swrc:title>Evaluating Natural Language Processing Systems</swrc:title><swrc:volume>39</swrc:volume><swrc:year>1996</swrc:year><swrc:keywords>evaluation NLP </swrc:keywords><swrc:abstract>Designing customized methods for testing various NLP systems may be costly and complex, but the resulting data is invaluable. Researchers now wonder if there are ways to share evaluation methodologies.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Margaret King"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/28c1106ad80cab949d4cb39929876dec8/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/28c1106ad80cab949d4cb39929876dec8/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#InProceedings"/><swrc:date>Fri Dec 14 02:40:57 CET 2007</swrc:date><swrc:crossref>ZZZ-LREC:2002</swrc:crossref><swrc:title>Cooperation between Black Box and Glass box Approaches for the Evaluation of a Question Answering System</swrc:title><swrc:year>577-584</swrc:year><swrc:keywords>evaluation question_answering </swrc:keywords><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Martine Hurault-Plantet"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Laura Monceaux"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2e2b2b1d5f4e232b87382c539cd30b1a8/diego_ma"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2e2b2b1d5f4e232b87382c539cd30b1a8/diego_ma"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#TechnicalReport"/><owl:sameAs rdf:resource="http://citeseer.nj.nec.com/galliers93evaluating.html"/><swrc:date>Fri Dec 14 02:39:10 CET 2007</swrc:date><swrc:institution><swrc:Organization swrc:name="Computer Laboratory, University of Cambridge"/></swrc:institution><swrc:number>TR-291</swrc:number><swrc:title>Evaluating Natural Language Processing Systems</swrc:title><swrc:year>1993</swrc:year><swrc:keywords>evaluation NLP </swrc:keywords><swrc:abstract>This report presents a detailed analysis and review of NLP evaluation, in principle and in Practice. Part 1 examines evaluation concepts and establishes a framework for NLP system evaluation. This makes use of experience in the related area of information retrieval and the analysis also refers to evaluation in speech processing. Part 2 surveys significant evaluation work done so far, for instance in machine translation, and discusses the particular problems of generic system evaluation. The conclusion is that evaluation strategies and techniques for NLP need much more development, in particular to take proper account of the influence of system tasks and settings. Part 3 develops a general approach to NLP evaluation, aimed at methodologically-sound strategies for test and evaluation motivated by comprehensive performance factor identification. The analysis throughout the report is supported by extensive illustrative examples.</swrc:abstract><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="Julia R. Galliers"/></rdf:_1><rdf:_2><swrc:Person swrc:name="Karen {Sparck Jones}"/></rdf:_2></rdf:Seq></swrc:author></rdf:Description></rdf:RDF>