copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta. Language Resources and Evaluation, 43 (3): 209--226 (Sep 1, 2009)
DOI: 10.1007/s10579-009-9081-4

Abstract

This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC versus the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.

Description

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora | SpringerLink

@thoni's tags highlighted

Cite this publication

search on

Meta data

Last update 7 years ago
Created 7 years ago

Comments and Reviews
(0)

There is no review or comment yet. You can write one!

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Comments and Reviews
(0)