jaj > corpus | BibSonomy

bookmarks (hide)38
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1Diachronic Electronic Corpus of Tyneside English
a corpus of dialect speech from the Tyneside area of North-East England. DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades. The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
10 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1Manually Annotated Sub-Corpus (MASC) Open American National Corpus
The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
10 years ago by @jaj
show all tags
corpus
reference
corpusreference
(0)
copydelete
- community post
- history of this post
1OntoNotes Release 4.0 - Linguistic Data Consortium
Developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
10 years ago by @jaj
show all tags
corpus
reference
corpusreference
(0)
copydelete
- community post
- history of this post
1subset of govdocs1 corpus
a subset of the govdocs1 corpus for testing file-characterization tools
11 years ago by @jaj
show all tags
corpus
digital_preservation
file_formats
tools
corpusdigital_preservationfile_formatstools
(0)
copydelete
- community post
- history of this post
1Digital Corpora » Govdocs1
a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
11 years ago by @jaj
show all tags
corpus
digital_preservation
govdocs
tools
corpusdigital_preservationgovdocstools
(0)
copydelete
- community post
- history of this post
1openplanets/format-corpus · GitHub
An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.
11 years ago by @jaj
show all tags
corpus
digital_preservation
file_formats
tools
corpusdigital_preservationfile_formatstools
(0)
copydelete
- community post
- history of this post
2LAUDATIO – Long-term Access and Usage of Deeply Annotated Information: XMLObjects
LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD.
11 years ago by @jaj
show all tags
corpus
linguistics
corpuslinguistics
(0)
copydelete
- community post
- history of this post
2Web Research Collections - Web Track
The University of Glasgow took over the distribution of the WT2g/WT10g/.GOV/.GOV2 Web Research Collections from CSIRO (Commonwealth Scientific and Industrial Research Organisation), which has been distributing the Web Research collections to organizations and individuals engaged in research and development of natural language processing, information retrieval or document understanding systems, strictly for research purposes only. These collections have been used in the TREC Web & Terabyte tracks. In addition, as part of the TREC Blog track, the University of Glasgow is currently distributing the Blogs06 & Blogs08 test collections. Getting access to the test collections (including .GOV, .GOV2, Blogs06, and Blogs08)
12 years ago by @jaj
show all tags
corpus
web_archives
corpusweb_archives
(0)
copydelete
- community post
- history of this post
1An Automated Method of Topic-Coding Legislative Speech Over Time with Application to the 105th-108th U.S. Senate
Kevin M. Quinn, et.al, July 18, 2006. based on “United States Congressional Speech Corpus.”
12 years ago by @jaj
show all tags
congress
corpus
congresscorpus
(0)
copydelete
- community post
- history of this post
7LDC - Linguistic Data Consortium
supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. LDC's Catalog contains hundreds of corpora of language data including Santa Barbara Corpus of Spoken American
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1The Blog Authorship Corpus
consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
12 years ago by @jaj
show all tags
blogs
corpus
data
blogscorpusdata
(0)
copydelete
- community post
- history of this post
7Open Language Archives Community
an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. includes a search across text-archives.
12 years ago by @jaj
show all tags
corpus
linguistics
data
corpuslinguisticsdata
(0)
copydelete
- community post
- history of this post
1Linguist List - Open Language Archives Community
dedicated to collecting information about language resources and making it available from a single search.
12 years ago by @jaj
show all tags
corpus
linguistics
data
corpuslinguisticsdata
(0)
copydelete
- community post
- history of this post
1The Petabyte Age: Because More Isn't Just More — More Is Different
Wired Magazine issue 16.07. Data Deluge. Crop predictions. Quark. Data mining. tracking news. watching the skies, scanning skeletons. airfares. voting. epidemics. google events. terrorism. visualizing big data
12 years ago by @jaj
show all tags
data
datavisualization
corpus
textmining
datadatavisualizationcorpustextmining
(0)
copydelete
- community post
- history of this post
2WaCKy: Web-as-Corpus kool ynitiative
We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them. We also
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1Web as Corpus
English-language corpora compiled from the Web in 2006 and 2007, and more
12 years ago by @jaj
show all tags
corpus
concordances
corpusconcordances
(0)
copydelete
- community post
- history of this post
2Phrases in English
PIE incorporates a database derived from the second or World Edition of the British National Corpus (BNC 2000). It aims to provide a simple yet powerful interface for studying words and phrases up to eight words long appropriate for both experienced researchers and novice users.
12 years ago by @jaj
show all tags
corpus
tools
linguistics
corpustoolslinguistics
(0)
copydelete
- community post
- history of this post
11British National Corpus [bnc]
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
12 years ago by @jaj
show all tags
corpus
data
reference
linguistics
corpusdatareferencelinguistics
(0)
copydelete
- community post
- history of this post
1WebAsCorpus.org - find Web Concordances
search the web for words, phrases. get results with hits marked. download all pages for further research.
12 years ago by @jaj
show all tags
corpus
searchengine
research
linguistics
textmining
corpussearchengineresearchlinguisticstextmining
(0)
copydelete
- community post
- history of this post
1UCI Knowledge Discovery in Databases (KDD) Archive
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
12 years ago by @jaj
show all tags
data_archive
datasets
datamining
big_data
corpus
data_archivedatasetsdataminingbig_datacorpus
(0)
copydelete
- community post
- history of this post
3Open Shakespeare
Read any of Shakespeare's plays or poems Compare two versions of the same text side-by-side Analyze text or word statistics Search any text
12 years ago by @jaj
show all tags
open_access
corpus
open_accesscorpus
(0)
copydelete
- community post
- history of this post
3Home Page for 20 Newsgroups Data Set
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
12 years ago by @jaj
show all tags
data
corpus
datasets
socialnetworking
datacorpusdatasetssocialnetworking
(0)
copydelete
- community post
- history of this post
12UCI Machine Learning Repository
data sets as a service to the machine learning community.
12 years ago by @jaj
show all tags
reference
data
corpus
datasets
datamining
machine-learning
referencedatacorpusdatasetsdataminingmachine-learning
(0)
copydelete
- community post
- history of this post
2The Institute for Language, Speech and Hearing
The Moby lexicon project is complete and has been place into the public domain. Use, sell, rework, excerpt and use in any way on any platform. 610,000+ words and phrases. The largest word list in the world and more.
12 years ago by @jaj
show all tags
linguistics
datasets
wordlists
corpus
linguisticsdatasetswordlistscorpus
(0)
copydelete
- community post
- history of this post
2The ClueWeb09 Dataset
The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009.
12 years ago by @jaj
show all tags
corpus
datasets
web
corpusdatasetsweb
(0)
copydelete
- community post
- history of this post
1JSTOR Data For Research
DFR is a set of web tools for selecting and exploring data sets constructed from content in the JSTOR archive.
12 years ago by @jaj
show all tags
corpus
datamining
corpusdatamining
(0)
copydelete
- community post
- history of this post
2Access to Web Research Collections WT2G/WT10G/GOV/GOV2/Blog06/Blog08
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
12 years ago by @jaj
show all tags
corpus
datasets
corpusdatasets
(0)
copydelete
- community post
- history of this post
1WebBase Project
The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis.
12 years ago by @jaj
show all tags
dlib
datasets
corpus
archive
web
harvest
govdocs
dlibdatasetscorpusarchivewebharvestgovdocs
(0)
copydelete
- community post
- history of this post
1Digital Corpora
DigitalCorpora.org is a website of digital corpora for use in computer forensics education research. All of the disk images, memory dumps, and network packet captures available on this website are freely available and may be used without prior authorization or IRB approval. We also have available a research corpus of real data acquired from around the world.
12 years ago by @jaj
show all tags
corpus
govdocs
corpusgovdocs
(0)
copydelete
- community post
- history of this post
1OPUS - an open source parallel corpus
OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus.
12 years ago by @jaj
show all tags
corpus
linguistics
open_access
corpuslinguisticsopen_access
(0)
copydelete
- community post
- history of this post
11[OTA] The Oxford Text Archive
The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources.
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1Thomson Reuters Releases TRC2 News Corpus Through NIST - Dr. Jochen L. Leidner's Blog
The new TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14 or 2,871,075,221 bytes, and is initially made available to participants of the blog track at the Text Retrieval Conference (TREC), to supplement the BLOG08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow), which is the main corpus used at the TREC Blog Track.
12 years ago by @jaj
show all tags
corpus
icpsr2010
news
corpusicpsr2010news
(0)
copydelete
- community post
- history of this post
1MemeTracker: tracking news phrases over the web
MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.
12 years ago by @jaj
show all tags
corpus
news
tools
corpusnewstools
(0)
copydelete
- community post
- history of this post
1Corpus-based word frequency lists, collocates, and n-grams
This site contains what we believe is the most accurate frequency data of English. It contains word frequency lists of the top 60,000 words (lemmas) in English, collocates lists (looking at nearby words to see word meaning and use), and n-grams (the frequency of all two and three-word sequences in the corpora).
12 years ago by @jaj
show all tags
corpus
wordlists
corpuswordlists
(0)
copydelete
- community post
- history of this post
7Corpus of Contemporary American English (COCA)
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of Brigham Young University in 2008, and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface), the 100 million word TIME Corpus (1920s-2000s), and the new 400 million word Corpus of Historical American English (COHA; 1810-2009). The corpus contains more than 425 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2011 and the corpus is also updated once or twice a year
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1The Oxford English Corpus : Oxford Dictionaries Online
http://oxforddictionaries.com/page/oec
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1Corpex - Corpora explorer
Corpex let's you swiftly browse through all the words of Wikipedia. the system shows you two statistics in four graphs. Corpex is also available as a restful webservice, Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available.
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
2Google Books: American English (155 billion words)
the Google Books corpus of American English, 155 billion words in size. limited to what you can do via the website at Brigham Young University. The easy thing to do is type in a word or phrase and see its frequency by decade, going back to the 1810s. The interface allows you to look for collocates (words that go with other words), view charts showing relative word frequency in the corpus by decade, handles parts of speech, and gives you various limits and display options. Other kinds of analysis that might be done with text corpora can’t be done through the interface.
12 years ago by @jaj
show all tags
corpora
corpus
reference
tools
corporacorpusreferencetools
(0)
copydelete
- community post
- history of this post

⟨⟨
⟨
1
⟩
⟩⟩

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

No matching posts.

⟨⟨
⟨
⟩
⟩⟩

bookmarks (hide)38 displayallbookmarks onlybookmarks per page5102050100 sort byadded attitle RSSBibTeXXML

publications (hide) displayallpublications onlypublications per page5102050100 sort byadded attitleauthorpublication dateentry typehelp for advanced sorting... RSSBibTeXRDFmore...

browse

related tags

tags

bookmarks (hide)38
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...