In this post, I want to show how I use NLTK for preprocessing and tokenization, but then apply machine learning techniques (e.g. building a linear SVM using stochastic gradient descent) using Scikit-Learn.
Our flagship collection, under development since 1987, covers the history, literature and culture of the Greco-Roman world. We are applying what we have learned from Classics to other subjects within the humanities and beyond.
From the first book printed in English by William Caxton, through the age of Spenser and Shakespeare and the tumult of the English Civil War, Early English Books Online (EEBO) will contain over 125,000 titles listed in Pollard and Redgrave's Short-Title Catalogue (1475-1640), Wing's Short-Title Catalogue (1641-1700), the Thomason Tracts (1640-1661), and the Early English Tract Supplement - all in full digital facsimile from the Early English Books microfilm collection.
Welcome to the Networked Digital Library of Theses and Dissertations (NDLTD), an international organization dedicated to promoting the adoption, creation, use, dissemination, and preservation of electronic theses and dissertations (ETDs). We support electronic publishing and open access to scholarship in order to enhance the sharing of knowledge worldwide. Our website includes resources for university administrators, librarians, faculty, students, and the general public.
The Fabian Society collection includes: Pamphlets published as part of the Fabian Tracts series, 1884-2000, Minutes of Executive Committee meetings and other key committee meetings, 1884 to 1954, Pamphlets published as part of the Young Fabian pamphlet series, 1961-2009. The London School of Economics and Political Science
project aims to put some Project Gutenberg ebooks into GitHub so people can fix problems in the files. use GitHub to open up the PG corpus to maintenance and use by libraries and librarians. The result will include MARC records, covers, OPDS feeds and ebook files to facilitate library use. Version-controlled fork and merge workflow, combined with a change triggered back-end build environment will allow scaleable, distributed maintenance of the greatest works of our literary heritage. 43,000 books and their metadata have been moved to the git version control software.
The English Short Title Catalogue (ESTC) lists over 460,000 items published between 1473 and 1800 mainly, but not exclusively, in English published mainly in the British Isles and North America from the collections of the British Library and over 2,000 other libraries
Deutscher Wortschatz contains data generated from newspapers and web resources that are publicly available. The data were collected per language and encompass statistics about co-occurrences of words in randomly selected sentences.
The Incunabula Short Title Catalogue is the international database of 15th-century European printing created by the British Library with contributions from institutions worldwide.
The UK Reading Experience Database (UK RED) is an open access database and research project housed in the English Department of the Open University. It is the largest resource recording the experiences of readers of its kind anywhere. UK RED has amassed over 30,000 records of reading experiences of British subjects, both at home and abroad, and of visitors to the British Isles, between 1450 and 1945. These include both famous and anonymous readers. It is both an open access resource and open to unsolicited public contributions.
The Open Utopia is a complete edition of Thomas More’s Utopia that honors the primary precept of Utopia itself: that all property is common property. But Utopia is more than the story of a far-off land with no private property. It’s a text that instructs us how to approach texts, be they literary or political, in an open manner: open to criticism, open to participation, and open to re-creation.
Handwritten annotations in books are an important key to understand how historical readers used their books. ABO aims to bring these books together. It is a digital library that reveals the variety of traces that readers left in their books. These examples were previously dispersed over many different libraries in the world. Yet it is also a digital laboratory, where visitors can work together: ABO has tools to enrich the early modern annotations with transcriptions and translations. ABO seeks to encourage collaboration.
TIR 2010
7th International Workshop on Text-based Information Retrieval
in conjunction with DEXA 2010
University of Deusto
Bilbao, Spain
30 August - 3 September 2010
20 Newsgroups
Abstract
This data set consists of 20000 messages taken from 20 Usenet newsgroups.
Information files:
description of the data
Data files:
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
CiteXplore combines literature search with text mining tools for biology.
Search results are cross referenced to EBI applications based on publication identifiers.
Links to full text versions are provided where available.
Y. Kim. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, page 1746--1751. (2014)
M. Hearst. Proceedings of the 14th Conference on Computational Linguistics - Volume 2, page 539--545. Stroudsburg, PA, USA, Association for Computational Linguistics, (1992)
G. Barbieri, F. Pachet, P. Roy, and M. Esposti. Proceedings of the 20th European Conference on Artificial Intelligence, page 115--120. Amsterdam, The Netherlands, The Netherlands, IOS Press, (2012)
T. Kenter, and M. de Rijke. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, page 1411--1420. New York, NY, USA, ACM, (2015)
D. Nguyen, N. Smith, and C. Rosé. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, page 115--123. Stroudsburg, PA, USA, Association for Computational Linguistics, (2011)
X. Zhang, and Y. LeCun. (2015)cite arxiv:1502.01710Comment: This technical report is superseded by a paper entitled "Character-level Convolutional Networks for Text Classification", arXiv:1509.01626. It has considerably more experimental results and a rewritten introduction.
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Proc. of 3rd ACM International Conference on Web Search and Data Mining New York City, NY USA (WSDM 2010)., (2010)
C. Zhai, A. Velivelli, and B. Yu. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 743--748. New York, NY, USA, ACM, (2004)
W. Cavnar, and J. Trenkle. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, page 161--175. Las Vegas, US, (1994)
C. Luo, Y. Li, and S. Chung. Data & Knowledge Engineering, 68 (11):
1271 - 1288(2009)Including Special Section: Conference on Privacy in Statistical Databases (PSD 2008) - Six selected and extended papers on Database Privacy.