The Datawrangling blog was put on the back burner last May while I focused on my startup. Now that I have some bandwidth again, I am getting back to work on several pet projects (including the Amazon EC2 Cluster).
DataSift provides very granular and modular ‘sifting’ functions from a wide range of social and web input feeds, augmenting them with sentiment analysis, storage and analytics to offer an unrivalled service platform which leverages the cloud and scales infinitely. The world is moving to streams, and consumers will consume and curate their own news. DataSift follows this paradigm shift and seeks to become the platform of choice for stream curation, consumption, and ultimately monetization. The end visualizations are unlimited and bounded only by your imagination.
This work is in the general area of sentiment analysis, opinion extraction or opinion mining, and feature-based opinion summarization from the user-generated content or user-generated media on the Web, e.g., reviews, forum and group discussions, and blogs. The area is also closely related to sentiment classification.
GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota. datasets include MovieLens, Wikilens, Book-Crossing, Jester Joke, EachMovie.
The Digging into Data Challenge is an international grant competition sponsored by four leading research agencies, the Joint Information Systems Committee (JISC1) from the United Kingdom, the National Endowment for the Humanities (NEH2) from the United States, the National Science Foundation (NSF3) from the United States, and the Social Sciences and Humanities Research Council (SSHRC4) from Canada.
the data here is useful for testing classification / clustering, and the accuracy of indexing techniques. However the datasets are too small to make claims about the efficiency of indexing.
The Software Environment for the Advancement of Scholarly Research (SEASR), funded by the Andrew W. Mellon Foundation, provides a research and development environment capable of powering leading-edge digital humanities initiatives.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
Baker provides us with a fascinating guide to the world of "The Numerati" who use the data we produce every day (click web pages, flip channels, drive through automatic toll booths, shop with credit cards, and make cell phone calls) to profile us as workers, shoppers, patients, voters, potential terrorists, and lovers.
even in the most wildly optimistic projections, data mining isn't tenable for uncovering future terrorist plots. We're not trading privacy for security; we're giving up privacy and getting no security in return.
The Model Organism Databases (MODs) are working with the InterMine group to enable faster comparative studies and develop tools that make analysis accessible to the wider scientific community.
R. Agrawal, S. Gollapudi, A. Kannan, и K. Kenthapadi. Proceedings of the 20th International Conference Companion on World Wide Web, стр. 483--492. New York, NY, USA, ACM, (2011)