A collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.
The collection represents a collaboration between LabROSA and The Echo Nest. More details, background, and instructions on how to use the datasets can be found at LabROSA’s site. The goal of sharing this data on Infochimps is to provide a large dataset for research and to encourage large-scale algorithms surrounding the data.
There is one dataset for each letter of the alphabet (A-Z) containing data for all songs that start with that letter, one dataset of additional files, and a small sample dataset.
Each of the datasets for each letter consists of song files in the HDF5 format.
Most of the data is licensed the same way as Echo Nest’s API. The code is under GNU public license.
Kaggle is a platform for data prediction competitions. Companies, organizations and researchers post their data and have it scrutinized by the world's best statisticians.
Tweets2011
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
S-Match is an open source Java framework for semantic matching. It contains semantic matching, minimal semantic matching and structure preserving semantic matching algorithm implementations.
A. Dulny, A. Hotho, und A. Krause. Machine Learning and Knowledge Discovery in Databases: Research Track, Seite 438--455. Cham, Springer Nature Switzerland, (2023)
Y. Song, L. Zhang, und C. Giles. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge mining, Seite 93--102. New York, NY, USA, ACM, (2008)