This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
S-Match is an open source Java framework for semantic matching. It contains semantic matching, minimal semantic matching and structure preserving semantic matching algorithm implementations.
Tweets2011
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.
Berlin wird leiser: aktiv gegen Verkehrslärm. - Die Senatsverwaltung für Stadtentwicklung und Umwelt Berlin will ihre Bürger an der Erarbeitung des Lärmaktionsplans beteiligen. Alle Bürgerinnen und Bürger können mitteilen, wo es ihnen in Berlin zu laut ist und welche Maßnahmen Abhilfe schaffen könnten. Auf dieser Basis erarbeitet die Stadt Maßnahmen, wie Berlin leiser werden kann.
20 Newsgroups
Abstract
This data set consists of 20000 messages taken from 20 Usenet newsgroups.
Information files:
description of the data
Data files:
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
Kaggle is a platform for data prediction competitions. Companies, organizations and researchers post their data and have it scrutinized by the world's best statisticians.
Take free online classes from 120+ top universities and educational organizations. We partner with schools like Stanford, Yale, Princeton, and others to offer courses in dozens of topics, from computer science to teaching and beyond. Whether you are pursuing a passion or looking to advance your career, Coursera provides open, free education for everyone.
Wie groß ist das Internet? Ein unbekannter Hacker beantwortet diese Frage jetzt - mit effektiven, aber illegalen Mitteln: Er verschaffte sich Zugriff auf Hunderttausende Router und nutzte sie als Forschungssonde. Das Ergebnis ist ein einzigartiges Abbild des Internets von heute.
A. Dulny, A. Hotho, und A. Krause. Machine Learning and Knowledge Discovery in Databases: Research Track, Seite 438--455. Cham, Springer Nature Switzerland, (2023)
Y. Song, L. Zhang, und C. Giles. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge mining, Seite 93--102. New York, NY, USA, ACM, (2008)