In deutschen Behörden und Universitäten befinden sich gewaltige Datenmengen und große Wissensschätze. Nicht alle Parteien wollen die Unterlagen gleichermaßen der Öffentlichkeit zugänglich
Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. Put more simply: Zanran is Google for data. At present, we extract tables and images from HTML, PDF and Excel files and will be processing PowerPoint and Word documents in the near future.
This dataset is released by Signal Media to facilitate conducting research on news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.
The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.
B. Berendt, A. Hotho, и G. Stumme. Web Semantics: Science, Services and Agents on the World Wide Web, 8 (2-3):
95 - 96(2010)Bridging the Gap--Data Mining and Social Network Analysis for Integrating Semantic Web and Web 2.0; The Future of Knowledge Dissemination: The Elsevier Grand Challenge for the Life Sciences.