iCrawl: An integrated focused crawling toolbox for Web Science
G. Gossen, E. Demidova, and T. Risse. GESIS Computational Social Science Winter Symposium, Cologne, Germany, GESIS, (December 2014)Poster abstract.
Within the scientific community an increasing interest in using Web content for research can be observed. Especially the Social Web is attractive for Social Science and other humanities disciplines as it provides direct access to opinions of many people about politics, popular topics and events. Documenting the activities on the Web and Social Web in Web archives facilitates better understanding of the public perception. However, state-of-the-art Web archive harvester like Heritrix have significant limitations in terms of usability, functionality and maintenance with regard to the needs of the scientific community. The iCrawl project has the goal to provide an integrated crawling toolbox. It provides an intuitive, flexible and extensible set of Web crawling components.
In interviews with historical and social science researchers as well as law researchers we identified several core requirements for creating Web archives for research. Primary among those are the need for topic and entity focused crawling, support for integrated crawling of Social Media, interactive crawl specification and monitoring, and enrichment with semantic metadata. Current state-of-the-art crawlers support at most a subset of these requirements. This holds especially for the requirement of integrated crawling: To our knowledge there is no Web crawler that can collect Web documents and Social Media APIs together, i.e. that can follow links from Social Media posts to Web documents and back again as part of the crawling process. This is however necessary to achieve a high quality Web archive, as links from Social Media posts often disappear from the Web after a very short time and are therefore not available anymore for analysis or archiving.
Crawled documents are semantically analyzed to extract mentions of named entities such as persons, organizations or locations. The extracted metadata is used to improve the relevance of the Web archive by steering the crawler towards documents containing relevant entities and omitting irrelevant documents.
Finally, the crawled content can be exported in the standard WARC format.
The iCrawl platform will be made publicly available as open source software.