Inproceedings,

iCrawl: An integrated focused crawling toolbox for Web Science

G. Gossen, E. Demidova, and T. Risse.
GESIS Computational Social Science Winter Symposium, Cologne, Germany, GESIS, (December 2014)Poster abstract.

Abstract

Within the scientific community an increasing interest in using Web content for research can be observed. Especially the Social Web is attractive for Social Science and other humanities disciplines as it provides direct access to opinions of many people about politics, popular topics and events. Documenting the activities on the Web and Social Web in Web archives facilitates better understanding of the public perception. However, state-of-the-art Web archive harvester like Heritrix have significant limitations in terms of usability, functionality and maintenance with regard to the needs of the scientific community. The iCrawl project has the goal to provide an integrated crawling toolbox. It provides an intuitive, flexible and extensible set of Web crawling components. In interviews with historical and social science researchers as well as law researchers we identified several core requirements for creating Web archives for research. Primary among those are the need for topic and entity focused crawling, support for integrated crawling of Social Media, interactive crawl specification and monitoring, and enrichment with semantic metadata. Current state-of-the-art crawlers support at most a subset of these requirements. This holds especially for the requirement of integrated crawling: To our knowledge there is no Web crawler that can collect Web documents and Social Media APIs together, i.e. that can follow links from Social Media posts to Web documents and back again as part of the crawling process. This is however necessary to achieve a high quality Web archive, as links from Social Media posts often disappear from the Web after a very short time and are therefore not available anymore for analysis or archiving. Crawled documents are semantically analyzed to extract mentions of named entities such as persons, organizations or locations. The extracted metadata is used to improve the relevance of the Web archive by steering the crawler towards documents containing relevant entities and omitting irrelevant documents. Finally, the crawled content can be exported in the standard WARC format. The iCrawl platform will be made publicly available as open source software.

BibTeX key: gossen2014icrawl
entry type: inproceedings
address: Cologne, Germany
booktitle: GESIS Computational Social Science Winter Symposium
year: 2014
month: 12
organization: GESIS
url: http://www.gesis.org/en/events/css-wintersymposium/
note: Poster abstract

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Conference Paper %1 gossen2014icrawl %A Gossen, Gerhard %A Demidova, Elena %A Risse, Thomas %B GESIS Computational Social Science Winter Symposium %C Cologne, Germany %D 2014 %K %T iCrawl: An integrated focused crawling toolbox for Web Science %U http://www.gesis.org/en/events/css-wintersymposium/ %X Within the scientific community an increasing interest in using Web content for research can be observed. Especially the Social Web is attractive for Social Science and other humanities disciplines as it provides direct access to opinions of many people about politics, popular topics and events. Documenting the activities on the Web and Social Web in Web archives facilitates better understanding of the public perception. However, state-of-the-art Web archive harvester like Heritrix have significant limitations in terms of usability, functionality and maintenance with regard to the needs of the scientific community. The iCrawl project has the goal to provide an integrated crawling toolbox. It provides an intuitive, flexible and extensible set of Web crawling components. In interviews with historical and social science researchers as well as law researchers we identified several core requirements for creating Web archives for research. Primary among those are the need for topic and entity focused crawling, support for integrated crawling of Social Media, interactive crawl specification and monitoring, and enrichment with semantic metadata. Current state-of-the-art crawlers support at most a subset of these requirements. This holds especially for the requirement of integrated crawling: To our knowledge there is no Web crawler that can collect Web documents and Social Media APIs together, i.e. that can follow links from Social Media posts to Web documents and back again as part of the crawling process. This is however necessary to achieve a high quality Web archive, as links from Social Media posts often disappear from the Web after a very short time and are therefore not available anymore for analysis or archiving. Crawled documents are semantically analyzed to extract mentions of named entities such as persons, organizations or locations. The extracted metadata is used to improve the relevance of the Web archive by steering the crawler towards documents containing relevant entities and omitting irrelevant documents. Finally, the crawled content can be exported in the standard WARC format. The iCrawl platform will be made publicly available as open source software.

@inproceedings{gossen2014icrawl, abstract = {Within the scientific community an increasing interest in using Web content for research can be observed. Especially the Social Web is attractive for Social Science and other humanities disciplines as it provides direct access to opinions of many people about politics, popular topics and events. Documenting the activities on the Web and Social Web in Web archives facilitates better understanding of the public perception. However, state-of-the-art Web archive harvester like Heritrix have significant limitations in terms of usability, functionality and maintenance with regard to the needs of the scientific community. The iCrawl project has the goal to provide an integrated crawling toolbox. It provides an intuitive, flexible and extensible set of Web crawling components. In interviews with historical and social science researchers as well as law researchers we identified several core requirements for creating Web archives for research. Primary among those are the need for topic and entity focused crawling, support for integrated crawling of Social Media, interactive crawl specification and monitoring, and enrichment with semantic metadata. Current state-of-the-art crawlers support at most a subset of these requirements. This holds especially for the requirement of integrated crawling: To our knowledge there is no Web crawler that can collect Web documents and Social Media APIs together, i.e. that can follow links from Social Media posts to Web documents and back again as part of the crawling process. This is however necessary to achieve a high quality Web archive, as links from Social Media posts often disappear from the Web after a very short time and are therefore not available anymore for analysis or archiving. Crawled documents are semantically analyzed to extract mentions of named entities such as persons, organizations or locations. The extracted metadata is used to improve the relevance of the Web archive by steering the crawler towards documents containing relevant entities and omitting irrelevant documents. Finally, the crawled content can be exported in the standard WARC format. The iCrawl platform will be made publicly available as open source software.}, added-at = {2023-12-12T17:53:43.000+0100}, address = {Cologne, Germany}, author = {Gossen, Gerhard and Demidova, Elena and Risse, Thomas}, biburl = {https://www.bibsonomy.org/bibtex/2661669e9b5241bd573c340e8cc18a61e/admin}, booktitle = {GESIS Computational Social Science Winter Symposium}, interhash = {0a5a58d11348eae6ffc08b29549d56e2}, intrahash = {661669e9b5241bd573c340e8cc18a61e}, keywords = {}, month = {12}, note = {Poster abstract}, organization = {GESIS}, timestamp = {2023-12-12T17:53:43.000+0100}, title = {iCrawl: An integrated focused crawling toolbox for Web Science}, url = {http://www.gesis.org/en/events/css-wintersymposium/}, year = 2014 }

BibSonomy

iCrawl: An integrated focused crawling toolbox for Web Science

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on