copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Scalable, Generic, and Adaptive Systems for Focused Crawling

G. Gouriten, S. Maniu, and P. Senellart. Proceedings of the 25th ACM Conference on Hypertext and Social Media, page 35--45. New York, NY, USA, ACM, (2014)
DOI: 10.1145/2631775.2631795

Abstract

Focused crawling is the process of exploring a graph iteratively, focusing on parts of the graph relevant to a given topic. It occurs in many situations such as a company collecting data on competition, a journalist surfing the Web to investigate a political scandal, or an archivist recording the activity of influential Twitter users during a presidential election. In all these applications, users explore a graph (e.g., the Web or a social network), nodes are discovered one by one, the total number of exploration steps is constrained, some nodes are more valuable than others, and the objective is to maximize the total value of the crawled subgraph. In this article, we introduce scalable, generic, and adaptive systems for focused crawling. Our first effort is to define an abstraction of focused crawling applicable to a large domain of real-world scenarios. We then propose a generic algorithm, which allows us to identify and optimize the relevant subsystems. We prove the intractability of finding an optimal exploration, even when all the information is available. Taking this intractability into account, we investigate how the crawler can be steered in several experimental graphs. We show the good performance of a greedy strategy and the importance of being able to run at each step a new estimation of the crawling frontier. We then discuss this estimation through heuristics, self-trained regression, and multi-armed bandits. Finally, we investigate their scalability and efficiency in different real-world scenarios and by comparing with state-of-the-art systems.

Description

Scalable, generic, and adaptive systems for focused crawling

Links and resources

BibTeX key: Gouriten:2014:SGA:2631775.2631795
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 25th ACM Conference on Hypertext and Social Media
year: 2014
pages: 35--45
publisher: ACM
series: HT '14
acmid: 2631795
isbn: 978-1-4503-2954-5
location: Santiago, Chile
numpages: 11
DOI: 10.1145/2631775.2631795
url: http://doi.acm.org/10.1145/2631775.2631795

@asmelash's tags highlighted

Cite this publication

%0 Conference Paper %1 Gouriten:2014:SGA:2631775.2631795 %A Gouriten, Georges %A Maniu, Silviu %A Senellart, Pierre %B Proceedings of the 25th ACM Conference on Hypertext and Social Media %C New York, NY, USA %D 2014 %I ACM %K adaptiveCrawler focusedCrawler k3 twitter %P 35--45 %R 10.1145/2631775.2631795 %T Scalable, Generic, and Adaptive Systems for Focused Crawling %U http://doi.acm.org/10.1145/2631775.2631795 %X Focused crawling is the process of exploring a graph iteratively, focusing on parts of the graph relevant to a given topic. It occurs in many situations such as a company collecting data on competition, a journalist surfing the Web to investigate a political scandal, or an archivist recording the activity of influential Twitter users during a presidential election. In all these applications, users explore a graph (e.g., the Web or a social network), nodes are discovered one by one, the total number of exploration steps is constrained, some nodes are more valuable than others, and the objective is to maximize the total value of the crawled subgraph. In this article, we introduce scalable, generic, and adaptive systems for focused crawling. Our first effort is to define an abstraction of focused crawling applicable to a large domain of real-world scenarios. We then propose a generic algorithm, which allows us to identify and optimize the relevant subsystems. We prove the intractability of finding an optimal exploration, even when all the information is available. Taking this intractability into account, we investigate how the crawler can be steered in several experimental graphs. We show the good performance of a greedy strategy and the importance of being able to run at each step a new estimation of the crawling frontier. We then discuss this estimation through heuristics, self-trained regression, and multi-armed bandits. Finally, we investigate their scalability and efficiency in different real-world scenarios and by comparing with state-of-the-art systems. %@ 978-1-4503-2954-5

@inproceedings{Gouriten:2014:SGA:2631775.2631795, abstract = {Focused crawling is the process of exploring a graph iteratively, focusing on parts of the graph relevant to a given topic. It occurs in many situations such as a company collecting data on competition, a journalist surfing the Web to investigate a political scandal, or an archivist recording the activity of influential Twitter users during a presidential election. In all these applications, users explore a graph (e.g., the Web or a social network), nodes are discovered one by one, the total number of exploration steps is constrained, some nodes are more valuable than others, and the objective is to maximize the total value of the crawled subgraph. In this article, we introduce scalable, generic, and adaptive systems for focused crawling. Our first effort is to define an abstraction of focused crawling applicable to a large domain of real-world scenarios. We then propose a generic algorithm, which allows us to identify and optimize the relevant subsystems. We prove the intractability of finding an optimal exploration, even when all the information is available. Taking this intractability into account, we investigate how the crawler can be steered in several experimental graphs. We show the good performance of a greedy strategy and the importance of being able to run at each step a new estimation of the crawling frontier. We then discuss this estimation through heuristics, self-trained regression, and multi-armed bandits. Finally, we investigate their scalability and efficiency in different real-world scenarios and by comparing with state-of-the-art systems.}, acmid = {2631795}, added-at = {2016-01-13T12:22:46.000+0100}, address = {New York, NY, USA}, author = {Gouriten, Georges and Maniu, Silviu and Senellart, Pierre}, biburl = {https://www.bibsonomy.org/bibtex/207fa9b51397f43a1def328881c37e242/asmelash}, booktitle = {Proceedings of the 25th ACM Conference on Hypertext and Social Media}, description = {Scalable, generic, and adaptive systems for focused crawling}, doi = {10.1145/2631775.2631795}, interhash = {6cfd494de28ff3ac87746570e4fe1f74}, intrahash = {07fa9b51397f43a1def328881c37e242}, isbn = {978-1-4503-2954-5}, keywords = {adaptiveCrawler focusedCrawler k3 twitter}, location = {Santiago, Chile}, numpages = {11}, pages = {35--45}, publisher = {ACM}, series = {HT '14}, timestamp = {2016-01-27T14:04:19.000+0100}, title = {Scalable, Generic, and Adaptive Systems for Focused Crawling}, url = {http://doi.acm.org/10.1145/2631775.2631795}, year = 2014 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Scalable, Generic, and Adaptive Systems for Focused Crawling

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Scalable, Generic, and Adaptive Systems for Focused Crawling

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Scalable, Generic, and Adaptive Systems for Focused Crawling

Comments and Reviews
(0)