@beate

Detecting semantic cloaking on the web

, and . WWW '06: Proceedings of the 15th international conference on World Wide Web, page 819--828. New York, NY, USA, ACM, (2006)
DOI: http://doi.acm.org/10.1145/1135777.1135901

Abstract

By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.

Description

Detecting semantic cloaking on the web

Links and resources

Tags

community

  • @chato
  • @beate
  • @dblp
@beate's tags highlighted