Article,

Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification

, , , and .
Proceedings of the Workshop on Statistical Approaches to Web Mining (SAWM), (2004)

Abstract

Standard techniques for a web page classification usually take a sim- ple text-based approach, in which most of the information provided by the vis- ual layout of a page is discarded. In our work we propose a new classification approach based on the visual layout analyses, conducted before implementing standard classification techniques. A page is represented as a hierarchical struc- ture – Visual Adjacency Multigraph, in which nodes represent simple HTML objects (text, images) while directed edges reflect spatial relations ‘immediately before’, ‘immediately after’, ‘immediately left’ and ‘immediately right’ on the browser screen. Using visual information contained in the multigraph, one is able to define heuristics for recognition of common page entities such as verti- cal and horizontal link lists, titles and subtitles, and paragraphs of text. Visual analyses results in more accurate method for representing the page contents, which splits the text features into different subsets according to the groups they belong to. Finally, we introduce a classification system, which taking into ac- count the proposed layout analysis clearly outperforms a standard bag-of-words approach.

Tags

Users

  • @arahul
  • @telekoma

Comments and Reviews