sign in · help · news · about · deen

BibSonomy ::  publication ::

The blue social bookmark and publication sharing system.
entry of diego_ma and 1 other user:    
(0)
This publication has not been reviewed yet.
rating distribution
average user rating
?
The average rating is computed over all reviews. However, some of them may be invisible to you due to the visibility setting chosen by the reviewers.
(0.0 of 5.0 based on 0 reviews)

Effective Web Data Extraction with Standard XML Technologies

by: Jussi Myllymaki
In: Proc. WWW10 (2001) .
Citation format (all formats):

Resources (URL, PDF, PS...)

Abstract

We discuss the problem of Web data extraction and describe an XML-based methodology whose goal extends far beyond simple ``screen scraping.'' An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process needs to deal with such roadblocks such as session identifiers, HTML forms, and client-side JavaScript, and data integration problems such as incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires a solid data validation and error recovery service to handle data extraction failures, which are unavoidable...

BibTeX record

Endnote record

a gripper