copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Reverse Engineering for Web Data: From Visual to Semantic Structures

C. Chung, M. Gertz, and N. Sundaresan. Data Engineering, International Conference on, (2002)
DOI: http://dx.doi.org/10.1109/ICDE.2002.994697

Abstract

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of "legacy" data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary.This paper describes a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD.We explore and discuss different techniques and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.

Links and resources

BibTeX key: Chung2002Reverse
entry type: article
address: Los Alamitos, CA, USA
year: 2002
journal: Data Engineering, International Conference on
publisher: IEEE Computer Society
volume: 0
posted-at: 2009-02-03 15:48:59
citeulike-article-id: 4001267
priority: 0
comment: * Input: XML obtained from structural analysis of HTML documents * Output: frequent paths representable as a tree (from the root of xml docs), transformed then into DTD by analysing ordering and repetition of elements * Techniques: \# Support: average frequency (over one xml doc) of a path over all the XML docs. \# Support ratio: the support of a path which is lower in the tree (closer to leaves) is naturally smaller, the support ratio is the ratio between the support of the last element of the path over the support of the prefix (rest of the path). * Limitations: \# doesn't consider semantic heterogeneity \# assumes xml docs are from one domain \# uses arbitrary thresholds
DOI: http://dx.doi.org/10.1109/ICDE.2002.994697
url: http://dx.doi.org/10.1109/ICDE.2002.994697

@lillejul's tags highlighted

Cite this publication

@article{Chung2002Reverse, abstract = {Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of "legacy" data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary.This paper describes a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD.We explore and discuss different techniques and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler.}, added-at = {2009-03-12T15:42:50.000+0100}, address = {Los Alamitos, CA, USA}, author = {Chung, Christina Y. and Gertz, Michael and Sundaresan, Neel}, biburl = {https://www.bibsonomy.org/bibtex/2efcf4b95a9f674dac40d5eab86c71a31/lillejul}, citeulike-article-id = {4001267}, comment = {* Input: XML obtained from structural analysis of HTML documents * Output: frequent paths representable as a tree (from the root of xml docs), transformed then into DTD by analysing ordering and repetition of elements * Techniques: \# Support: average frequency (over one xml doc) of a path over all the XML docs. \# Support ratio: the support of a path which is lower in the tree (closer to leaves) is naturally smaller, the support ratio is the ratio between the support of the last element of the path over the support of the prefix (rest of the path). * Limitations: \# doesn't consider semantic heterogeneity \# assumes xml docs are from one domain \# uses arbitrary thresholds}, doi = {http://dx.doi.org/10.1109/ICDE.2002.994697}, interhash = {5299f1e9825d19eecc2b9a43f45ec017}, intrahash = {efcf4b95a9f674dac40d5eab86c71a31}, journal = {Data Engineering, International Conference on}, keywords = {entityguides information_extraction schema xml}, posted-at = {2009-02-03 15:48:59}, priority = {0}, publisher = {IEEE Computer Society}, timestamp = {2009-04-22T10:29:37.000+0200}, title = {Reverse Engineering for Web Data: From Visual to Semantic Structures}, url = {http://dx.doi.org/10.1109/ICDE.2002.994697}, volume = 0, year = 2002 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Reverse Engineering for Web Data: From Visual to Semantic Structures

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Reverse Engineering for Web Data: From Visual to Semantic Structures

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Reverse Engineering for Web Data: From Visual to Semantic Structures

Comments and Reviews
(0)