copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Learning deterministic regular expressions for the inference of schemas from XML data

G. Bex, W. Gelade, F. Neven, and S. Vansummeren. WWW '08 Proceeding of the 17th international conference on World Wide Web, (2008)

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

Description

Learning deterministic regular expressions for the inference of schemas from XML data

Links and resources

BibTeX key: BGN08
entry type: article
year: 2008
journal: WWW '08 Proceeding of the 17th international conference on World Wide Web
pages: 825-834
url: http://portal.acm.org/citation.cfm?id=1367497.1367609

@malte.wunsch's tags highlighted

Cite this publication

@article{BGN08, abstract = {Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.}, added-at = {2010-11-03T15:39:14.000+0100}, author = {Bex, Geert Jan and Gelade, Wouter and Neven, Frank and Vansummeren, Stijn}, biburl = {https://www.bibsonomy.org/bibtex/2309bd97f2d6cc8e676cc1fb9e4dec5b2/malte.wunsch}, description = {Learning deterministic regular expressions for the inference of schemas from XML data}, interhash = {66a0277629dd7f937a91cb072c16da87}, intrahash = {309bd97f2d6cc8e676cc1fb9e4dec5b2}, journal = { WWW '08 Proceeding of the 17th international conference on World Wide Web}, keywords = {data database deterministic expressions regular xml}, pages = {825-834}, timestamp = {2010-11-03T15:43:29.000+0100}, title = {Learning deterministic regular expressions for the inference of schemas from XML data}, url = {http://portal.acm.org/citation.cfm?id=1367497.1367609}, year = 2008 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Learning deterministic regular expressions for the inference of schemas from XML data

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Learning deterministic regular expressions for the inference of schemas from XML data

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Learning deterministic regular expressions for the inference of schemas from XML data

Comments and Reviews
(0)