copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Extracting schema from semistructured data

S. Nestorov, S. Abiteboul, and R. Motwani. SIGMOD Rec., 27 (2): 295--306 (1998)
DOI: http://dx.doi.org/10.1145/276305.276331

Abstract

Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.

Links and resources

BibTeX key: Nestorov1998Extracting
entry type: article
address: New York, NY, USA
year: 1998
journal: SIGMOD Rec.
number: 2
pages: 295--306
publisher: ACM
volume: 27
issn: 0163-5808
posted-at: 2009-02-03 15:55:13
citeulike-article-id: 4001285
priority: 0
comment: * Purpose: presenting and querying * Input: OEM, not a tree, didn't see cycles, but not excluded explicitly... * Output: Several "home types" describing the data with defects (either too many links or too few). The size of the summary is reduced by allowing an object to have several roles (i.e. types). * Principle: 1) Create perfect typing. 2) Reduce the number of types (by clustering), i.e., distance between types. 3) Recasting: for each object, find other types in which it could fit. * Limitations: unambiguous semi-structured semantics.
DOI: http://dx.doi.org/10.1145/276305.276331
url: http://dx.doi.org/10.1145/276305.276331

@lillejul's tags highlighted

Cite this publication

@article{Nestorov1998Extracting, abstract = {Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.}, added-at = {2009-03-12T15:42:50.000+0100}, address = {New York, NY, USA}, author = {Nestorov, Svetlozar and Abiteboul, Serge and Motwani, Rajeev}, biburl = {https://www.bibsonomy.org/bibtex/2174aa3d956c654b8921e368c20492cfa/lillejul}, citeulike-article-id = {4001285}, comment = {* Purpose: presenting and querying * Input: OEM, not a tree, didn't see cycles, but not excluded explicitly... * Output: Several "home types" describing the data with defects (either too many links or too few). The size of the summary is reduced by allowing an object to have several roles (i.e. types). * Principle: 1) Create perfect typing. 2) Reduce the number of types (by clustering), i.e., distance between types. 3) Recasting: for each object, find other types in which it could fit. * Limitations: unambiguous semi-structured semantics.}, doi = {http://dx.doi.org/10.1145/276305.276331}, interhash = {593dac96a60921f5e77428f564af3a80}, intrahash = {174aa3d956c654b8921e368c20492cfa}, issn = {0163-5808}, journal = {SIGMOD Rec.}, keywords = {entityguides schema xml}, number = 2, pages = {295--306}, posted-at = {2009-02-03 15:55:13}, priority = {0}, publisher = {ACM}, timestamp = {2009-03-12T15:42:50.000+0100}, title = {Extracting schema from semistructured data}, url = {http://dx.doi.org/10.1145/276305.276331}, volume = 27, year = 1998 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Extracting schema from semistructured data

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Extracting schema from semistructured data

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Extracting schema from semistructured data

Comments and Reviews
(0)