copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Using Provenance for Personalized Quality Ranking of Scientific Datasets

Y. Simmhan, and B. Plale. International Journal of Computers and Their Applications (IJCA), 18 (3): 180--195 (September 2011)

Abstract

The rapid growth of eScience has led to an explosion in the creation and availability of scientific datasets that includes raw instrument data and derived datasets from model simulations. A large number of these datasets are surfacing online in public and private catalogs, often annotated with XML metadata, as part of community efforts to foster open research. With this rapid expansion comes the challenge of filtering and selecting datasets that best match the needs of scientists. We address a key aspect of the scientific data discovery process by ranking search results according to a personalized data quality score based on a declarative quality profile to help scientists select the most suitable data for their applications. Our quality model is resilient to missing metadata using a novel strategy that uses provenance in its absence. Intuitively, our premise is that the quality score for a dataset depends on its provenance – the scientific task and its inputs that created the dataset – and it is possible to define a quality function based on provenance metadata that predicts the same quality score as one evaluated using the user’s quality profile over the complete metadata. Here, we present a model and architecture for data quality scoring, apply machine learning techniques to construct a quality function that uses provenance as proxy for missing metadata, and empirically test the prediction power of our quality function. Our results show that for some scientific tasks, quality scores based on provenance closely track the quality scores based on complete metadata properties, with error margins between 1 – 29%.

Links and resources

BibTeX key: Simmhan:ijca:2011
entry type: article
year: 2011
month: September
journal: International Journal of Computers and Their Applications (IJCA)
number: 3
pages: 180--195
publisher: ISCA
volume: 18
owner: Simmhan
entrytype: journal
issn: 1076-5204
Document: http://ceng.usc.edu/~simmhan/pubs/simmhan-ijca-2011.pdf

@vinayaka2000's tags highlighted

Cite this publication

%0 Journal Article %1 Simmhan:ijca:2011 %A Simmhan, Yogesh %A Plale, Beth %D 2011 %I ISCA %J International Journal of Computers and Their Applications (IJCA) %K issue iu, karma, peer provenance, reviewed, special usc, %N 3 %P 180--195 %T Using Provenance for Personalized Quality Ranking of Scientific Datasets %U http://ceng.usc.edu/~simmhan/pubs/simmhan-ijca-2011.pdf %V 18 %X The rapid growth of eScience has led to an explosion in the creation and availability of scientific datasets that includes raw instrument data and derived datasets from model simulations. A large number of these datasets are surfacing online in public and private catalogs, often annotated with XML metadata, as part of community efforts to foster open research. With this rapid expansion comes the challenge of filtering and selecting datasets that best match the needs of scientists. We address a key aspect of the scientific data discovery process by ranking search results according to a personalized data quality score based on a declarative quality profile to help scientists select the most suitable data for their applications. Our quality model is resilient to missing metadata using a novel strategy that uses provenance in its absence. Intuitively, our premise is that the quality score for a dataset depends on its provenance – the scientific task and its inputs that created the dataset – and it is possible to define a quality function based on provenance metadata that predicts the same quality score as one evaluated using the user’s quality profile over the complete metadata. Here, we present a model and architecture for data quality scoring, apply machine learning techniques to construct a quality function that uses provenance as proxy for missing metadata, and empirically test the prediction power of our quality function. Our results show that for some scientific tasks, quality scores based on provenance closely track the quality scores based on complete metadata properties, with error margins between 1 – 29%.

@article{Simmhan:ijca:2011, abstract = {The rapid growth of eScience has led to an explosion in the creation and availability of scientific datasets that includes raw instrument data and derived datasets from model simulations. A large number of these datasets are surfacing online in public and private catalogs, often annotated with XML metadata, as part of community efforts to foster open research. With this rapid expansion comes the challenge of filtering and selecting datasets that best match the needs of scientists. We address a key aspect of the scientific data discovery process by ranking search results according to a personalized data quality score based on a declarative quality profile to help scientists select the most suitable data for their applications. Our quality model is resilient to missing metadata using a novel strategy that uses provenance in its absence. Intuitively, our premise is that the quality score for a dataset depends on its provenance – the scientific task and its inputs that created the dataset – and it is possible to define a quality function based on provenance metadata that predicts the same quality score as one evaluated using the user’s quality profile over the complete metadata. Here, we present a model and architecture for data quality scoring, apply machine learning techniques to construct a quality function that uses provenance as proxy for missing metadata, and empirically test the prediction power of our quality function. Our results show that for some scientific tasks, quality scores based on provenance closely track the quality scores based on complete metadata properties, with error margins between 1 – 29%.}, added-at = {2023-04-07T07:37:58.000+0200}, author = {Simmhan, Yogesh and Plale, Beth}, biburl = {https://www.bibsonomy.org/bibtex/2f342c776ed18a0211a4cf9334f7c8332/vinayaka2000}, entrytype = {journal}, interhash = {1d1f95297e0ece15b8eae4775eaaf470}, intrahash = {f342c776ed18a0211a4cf9334f7c8332}, issn = {1076-5204}, journal = {International Journal of Computers and Their Applications (IJCA)}, keywords = {issue iu, karma, peer provenance, reviewed, special usc,}, month = {September}, number = 3, owner = {Simmhan}, pages = {180--195}, publisher = {ISCA}, timestamp = {2023-04-07T07:37:58.000+0200}, title = {Using Provenance for Personalized Quality Ranking of Scientific Datasets}, url = {http://ceng.usc.edu/~simmhan/pubs/simmhan-ijca-2011.pdf}, volume = 18, year = 2011 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Using Provenance for Personalized Quality Ranking of Scientific Datasets

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Using Provenance for Personalized Quality Ranking of Scientific Datasets

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Using Provenance for Personalized Quality Ranking of Scientific Datasets

Comments and Reviews
(0)