@vinayaka2000

Using Provenance for Personalized Quality Ranking of Scientific Datasets

, and . International Journal of Computers and Their Applications (IJCA), 18 (3): 180--195 (September 2011)

Abstract

The rapid growth of eScience has led to an explosion in the creation and availability of scientific datasets that includes raw instrument data and derived datasets from model simulations. A large number of these datasets are surfacing online in public and private catalogs, often annotated with XML metadata, as part of community efforts to foster open research. With this rapid expansion comes the challenge of filtering and selecting datasets that best match the needs of scientists. We address a key aspect of the scientific data discovery process by ranking search results according to a personalized data quality score based on a declarative quality profile to help scientists select the most suitable data for their applications. Our quality model is resilient to missing metadata using a novel strategy that uses provenance in its absence. Intuitively, our premise is that the quality score for a dataset depends on its provenance – the scientific task and its inputs that created the dataset – and it is possible to define a quality function based on provenance metadata that predicts the same quality score as one evaluated using the user’s quality profile over the complete metadata. Here, we present a model and architecture for data quality scoring, apply machine learning techniques to construct a quality function that uses provenance as proxy for missing metadata, and empirically test the prediction power of our quality function. Our results show that for some scientific tasks, quality scores based on provenance closely track the quality scores based on complete metadata properties, with error margins between 1 – 29%.

Links and resources

Tags

community

  • @simmhan
  • @vinayaka2000
  • @dblp
@vinayaka2000's tags highlighted