E. Marx, S. Shekarpour, S. Auer, and A. Ngomo. 7th IEEE International Conference on Semantic Computing, September 16-18, 2013, Irvine, California, USA, (2013)
Abstract
In the last years an increasing number of structured data was published
on the Web as Linked Open Data (LOD). Despite recent advances, consuming
and using Linked Open Data within an organization is still a substantial
challenge. Many of the LOD datasets are quite large and despite progress
in RDF data management their loading and querying within a triple
store is extremely time-consuming and resource-demanding. To overcome
this consumption obstacle, we propose a process inspired by the classical
Extract-Transform-Load (ETL) paradigm. In this article, we focus
particularly on the selection and extraction steps of this process.
We devise a fragment of SPARQL dubbed SliceSPARQL, which enables
the selection of well-defined slices of datasets fulfilling typical
information needs. SliceSPARQL supports graph patterns for which
each connected subgraph pattern involves a maximum of one variable
or IRI in its join conditions. This restriction guarantees the efficient
processing of the query against a sequential dataset dump stream.
As a result our evaluation shows that dataset slices can be generated
an order of magnitude faster than by using the conventional approach
of loading the whole dataset into a triple store and retrieving the
slice by executing the query against the triple store's SPARQL endpoint.
%0 Conference Paper
%1 Marx2013
%A Marx, Edgard
%A Shekarpour, Saeedeh
%A Auer, Sören
%A Ngomo, Axel-Cyrille Ngonga
%B 7th IEEE International Conference on Semantic Computing, September 16-18, 2013, Irvine, California, USA
%D 2013
%K 2013 auer event_ICSC group_aksw lod2page marx ngonga shekarpour
%T Large-scale RDF Dataset Slicing
%U http://svn.aksw.org/papers/2013/ICSC_SLICE/public.pdf
%X In the last years an increasing number of structured data was published
on the Web as Linked Open Data (LOD). Despite recent advances, consuming
and using Linked Open Data within an organization is still a substantial
challenge. Many of the LOD datasets are quite large and despite progress
in RDF data management their loading and querying within a triple
store is extremely time-consuming and resource-demanding. To overcome
this consumption obstacle, we propose a process inspired by the classical
Extract-Transform-Load (ETL) paradigm. In this article, we focus
particularly on the selection and extraction steps of this process.
We devise a fragment of SPARQL dubbed SliceSPARQL, which enables
the selection of well-defined slices of datasets fulfilling typical
information needs. SliceSPARQL supports graph patterns for which
each connected subgraph pattern involves a maximum of one variable
or IRI in its join conditions. This restriction guarantees the efficient
processing of the query against a sequential dataset dump stream.
As a result our evaluation shows that dataset slices can be generated
an order of magnitude faster than by using the conventional approach
of loading the whole dataset into a triple store and retrieving the
slice by executing the query against the triple store's SPARQL endpoint.
@inproceedings{Marx2013,
abstract = {In the last years an increasing number of structured data was published
on the Web as Linked Open Data (LOD). Despite recent advances, consuming
and using Linked Open Data within an organization is still a substantial
challenge. Many of the LOD datasets are quite large and despite progress
in RDF data management their loading and querying within a triple
store is extremely time-consuming and resource-demanding. To overcome
this consumption obstacle, we propose a process inspired by the classical
Extract-Transform-Load (ETL) paradigm. In this article, we focus
particularly on the selection and extraction steps of this process.
We devise a fragment of SPARQL dubbed SliceSPARQL, which enables
the selection of well-defined slices of datasets fulfilling typical
information needs. SliceSPARQL supports graph patterns for which
each connected subgraph pattern involves a maximum of one variable
or IRI in its join conditions. This restriction guarantees the efficient
processing of the query against a sequential dataset dump stream.
As a result our evaluation shows that dataset slices can be generated
an order of magnitude faster than by using the conventional approach
of loading the whole dataset into a triple store and retrieving the
slice by executing the query against the triple store's SPARQL endpoint.},
added-at = {2017-01-27T23:28:47.000+0100},
author = {Marx, Edgard and Shekarpour, Saeedeh and Auer, S\"oren and Ngomo, Axel-Cyrille Ngonga},
bdsk-url-1 = {http://svn.aksw.org/papers/2013/ICSC_SLICE/public.pdf},
biburl = {https://www.bibsonomy.org/bibtex/2ccbd5123f1a902c683038bd751e89505/soeren},
booktitle = {7th IEEE International Conference on Semantic Computing, September 16-18, 2013, Irvine, California, USA},
interhash = {5f2eaf20305e583d74d25d56ecfa07b1},
intrahash = {ccbd5123f1a902c683038bd751e89505},
keywords = {2013 auer event_ICSC group_aksw lod2page marx ngonga shekarpour},
owner = {soeren},
timestamp = {2017-01-27T23:30:12.000+0100},
title = {Large-scale RDF Dataset Slicing},
url = {http://svn.aksw.org/papers/2013/ICSC_SLICE/public.pdf},
year = 2013
}