copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

H. Holzmann, V. Goel, and A. Anand. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, page 83--92. New York, NY, USA, ACM, (2016)
DOI: 10.1145/2910896.2910902

Abstract

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

Links and resources

BibTeX key: Holzmann:2016:AEW:2910896.2910902
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
year: 2016
pages: 83--92
publisher: ACM
series: JCDL '16
acmid: 2910902
isbn: 978-1-4503-4229-2
location: Newark, New Jersey, USA
numpages: 10
DOI: 10.1145/2910896.2910902
url: https://arxiv.org/abs/1702.01015

@alexandriaproj's tags highlighted

Cite this publication

search on

Meta data

Last update 7 years ago
Created 8 years ago

Comments and Reviews
(0)

There is no review or comment yet. You can write one!

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Comments and Reviews
(0)