jaeschke > regio | BibSonomy

bookmarks (hide)1
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1Summary GAW | Zenodo
The dataset has been created in an effort to establish a knowledge base on the ``German Academic Web'' (GAW). Since 2012, semi-annual focused crawls of the web pages of universities and research institutes in Germany have been performed using Heritrix, the open source archival quality web crawler of the Internet Archive. Starting from a list of given seeds, follows newly discovered hyperlinks and stores seen content in the standardised WARC file format. For each crawl, Heritrix was initialised with a conceptually invariant seed list of, on average, 150 domains of all German academic institutions with the right to award doctorates. The seed list is extracted from the current entries on https://de.wikipedia.org/wiki/Liste_der_Hochschulen_in_Deutschland The crawler follows a breadth-first policy on each host, thereby collecting all available pages reachable by links from the homepage. The scope was limited to crawl only pages from the seed domains and certain file types (mainly audio, video, and compressed files) were excluded using regular expressions. Along the crawl, the URL queues were monitored via a web UI. Hosts that appeared to be undesirable, such as e-learning systems or repositories, were `retired', that is, their URLs no longer crawled. However, previously harvested URLs from retired hosts were not removed. Most crawls were finished (manually) after roughly 100 million pages were collected (according to Heritrix' control console), which took roughly two weeks per crawl, on average. The present data set presents an overview of the size of the GAW.
3 years ago by @jaeschke
show all tags
dataset
gaw
myown
regio
zenodo
datasetgawmyownregiozenodo
(0)
copydelete
- community post
- history of this post

⟨⟨
⟨
1
⟩
⟩⟩

publications (hide)3
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

3Evaluating dataset creation heuristics for concept detection in web pages using BERT
M. Paris, and R. Jäschke. Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management, volume 12816 of Lecture Notes in Artificial Intelligence, page 1--14. Springer, (2021)
3 years ago by @jaeschke
show all tags
2021
archive
bert
classification
data
deeplearning
embedding
gaw
learning
machine
ml
myown
network
neural
regio
web
2021archivebertclassificationdatadeeplearningembeddinggawlearningmachinemlmyownnetworkneuralregioweb
(0)
copydeleteadd this publication to your clipboard
12Proximity dimensions and the emergence of collaboration: a HypTrails study on German AI research
T. Koopmann, M. Stubbemann, M. Kapa, M. Paris, G. Buenstorf, T. Hanika, A. Hotho, R. Jäschke, and G. Stumme. Scientometrics, (March 2021)
4 years ago by @jaeschke
show all tags
2021
ai
collaboration
myown
proximity
regio
scientometrics
2021aicollaborationmyownproximityregioscientometrics
(0)
copydeleteadd this publication to your clipboard
2How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web
M. Paris, and R. Jäschke. Proceedings of the 31st ACM Conference on Hypertext and Social Media, New York, NY, USA, ACM, (2020)
4 years ago by @jaeschke
show all tags
2020
academic
archive
crawl
exhaustiveness
gaw
german
longitudinal
myown
regio
web
2020academicarchivecrawlexhaustivenessgawgermanlongitudinalmyownregioweb
(0)
copydeleteadd this publication to your clipboard

⟨⟨
⟨
1
⟩
⟩⟩

BibSonomy

bookmarks (hide)1
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1Summary GAW | Zenodo

publications (hide)3
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

3Evaluating dataset creation heuristics for concept detection in web pages using BERT

12Proximity dimensions and the emergence of collaboration: a HypTrails study on German AI research

2How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web

browse

related tags

concepts

tags

BibSonomy

bookmarks (hide)1 displayallbookmarks onlybookmarks per page5102050100 sort byadded attitle RSSBibTeXXML

1Summary GAW | Zenodo

publications (hide)3 displayallpublications onlypublications per page5102050100 sort byadded attitleauthorpublication dateentry typehelp for advanced sorting... RSSBibTeXRDFmore...

3Evaluating dataset creation heuristics for concept detection in web pages using BERT

12Proximity dimensions and the emergence of collaboration: a HypTrails study on German AI research

2How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web

browse

related tags

concepts

tags

bookmarks (hide)1
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

publications (hide)3
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...