Inproceedings,

Semantic Clustering of the Website Based on its Hypertext Structure

V. Salin, M. Slastihina, I. Ermilov, R. Speck, S. Auer, and S. Papshev.
6th International Conference on Knowledge Engineering and Semantic Web, (2015)

Full text

Abstract

The volume of unstructured information presented on the Internet is constantly increasing, together with the total amount of websites and their contents. To process this vast amount of information it is important to distinguish different clusters of related webpages. Such clusters are used, for example, for knowledge extraction, named entity recognition, and recommendation algorithms. A variety of applications (such as semantic analysis systems, crawlers and search engines) utilizes semantic clustering algorithms to recognize thematically connected webpages. The majority of them relies on text analysis of the web documents content, and this leads to certain limitations, such as long processing time, need of representative text content, or vagueness of natural language. In this article, we present an framework for unsupervised domain and language independent semantic clustering of the website, which utilizes its internal hypertext structure and does not require text analysis. As a basis, we represent the hypertext structure as a graph and apply known flow simulation clustering algorithms to the graph to produce a set of webpage clusters. We assume these clusters contain thematically connected webpages. We evaluate our clustering approach with a corpus of real-world webpages and compare the approach with well-known text document clustering algorithms.

BibTeX key: salin2015
entry type: inproceedings
booktitle: 6th International Conference on Knowledge Engineering and Semantic Web
year: 2015
owner: iermilov
bdsk-url-1: http://svn.aksw.org/papers/2015/KESW_SemanticClustering/public.pdf
Document: http://svn.aksw.org/papers/2015/KESW_SemanticClustering/public.pdf

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{salin2015, abstract = {The volume of unstructured information presented on the Internet is constantly increasing, together with the total amount of websites and their contents. To process this vast amount of information it is important to distinguish different clusters of related webpages. Such clusters are used, for example, for knowledge extraction, named entity recognition, and recommendation algorithms. A variety of applications (such as semantic analysis systems, crawlers and search engines) utilizes semantic clustering algorithms to recognize thematically connected webpages. The majority of them relies on text analysis of the web documents content, and this leads to certain limitations, such as long processing time, need of representative text content, or vagueness of natural language. In this article, we present an framework for unsupervised domain and language independent semantic clustering of the website, which utilizes its internal hypertext structure and does not require text analysis. As a basis, we represent the hypertext structure as a graph and apply known flow simulation clustering algorithms to the graph to produce a set of webpage clusters. We assume these clusters contain thematically connected webpages. We evaluate our clustering approach with a corpus of real-world webpages and compare the approach with well-known text document clustering algorithms.}, added-at = {2024-03-04T14:15:02.000+0100}, author = {Salin, Vladimir and Slastihina, Maria and Ermilov, Ivan and Speck, Ren{\'e} and Auer, S{\"}oren and Papshev, Sergey}, bdsk-url-1 = {http://svn.aksw.org/papers/2015/KESW_SemanticClustering/public.pdf}, biburl = {https://www.bibsonomy.org/bibtex/2986d28c139ce8e4a5ef8e4ee98c05f2c/aksw}, booktitle = {6th International Conference on Knowledge Engineering and Semantic Web}, editor = {Klinov, Pavel and Mouromtsev, Dmitry}, interhash = {cda4eac6b8032e0081775dd21345c644}, intrahash = {986d28c139ce8e4a5ef8e4ee98c05f2c}, keywords = {2015 auer group_aksw iermilov speck}, owner = {iermilov}, timestamp = {2024-03-04T14:15:02.000+0100}, title = {Semantic Clustering of the Website Based on its Hypertext Structure}, url = {http://svn.aksw.org/papers/2015/KESW_SemanticClustering/public.pdf}, year = 2015 }

BibSonomy

Semantic Clustering of the Website Based on its Hypertext Structure

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on