copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Segmenting HTML pages using visual and semantic information

G. Petasis, P. Fragkou, A. Theodorakos, V. Karkaletsis, and C. Spyropoulos. (June 2008)Proceedings: The 4th Web as Corpus: Can we do better than Google? http://www.lrec-conf.org/proceedings/lrec2008/workshops/W19_Proceedings.pdf.
DOI: 10.1109/SPCA.2006.297506

Abstract

The information explosion of the Web aggravates the problem of effective information retrieval. Even though linguistic approaches found in the literature perform linguistic annotation by creating metadata in the form of tokens, lemmas or part of speech tags, however,this process is insufficient. This is due to the fact that these linguistic metadata do not exploit the actual content of the page, leading to the need of performing semantic annotation based on a predefined semantic model. This paper proposes a new learning approach for performing automatic semantic annotation. This is the result of a two step procedure: the first step partitions a web page into blocks based on its visual layout, while the second, performs subsequent partitioning based on the examination of appearance of specific types of entities denoting the semantic category as well as the application of a number of simple heuristics. Preliminary experiments performed on a manually annotated corpus regarding athletics proved to be very promising.

Links and resources

BibTeX key: citeulike:5663452
entry type: conference
address: Marrakech, Morocco
booktitle: Proceedings of the 4th Web as a Corpus Workshop (WAC-4), 6th Language Resources and Evaluation Conference (LREC 2008)
year: 2008
month: June 1
journal: 4th Web as Corpus Workshop (WAC-4)
pages: 18--24
DOI: 10.1109/SPCA.2006.297506
Document: http://www.ellogon.org/petasis/bibliography/LREC2008/LREC-2008-SemanticSegmentation-Submitted.pdf
note: Proceedings: The 4th Web as Corpus: Can we do better than Google? http://www.lrec-conf.org/proceedings/lrec2008/workshops/W19_Proceedings.pdf

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Segmenting HTML pages using visual and semantic information

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Segmenting HTML pages using visual and semantic information

Abstract

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Segmenting HTML pages using visual and semantic information

Comments and Reviews
(0)