Techreport,

Introduction to OXPath

R. Fayzrakhmanov, C. Michels, and M. Neumann.
(2018)

Abstract

Contemporary web pages with increasingly sophisticated interfaces rival traditional desktop applications for interface complexity and are often called web applications or RIA (Rich Internet Applications). They often require the execution of JavaScript in a web browser and can call AJAX requests to dynamically generate the content, reacting to user interaction. From the automatic data acquisition point of view, thus, it is essential to be able to correctly render web pages and mimic user actions to obtain relevant data from the web page content. Briefly, to obtain data through existing Web interfaces and transform it into structured form, contemporary wrappers should be able to: 1) interact with sophisticated interfaces of web applications; 2) precisely acquire relevant data; 3) scale with the number of crawled web pages or states of web application; 4) have an embeddable programming API for integration with existing web technologies. OXPath is a state-of-the-art technology, which is compliant with these requirements and demonstrated its efficiency in comprehensive experiments. OXPath integrates Firefox for correct rendering of web pages and extends XPath 1.0 for the DOM node selection, interaction, and extraction. It provides means for converting extracted data into different formats, such as XML, JSON, CSV, and saving data into relational databases. This tutorial explains main features of the OXPath language and the setup of a suitable working environment. The guidelines for using OXPath are provided in the form of prototypical examples.

BibTeX key: fayzrakhmanov2018introduction
entry type: techreport
year: 2018
pdf: https://arxiv.org/pdf/1806.10899.pdf
url: http://arxiv.org/abs/1806.10899
note: cite arxiv:1806.10899Comment: 63 pages

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@techreport{fayzrakhmanov2018introduction, abstract = {Contemporary web pages with increasingly sophisticated interfaces rival traditional desktop applications for interface complexity and are often called web applications or RIA (Rich Internet Applications). They often require the execution of JavaScript in a web browser and can call AJAX requests to dynamically generate the content, reacting to user interaction. From the automatic data acquisition point of view, thus, it is essential to be able to correctly render web pages and mimic user actions to obtain relevant data from the web page content. Briefly, to obtain data through existing Web interfaces and transform it into structured form, contemporary wrappers should be able to: 1) interact with sophisticated interfaces of web applications; 2) precisely acquire relevant data; 3) scale with the number of crawled web pages or states of web application; 4) have an embeddable programming API for integration with existing web technologies. OXPath is a state-of-the-art technology, which is compliant with these requirements and demonstrated its efficiency in comprehensive experiments. OXPath integrates Firefox for correct rendering of web pages and extends XPath 1.0 for the DOM node selection, interaction, and extraction. It provides means for converting extracted data into different formats, such as XML, JSON, CSV, and saving data into relational databases. This tutorial explains main features of the OXPath language and the setup of a suitable working environment. The guidelines for using OXPath are provided in the form of prototypical examples.}, added-at = {2018-08-16T11:07:16.000+0200}, author = {Fayzrakhmanov, Ruslan R. and Michels, Christopher and Neumann, Mandy}, biburl = {https://www.bibsonomy.org/bibtex/2de437ce1b203d68df9e6dfc2d11b214b/irgroup_thkoeln}, description = {[1806.10899] Introduction to OXPath}, interhash = {a9a305b376bf483b5a4c2ef07c200599}, intrahash = {de437ce1b203d68df9e6dfc2d11b214b}, keywords = {myown neumannm sh2}, note = {cite arxiv:1806.10899Comment: 63 pages}, pdf = {https://arxiv.org/pdf/1806.10899.pdf}, timestamp = {2023-10-26T12:36:08.000+0200}, title = {Introduction to OXPath}, url = {http://arxiv.org/abs/1806.10899}, year = 2018 }

BibSonomy

Introduction to OXPath

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on