copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Data pre-processing evaluation for text mining : transaction/sequence model

D. Munková, M. Munk, and M. Vozár. Procedia Computer Science, 18 (0): 1198--1207 (2013)
DOI: 10.1016/j.procs.2013.05.286

Abstract

Data pre-processing presents the most time consuming phase in the whole process of knowledge discovery. The complexity of data pre-processing depends on the data sources used. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in e-documents. We used the transaction/sequence model for text representation and sequence rule analysis as a method of modelling. We compare four datasets of different quality obtained from texts and pre-processed in different ways: data with identified the paragraph sequences, data with identified the sentence sequences, data with identified the paragraph sequences without stop words and data with identified the sentence sequences without stop words. We try to assess the impact of these advanced techniques of data pre-processing on the quantity and quality of the extracted rules. The results confirm some initial assumptions, but they also show that the stop words removal has a substantial impact on the quantity and quality of extracted rules in case of paragraph sequence identification. Contrary, in case of sentence sequence identification, removing the stop words has not any significant impact on the quantity and quality of extracted rules.

@lepsky's tags highlighted

text_mining

Cite this publication

@article{munkova_data_2013, abstract = {Data pre-processing presents the most time consuming phase in the whole process of knowledge discovery. The complexity of data pre-processing depends on the data sources used. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in e-documents. We used the transaction/sequence model for text representation and sequence rule analysis as a method of modelling. We compare four datasets of different quality obtained from texts and pre-processed in different ways: data with identified the paragraph sequences, data with identified the sentence sequences, data with identified the paragraph sequences without stop words and data with identified the sentence sequences without stop words. We try to assess the impact of these advanced techniques of data pre-processing on the quantity and quality of the extracted rules. The results confirm some initial assumptions, but they also show that the stop words removal has a substantial impact on the quantity and quality of extracted rules in case of paragraph sequence identification. Contrary, in case of sentence sequence identification, removing the stop words has not any significant impact on the quantity and quality of extracted rules.}, added-at = {2018-11-04T17:02:36.000+0100}, author = {Munková, Dav and Munk, Michal and Vozár, Martin}, biburl = {https://www.bibsonomy.org/bibtex/28855cc66f8023965792bb18ad7201fcd/lepsky}, doi = {10.1016/j.procs.2013.05.286}, interhash = {685aff6d1da55c33a8443dea351e677b}, intrahash = {8855cc66f8023965792bb18ad7201fcd}, journal = {Procedia Computer Science}, keywords = {text_mining}, number = 0, pages = {1198--1207}, timestamp = {2018-11-07T09:15:37.000+0100}, title = {Data pre-processing evaluation for text mining : transaction/sequence model}, url = {http://dx.doi.org/10.1016/j.procs.2013.05.286}, volume = 18, year = 2013 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Data pre-processing evaluation for text mining : transaction/sequence model

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Data pre-processing evaluation for text mining : transaction/sequence model

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Data pre-processing evaluation for text mining : transaction/sequence model

Comments and Reviews
(0)