@lepsky

Data pre-processing evaluation for text mining : transaction/sequence model

, , and . Procedia Computer Science, 18 (0): 1198--1207 (2013)
DOI: 10.1016/j.procs.2013.05.286

Abstract

Data pre-processing presents the most time consuming phase in the whole process of knowledge discovery. The complexity of data pre-processing depends on the data sources used. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in e-documents. We used the transaction/sequence model for text representation and sequence rule analysis as a method of modelling. We compare four datasets of different quality obtained from texts and pre-processed in different ways: data with identified the paragraph sequences, data with identified the sentence sequences, data with identified the paragraph sequences without stop words and data with identified the sentence sequences without stop words. We try to assess the impact of these advanced techniques of data pre-processing on the quantity and quality of the extracted rules. The results confirm some initial assumptions, but they also show that the stop words removal has a substantial impact on the quantity and quality of extracted rules in case of paragraph sequence identification. Contrary, in case of sentence sequence identification, removing the stop words has not any significant impact on the quantity and quality of extracted rules.

Links and resources

Tags

community

  • @lepsky
  • @dblp
@lepsky's tags highlighted