copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Search Based Training Data Selection For Cross Project Defect Prediction

S. Hosseini, B. Turhan, and M. Mäntylä. Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, page 3:1--3:10. New York, NY, USA, ACM, (2016)
DOI: 10.1145/2972958.2972964

Abstract

Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.697) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.753) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.227) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.

Description

Search Based Training Data Selection For Cross Project Defect Prediction

Links and resources

BibTeX key: Hosseini:2016:SBT:2972958.2972964
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering
year: 2016
pages: 3:1--3:10
publisher: ACM
series: PROMISE 2016
acmid: 2972964
isbn: 978-1-4503-4772-3
location: Ciudad Real, Spain
numpages: 10
articleno: 3
DOI: 10.1145/2972958.2972964
url: http://doi.acm.org/10.1145/2972958.2972964

Cite this publication

%0 Conference Paper %1 Hosseini:2016:SBT:2972958.2972964 %A Hosseini, Seyedrebvar %A Turhan, Burak %A Mäntylä, Mika %B Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering %C New York, NY, USA %D 2016 %I ACM %K myown %P 3:1--3:10 %R 10.1145/2972958.2972964 %T Search Based Training Data Selection For Cross Project Defect Prediction %U http://doi.acm.org/10.1145/2972958.2972964 %X Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.697) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.753) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.227) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate. %@ 978-1-4503-4772-3

@inproceedings{Hosseini:2016:SBT:2972958.2972964, abstract = {Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.697) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.753) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value &Lt; 0.001, Cohen's d = 0.227) and GMean (p -- value &Lt; 0.001, Cohen's d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.}, acmid = {2972964}, added-at = {2016-11-16T19:16:34.000+0100}, address = {New York, NY, USA}, articleno = {3}, author = {Hosseini, Seyedrebvar and Turhan, Burak and M\"{a}ntyl\"{a}, Mika}, biburl = {https://www.bibsonomy.org/bibtex/224187e884ce3b83df822e9fbe5cfbe2b/burak.turhan}, booktitle = {Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering}, description = {Search Based Training Data Selection For Cross Project Defect Prediction}, doi = {10.1145/2972958.2972964}, interhash = {141967ca11c83faf539c09437733fc99}, intrahash = {24187e884ce3b83df822e9fbe5cfbe2b}, isbn = {978-1-4503-4772-3}, keywords = {myown}, location = {Ciudad Real, Spain}, numpages = {10}, pages = {3:1--3:10}, publisher = {ACM}, series = {PROMISE 2016}, timestamp = {2016-11-16T19:16:34.000+0100}, title = {Search Based Training Data Selection For Cross Project Defect Prediction}, url = {http://doi.acm.org/10.1145/2972958.2972964}, year = 2016 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Search Based Training Data Selection For Cross Project Defect Prediction

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Search Based Training Data Selection For Cross Project Defect Prediction

Abstract

Description

Links and resources

Tags

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Search Based Training Data Selection For Cross Project Defect Prediction

Comments and Reviews
(0)