The Effect of Recursive Feature Elimination with Cross-Validation Method on Classification Performance with Different Sizes of Datasets


Creative Commons License

Akkaya B.

4th International Conference on Data Science & Applications, İstanbul, Türkiye, 4 - 06 Haziran 2021, ss.142-152

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.142-152
  • İstanbul Üniversitesi Adresli: Evet

Özet

The high dimensionality problem, which is one of the problems encountered in classification problems, arises when there are too many features in the dataset. This problem affects the success of classification models and causes loss of time. Feature selection is one of the methods used to eliminate the high dimensionality problem. Feature selection is defined as the selection of the best features that can represent the original dataset. This process aims to reduce the size of the data by reducing the number of features in the dataset by selecting the most useful and important features for the relevant problem. In this study, the performances of various classification algorithms in different data sizes were compared by using the recursive feature elimination method with cross-validation, which is one of the feature selection methods. Recursive feature elimination with cross-validation is a method that tries to get the most accurate result by eliminating the least important variables with cross- validation. In the study, datasets containing binary classification problems with a balanced distribution were used. Accuracy, ROC-AUC score, and fit time were used as evaluation metrics, while Logistic Regression, Support Vector Machines, Naive Bayes, k-Nearest Neighbors, Stochastic Gradient Descent, Decision Tree, AdaBoost, Multilayer Perceptron, and XGBoost classifiers were used as classification algorithms in the study. When the findings obtained as a result of recursive feature elimination with cross-validation were examined, it was observed that the accuracy increased by 5% on average and the ROC-AUC score increased by 5,3% on average, and the fit time decreased by about 5,1 seconds on average. It has been concluded that Naive Bayes and Multilayer Perceptron classifiers are the most sensitive to feature selection since they are the classifiers whose classification performance increases the most after feature selection.