BazEkon - The Main Library of the Cracow University of Economics

BazEkon home page

Main menu

Author
Djafar Nur Mutmainnah (Universitas Islam Indonesia, Indonesia), Fauzan Achmad (Universitas Islam Indonesia, Indonesia)
Title
Implementation of K-Nearest Neighbor Using the Oversampling Technique on Mixed Data for the Classification of Household Welfare Status
Source
Statistics in Transition, 2024, vol. 25, nr 1, s. 109-124, tab., wykr., bibliogr. 40 poz.
Keyword
Dobrobyt społeczny, Wskaźniki ubóstwa, Modele statystyczne, Statystyka
Social welfare, Poverty indicators, Statistical models, Statistics
Note
summ.
Country
Indonezja
Indonesia
Abstract
Welfare is closely related to poverty and the socio-economic disparities in a society. Based on data from the Central Bureau of Statistics, Kulon Progo in Indonesia had the highest poverty rate in the province of the Special Region of Yogyakarta; an increasing trend was observed every year from 2019 to 2021; Kulon Progo also had a low poverty line (after Gunung Kidul) compared to other regencies/cities in this province. This study aimed to classify the household welfare status in Kulon Progo in March 2021 using the K-Nearest Neighbor (KNN) method. Since imbalance was found between the poor and non-poor classes, an oversampling technique was employed. Imbalanced data affect classification, particularly when predicting the results of the classification. The following oversampling techniques were employed in this study: Random Oversampling (RO), the Adaptive Synthetic (ADASYN) and the Synthetic Minority Oversampling Technique (SMOTE). It was found that, of the three techniques, RO was the most efficient with k = 5, which yielded the best performance in terms of sensitivity, specificity, the G-mean, and accuracy reaching 0.643, 0.805, 0.719, and 78.873%, respectively. Therefore, it can be concluded that the classification model performed well enough to classify household welfare status, especially among the poor (minority class). (original abstract)
Accessibility
The Main Library of the Cracow University of Economics
Full text
Show
Bibliography
Show
  1. Akbar, S., Hayat, M., Kabir, M., and Iqbal, M., (2019). iAFP-gap-SMOTE: An Efficient Feature Extraction Scheme Gapped Dipeptide Composition is Coupled with an Oversampling Technique for Identification of Antifreeze Proteins. Letters in Organic Chemistry, 16(4), pp. 294-302. https://doi.org/10.2174/ 1570178615666180816101653
  2. Alsammak, I. L. H., Sahib, H. M. A., and Itwee, W. H., (2020). An Enhanced Performance of K-Nearest Neighbor (K-NN) Classifier to Meet New Big Data Necessities. IOP Conference Series: Materials Science and Engineering, 928(3). https://doi.org/10.1088/1757-899X/928/3/032013
  3. Awotunde, J. B., Misra, S., Adeniyi, A. E., Abiodun, M. K., Kaushik, M., and Lawrence, M. O., (2022). A Feature Selection-Based K-NN Model for Fast Software Defect Prediction. In O. Gervasi, B. Murgante, S. Misra, A. M. A. C. Rocha, & C. Garau (Eds.), Computational Science and Its Applications - ICCSA 2022 Workshops, pp. 49-61. Springer International Publishing.
  4. Bekkar, M., Djemaa, H. K., and Alitouche, T. A., (2013). Evaluation Measures for Models Assessment over Imbalanced Data Sets. Journal of Information Engineering and Applications, 3, pp. 27-38.
  5. BPS-Statistics of DI Yogyakarta Province, (2021). Persentase Penduduk Miskin menurut Kabupaten/Kota di Provinsi DI Yogyakarta (Persen), 2009-2021.
  6. Chawla, N. V., (2005). Data Mining for Imbalanced Datasets: An Overview. In L. Maimon Oded and Rokach (Ed.), Data Mining and Knowledge Discovery Handbook (pp. 853-867). Springer US. https://doi.org/10.1007/0-387-25465-X_40
  7. ChitraDevi, N., Palanisamy, V., Baskaran, K., and Prabeela, S., (2012). A Novel Distance for Clustering to Support Mixed Data Attributes and Promote Data Reliability and Network Lifetime in Large Scale Wireless Sensor Networks. Procedia Engineering, 30, pp. 669-677. https://doi.org/10.1016/j.proeng.2012.01.913
  8. Dalatu, P. I., Midi, (2020). Modified Statistical Approach for Data Preprocessing to Improve Heterogeneous Distance Functions. In Malaysian Journal of Mathematical Sciences (Vol. 14, Issue 2).
  9. Elreedy, D., Atiya, A. F., (2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, pp. 32-64. https://doi.org/10.1016/j.ins.2019.07.070
  10. Gao, K., Khoshgoftaar, T. M., and Wald, R., (2014). Combining Feature Selection and Ensemble Learning for Software Quality Estimation. The Florida AI Research Society.
  11. Hamel, L., (2009). Model Assessment with ROC Curves. In Encyclopedia of Data Warehousing and Mining, Second Edition, pp. 1316-1323. IGI Global. https://doi.org/10.4018/978-1-60566-010-3.ch204
  12. Haseela H A., (2022). Hybrid Method for Image Classification. EPRA International Journal of Research and Development (IJRD), 7(2), pp. 59-61. https://doi.org/ 10.36713/epra2016
  13. He, H., Bai, Y., Garcia, E. A., and Li, S., (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322- 1328. https://doi.org/10.1109/IJCNN.2008.4633969
  14. Hoque, N., Bhattacharyya, D. K., and Kalita, J. K., (2021). KNN-DK: A Modified K-NN Classifier with Dynamic k Nearest Neighbors. In J. C. Bansal, L. C. C. Fung, M. Simic, & A. Ghosh (Eds.), Advances in Applications of Data-Driven Computing, pp. 21-34. Springer Singapore. https://doi.org/10.1007/978-981-33-6919-1_2
  15. Hussain, L., Lone, K. J., Awan, I. A., Abbasi, A. A., and Pirzada, J.-R., (2022). Detecting congestive heart failure by extracting multimodal features with synthetic minority oversampling technique (SMOTE) for imbalanced data using robust machine learning techniques. Waves in Random and Complex Media, 32(3), pp. 1079-1102. https://doi.org/10.1080/17455030.2020.1810364
  16. Indriani, A., (2014). Klasifikasi Data Forum dengan menggunakan Metode Naive Bayes Classifier. Seminar Nasional Aplikasi Teknologi Informasi (SNATI) Yogyakarta. www.bluefame.com,
  17. Islam, A., Belhaouari, S. B., Rehman, A. U., and Bensmail, H., (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288
  18. Jahangiri, M., Jahangiri, M., and Najafgholipour, M., (2020). The sensitivity and specificity analyses of ambient temperature and population size on the transmission rate of the novel coronavirus (COVID-19) in different provinces of Iran. Science of The Total Environment, 728, 138872. https://doi.org/10.1016/ j.scitotenv.2020.138872
  19. Jian, C., Gao, J., and Ao, Y., (2016). A New Sampling Method for Classifying Imbalanced Data Based on Support Vector Machine Ensemble. Neurocomput., 193(C), pp. 115-122. https://doi.org/10.1016/j.neucom.2016.02.006
  20. Kirtania, R., Mitra, S., and Shankar, B. U., (2020). A novel adaptive k-NN classifier for handling imbalance: Application to brain MRI. Intelligent Data Analysis, 24, pp. 909-924. https://doi.org/10.3233/IDA-194647
  21. Kubát, M., Matwin, S., (1997). Addressing the Curse of Imbalanced Training Sets: One- Sided Selection. International Conference on Machine Learning.
  22. Li, J., Zhu, Q., Wu, Q., and Fan, Z., (2021). A novel oversampling technique for classimbalanced learning based on SMOTE and natural neighbors. Information Sciences, 565, pp. 438-455. https://doi.org/10.1016/j.ins.2021.03.041
  23. Maxim, L. D., Niebo, R., and Utell, M. J., (2014). Screening tests: a review with examples. Inhalation Toxicology, 26(13), pp. 811-828. https://doi.org/10.3109/ 08958378.2014.955932
  24. Noorhalim, N., Ali, A., and Shamsuddin, S. M., (2019). Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE. In L.-K. Kor, A.-R. Ahmad, Z. Idrus, & K. A. Mansor (Eds.), Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), pp. 19-30. Springer Singapore.
  25. Pangastuti, S. S., (2018). Perbandingan Metode Ensemble Random Forest dengan Smote- Boosting dan Smote-Bagging pada Klasifikasi Data Mining untuk Kelas Imbalance (Studi Kasus: Data Beasiswa Bidikmisi Tahun 2017 di Jawa Timur). Institut Teknologi Sepuluh Nopember.
  26. Pramana, S., Yuniarto, B., Mariyah, S., Santoso, I., and Nooraeni, R., (2018). Data Mining dengan R: Konsep Serta Implementasi. IN MEDIA.
  27. Pristyanto, Y., Pratama, I., and Nugraha, A. F., (2018). Data level approach for imbalanced class handling on educational data mining multiclass classification. 2018 International Conference on Information and Communications Technology (ICOIACT), pp. 310-314. https://doi.org/10.1109/ICOIACT.2018.8350792
  28. Rahayu, S., Bharata Adji, T., Akhmad Setiawan, N., and Teknik Elektro dan Teknologi Informasi, D., (2017). Penghitungan k-NN pada Adaptive Synthetic-Nominal (ADASYN-N) dan Adaptive Synthetic-kNN (ADASYN-kNN) untuk Data Nominal-Multi Kategori. Ktrl.Inst (J.Auto.Ctrl.Inst), 9(2).
  29. Randall, D., And, W., and Martinez, T. R., (2000). An Integrated Instance-Based Learning Algorithm. Computational Intelligence, 16(1).
  30. Ren, F., Cao, P., Li, W., Zhao, D., and Zaiane, O., (2017). Ensemble based adaptive oversampling method for imbalanced data learning in computer aided detection of microaneurysm. Computerized Medical Imaging and Graphics, 55, pp. 54-67. https://doi.org/https://doi.org/10.1016/j.compmedimag.2016.07.011
  31. Shi, Z., (2020). Improving k-Nearest Neighbors Algorithm for Imbalanced Data Classification. IOP Conference Series: Materials Science and Engineering, 719(1), 012072. https://doi.org/10.1088/1757-899X/719/1/012072
  32. Srinilta, C., Kanharattanachai, S., (2021). Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms. 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), pp. 217- 220. https://doi.org/10.1109/ICEAST52143.2021.9426310
  33. Suryadarma, D., Akhmadi, Hastuti, and Toyamah, N., (2005). Objective measures of family welfare for individual targeting: results from pilot project on community based monitoring system in Indonesia. SMERU Research Institute.
  34. Suud, M., Harsono, (2006). 3 Orientasi Kesejahteraan Sosial. Prestasi Pustaka.
  35. Tusyakdiah, H., (2021). Implementasi K Nearest Neighbor (KNN) dalam Klasifikasi Status Kerja Lulusan Sekolah Menengah Kejuruan (SMK) dengan Oversampling Synthetic Minority Oversampling Technique (SMOTE) dan Adaptive Synthetic (ADASYN). Universitas Islam Indonesia.
  36. Widayati, Y. T., Prihati, Y., and Widjaja, S., (2021). Analisis dan Komparasi Algoritma Naive Bayes dan C4.5 untuk Klasifikasi Loyalitas Pelanggan MNC Play Kota Semarang. TRANSFORMTIKA, 18(2), pp. 161-172.
  37. Wilson, D. R., Martinez, T. R., (1997). Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research, 6, pp. 1-34.
  38. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D., (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), pp. 1-37. https://doi.org/10.1007/s10115-007-0114-2
  39. Xin, L. K., and Rashid, N. binti A., (2021). Prediction of Depression among Women Using Random Oversampling and Random Forest. 2021 International Conference of Women in Data Science at Taif University (WiDSTaif), pp. 1-5. https://doi.org/10.1109/WiDSTaif52235.2021.9430215
  40. Zhu, W., Zeng, N. F., and Wang, N., (2010). Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS. Northeast SAS Users Group 2010: Health Care and Life Sciences.
Cited by
Show
ISSN
1234-7655
Language
eng
URI / DOI
http://dx.doi.org/10.59170/stattrans-2024-007
Share on Facebook Share on Twitter Share on Google+ Share on Pinterest Share on LinkedIn Wyślij znajomemu