- Author
- Djafar Nur Mutmainnah (Universitas Islam Indonesia, Indonesia), Fauzan Achmad (Universitas Islam Indonesia, Indonesia)
- Title
- Implementation of K-Nearest Neighbor Using the Oversampling Technique on Mixed Data for the Classification of Household Welfare Status
- Source
- Statistics in Transition, 2024, vol. 25, nr 1, s. 109-124, tab., wykr., bibliogr. 40 poz.
- Keyword
- Dobrobyt społeczny, Wskaźniki ubóstwa, Modele statystyczne, Statystyka
Social welfare, Poverty indicators, Statistical models, Statistics - Note
- summ.
- Country
- Indonezja
Indonesia - Abstract
- Welfare is closely related to poverty and the socio-economic disparities in a society. Based on data from the Central Bureau of Statistics, Kulon Progo in Indonesia had the highest poverty rate in the province of the Special Region of Yogyakarta; an increasing trend was observed every year from 2019 to 2021; Kulon Progo also had a low poverty line (after Gunung Kidul) compared to other regencies/cities in this province. This study aimed to classify the household welfare status in Kulon Progo in March 2021 using the K-Nearest Neighbor (KNN) method. Since imbalance was found between the poor and non-poor classes, an oversampling technique was employed. Imbalanced data affect classification, particularly when predicting the results of the classification. The following oversampling techniques were employed in this study: Random Oversampling (RO), the Adaptive Synthetic (ADASYN) and the Synthetic Minority Oversampling Technique (SMOTE). It was found that, of the three techniques, RO was the most efficient with k = 5, which yielded the best performance in terms of sensitivity, specificity, the G-mean, and accuracy reaching 0.643, 0.805, 0.719, and 78.873%, respectively. Therefore, it can be concluded that the classification model performed well enough to classify household welfare status, especially among the poor (minority class). (original abstract)
- Accessibility
- The Main Library of the Cracow University of Economics
- Full text
- Show
- Bibliography
- Akbar, S., Hayat, M., Kabir, M., and Iqbal, M., (2019). iAFP-gap-SMOTE: An Efficient Feature Extraction Scheme Gapped Dipeptide Composition is Coupled with an Oversampling Technique for Identification of Antifreeze Proteins. Letters in Organic Chemistry, 16(4), pp. 294-302. https://doi.org/10.2174/ 1570178615666180816101653
- Alsammak, I. L. H., Sahib, H. M. A., and Itwee, W. H., (2020). An Enhanced Performance of K-Nearest Neighbor (K-NN) Classifier to Meet New Big Data Necessities. IOP Conference Series: Materials Science and Engineering, 928(3). https://doi.org/10.1088/1757-899X/928/3/032013
- Awotunde, J. B., Misra, S., Adeniyi, A. E., Abiodun, M. K., Kaushik, M., and Lawrence, M. O., (2022). A Feature Selection-Based K-NN Model for Fast Software Defect Prediction. In O. Gervasi, B. Murgante, S. Misra, A. M. A. C. Rocha, & C. Garau (Eds.), Computational Science and Its Applications - ICCSA 2022 Workshops, pp. 49-61. Springer International Publishing.
- Bekkar, M., Djemaa, H. K., and Alitouche, T. A., (2013). Evaluation Measures for Models Assessment over Imbalanced Data Sets. Journal of Information Engineering and Applications, 3, pp. 27-38.
- BPS-Statistics of DI Yogyakarta Province, (2021). Persentase Penduduk Miskin menurut Kabupaten/Kota di Provinsi DI Yogyakarta (Persen), 2009-2021.
- Chawla, N. V., (2005). Data Mining for Imbalanced Datasets: An Overview. In L. Maimon Oded and Rokach (Ed.), Data Mining and Knowledge Discovery Handbook (pp. 853-867). Springer US. https://doi.org/10.1007/0-387-25465-X_40
- ChitraDevi, N., Palanisamy, V., Baskaran, K., and Prabeela, S., (2012). A Novel Distance for Clustering to Support Mixed Data Attributes and Promote Data Reliability and Network Lifetime in Large Scale Wireless Sensor Networks. Procedia Engineering, 30, pp. 669-677. https://doi.org/10.1016/j.proeng.2012.01.913
- Dalatu, P. I., Midi, (2020). Modified Statistical Approach for Data Preprocessing to Improve Heterogeneous Distance Functions. In Malaysian Journal of Mathematical Sciences (Vol. 14, Issue 2).
- Elreedy, D., Atiya, A. F., (2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, pp. 32-64. https://doi.org/10.1016/j.ins.2019.07.070
- Gao, K., Khoshgoftaar, T. M., and Wald, R., (2014). Combining Feature Selection and Ensemble Learning for Software Quality Estimation. The Florida AI Research Society.
- Hamel, L., (2009). Model Assessment with ROC Curves. In Encyclopedia of Data Warehousing and Mining, Second Edition, pp. 1316-1323. IGI Global. https://doi.org/10.4018/978-1-60566-010-3.ch204
- Haseela H A., (2022). Hybrid Method for Image Classification. EPRA International Journal of Research and Development (IJRD), 7(2), pp. 59-61. https://doi.org/ 10.36713/epra2016
- He, H., Bai, Y., Garcia, E. A., and Li, S., (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322- 1328. https://doi.org/10.1109/IJCNN.2008.4633969
- Hoque, N., Bhattacharyya, D. K., and Kalita, J. K., (2021). KNN-DK: A Modified K-NN Classifier with Dynamic k Nearest Neighbors. In J. C. Bansal, L. C. C. Fung, M. Simic, & A. Ghosh (Eds.), Advances in Applications of Data-Driven Computing, pp. 21-34. Springer Singapore. https://doi.org/10.1007/978-981-33-6919-1_2
- Hussain, L., Lone, K. J., Awan, I. A., Abbasi, A. A., and Pirzada, J.-R., (2022). Detecting congestive heart failure by extracting multimodal features with synthetic minority oversampling technique (SMOTE) for imbalanced data using robust machine learning techniques. Waves in Random and Complex Media, 32(3), pp. 1079-1102. https://doi.org/10.1080/17455030.2020.1810364
- Indriani, A., (2014). Klasifikasi Data Forum dengan menggunakan Metode Naive Bayes Classifier. Seminar Nasional Aplikasi Teknologi Informasi (SNATI) Yogyakarta. www.bluefame.com,
- Islam, A., Belhaouari, S. B., Rehman, A. U., and Bensmail, H., (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288
- Jahangiri, M., Jahangiri, M., and Najafgholipour, M., (2020). The sensitivity and specificity analyses of ambient temperature and population size on the transmission rate of the novel coronavirus (COVID-19) in different provinces of Iran. Science of The Total Environment, 728, 138872. https://doi.org/10.1016/ j.scitotenv.2020.138872
- Jian, C., Gao, J., and Ao, Y., (2016). A New Sampling Method for Classifying Imbalanced Data Based on Support Vector Machine Ensemble. Neurocomput., 193(C), pp. 115-122. https://doi.org/10.1016/j.neucom.2016.02.006
- Kirtania, R., Mitra, S., and Shankar, B. U., (2020). A novel adaptive k-NN classifier for handling imbalance: Application to brain MRI. Intelligent Data Analysis, 24, pp. 909-924. https://doi.org/10.3233/IDA-194647
- Kubát, M., Matwin, S., (1997). Addressing the Curse of Imbalanced Training Sets: One- Sided Selection. International Conference on Machine Learning.
- Li, J., Zhu, Q., Wu, Q., and Fan, Z., (2021). A novel oversampling technique for classimbalanced learning based on SMOTE and natural neighbors. Information Sciences, 565, pp. 438-455. https://doi.org/10.1016/j.ins.2021.03.041
- Maxim, L. D., Niebo, R., and Utell, M. J., (2014). Screening tests: a review with examples. Inhalation Toxicology, 26(13), pp. 811-828. https://doi.org/10.3109/ 08958378.2014.955932
- Noorhalim, N., Ali, A., and Shamsuddin, S. M., (2019). Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE. In L.-K. Kor, A.-R. Ahmad, Z. Idrus, & K. A. Mansor (Eds.), Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017), pp. 19-30. Springer Singapore.
- Pangastuti, S. S., (2018). Perbandingan Metode Ensemble Random Forest dengan Smote- Boosting dan Smote-Bagging pada Klasifikasi Data Mining untuk Kelas Imbalance (Studi Kasus: Data Beasiswa Bidikmisi Tahun 2017 di Jawa Timur). Institut Teknologi Sepuluh Nopember.
- Pramana, S., Yuniarto, B., Mariyah, S., Santoso, I., and Nooraeni, R., (2018). Data Mining dengan R: Konsep Serta Implementasi. IN MEDIA.
- Pristyanto, Y., Pratama, I., and Nugraha, A. F., (2018). Data level approach for imbalanced class handling on educational data mining multiclass classification. 2018 International Conference on Information and Communications Technology (ICOIACT), pp. 310-314. https://doi.org/10.1109/ICOIACT.2018.8350792
- Rahayu, S., Bharata Adji, T., Akhmad Setiawan, N., and Teknik Elektro dan Teknologi Informasi, D., (2017). Penghitungan k-NN pada Adaptive Synthetic-Nominal (ADASYN-N) dan Adaptive Synthetic-kNN (ADASYN-kNN) untuk Data Nominal-Multi Kategori. Ktrl.Inst (J.Auto.Ctrl.Inst), 9(2).
- Randall, D., And, W., and Martinez, T. R., (2000). An Integrated Instance-Based Learning Algorithm. Computational Intelligence, 16(1).
- Ren, F., Cao, P., Li, W., Zhao, D., and Zaiane, O., (2017). Ensemble based adaptive oversampling method for imbalanced data learning in computer aided detection of microaneurysm. Computerized Medical Imaging and Graphics, 55, pp. 54-67. https://doi.org/https://doi.org/10.1016/j.compmedimag.2016.07.011
- Shi, Z., (2020). Improving k-Nearest Neighbors Algorithm for Imbalanced Data Classification. IOP Conference Series: Materials Science and Engineering, 719(1), 012072. https://doi.org/10.1088/1757-899X/719/1/012072
- Srinilta, C., Kanharattanachai, S., (2021). Application of Natural Neighbor-based Algorithm on Oversampling SMOTE Algorithms. 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), pp. 217- 220. https://doi.org/10.1109/ICEAST52143.2021.9426310
- Suryadarma, D., Akhmadi, Hastuti, and Toyamah, N., (2005). Objective measures of family welfare for individual targeting: results from pilot project on community based monitoring system in Indonesia. SMERU Research Institute.
- Suud, M., Harsono, (2006). 3 Orientasi Kesejahteraan Sosial. Prestasi Pustaka.
- Tusyakdiah, H., (2021). Implementasi K Nearest Neighbor (KNN) dalam Klasifikasi Status Kerja Lulusan Sekolah Menengah Kejuruan (SMK) dengan Oversampling Synthetic Minority Oversampling Technique (SMOTE) dan Adaptive Synthetic (ADASYN). Universitas Islam Indonesia.
- Widayati, Y. T., Prihati, Y., and Widjaja, S., (2021). Analisis dan Komparasi Algoritma Naive Bayes dan C4.5 untuk Klasifikasi Loyalitas Pelanggan MNC Play Kota Semarang. TRANSFORMTIKA, 18(2), pp. 161-172.
- Wilson, D. R., Martinez, T. R., (1997). Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research, 6, pp. 1-34.
- Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D., (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), pp. 1-37. https://doi.org/10.1007/s10115-007-0114-2
- Xin, L. K., and Rashid, N. binti A., (2021). Prediction of Depression among Women Using Random Oversampling and Random Forest. 2021 International Conference of Women in Data Science at Taif University (WiDSTaif), pp. 1-5. https://doi.org/10.1109/WiDSTaif52235.2021.9430215
- Zhu, W., Zeng, N. F., and Wang, N., (2010). Sensitivity, Specificity, Accuracy, Associated Confidence Interval and ROC Analysis with Practical SAS. Northeast SAS Users Group 2010: Health Care and Life Sciences.
- Cited by
- ISSN
- 1234-7655
- Language
- eng
- URI / DOI
- http://dx.doi.org/10.59170/stattrans-2024-007