Pengaruh Feature Selection terhadap Kinerja C5.0, XGBoost, dan Random Forest dalam Mengklasifikasikan Website Phishing

Michael Jonathan(1*), Silvia Rostianingsih(2), Henry Novianus Palit(3),


(1) Program Studi Teknik Informatika, Universitas Kristen Petra Surabaya
(2) Program Studi Teknik Informatika, Universitas Kristen Petra Surabaya
(3) Program Studi Teknik Informatika, Universitas Kristen Petra Surabaya
(*) Corresponding Author

Abstract


With the increase in internet users, especially websites, it provides an opportunity for phishing actors to obtain or steal personal information from users. On each website there will be a lot of information that will be used as a feature, this feature will be used to classify phishing websites. Features will be divided into 3, namely feature url, content feature, and external feature. In this study, three methods will be used, namely C5.0, XGBoost, and Random Forest. The three methods will be tested for their performance to find the best method for classifying phishing websites. In addition, this research will also utilize feature selection with the aim of removing features that have no effect so that training time can be shortened. Based on the test results obtained, it shows that C5.0 is able to provide accuracy, precision, recall, & f1-score values with an average of 93.5%, XGBoost with an average of 96.6%, and Random Forest with an average of 95.7 %. The use of feature selection in the three algorithms also shows that training time can be shortened by an average of about 3.53 times faster by using only 15 feature importance. However, with the use of feature selection, the performance on accuracy, precision, recall, & f1- score values decreased slightly even though the given decrease was not significant or had no major impact on the classification process.

Keywords


Feature Selection, Website Phishing, Random Forest, C5.0, XGBoost

Full Text:

PDF

References


Aminu, A. A., Abdulrahman, A., Aliyu, A. Y., Aliyu, M., &

Turaki, A. M. 2019. Detection of Phishing WebsitesUsing

Random Forest and XGBOOST Algorithms. International

Journal Of Pure And Applied Sciences, 2(3), 1-11.

Baykara, M., & Gurel, Z. 2018. Detection of phishing

attacks. 2018 6Th International Symposium On Digital

Forensic And Security (ISDFS).

DOI=10.1109/isdfs.2018.8355389.

Berry, M. W., Mohamed, A., & Yap, B. W. (Eds.). 2015.

Soft Computing in Data Science. Communications In

Computer And Information Science, 257. DOI=10.1007/978-

-287-936-3.

Chelvan, V. P. 2022. OCBC says S$13.7 million lost in

phishing scams, up from S$8.5 million. CNA.

URI=https://www.channelnewsasia.com/singapore/ocbcphishing-scam-more-losses-victims-reported-2469086.

Chen, C., Tsai, Y., Chang, F., & Lin, W. 2020. Ensemble

feature selection in medical datasets: Combining filter,

wrapper, and embedded feature selection results. Expert

Systems, 37(5). DOI=10.1111/exsy.12553

Dewi, D. A. W., Cholissodin, I., & Sutrisno. 2019.

Klasifikasi Penyimpangan Tumbuh Kembang Anak

Menggunakan Algoritma C5.0. Jurnal Pengembangan

Teknologi Informasi Dan Ilmu Komputer, 3(10), 10260-

Karo Karo. M., I. 2020. Implementasi Metode XGBoost dan

Feature Importance untuk Klasifikasi pada Kebakaran Hutan

dan Lahan. Journal Of Software Engineering, Information

And Communication Technology, 1(1), 12-13.

Khan, N., Madhav C, N., Negi, A., & Thaseen, I. 2019.

Analysis on Improving the Performance of Machine

Learning Models Using Feature Selection Technique.

Advances In Intelligent Systems And Computing, 69-77.

DOI=10.1007/978-3-030-16660-1_7.

Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B., &

Bindhumadhava, B. 2020. Phishing Website Classification

and Detection Using Machine Learning. 2020 International

Conference On Computer Communication And Informatics

(ICCCI). DOI=10.1109/iccci48352.2020.9104161.

Machado, L., & Gadge, J. 2017. Phishing Sites Detection

Based on C4.5 Decision Tree Algorithm. 2017 International

Conference On Computing, Communication, Control And

Automation (ICCUBEA).

DOI=10.1109/iccubea.2017.8463818.

Masurkar, S., & Dalal, V. 2020. ENHANCED MODEL FOR

DETECTION OF PHISHING URL USING MACHINE

LEARNING. Ethics And Information Technology (ETIT),

(2), 158-163. DOI=10.26480/etit.02.2020.158.163.

Shah, K., Patel, H., Sanghvi, D., & Shah, M. 2020. A

Comparative Analysis of Logistic Regression, Random

Forest and KNN Models for the Text Classification.

Augmented Human Research, 5(1), 8. DOI=10.1007/s41133-

-00032-0.

Zhang, L., & Zhan, C. 2017. Machine Learning in Rock

Facies Classification: An Application of XGBoost.

International Geophysical Conference, Qingdao, China, 17-

April 2017. DOI=10.1190/igc2017-35


Refbacks

  • There are currently no refbacks.


Jurnal telah terindeks oleh :