Penerapan Random Forest dalam Email Filtering untuk Mendeteksi spam

Billy Christanto(1*), Djoni Haryadi Setiabudi(2),


(1) Program Studi Informatika
(2) Program Studi Informatika
(*) Corresponding Author

Abstract


Email became an integral part of the internet experience. As users increase, marketing via email also became more popular. These emails often annoy users, hence the name “spam”. Because of its excessive number, the need to separate important messages from unimportant ones emerges. Up until this point, there’s no optimal solution to this problem. Among the methods being used, machine learning based solutions show the most promising results.  

The method being tested is Random Forest, which is often regarded as superior compared to Naïve Bayesian, a popular algorithm for email filtering. Both of the algorithms are to be subjected to tests and compared for their accuracy, recall and precision. The effects of pre-processing and stemming to the dataset will also be tested. 

This research shows that both models produce similar accuracy, recall and precision that reach 96% for each category. Tests also show that Random Forest needs around  80 times more time to train it’s model compared to Naive Bayesian so it became not suitable for email filtering purposes


Keywords


Spam; Email Filtering; Naive Bayes; Random Forest

Full Text:

PDF

References


Akinyelu, Andronicus A., Adewumi, Aderemi O. 2014. Classification of Phishing Email Using Random Forest Machine Learning Technique. Hindawi Publishing Corporation

Anugroho, Prasetyo., Winarno, Idris., Rosyid M., Nur.2010. Klasifikasi email spam dengan metode naive bayes classifier menggunakan java programming.

Bahgat, Eman M., Rdine, Shery, Gad, Wala, Moawad, Ibrahim F. (2018). Efficient email classification approach based on semantic methods. Ain Shams Engineering Journal, 9(4)

Dada, Emmanuel G., Bassi, Joseph S, Chiroma, Haruna, Abdulahaamid, Shafi’i M., Adetunmbi, Adebayo O., Ajibuwa, Opeyemi E. 2019. ‘Machine learning for email spam filtering: review, approaches and open research problems’. Heliyon, 5(6).

Enron Spam Dataset. 2006. Retrieved from http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html

Loudriyan, Lovin .2014. Analisis kemampuan algoritma-algoritma yang digunakan dalam spam filtering. Universitas Kristen Petra.

Renuka, Karthika., Hamsapriya, T. 2010. Email Classification for spam detection using word stemming. International Journal of Computer Applications 1(5)

Radicati Group.inc 2019. Email Statistic Report. Retrieved from https://www.radicati.com/wp/ wp-content/uploads/2018/12/Email-Statistics-Report-2019-2023-Executive-Summary.pdf

Symantec Corp. 2019. Internet Security Threat Report Vol.24. Retrieved from https://www .symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf

Guenter, Bruce. 2020. Untroubled Spam Dataset. Retrieved from https://untroubled.org/spam/


Refbacks

  • There are currently no refbacks.


Jurnal telah terindeks oleh :