Penerapan Random Forest dalam Email Filtering untuk Mendeteksi spam

Authors

  • Billy Christanto Program Studi Informatika
  • Djoni Haryadi Setiabudi Program Studi Informatika

Keywords:

Kompetensi, Penyiar, Radio, Global Saranghae, Global FM Surabaya

Abstract

Email became an integral part of the internet experience. As users increase, marketing via email also became more popular. These emails often annoy users, hence the name “spam”. Because of its excessive number, the need to separate important messages from unimportant ones emerges. Up until this point, there’s no optimal solution to this problem. Among the methods being used, machine learning based solutions show the most promising results.  

The method being tested is Random Forest, which is often regarded as superior compared to Naïve Bayesian, a popular algorithm for email filtering. Both of the algorithms are to be subjected to tests and compared for their accuracy, recall and precision. The effects of pre-processing and stemming to the dataset will also be tested. 

This research shows that both models produce similar accuracy, recall and precision that reach 96% for each category. Tests also show that Random Forest needs around  80 times more time to train it’s model compared to Naive Bayesian so it became not suitable for email filtering purposes

References

[1] Akinyelu, Andronicus A., Adewumi, Aderemi O. 2014. Classification of Phishing Email Using Random Forest Machine Learning Technique. Hindawi Publishing Corporation

[2] Anugroho, Prasetyo., Winarno, Idris., Rosyid M., Nur.2010. Klasifikasi email spam dengan metode naive bayes classifier menggunakan java programming.

[3] Bahgat, Eman M., Rdine, Shery, Gad, Wala, Moawad, Ibrahim F. (2018). Efficient email classification approach based on semantic methods. Ain Shams Engineering Journal, 9(4)

[4] Dada, Emmanuel G., Bassi, Joseph S, Chiroma, Haruna, Abdulahaamid, Shafi’i M., Adetunmbi, Adebayo O., Ajibuwa, Opeyemi E. 2019. ‘Machine learning for email spam filtering: review, approaches and open research problems’. Heliyon, 5(6).

[5] Enron Spam Dataset. 2006. Retrieved from http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html

[6] Loudriyan, Lovin .2014. Analisis kemampuan algoritma-algoritma yang digunakan dalam spam filtering. Universitas Kristen Petra.

[7] Renuka, Karthika., Hamsapriya, T. 2010. Email Classification for spam detection using word stemming. International Journal of Computer Applications 1(5)

[8] Radicati Group.inc 2019. Email Statistic Report. Retrieved from https://www.radicati.com/wp/ wp-content/uploads/2018/12/Email-Statistics-Report-2019-2023-Executive-Summary.pdf

[9] Symantec Corp. 2019. Internet Security Threat Report Vol.24. Retrieved from https://www .symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf

[10] Guenter, Bruce. 2020. Untroubled Spam Dataset. Retrieved from https://untroubled.org/spam/

Downloads

Published

2020-10-03

Issue

Section

Articles