Klasifikasi Artikel Berita Bahasa Indonesia Dengan Naive Bayes Classifier
Keywords:
Corporate Social Responsibility Disclosure, Information Asymmetry, Corporate Social Responsibility Disclosure Level, Bid-Ask Spread, Firm Size, and Earning QualityAbstract
Human access to latest news now becoming more easier and much more, caused by advanced technological development in latest years. But, the article categorization is still manually inserted by the writer, so sometimes by human error, some mistake can be happening, like inserting wrong category or sometimes the writer purposely insert wrong category just because that category is so popular just to boost his viewer count. That’s why there is an application in the form of website to automatically categorizing the article that fit mostly to their its category.
This application is using N-Gram feature and Naïve Bayes Classifier method to classifying news content. N-Gram feature is a feature that group words based on the amount of N, like unigram or bigram. Naïve Bayes Classifier is a method that using probability to solve some problem.
According to the test using Naïve Bayes Classifier, in dataset training and test with ratio of 50 : 50, at unigram section the correct accuracy result are 0.901, and the bigram result are 0.508. In dataset ratio of 60 : 40, at unigram section the correct accuracy result are 0.904, and the bigram result are 0.498. In dataset ratio of 70 : 30, at unigram section the correct accuracy result are 0.947, and the bigram result are 0.519. In dataset ratio of 80 : 20, at unigram section the correct accuracy result are 0.887, and the bigram result are 0.507. So, the conclusion is dataset training and test with ratio of 70 : 30 yield highest accuracy, in unigram (0.947) and also bigram (0.519).
References
[1] A. S., Santoso, B. P., D. R., Wiraswari, N. M. A. K., & Sari, T. R. Klasifikasi dokumen bahasa Jawa menggunakan metode N-Gram. https://docplayer.info/37613251-Klasifikasi-dokumen-bahasa-jawa-menggunkan-metode-n-gram.html
[2] Destuardi & Sumpeno, S. 2009. Klasifikasi emosi untuk teks bahasa Indonesia menggunakan metode Naive Bayes. http://digilib.its.ac.id/ITS-Article-91105120000039/19046
Draxl, V. (2018). Web Scraping Data Extraction from Websites. https://www.academia.edu/35901535/BACHELOR_PAPER_Web_Scraping_Data_Extraction_from_websites
[3] Holm, J. & Gustavsson, M. 2018. XML Parser – A Comparative Study with Respect to Adaptability. http://www.diva-portal.org/smash/get/diva2:1220705/FULLTEXT01.pdf
[4] Huang, O. 2017. Applying Multinomial Naïve Bayes to NLP Problems: A Practical Explanation. https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf
[5] Naive Bayes Classifier. 2018. http://www.statsoft.com/textbook/naive-bayes-classifier
[6] Shaoul, C., Westbury, C. F., Baayen, R. H. 2013. The Subjective Frecuency of Word N-Grams. https://www.academia.edu/33832265/The_subjective_frequency_of_word_n-grams
[7] Wijaya, A. P., & Santoso, H. A. 2016. Naïve Bayes Classification pada klasifikasi dokumen untuk identifikasi konten E-Government. In Journal of Applied Intelligent System. 1(1), 48-55. https://publikasi.dinus.ac.id/index.php/jais/article/view/1032/772
[8] Yulio, A. P. 2019. Text Preprocessing dengan Python NLTK. https://devtrik.com/python/text-preprocessing-dengan-python-nltk/