Deteksi Plagiarisme pada Kode Bahasa Pemrograman Java menggunakan XGBoost

Tomy Widjaja, Andre Gunawan, Liliana Liliana


With the ease of access to information and cloud server technology, it makes it easier for anyone to access the code data. Coupled with the industry 4.0 era, the number of informatics students is also increasing rapidly. This makes code plagiarism easier to do, especially in academic environment Manual checking of plagiarism is repetitive, difficult, and time-consuming task. Therefore, automation for high quality source code plagiarism detection is needed. The dataset used in this research was collected from “Dasar Pemrograman” class at Petra Christian University. After that the code will continue to tokenization preprocessing using java grammar stage. Then, the algorithm will calculate pairwise features using 3 main algorithms, namely levenshtein distance, greedy string tiling, and bigram which will produce 12 features and a collection of statistic features. Finally, the features will be used for the training and inference process on the XGBoost model. The test result shows that the proposed features have better performance metrics than previous research, it has f1-score of 99%. Implementation of preprocessing can also improve performance metrics on the features proposed in this study and in previous research.


code plagiarism detection; text processing; pairwise features; XGBoost; Levenshtein Distance; greedy string tiling

Full Text:



Anghel, A., Papandreou, N., Parnell, T., Palma A.D., &

Pozidis, H. (2019). Benchmarking and Optimization of

Gradient Boosting Decision Tree Algorithms. ArXiv,

abs/1809.04559. DOI: 10.48550/arXiv.1809.04559.

Asaadi, S., Mohammad, S., & Kiritchenko, S. (2019). Big

BiRD: A Large, Fine-Grained, Bigram Relatedness Dataset

for Examining Semantic Composition. Proceedings of the

Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language

Technologies, Volume 1. DOI: 10.18653/v1/N19-1050

Awale, N., Pandey, M., Dulal, A., & Trismina, B. (2020).

Plagiarism Detection in Programming Assignments using

Machine Learning. Journal of Artificial Intelligence and

Capsule Networks, 2(3), 177-184. DOI:


Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree

Boosting System. KDD '16: Proceedings of the 22nd ACM

SIGKDD International Conference on Knowledge Discovery

and Data Mining. DOI: 10.1145/2939672.2939785

Chowdhury, H. A. & Bhattacharyya, D. K. (2018). Plagiarism:

Taxonomy, Tools and Detection Techniques. Knowledge,

Library and Information Networking, NACLIN 2016, Assam,

India. ArXiv, abs/1801.06323. DOI:


Cosma, G. & Joy, M. (2008). Towards a Definition of SourceCode Plagiarism. IEEE Transactions on Education, 51(2),

-200. DOI: 10.1109/TE.2007.906776

Ďuračík, M., Kršák E., & Hrkút, P. (2017). Plagiarism Across

Europe and Beyond 2017. Using Concepts of Text Based

Plagiarism Detection in Source Code Plagiarism Analysis.

Mendel University. ISBN: 978-80-7509-493-3

Gomaa, W.H. & Fahmy, A.A. (2013). A Survey of Text

Similarity Approaches. International Journal of Computer

Applications, 68(13), 13-18. DOI: 10.5120/11638-7118

Gosling, J., Joy, B., Steele, G., Bracha, G., & Buckley, A.

(2014). The Java Language Specification, Java SE 8 Edition.

Addison-Wesley Professional. ISBN: 978-0-13-390069-9

Jijo, B.T. & Abdulazeez, A. M. (2021). Classification Based

on Decision Tree Algorithm for Machine Learning. Journal of

Applied Science and Technology Trends, 2(1), 20-28. DOI:


Karnalim, O. & Sulistiani, L. (2018). Which Source Code

Plagiarism Detection Approach is More Humane? The 9th

International Conference on Awareness Science and

Technology, Fukuoka, Japan. DOI:


Munif, A., Akbar, R. J., Tantra, R.I., & Ilavi, R. (2017).

Rancang Bangun Sistem E-Learning Pemrograman pada

Modul Deteksi Plagiarisme Kode Program dan Student

Feedback System. JUTI: Jurnal Ilmiah Teknologi Informasi,

(1), 104-118. DOI: 10.12962/j24068535.v15i1.a640

Pradhan, N., Gyanchandani, M., & Wadhvani, R. (2015). A

Review on Text Similarity Technique used in IR and its

Application. International Journal of Computer Applications,

(9), 29-34. DOI: 10.5120/21257-4109

Prechelt, L., Malpohl, G., & Philippsen M. (2002). Finding

Plagiarisms among a Set of Programs with JPlag. Journal of

Universal Computer Science, 8(11), 1016-1038. ISSN: 0948-

Priya, S., Dixit, A. Das, K., & Patil, R. H. (2019). Plagiarism

Detection in Source Code Using Machine Learning.

International Journal of Engineering and Advanced

Technology, 8(4), 897-901. ISSN: 2249-8958

Sarkar, S., Das, D., Pakray, P., & Gelbukh, A. (2016).

JUNITMZ at SemEval-2016 Task 1: Identifying Semantic

Similarity Using Levenshtein Ratio. Proceedings of the 10th

International Workshop on Semantic Evaluation (SemEval2016), 702-705. DOI: 10.18653/v1/S16-1108

Shivaji, S.K. & Prabhudeva. (2015). Plagiarism Detection by

using Karp-Rabin and String Matching Algorithm Together.

International Journal of Computer Applications, 116(23), 37-

DOI: 10.5120/20294


  • There are currently no refbacks.

Jurnal telah terindeks oleh :