Penggabungan Algoritma Markov Chain Monte Carlo dan Metode Statistik pada Named Entity Recognition Lintas Bahasa Suku di Indonesia untuk Pembelajaran Alkitab

Vania Putri Minarso, Henry Novianus Palit, Justinus Andjarwirawan

Abstract


Bible is an important bilingual or monolingual text, making it a resource for training such as translation, named entity analysis, and transliteration. Named entity is a realworld object such as people, location, organization, product, et cetera, that can be symbolized with an exact name. The objective of the experiment is to help people learning the bible in understanding important names in the bible either in Indonesian or other local language in Indonesia. This can also help missionaries doing evangelism to quickly understand named entity in the local language of the place they are in. In this process, they may be a gap exists between the named entity in Indonesian and the local language.

To solve this issue, designed an application that integrate Markov Chain Monte Carlo algorithm that wrapped inside efmaral tool with statistical method wrapped inside giza++ tool, and IBM Model 2 reparameterization wrapped inside fast align tool. From the product of all the tools, which is correlation between every word in Indonesian with the word in the local language, the result will be decided by finding the right consensus to find the correct named entity in the local language. The named entity will be chosen based on the strong numbering from the original language.

Based on the result of the experiment, integration of efmaral, giza++, and fast align tools yields better accuracy than efmaral tool alone. Efmaral tool has the accuracy around 0,07 to 0,66. Giza++ tool has the accuracy around 0,47 to 0,90. The integrated tools (efmaral, giza++, and fast align) has the accuracy around 0,36 to 0,87.


Keywords


named entity recognition; Markov Chain Monte Carlo; statistical method; IBM Model 2 reparameterization; efmaral; giza++; fast align

Full Text:

PDF

References


Dyer, C., Chahuneau, V., & Smith, N. A. 2013. A simple, fast, and effective reparameterization of IBM model 2. NAACL HLT 2013 - 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Main Conference, (June), 644–648.

Hagiwara, M. Using GIZA++ to Obtain Word Alignment Between Bilingual Sentences. URI= http://masatohagiwara.net/using-giza-to-obtain-word-alignment-between-bilingual-sentences.html

Jarob, Y., Sujaini, H., & Safriadi, N. 2017. Uji Akurasi Penerjemahan Bahasa Indonesia – Dayak Taman Dengan Penandaan Kata Dasar Dan Imbuhan. Jurnal Edukasi Dan Penelitian Informatika (JEPIN), 2(2), 78–83. DOI= https://doi.org/10.26418/jp.v2i2.16520

Lee, D., Park, S., Jung, N., & Chun, M. 2007. for Data Modeling. 224–231.

Moses - Moses/Overview. 2013. URI= http://www.statmt.org/ moses/?n=Moses.Overview

Oliver, I. 1993. Programming classics : implementing the world’s best algorithms. Prentice Hall.

Östling, R., & Tiedemann, J. 2016. The Prague Bulletin of Mathematical Linguistics NUMBER 106 OCTOBER 2016 125-146 Efficient Word Alignment with Markov Chain Monte Carlo. DOI= https://doi.org/10.1515/pralin-2016-0013

PHP: soundex-Manual. URI= https://www.php.net/manual/en/ function.soundex.php

Ping Shung, K. Accuracy, Precision, Recall or F1? – Towards Data Science. 2018. URI= https://towardsdatascience.com/ accuracy-precision-recall-or-f1-331fb37c5cb9

Wu, W., Vyas, N., & Yarowsky, D. 2018. Creating a Translation Matrix of the Bible ’ s Names Across 591 Languages. 11th Edition of Its Language Resources and Evaluation Conference (LREC 2018), 1659–1665.

Yulaiandaru, A. 2015. Penerapan String Matching Pada Auto-Correct Berbasis Algoritma Levenshtein Distance.


Refbacks

  • There are currently no refbacks.


Jurnal telah terindeks oleh :