Pemodelan Lip Reading Bahasa Indonesia Berbasis Visem Menggunakan VGG16 serta Jaro-Winkler Similarity dan Bigram

Henry Wicaksono, Liliana Liliana, Alvin Nathaniel Tjondrowiguno


Lip reading is a technique used to understand spoken words through visual representation of lip movements. Lip reading has many uses, such as aids for laryngectomy patients and aids for people with hearing disabilities. A research shows that 2.6% of Indonesia’s population has a hearing disability. Thus, lip reading can be a relevant solution in Indonesia. This study aims to model a viseme-based Indonesian lip reading system. The method used in this research is VGG16 which is used as a classifier and Jaro-Winkler similarity and bigram (JW-bigram) which is used as a decoder. The dataset used consists of 25 Indonesian sentences composed of 50 different words and spoken by 12 speakers. The results showed that the lip reading system made using VGG16 and JW-bigram was more effective in terms of accuracy and speed compared to other methods combinations.


lip reading; video processing; VGG16; JaroWinkler similarity; bigram

Full Text:



Ansari, M. et al. (2017). A comprehensive analysis of image

edge detection techniques. International Journal of

Multimedia and Ubiquitous Engineering, 12(11), 1-12. DOI:


Archana, J. N., & Aishwarya, P. (2016). A review on the image

sharpening algorithms using unsharp masking. International

Journal of Engi-neering Science and Computing, 6(7). DOI:


Arifin, F. et al. (2015). Lip reading based on background

subtraction and image projection. 2015 International

Conference on Information Technology Systems and

Innovation (ICITSI), 1-3. DOI: 10.1109/ICITSI.2015.

Aulia, M. et al. (2017). Sentence-level Indonesian lip reading

with Spatiotemporal CNN and Gated RNN. DOI:


Cho, K. et al (2014). Learning phrase representations using

RNN encoder-decoder for statistical machine translation.

DOI: 10.3115/v1/D14-1179.

Dell'Aringa, A. H. B. et al. (2007). Lip reading role in the

hearing aid fitting process. Brazilian Journal of

Otorhinolaryngology, Volume 73, Issue 1, 95-99. ISSN 1808-

DOI: 10.1016/S1808-8694(15)31129-0.

Estellers, V. & Thiran, J. (2012). Multi-pose lipreading and

audio-visual speech recognition. EURASIP Journal on

Advances in Signal Processing, 2012(1), 1-23. DOI:

1186/1687-6180- 2012-51.

Garg, A. et al. (2016). Lip reading using CNN and LSTM.

Retrieved from


Grossi, E. & Buscema, M. (2008). Introduction to artificial

neural networks. European Journal of Gastroenterology &

Hepatology, 19(12), 1046-54. DOI: 10.1097/MEG.


Harpini, A. (2019). Disabilitas rungu di Indonesia, p. 3. ISSN:


Ji, S. et al. (2010). 3D Convolutional Neural Networks for

human action recognition. Pattern Analysis and Machine

Intelligence, 35(1), 495-502. DOI: 10.1109/TPAMI.2012.59.

Kanan, C & Cottrell, G. W. (2012). Color-to-grayscale: does

the method matter in image recognition?. PloS one, 7(1),

e29740. DOI: 10.1371/journal.pone.0029740.

Klakow, D., & Peters, J. (2002). Testing the correlation of

word error rate and perplexity. Speech Communication, 38(1-

, 19-28. DOI: 10.1016/S0167-6393(01)00041-3.

Kurniawan, A. & Suyanto, S. (2020). Syllable-based

Indonesian lip reading model. DOI: 10.1109/


Leonardo, B., & Hansun, S. (2017). Text documents

plagiarism detection using Rabin-Karp and Jaro-Winkler

distance algorithms. Indonesian Journal of Electrical

Engineering and Computer Science, 5(2), 462-471. DOI:


Lipton, Z. C. et al. (2015). A critical review of recurrent

neural networks for sequence learning. arXiv preprint

arXiv:1506.00019. DOI: 10.48550/arXiv.1506.00019.

Lu, Y. & Li, H. (2019). Automatic lip-reading system based

on deep Convolutional Neural Network and attention-based

Long Short-Term Memory. Applied Sciences, 9(8), 1599.

DOI: 10.3390/app9081599.

Lynn, H. et al. (2019). A deep Bidirectional GRU Network

model for biometric electrocardiogram classification based

on Recurrent Neural Networks. IEEE Access, 7, 145395-

DOI: 10.1109/ACCESS.2019.2939947.

Martin, S. et al. (1998). Algorithms for bigram and trigram

word clustering. Speech communication, 24(1), 19-37. DOI:


Murthy, N. & Rudregowda, S. (2020). Lip-reading

techniques: A review. International journal of scientific &

technology research, 9(02), 4378-4383. ISSN: 2277-8616.

Nasuha, A. et al. (2017). Automatic lip reading for daily

Indonesian words based on frame difference and horizontalvertical image projection, 95(2), 393-402. ISSN: 1992-8645.

O'Shea, K., & Nash, R. (2015). An introduction to

convolutional neural networks. arXiv preprint

arXiv:1511.08458. DOI: 10.48550/arXiv.1511.08458.

Özcan, T. & Basturk, A. (2019). Lip reading using

Convolutional Neural Networks with and without pre-trained

models. Balkan Journal of Electrical and Computer

Engineering, 7(2), 195-201. DOI: 10.17694/bajece.479891.

Paskin, M. (2004). Grammatical bigrams. Advances in Neural

Information Processing Systems, 14. DOI:

Setyati, E. et al. (2015). Phoneme-Viseme mapping for

Indonesian language based on blend shape animation.

IAENG International Journal of Computer Science, 42(3), 1-

DOI: 10.22146/ijitee.47577.

Sherstinsky, A. (2020). Fundamentals of Recurrent Neural

Network (RNN) and Long Short-Term Memory (LSTM)

network. Physica D: Nonlinear Phenomena, 404, 132306.

DOI: 10.1016/j.physd.2019.132306.

Stehman, S. V. (1997). Selecting and interpreting measures of

thematic classification accuracy. Remote sensing of

Environment, 62(1), 77-89. DOI: 10.1016/S0034-


Yuheng, S., & Hao, Y. (2017). Image segmentation

algorithms overview. arXiv preprint arXiv:1707.02051. DOI:


Zhu, C., & Gao, D. (2016). Influence of data preprocessing.

Journal of Computing Science and Engineering, 10(2), 51-57.

DOI: 10.5626/JCSE.2016.10.2.51.

Zisserman, A. (2014). Very deep convolutional networks for

large-scale image recognition. arXiv: 1409.1556. DOI:



  • There are currently no refbacks.

Jurnal telah terindeks oleh :