Faculty of Electrical and Electronic Engineering, University of Transport and Communications, HaNoi, Vietnam.
World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 2894–2898
Article DOI: 10.30574/wjaets.2025.15.2.0877
Received on 20 April 2025; revised on 27 May 2025; accepted on 30 May 2025
Speech recognition has become increasingly important in various real-world applications. However, Vietnamese presents unique linguistic challenges such as tones, syllabic structures, and complex morphology, which make speech recognition for this language significantly different from that of languages like English. In this paper, we propose a deep learning approach that combines Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) networks to recognize Vietnamese speech using the VIVOS dataset. The CNN component is employed to extract spatial features from audio spectrograms, while the BiLSTM captures the bidirectional temporal dependencies in speech signals. Experimental results show that the proposed CNN-BiLSTM model achieves a competitive Word Error Rate (WER) of 14.7%. These results highlight the potential of deep learning techniques in effectively recognizing tonal languages such as Vietnamese.
Speech Recognition; Vietnamese; VIVOS; CNN; BiLSTM
Preview Article PDF
Van Khoi Nguyen. A study on the application of deep learning in Vietnamese speech recognition. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 2894–2898. Article DOI: https://doi.org/10.30574/wjaets.2025.15.2.0877.