Toward Text-To-Speech in Low-Resource and Unwritten Languages by Leveraging Transfer Learning: Application in Viet Muong Closed Language Pair

Pham Van-Dong

Pham Van-Dong

Keywords: Computing methodologies → Speech Synthesis, Tacotron 2, low-resource languages, unwritten language, Muong speech, transfer learning

Abstract

Text-to-speech systems require a lot of text and speech data to train models on. But with over 6,000 languages in the world, making text-to-speech systems for minority and low-resource languages is very difficult. Traditional text-to-speech has two parts: an acoustic model that predicts sounds from text and a vocoder that turns the sounds into waveforms. This paper proposes a text-to-speech system for languages with very little data to support minority languages. It uses three techniques: 1. Pre-training the acoustic model on languages with a lot of data, then fine-tuning on the low-resource language; 2. Using "knowledge distillation" to adapt the model to match a high-quality example voice; 3. Treating input text data for a minority language like Muong the same way as Vietnamese text data. We first learn linguistic features from Vietnamese speech data using a standard Tacotron 2 acoustic model. Then, we train the acoustic model on Muong speech data, starting from the weights of the Vietnamese model. The synthesized Muong speech has a naturalness score of 3.63 out of 5.0 and a Mel Cepstral Distortion of 5.133, based on 60 minutes of Muong data. These results show the effectiveness and quality of the Muong text-to-speech system, built with very little Muong language data.

References

Byambadorj, Z., Nishimura, R., Ayush, A., Ohta, K., & Kitaoka, N. (2021). Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation. In EURASIP Journal on Audio, Speech, and Music Processing.
Cai, Z., Yang, Y., & Li, M. (2023). Cross-lingual multi-speaker speech synthesis with limited bilingual training data. Computer Speech & Language, 77, 101427. https://doi.org/10.1016/j.csl.2022.101427
Comini, G., Huybrechts, G., Ribeiro, M. S., Gabrys, A., & Lorenzo-Trueba, J. (2022). Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. In Interspeech.
Do, P., Coler, M., Dijkstra, J., & Klabbers, E. (2022). Text-to-Speech for Under-Resourced Languages: Phoneme Mapping and Source Language Selection in Transfer Learning. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, European Language Resources Association, Marseille, France, 16–22.
Haldar, R., & Mukhopadhyay, D. (2011). Levenshtein distance technique in dictionary lookup methods: An improved approach. arXiv preprint arXiv:1101.1232.
Huang, W. P., Chen, P. C., Huang, S. F., & Lee, H. Y. (2022). Few-shot cross-lingual tts using transferable phoneme embedding. arXiv preprint arXiv:2206.15427.
Huybrechts, G., Merritt, T., Comini, G., Perz, B., Shah, R., & Lorenzo-Trueba, J. (2021). Low-resource expressive text-to-speech using data augmentation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 6593–6597.
Jamal, S., Rauf, S. A., & Majid, Q. (2022). Exploring Transfer Learning for Urdu Speech Synthesis. In Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 70–74.
Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Curran Associates Inc., Red Hook, NY, USA.
Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., ... & Courville, A. C. (2019). MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Neural Information Processing Systems.
Lux, F., & Vu, N. T. (2022). Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features. CoRR abs/2203.03191. https://doi.org/10.48550/arXiv.2203.03191
Muthukumar, P. K., & Black, A. W. (2014). Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2594–2598.
Nguyen, Q. B., Vu, T. T., & Luong, C. M. (2016). The Effect of Tone Modeling in Vietnamese LVCSR System. Procedia Comput. Sci., 81, 174–181. https://doi.org/10.1016/j.procs.2016.04.046
Phạm, V. Đ., Do, T. N. D., Mac, D. K., Nguyen, V. S., Nguyen, T. T., & Tran, D. D. (2022). How to generate Muong speech directly from Vietnamese text: Cross-lingual speech synthesis for close language pair. Journal of Military Science and Technology, 81, 138–147. https://doi.org/10.54939/1859-1043.j.mst.81.2022.138-147
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 4779–4783.
Staib, M., Teh, T. H., Torresquintero, A., Mohan, D. S. R., Foglianti, L., Lenain, R., & Gao, J. (2020). Phonological features for 0-shot multilingual speech synthesis. ArXiv Prepr. ArXiv200804107.
Tu, T., Chen, Y. J., Yeh, C. C., & Lee, H. Y. (2019). End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. ArXiv Prepr. ArXiv190406508.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
Van Dong, P., & Ha, V. T. H. (2022). Speech translation for Unwritten language using intermediate representation: Experiment for Viet-Muong language pair. J. Mil. Sci. Technol., CSCE6, 65–76.
Van Dong, P., Thanh, N. T., Do Dat, T., Ha, V. T. H., & Mai, D. T. (2022). Computational linguistic material for Vietnamese speech processing: applying in Vietnamese text-to-speech. International Journal of Advanced Research in Computer Science, 13(6), 49-54.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. ArXiv Prepr. ArXiv170310135.
Weiss, R. J., Skerry-Ryan, R. J., Battenberg, E., Mariooryad, S., & Kingma, D. P. (2021). Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis. In ICASSP. Retrieved from https://arxiv.org/abs/2011.03568
Wells, D., & Richmond, K. (2021). Cross-lingual transfer of phonological features for low-resource speech synthesis. In Proc. 11th ISCA Speech Synthesis Workshop (SSW 11). https://doi.org/10.21437/SSW.2021-28
Yang, L. J., Yeh, I. P., & Chien, J. T. (2022). Low-Resource Speech Synthesis with Speaker-Aware Embedding. In ISCSLP International Symposium on Chinese Spoken Language Processing.
Yasuda, Y., Wang, X., & Yamagishd, J. (2021). End-to-end text-to-speech using latent duration based on vq-vae. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 5694–5698.