Audio samples from Learning pronunciation from a foreign language in speech synthesis networks

Abstract: Although there are more than 65,000 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflects their native language's characteristics. This motivates us to investigate how the speech synthesis network learns the pronunciation from datasets from different languages. In this study, we are interested in analyzing and taking advantage of multilingual speech synthesis network. First, we train the speech synthesis network bilingually in English and Korean and analyze how the network learns the relations of phoneme pronunciation between the languages. Our experimental result shows that the learned phoneme embedding vectors are located closer if their pronunciations are similar across the languages. Consequently, the trained networks can synthesize the English speakers' Korean speech and vice versa. Using this result, we propose a training framework to utilize information from a different language. To be specific, we pre-train a speech synthesis network using datasets from both high-resource language and low-resource language, then we fine-tune the network using the low-resource language dataset. Finally, we conducted more simulations on 10 different languages to show it is generally extendable to other languages.

Paper: [pdf] , Appendix [pdf]

1. Samples from the Bilingual TTS model

1-1. English sentences read by English/Korean speakers

Text English speaker Korean speaker
We don't like to admit our small faults.
Take the winding path to reach the lake.
Slide the box into that empty space.
Either mud or dust are found at all times.
To have is better than to wait and hope.

1-2. Korean sentences read by English/Korean speakers

Text English speaker Korean speaker
제법 재미가 있을 것 같아 그 책을 좀 더 자세히 살펴본다.
그 어린이의 답의 정확성이 마이크로 컴퓨터에 부착된 음성 식별기에 의해 평가된다.
복숭아빛 피부가 햇빛에 반짝이고, 푸른 수염이 이 순간에도 몇 밀리의 몇천분의 일씩 자라고 있다.
침팬지가 흥겨워 뛸 때에는 한쪽 다리에 힘을 주어 두 박자의 트롯이 된다.
그는 태어날 때부터 농노였으나 자유의 몸이 되었고, 왕립 예술학교에 보내졌다.

2. Conversion of each English phoneme to the nearest Korean phoneme

Text Generated from English phoneme Generated from Korean phoneme
They were three hundred yards apart.
He has many good friends.
The questions may have come vaguely in his mind.

3. Effect of pre-training with multilingual dataset

Assuming we have access to only a small amount of English language (low-resource) and a large amount of Korean language (high-resource), we pre-trained model in 3 different ways.
10 hours of Blizzard 2013 English data and 60 hours of Korean data was used in this demo.

  • T-base: Baseline Tacotron, no pre-train
  • PD-H: Pre-train only the Decoder module using High-resource language and fine-tune with low-resource language
  • PA-HL: Pre-train All modules using High-resource language and Low-resource language. Finally, fine-tune with low-resource language
Text T-base PD-H PA-HL
Pour the stew from the pot into the plate.
Green moss grows on the northern side.
We don't like to admit our small faults.
The marsh will freeze when cold enough.
The kite dipped and swayed, but stayed aloft.

We applied the same pre-training scheme to other languages. We used 2 hours of data for each language from CSS10 dataset and 60 hours of Korean data.

Language Text PD-H PA-HL

Chinese jiē zhe zǒu chū yī gè xiǎo dàn lái, yī yī yā yā dì chàng.
zhōng guó dì zuò wén zhāng yǒu guǐ fàn, shì shì yě réng rán shì luó xuán.
suī rán míng rén dì wén zhāng, pà nán miǎn yǒu xiē kuā dà.

Dutch De parels, die in de zeven met twintig tot gaatjes blijven liggen, zijn de beste.
en die noch in de diepte, noch op hare oppervlakte iets goeds oplevert.
Zoo wist ik niet anders dat wij het eiland Carpathos,

Finnish Sellaisten rakentaminen on näet siinä maassa kehittynyt varsin korkealle kannalle.
Eihän tällaisilla vesillä ensinkään pystyisi liikkumaan sellainen vene, jota minä voisin hoidella.
Pikku-muori parka menettää nyt kuningattaren suosion ja sen kautta onnensakin.

French Dans ce défilé de condamnés qu'on appelle la destinée humaine,
C'est ce qui avait un peu faussé la balance de ce coeur, penchée d'un seul côté.
Je dis assassinat et vol, monsieur le baron.

German Knarrpanti hörte dies alles mit einem selbstzufriedenen Lächeln an und versicherte,
Soviel wie ich mich auf solche Dinge verstehe, ist sie gar nicht weit, mir ist's, als wittere ich ihre Nähe.
und warf ihn unter dem unaussprechlichsten Jubel des versammelten Haufens zum Fenster hinaus.

Greek Μα και αν σε ακολουθούσαν, πώς θα προφθάσει μόνος ο πατέρας μου, να φτιάσει τόσα όπλα,
Το Βασιλόπουλο το αντιλήφθηκε, και ανάγκασε τον Κατεργαρίσκο να γυρίσει την πλάτη.
Τι περιμένετε μαζεμένοι εδώ, όταν ο εχθρός ρημάζει τη χώρα μας,

Hungarian Egy jól irányzott mozdulattal beledobta a török csónakba.
Együtt szokott vacsorázni Mohácson a tisztikar, s ott még a haragosok is összebékültek.
Dobó valami beszédet mondott nekik, aztán föltette a süvegét, s a robogó paripa felé fordult.

Japanese hyaku san kanojo ga isha no genkan e kakat ta no wa sono san yon fun mae de at ta.
sentaku ya no otoko wa, zokka wo utai nagara, kugiri kugiri e shisshisshi toyuu kotoba wo ire ta.
hotondo niwa toyuu mono ga nakat ta. kuruma mawashi, basha mawashi wa muron no koto de at ta.

Russian Сестра перевязывает и ласково улыбается, ну, счастливчик вы,
Ну вот их и перерубили. Там, на дороге, еще валяются.
О домашних животных нечего и говорить, скот крупный и мелкий прятался под навес,

Spanish No llegábamos a seis mil, pero éramos buena gente aunque me esté mal el decirlo.
Difícil es conocer la cifra exacta a que se elevaron las fuerzas de paisanos armados,
Habrías de ver su diligencia y extremado empeño de hacer cumplidos.

