Audio samples from "Learning pronunciation from a foreign language in speech synthesis networks"

Abstract: Although there are more than 65,000 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflect their native language's characteristics. That motivates us to investigate how the speech synthesis network learns the pronunciation when multi-lingual dataset is given. In this study, we train the speech synthesis network bilingually in English and Korean, and analyze how the network learns the relations of phoneme pronunciation between the languages. Our experimental result shows that the learned phoneme embedding vectors are located closer if their pronunciations are similar across the languages. Based on the result, we also show that it is possible to train networks that synthesize English speaker's Korean speech and vice versa. In another experiment, we train the network with limited amount of English dataset and large Korean dataset, and analyze the required amount of dataset to train a resource-poor language with the help of resource-rich languages.

1. Samples from the cross-lingual TTS model (trained with public data)

1-1. English sentences read by English/Korean speakers

TextEnglish speakerKorean speaker
Now these things had been struck dead within him.
He confessed that the sketch had startled him.
He began to follow the footprints of the dog.
I suppose you wonder why she is coming up here.
They saw each other for the first time in Boston.
Its diameter was not more than two hundred yards.

1-2. Korean sentences read by English/Korean speakers

TextEnglish speakerKorean speaker
내 고향은 경기도 안성군 양성면에 있는 난실리라는 산골이다.
밥을 푼다는 것은 밥을 짓는 과정에서 생기는 마이너스 요인이다.
그렇기 때문에 아름다움은 하나의 발견일 수도 있어.
한참 세수를 하고 나더니, 이번에는 물 속을 빤히 들여다본다.
오누이는 의심스러웠지만 그만 문을 열어 주고 말았어요.
옛날 어느 산골마을에 울기를 잘하는 어린애가 있었습니다.

2. Conversion of each English phoneme to the nearest Korean phoneme

TextGenerated from English phonemeGenerated from Korean phoneme
They were three hundred yards apart.
He has many good friends.
The questions may have come vaguely in his mind.

3. Data requirements for foreign language

We used 46.2 hours of Korean dataset. The amount of English data used is written in the header row.

3-1. English sentences read by English speaker

Text10 Hours5 Hours3 Hours1 Hours
Then he stepped back with a low cry of pleasure.
And now put yourself in my place for a moment.
The moon had already begun its westward decline.

3-2. English sentences read by Korean speaker

Text10 Hours5 Hours3 Hours1 Hours
Then he stepped back with a low cry of pleasure.
And now put yourself in my place for a moment.
The moon had already begun its westward decline.

4. Samples from the cross-lingual TTS model (trained with proprietary data)

Here, we used Wavenet instead of Griffin-Lim algorithm

TextEnglish speakerKorean speaker
Generative adversarial network or variational autoencoder
제너러티브 어드버서리얼 네트워크 또는 베이에이셔널 오토 인코더
Basilar membrane and otolaryngology are not autocorrelations
기저막과 이비인후과는 자기 상관이 아닙니다
That girl did a video about Star Wars lipstick
그 소녀는 스타워즈 립스틱에 대한 비디오를 보았습니다
Peter Piper picked a peck of pickled peppers How many pickled peppers did Peter Piper pick
네가 그린 기린 그림은 못 그린 기린 그림이고 내가 그린 기린 그림은 잘 그린 기린 그림이다
She earned a doctorate in sociology at Columbia University
그녀는 콜럼비아 대학에서 박사학위를 받았다
George Washington was the first President of the United States
조지 워싱턴은 미국의 초대 대통령이다