1. Speech-side prosody control

Speech-side prosody embedding can control prosody of specific frame. * The red line denotes the 1st dimension of the prosody embedding. * The green line denotes the 2nd dimension of the prosody embedding.

Text: I had a dear friend, once a brown terrier, "skye" they called her. No Adjustment



Adjusted 1st dimension (pitch)



Adjusted 2nd dimension (amplitude)



Text: But when it came to breaking in, that was a bad time for me. No Adjustment



Adjusted 1st dimension (pitch)



Adjusted 2nd dimension (amplitude)



Text: I know nothing about it, but Fanny must teach me. No Adjustment



Adjusted 1st dimension (pitch)



Adjusted 2nd dimension (amplitude)



2. Text-side prosody control

Text-side prosody embedding can control prosody of specific phoneme.
  • The red line denotes the 1st dimension of the prosody embedding.
  • The blue line denotes the 2nd dimension of the prosody embedding.
  • The yellow line denotes the 3rd dimension of the prosody embedding.


Text: I had a dear friend, once a brown terrier, “skye” they called her. No Adjustment



Adjusted 1st dimension (amplitude, length)



Adjusted 2nd dimension (pitch)



Adjusted 3rd dimension (pitch, length)



Text: But when it came to breaking in, that was a bad time for me. No Adjustment



Adjusted 1st dimension (amplitude, length)



Adjusted 2nd dimension (pitch)



Adjusted 3rd dimension (pitch, length)



Text: I know nothing about it, but Fanny must teach me. No Adjustment



Adjusted 1st dimension (amplitude, length)



Adjusted 2nd dimension (pitch)



Adjusted 3rd dimension (pitch, length)



3. Effect of the normalized prosody embedding

When we use prosody embedding for prosody transfer, the result tends to show reference speech’s speaker identity. For example, when the reference speaker is female and target speaker is male, the generated speech had higher pitch than the male speaker’s normal pitch. We show that the normalized prosody embedding could prevent this problem.

Text: He stopped, and Philip nodded at the horrified question in his eyes. Reference speech (American female)



Target speaker (American male)



Transferred speech (without normalization)



Transferred speech (with normalization)



Text: Well, said York, if they come here they must wear the bearing rein. Reference speech (American female)



Target speaker (American male)



Transferred speech (without normalization)



Transferred speech (with normalization)



Text: He was taken last night in the yard, and could scarcely crawl home. Reference speech (American female)



Target speaker (Korean male)



Transferred speech (without normalization)



Transferred speech (with normalization)



4. Singing voice transfer

The result of prosody transfer applied for the singing voice.

Text: Sweet dreams are made of these. Friendly Assistants who work hard to please Original song

Source: [GitHub](https://google.github.io/tacotron/publications/end_to_end_prosody_transfer/index.html)

Target speaker (American female)



Global style token



Speech-side control



Text-side control



5. BTS "Fake Love" covered by Fake Trump