We propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed methods introduce temporal structures in the embedding networks, which enable fine-grained control of the speaking style of the synthesized speech. The temporal structures could be designed either in speech-side or text-side, which lead different control resolution in time. The prosody embedding networks are plugged into end-to-end speech synthesis networks, and trained without any other supervision except the target speech for synthesizing. The prosody embedding networks learned to extract prosodic features. By adjusting the learned prosody features, we could change the pitch and amplitude of the synthesized speech both in frame level and phoneme level. We also introduce temporal normalization of prosody embeddings, which shows better robustness against speaker perturbation in prosody transfer tasks.





1. Speech-side prosody control

Speech-side prosody embedding can control prosody of specific frame.

  • The red line denotes the 1st dimension of the prosody embedding.
  • The green line denotes the 2nd dimension of the prosody embedding.



Text: I had a dear friend, once a brown terrier, "skye" they called her.

No Adjustment

spectrogram
prosody

Adjusted 1st dimension (pitch)

spectrogram
prosody

Adjusted 2nd dimension (amplitude)

spectrogram
prosody



Text: But when it came to breaking in, that was a bad time for me.

No Adjustment

spectrogram
prosody

Adjusted 1st dimension (pitch)

spectrogram
prosody

Adjusted 2nd dimension (amplitude)

spectrogram
prosody



Text: I know nothing about it, but Fanny must teach me.

No Adjustment

spectrogram
prosody

Adjusted 1st dimension (pitch)

spectrogram
prosody

Adjusted 2nd dimension (amplitude)

spectrogram
prosody





2. Text-side prosody control

Text-side prosody embedding can control prosody of specific phoneme.

  • The red line denotes the 1st dimension of the prosody embedding.
  • The blue line denotes the 2nd dimension of the prosody embedding.
  • The yellow line denotes the 3rd dimension of the prosody embedding.



Text: I had a dear friend, once a brown terrier, “skye” they called her.

No Adjustment

spectrogram
prosody

Adjusted 1st dimension (amplitude, length)

spectrogram
prosody

Adjusted 2nd dimension (pitch)

spectrogram
prosody

Adjusted 3rd dimension (pitch, length)

spectrogram
prosody



Text: But when it came to breaking in, that was a bad time for me.

No Adjustment

spectrogram
prosody

Adjusted 1st dimension (amplitude, length)

spectrogram
prosody

Adjusted 2nd dimension (pitch)

spectrogram
prosody

Adjusted 3rd dimension (pitch, length)

spectrogram
prosody



Text: I know nothing about it, but Fanny must teach me.

No Adjustment

spectrogram
prosody

Adjusted 1st dimension (amplitude, length)

spectrogram
prosody

Adjusted 2nd dimension (pitch)

spectrogram
prosody

Adjusted 3rd dimension (pitch, length)

spectrogram
prosody





3. Effect of the normalized prosody embedding

When we use prosody embedding for prosody transfer, the result tends to show reference speech’s speaker identity. For example, when the reference speaker is female and target speaker is male, the generated speech had higher pitch than the male speaker’s normal pitch. We show that the normalized prosody embedding could prevent this problem.



Text: He stopped, and Philip nodded at the horrified question in his eyes.

Reference speech (American female)

Target speaker (American male)

Transferred speech (without normalization)

Transferred speech (with normalization)



Text: Well, said York, if they come here they must wear the bearing rein.

Reference speech (American female)

Target speaker (American male)

Transferred speech (without normalization)

Transferred speech (with normalization)



Text: He was taken last night in the yard, and could scarcely crawl home.

Reference speech (American female)

Target speaker (Korean male)

Transferred speech (without normalization)

Transferred speech (with normalization)





4. Singing voice transfer

The result of prosody transfer applied for the singing voice.



Text: Sweet dreams are made of these. Friendly Assistants who work hard to please

Original song

source: https://google.github.io/tacotron/publications/end_to_end_prosody_transfer/index.html

Target speaker (American female)

Global style token

Speech-side control

Text-side control





5. BTS "Fake Love" covered by Fake Trump