Audio samples from "Fine-grained prosodic speech synthesis with jointly trained hierarchical auto-regressive prior"

Abstract: Fine-grained text-to-speech (TTS) models can model phonemes' prosody explicitly. To synthesize speech from an arbitrary text input, this requires an auto-regressive (AR) prior network, which models a prosody vector. Previously, vector-quantization (VQ) and Variational autoencoder (VAE) have been used for the AR prior. However, VQ and VAE have two shortcomings. It is hard to reproduce the results because the training of VAE is often unstable, and the VQ limits the resolution of prosody. This study investigated the reason why a previous AR prior network without VQ and VAE was not trained well. Then, three methods are proposed to remedy the problem: 1) joint training of the AR prior with the entire model, 2) hierarchical prosody modeling, and 3) an additional encoder. The proposed model, which does not use neither VQ nor VAE, was able to predict the prosody vector more accurately. This resulted in more natural speech than the baseline model. Additionally, the proposed model demonstrated the ability to generate very long sentences without modifying the commonly used location-sensitive attention mechanism.

Paper: [pdf]

1. Samples from the baseline and the proposed method


Text: Another son, Assurbanipal, or the great Sardanapalus of the Greeks, became the King of Nineveh.
ModelSample
Baseline
Proposed

Text: An elaborate machinery planned for the protection of the trader, and altogether on his side, had long existed for the recovery of debts.
ModelSample
Baseline
Proposed

Text: it is well to stir into the yeast a bit of soda no larger than a grain of corn already wet up in a teaspoonful of boiling water.
ModelSample
Baseline
Proposed

Text: Nicol differed with the FBI experts on one bullet taken from Tippit's body.
ModelSample
Baseline
Proposed

Text: Saward spent all his share at low gaming houses, and in all manner of debaucheries.
ModelSample
Baseline
Proposed

2. Generating long sentences


Text: Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all. The sun rose on the same tidy front gardens and lit up the brass number four on the Dursleys' front door; it crept into their living room, which was almost exactly the same as it had been on the night when Mr. Dursley had seen that fateful news report about the owls.
ModelSample
Baseline
Proposed
Proposed w/o M1
Proposed w/o M2
Proposed w/o M3

Text: Harry had the best morning he'd had in a long time. He was careful to walk a little way apart from the Dursleys so that Dudley and Piers, who were starting to get bored with the animals by lunchtime, wouldn't fall back on their favorite hobby of hitting him. They ate in the zoo restaurant, and when Dudley had a tantrum because his knickerbocker glory didn't have enough ice cream on top, Uncle Vernon bought him another one and Harry was allowed to finish the first. Harry felt, afterward, that he should have known it was all too good to last. After lunch they went to the reptile house. It was cool and dark in there, with lit windows all along the walls. Behind the glass, all sorts of lizards and snakes were crawling and slithering over bits of wood and stone. Dudley and Piers wanted to see huge, poisonous cobras and thick, man-crushing pythons. Dudley quickly found the largest snake in the place. It could have wrapped its body twice around Uncle Vernon's car and crushed it into a trash can - but at the moment it didn't look in the mood. In fact, it was fast asleep.
ModelSample
Baseline
Proposed
Proposed w/o M1
Proposed w/o M2
Proposed w/o M3

Text: October arrived, spreading a damp chill over the grounds and into the castle. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire. Raindrops the size of bullets thundered on the castle windows for days on end; the lake rose, the flower beds turned into muddy streams, and Hagrid's pumpkins swelled to the size of garden sheds. Oliver Wood's enthusiasm for regular training sessions, however, was not dampened, which was why Harry was to be found, late one stormy Saturday afternoon a few days before Halloween, returning to Gryffindor Tower, drenched to the skin and splattered with mud. Even aside from the rain and wind it hadn't been a happy practice session. Fred and George, who had been spying on the Slytherin team, had seen for themselves the speed of those new Nimbus Two Thousand and Ones. They reported that the Slytherin team was no more than seven greenish blurs, shooting through the air like missiles.
ModelSample
Baseline
Proposed
Proposed w/o M1
Proposed w/o M2
Proposed w/o M3

3. Prosody control with the proposed method

We changed value of each dimension of the prosody vector to see its effect. The controlled text is marked with underline. Note that, the fourth dimension could not learned meaningful representation.


Text: Nicol differed with the FBI experts on one bullet taken from Tippit's body.
No change
Change 1st dim
Change 2nd dim
Change 3rd dim

Text: Saward spent all his share at low gaming houses, and in all manner of debaucheries.
No change
Change 1st dim
Change 2nd dim
Change 3rd dim

4. Generating diverse style for a sentence

Given one sentence, the proposed model can generate diverse styles by using different style token.


Text: Nicol differed with the FBI experts on one bullet taken from Tippit's body.
Token 1
Token 2
Token 3
Token 4