Audio Samples from "IMPROVING SEQUENCE-TO-SEQUENCE VOICE CONVERSION BY ADDING TEXT-SUPERVISION"

Paper: Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision (Submitted to ICASSP 2019, Nov. 2018)

Authors: Jing-Xuan Zhang, Zhen-Hua Ling, Yuan-Jiang, Li-Juan Liu, Chen Liang, Li-Rong Dai

Abstract: This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method proposed in our previous work achieved higher naturalness and similarity. In this paper, we further improve its performance by utilizing the text transcriptions of parallel training data. First, a multi-task learning structure is designed which adds auxiliary classifiers to the middle layers of the seq2seq model and predicts linguistic labels as a secondary task. Second, a data-augmentation method is proposed which utilizes text alignment to produce extra parallel sequences for model training. Experiments are conducted to evaluate our proposed method with training sets at different sizes. Experimental results show that the multi-task learning with linguistic labels is effective at reducing the errors of seq2seq voice conversion. The data-augmentation method can further improve the performance of seq2seq voice conversion when only 50 or 100 training utterances are available.


text: 而我竟感觉有一种难以言说的忧伤涌上心头。

er2 wo3 jing4 gan3 jue2 you3 yi4 zhong3 nan2 yi3 yan2 shuo1 de0 you1 shang1 yong3 shang4 xin1 tou2.

Reference audio

Size of training data seq2seq seq2seq-MT-DA
50 utterances
Size of training data seq2seq seq2seq-MT
1000 utterances

 

text: 前方桥下侧丁字路口,准备进主路。

qian2 fang1 qiao2 xia4 ce4 ding1 zi4 lu4 kou3, zhun3 bei4 jin4 zhu3 lu4.

Reference audio

Size of training data seq2seq seq2seq-MT-DA
50 utterances
Size of training data seq2seq seq2seq-MT
1000 utterances