seq2seqVC

Audio Samples from "Sequence-to-Sequence Acoustic Modeling for Voice Conversion"

Paper:

Sequence-to-Sequence Acoustic Modeling for Voice Conversion (Submitted to IEEE/ACM Transactions on audio, speech and language processing, Aug. 2018)

Authors:

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan-Jiang, Li-Rong Dai

Abstract:

In this paper, a neural network named Sequence-to- sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and target speakers implicitly using attention mechanism. At conversion stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model, which is difficult to be achieved in conventional method. Mel-scale spectrograms are adopted as acoustic features which contain both excitation and vocal tract descriptions of speech signals. The bottleneck features extracted from source speech using an automatic speech recognition (ASR) model are appended as auxiliary input. A WaveNet vocoder conditioned on Mel- spectrograms is built to reconstruct waveforms from the output of the SCENT model. Experimental results show that our proposed method achieved better objective and subjective performance than the baseline methods using Gaussian mixture models (GMM) and deep neural networks (DNN) as acoustic models. This proposed method also outperformed our previous work which achieved the top rank in Voice Conversion Challenge 2018. Ablation tests further confirm the effectiveness of appending bottleneck features and using attention module in our proposed method.

0. Reference audios and corresponding texts in Section 1, 2, 3

Female		Male

他难道不知道我要存钱去超市买零食啊？ (ta1 nan2 dao4 bu4 zhi1 dao4 wo3 yao4 cun2 qian2 qu4 chao1 shi4 mai3 ling2 shi2 a0?)	马上停机了，明天要是去市里，记得给我充下话费。 (ma3 shang4 ting2 ji1 le0, ming2 tian1 yao4 shi4 qu4 shi4 li3, ji4 de0 gei2 wo3 chong1 xia3 hua4 fei4.)	他难道不知道我要存钱去超市买零食啊？ (ta1 nan2 dao4 bu4 zhi1 dao4 wo3 yao4 cun2 qian2 qu4 chao1 shi4 mai3 ling2 shi2 a0?)	马上停机了，明天要是去市里，记得给我充下话费。 (ma3 shang4 ting2 ji1 le0, ming2 tian1 yao4 shi4 qu4 shi4 li3, ji4 de0 gei2 wo3 chong1 xia3 hua4 fei4.)

1. Comparsion of proposed method to baseline methods

Samples of experiments on Mandarin dataset.

Methods	Female-to-Male	Male-to-Female
i-JD-GMM
i-bn-DNN
i-VCC2018
Proposed

"i-" in the prefixes of method names represent interpolation for global speaking rate adjustment. "-bn-" represents using bottleneck features for DNN based method as we described in the paper.

Samples of experiments on English CMU ARCTIC dataset.

Methods	SLT-to-RMS	RMS-to-SLT
i-JD-GMM
i-bn-DNN
i-VCC2018
Proposed
Reference

"i-" in the prefixes of method names represent interpolation for global speaking rate adjustment. "-bn-" represents using bottleneck features for DNN based method as we described in the paper.

2. Comparison of proposed method to proposed method without using Mel spectrograms and bottleneck features

Methods	Female-to-Male	Male-to-Female
Proposed
w/o-Mel
w/o-bn

"w/o-Mel" and "w/o-bn" represent proposed method without using Mel spectrograms and bottleneck features respectively.

Adding bottleneck features is beneficial for improving the pronunciation correctness of converted speech.

Methods	Audio samples	Texts sound like
Proposed		想听音乐，您就说打开音乐。（xiang3 ting1 yin1 yue4, nin2 jiu4 shuo1 da3 ka1 yin1 yue4.）
w/o-bn		想听音乐，就说打开音乐。（xiang3 ting1 yin1 （unclear voice）, (skipping phoneme) jiu4 shuo1 da3 ka1 yin1 yue4.）
Reference		想听音乐，您就说打开音乐。（xiang3 ting1 yin1 yue4, nin2 jiu4 shuo1 da3 ka1 yin1 yue4.）

3. Comparison of proposed method to proposed method without attention module

Methods	Female-to-Male	Male-to-Female
Proposed
w/o-att
i-w/o-att

''w/o-att'' represents proposed method without attention module; ''i-w/o-att" represents proposed method without attention and using interpolation for global speaking rate adjustment.

4. Comparsion of proposed method to proposed method without location code

Methods	Female-to-Male	Male-to-Female
Proposed
w/o-locc

"w/o-locc" represents proposed method without using location code.

5. Examples of mispronunciation in proposed method

Conversion Pairs	Source utterances	Ground truth texts	Converted utterances	Texts sound like
Female-to-Male		懒鬼起床了，大懒虫，太阳晒屁股了，该运动运动了。(lan2 gui3 qi3 chuang2 le0, da4 lan3 chong2, tai4 yang2 sha4 pi4 gui0 le0, gai1 yun4 dong0 yun4 dong0 le0.)		lan2 gui3 qi3 chang2 le0, da4 lan3 chong2, tai4 yang2 sha4 pi4 gui0 le0, gai1 yun4 dong0 yun4 dong0 le0.
Female-to-Male		做作业从早上八点半到晚上十二点。(zuo4 zuo4 ye4 cong2 zao3 shang4 ba1 dian3 ban4 dao4 shi2 er4 dian3.)		zuo4 zuo4 ye4 cong2 zao3 shang4 ba1 dan4 ban4 dao4 shi2 er4 dian3.
Male-to-Female		想娶我吗?.那是要付出代价的。 (xiang2 qu2 wo3 ma0? na4 shi4 yao4 fu4 chu1 dai4 jia0 de0.)		xiang2 qu2 wo3 ma0? na4 shi4 yao4 fu3 chu1 dai4 jia3 de0.
Male-to-Female		你家最近怎么这么多客人呢？(ni3 jia1 zui4 jin4 zen3 me0 zhe4 me0 duo1 ke4 ren0 ne0?)		ni3 jia1 zi4 jin4 zen3 me0 zhe4 me0 duo1 ke3 ren2 ne0?

Notice that these samples are selected from the extras non-parallel part of datasets, thus no corresponding ground truth target utterances can be presented. Red emphasised phomenes or intonations indicate the incorrect pronunciation.