Sequence-to-Sequence Acoustic Modeling for Voice Conversion (Submitted to IEEE/ACM Transactions on audio, speech and language processing, Aug. 2018)
Female | Male | ||
他难道不知道我要存钱去超市买零食啊? (ta1 nan2 dao4 bu4 zhi1 dao4 wo3 yao4 cun2 qian2 qu4 chao1 shi4 mai3 ling2 shi2 a0?) |
马上停机了,明天要是去市里,记得给我充下话费。 (ma3 shang4 ting2 ji1 le0, ming2 tian1 yao4 shi4 qu4 shi4 li3, ji4 de0 gei2 wo3 chong1 xia3 hua4 fei4.) |
他难道不知道我要存钱去超市买零食啊? (ta1 nan2 dao4 bu4 zhi1 dao4 wo3 yao4 cun2 qian2 qu4 chao1 shi4 mai3 ling2 shi2 a0?) |
马上停机了,明天要是去市里,记得给我充下话费。 (ma3 shang4 ting2 ji1 le0, ming2 tian1 yao4 shi4 qu4 shi4 li3, ji4 de0 gei2 wo3 chong1 xia3 hua4 fei4.) |
Samples of experiments on Mandarin dataset.
Methods | Female-to-Male | Male-to-Female | ||
i-JD-GMM | ||||
i-bn-DNN | ||||
i-VCC2018 | ||||
Proposed |
"i-" in the prefixes of method names represent interpolation for global speaking rate adjustment. "-bn-" represents using bottleneck features for DNN based method as we described in the paper.
Samples of experiments on English CMU ARCTIC dataset.
Methods | SLT-to-RMS | RMS-to-SLT | ||
i-JD-GMM | ||||
i-bn-DNN | ||||
i-VCC2018 | ||||
Proposed | ||||
Reference |
"i-" in the prefixes of method names represent interpolation for global speaking rate adjustment. "-bn-" represents using bottleneck features for DNN based method as we described in the paper.
Methods | Female-to-Male | Male-to-Female | ||
Proposed | ||||
w/o-Mel | ||||
w/o-bn |
"w/o-Mel" and "w/o-bn" represent proposed method without using Mel spectrograms and bottleneck features respectively.
Adding bottleneck features is beneficial for improving the pronunciation correctness of converted speech.
Methods | Audio samples | Texts sound like |
Proposed | 想听音乐,您就说打开音乐。(xiang3 ting1 yin1 yue4, nin2 jiu4 shuo1 da3 ka1 yin1 yue4.) | |
w/o-bn | 想听音乐,就说打开音乐。(xiang3 ting1 yin1 (unclear voice), (skipping phoneme) jiu4 shuo1 da3 ka1 yin1 yue4.) | |
Reference | 想听音乐,您就说打开音乐。(xiang3 ting1 yin1 yue4, nin2 jiu4 shuo1 da3 ka1 yin1 yue4.) |
Methods | Female-to-Male | Male-to-Female | ||
Proposed | ||||
w/o-att | ||||
i-w/o-att |
''w/o-att'' represents proposed method without attention module; ''i-w/o-att" represents proposed method without attention and using interpolation for global speaking rate adjustment.
Methods | Female-to-Male | Male-to-Female | ||
Proposed | ||||
w/o-locc |
"w/o-locc" represents proposed method without using location code.
Conversion Pairs | Source utterances | Ground truth texts | Converted utterances | Texts sound like |
Female-to-Male | 懒鬼起床了,大懒虫,太阳晒屁股了,该运动运动了。(lan2 gui3 qi3 chuang2 le0, da4 lan3 chong2, tai4 yang2 sha4 pi4 gui0 le0, gai1 yun4 dong0 yun4 dong0 le0.) |
lan2 gui3 qi3 chang2 le0, da4 lan3 chong2, tai4 yang2 sha4 pi4 gui0 le0, gai1 yun4 dong0 yun4 dong0 le0. | ||
Female-to-Male | 做作业从早上八点半到晚上十二点。(zuo4 zuo4 ye4 cong2 zao3 shang4 ba1 dian3 ban4 dao4 shi2 er4 dian3.) |
zuo4 zuo4 ye4 cong2 zao3 shang4 ba1 dan4 ban4 dao4 shi2 er4 dian3. | ||
Male-to-Female | 想娶我吗?.那是要付出代价的。 (xiang2 qu2 wo3 ma0? na4 shi4 yao4 fu4 chu1 dai4 jia0 de0.) |
xiang2 qu2 wo3 ma0? na4 shi4 yao4 fu3 chu1 dai4 jia3 de0. | ||
Male-to-Female | 你家最近怎么这么多客人呢?(ni3 jia1 zui4 jin4 zen3 me0 zhe4 me0 duo1 ke4 ren0 ne0?) |
ni3 jia1 zi4 jin4 zen3 me0 zhe4 me0 duo1 ke3 ren2 ne0? |
Notice that these samples are selected from the extras non-parallel part of datasets, thus no corresponding ground truth target utterances can be presented. Red emphasised phomenes or intonations indicate the incorrect pronunciation.