Audio Samples from "Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations"

Paper:

Accepted by IEEE/ACM Transaction on Audio, Speech and Language Processing (pre-printed version avaible at here).

Implementation code is avaible at here.

Authors:

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai

Abstract:

This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including pre-training using a multi-speaker dataset and finetuning on the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the top non-parallel voice conversion method in Voice Conversion Challenge 2018. The performance of our proposed method was closed to the state-of-the-art parallel seq2seq voice conversion method.

 


0. Ground truth target samples

 

slt

rms

source
target

 

1. Comparsion of proposed method to baseline methods

 

Methods

rms-to-slt

slt-to-rms

para
DNN
Seq2seqVC
non-para
CycleGAN
VCC2018
Proposed

 

2. Reducing size of training data for adaptation

Size of data
rms-to-slt
slt-to-rms
500
300
100

 

3.Ablation studies

Methods
rms-to-slt
slt-to-rms
Proposed
-adv
-LCT
-text
-adv-text
-pre-training

"-adv", "-LCT", "-text" represent the proposed method without using adversarial training, contrastive loss, text inputs respectively. "-pre-training" represents the proposed method without using pre-training strategy.

4. More conversion pairs

Ground truth samples:

slt rms bdl clb

Inter-gender conversion

Conversion direction
Audio
slt-to-rms
rms-to-slt
clb-to-bdl
bdl-to-clb

Intra-gender conversion

Conversion direction
Audio
slt-to-clb
clb-to-slt
rms-to-bdl
bdl-to-rms