Demos from "TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis"

Paper: TO DO

Authors: Jing-Xuan Zhang, Korin Richmond, Zhen-Hua-Ling, Li-Rong Dai

Abstract: This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes. To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strategy to exploit the much larger amounts of acoustic-only data available to train TTS models. For this, a Tacotron 2 TTS model is first trained, and then the parameters of its decoder are transferred to the TaLNet decoder. We have evaluated our approach on an unconstrained multi-speaker voice recovery task. Our results show the effectiveness of both the proposed model and the transfer learning strategy. Speech reconstructed using our proposed method significantly outperformed all baselines (DNN, BLSTM and without transfer learning) in terms of both naturalness and intelligibility. When using an ASR model decoding the recovery speech,
the WER of our proposed method is relatively reduced over 30% compared to baselines.


1. Comparison with baselines

DNN BLSTM TaLNet w/o transfer TaLNet Natural

2. Varying the number of the speakers

Number of speakers used TaLNet w/o transfer TaLNet
1
3
9
25
75

 

Number of speakers used TaLNet w/o transfer TaLNet
1
3
9
25
75

3. Ablation Studies

Methods Sample1 Sample2 Sample3
TaLNet
w/o tongue
w/o lip
w/o stat
w/o scheduled sampling
w/o finetune
Natural

4. Silent utterances

Audible Silent