This paper proposes a 2D audiovisual text-to-speech synthesis system that constructs the output signal by selecting and concatenating multimodal segments containing natural combinations of audio and video. We describe the experiments that were conducted in order to assess the impact of this joint audio/video synthesis technique on the perceived quality of the synthetic speech. The experiments indicate that a maximal level of audiovisual coherence present in the output speech improves the perceived quality when compared to the traditional approach of synthesizing the visual signal separately from the audio. In addition, we measured that there is a same maximum allowable desynchronization between the audio and the image sequence, irrespective whether the degree of desynchronization is constant or time varying. This tolerance is used in the synthesizer for further optimizing the segment cuttings points in the audio and in the video mode.
Mattheyses, W, Latacz, L & Verhelst, W 2009, Multimodal Coherency Issues in Designing and Optimizing Audiovisual Speech Synthesis Techniques. in International Conference on Auditory-visual Speech Processing 2009, Norwich, UK. pp. 47-52, Finds and Results from the Swedish Cyprus Expedition: A Gender Perspective at the Medelhavsmuseet, Stockholm, Sweden, 21/09/09. <http://www.etro.vub.ac.be/Research/DSSP/PUB_FILES/int_conf/AVSP-2009-mattheyses.pdf>
Mattheyses, W., Latacz, L., & Verhelst, W. (2009). Multimodal Coherency Issues in Designing and Optimizing Audiovisual Speech Synthesis Techniques. In International Conference on Auditory-visual Speech Processing 2009, Norwich, UK (pp. 47-52) http://www.etro.vub.ac.be/Research/DSSP/PUB_FILES/int_conf/AVSP-2009-mattheyses.pdf
@inproceedings{625a30ea4151473f8b6dcf3694380cdc,
title = "Multimodal Coherency Issues in Designing and Optimizing Audiovisual Speech Synthesis Techniques",
abstract = "This paper proposes a 2D audiovisual text-to-speech synthesis system that constructs the output signal by selecting and concatenating multimodal segments containing natural combinations of audio and video. We describe the experiments that were conducted in order to assess the impact of this joint audio/video synthesis technique on the perceived quality of the synthetic speech. The experiments indicate that a maximal level of audiovisual coherence present in the output speech improves the perceived quality when compared to the traditional approach of synthesizing the visual signal separately from the audio. In addition, we measured that there is a same maximum allowable desynchronization between the audio and the image sequence, irrespective whether the degree of desynchronization is constant or time varying. This tolerance is used in the synthesizer for further optimizing the segment cuttings points in the audio and in the video mode.",
keywords = "audiovisual speech synthesis, multimodal unit selection, audiovisual synchrony",
author = "Wesley Mattheyses and Lukas Latacz and Werner Verhelst",
year = "2009",
month = sep,
day = "10",
language = "English",
pages = "47--52",
booktitle = "International Conference on Auditory-visual Speech Processing 2009, Norwich, UK",
note = "Finds and Results from the Swedish Cyprus Expedition: A Gender Perspective at the Medelhavsmuseet ; Conference date: 21-09-2009 Through 25-09-2009",
}