A common approach in visual speech synthesis is the use of visemes as atomic units of speech. In this paper, phoneme-based and viseme-based audiovisual speech synthesis techniques are compared in order to explore the balancing between data availability and an improved audiovisual coherence for synthesis optimization. A technique for automatic viseme clustering is described and it is compared to the standardized viseme set described in MPEG-4. Both objective and subjective testing indicated that a phoneme-based approach leads to better synthesis results. In addition, the test results improve when more different visemes are defined. This raises some questions on the widely applied viseme-based approach. It appears that a many-to-one phoneme-to-viseme mapping is not capable of describing all subtle details of the visual speech information. In addition, with viseme-based synthesis the perceived synthesis quality is affected by the loss of audiovisual coherence in the synthetic speech.
Mattheyses, W, Latacz, L & Verhelst, W 2011, 'Automatic Viseme Clustering for Audiovisual Speech Synthesis', Proceedings of Interspeech, no. 2011, pp. 2173-2176. <http://www.etro.vub.ac.be/Research/DSSP/PUB_FILES/int_conf/Interspeech2011_Mattheyses.pdf>
Mattheyses, W., Latacz, L., & Verhelst, W. (2011). Automatic Viseme Clustering for Audiovisual Speech Synthesis. Proceedings of Interspeech, (2011), 2173-2176. http://www.etro.vub.ac.be/Research/DSSP/PUB_FILES/int_conf/Interspeech2011_Mattheyses.pdf
@article{0581a652b547418fb321f6c20bfb42b9,
title = "Automatic Viseme Clustering for Audiovisual Speech Synthesis",
abstract = "A common approach in visual speech synthesis is the use of visemes as atomic units of speech. In this paper, phoneme-based and viseme-based audiovisual speech synthesis techniques are compared in order to explore the balancing between data availability and an improved audiovisual coherence for synthesis optimization. A technique for automatic viseme clustering is described and it is compared to the standardized viseme set described in MPEG-4. Both objective and subjective testing indicated that a phoneme-based approach leads to better synthesis results. In addition, the test results improve when more different visemes are defined. This raises some questions on the widely applied viseme-based approach. It appears that a many-to-one phoneme-to-viseme mapping is not capable of describing all subtle details of the visual speech information. In addition, with viseme-based synthesis the perceived synthesis quality is affected by the loss of audiovisual coherence in the synthetic speech.",
keywords = "audiovisual speech synthesis, visemes, facial animation",
author = "Wesley Mattheyses and Lukas Latacz and Werner Verhelst",
year = "2011",
month = aug,
day = "27",
language = "English",
pages = "2173--2176",
journal = "Proceedings of Interspeech",
issn = "1990-9772",
number = "2011",
note = "Interspeech 2011: 12th Annual Conference of the International Speech Communication Association, Interspeech 2011 ; Conference date: 27-08-2011 Through 31-08-2011",
url = "http://www.interspeech2011.org/",
}