The perceived quality of a synthetic visual speech signal greatly depends on the smoothness of the presented visual articulators. This paper explains how concatenative visual speech synthesis systems can apply active appearance models to achieve a smooth and natural visual output speech. By modeling the visual speech contained in the system's speech database, a diversification between the synthesis of the shape and the texture of the talking head is feasible. This allows the system to accurately balance between the articulation strength of the visual articulators and the signal smoothness of the visual mode in order to optimize the synthesis. To improve the synthesis quality, an automatic database normalization strategy has been designed that removes variations from the database which are not related to speech production. As was verified by a perception experiment, this normalization strategy significantly improves the perceived signal quality.
Mattheyses, W, Latacz, L & Verhelst, W 2010, 'Active Appearance Models for Photorealistic Visual Speech Synthesis', Proceedings of Interspeech, no. 2010, pp. 1113-1116. <http://www.etro.vub.ac.be/Research/DSSP/PUB_FILES/int_conf/interspeech2010_Mattheyses.pdf>
Mattheyses, W., Latacz, L., & Verhelst, W. (2010). Active Appearance Models for Photorealistic Visual Speech Synthesis. Proceedings of Interspeech, (2010), 1113-1116. http://www.etro.vub.ac.be/Research/DSSP/PUB_FILES/int_conf/interspeech2010_Mattheyses.pdf
@article{61f9051de0db4569a5bd00a6b4b897fb,
title = "Active Appearance Models for Photorealistic Visual Speech Synthesis",
abstract = "The perceived quality of a synthetic visual speech signal greatly depends on the smoothness of the presented visual articulators. This paper explains how concatenative visual speech synthesis systems can apply active appearance models to achieve a smooth and natural visual output speech. By modeling the visual speech contained in the system's speech database, a diversification between the synthesis of the shape and the texture of the talking head is feasible. This allows the system to accurately balance between the articulation strength of the visual articulators and the signal smoothness of the visual mode in order to optimize the synthesis. To improve the synthesis quality, an automatic database normalization strategy has been designed that removes variations from the database which are not related to speech production. As was verified by a perception experiment, this normalization strategy significantly improves the perceived signal quality.",
keywords = "audiovisual speech synthesis, AAM modeling",
author = "Wesley Mattheyses and Lukas Latacz and Werner Verhelst",
year = "2010",
month = sep,
day = "26",
language = "English",
pages = "1113--1116",
journal = "Proceedings of Interspeech",
issn = "1990-9772",
number = "2010",
note = "Interspeech 2010: 11th Annual Conference of the International Speech Communication Association, Interspeech 2010 ; Conference date: 26-09-2010 Through 30-09-2010",
url = "http://www.interspeech2010.org/",
}