We live in a world where there are countless interactions with computer systems in every-day situations. In the most ideal case, this interaction feels as familiar and as natural as the communication we experience with other humans. To this end, an ideal means of communication between a user and a computer system consists of audiovisual speech signals. Audiovisual text-to-speech technology allows the computer system to utter any spoken message towards its users. Over the last decades, a wide range of techniques for performing audiovisual speech synthesis has been developed. This paper gives a comprehensive overview on these approaches using a categorization of the systems based on multiple important aspects that determine the properties of the synthesized speech signals. The paper makes a clear distinction between the techniques that are used to model the virtual speaker and the techniques that are used to generate the appropriate speech gestures. In addition, the paper discusses the evaluation of audiovisual speech synthesizers, it elaborates on the hardware requirements for performing visual speech synthesis and it describes some important future directions that should stimulate the use of audiovisual speech synthesis technology in real-life applications.
Mattheyses, W & Verhelst, W 2015, 'Audiovisual speech synthesis: An overview of the state-of-the-art', Speech Communication, vol. 66, pp. 182-217. https://doi.org/10.1016/j.specom.2014.11.001
Mattheyses, W., & Verhelst, W. (2015). Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication, 66, 182-217. https://doi.org/10.1016/j.specom.2014.11.001
@article{da64953dd196434e948b8ef971329bdf,
title = "Audiovisual speech synthesis: An overview of the state-of-the-art",
abstract = "We live in a world where there are countless interactions with computer systems in every-day situations. In the most ideal case, this interaction feels as familiar and as natural as the communication we experience with other humans. To this end, an ideal means of communication between a user and a computer system consists of audiovisual speech signals. Audiovisual text-to-speech technology allows the computer system to utter any spoken message towards its users. Over the last decades, a wide range of techniques for performing audiovisual speech synthesis has been developed. This paper gives a comprehensive overview on these approaches using a categorization of the systems based on multiple important aspects that determine the properties of the synthesized speech signals. The paper makes a clear distinction between the techniques that are used to model the virtual speaker and the techniques that are used to generate the appropriate speech gestures. In addition, the paper discusses the evaluation of audiovisual speech synthesizers, it elaborates on the hardware requirements for performing visual speech synthesis and it describes some important future directions that should stimulate the use of audiovisual speech synthesis technology in real-life applications.",
keywords = "Audiovisual speech synthesis, Visual speech synthesis, Speech synthesis",
author = "Wesley Mattheyses and Werner Verhelst",
year = "2015",
month = feb,
day = "1",
doi = "10.1016/j.specom.2014.11.001",
language = "English",
volume = "66",
pages = "182--217",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",
}