Automatic Recognition of Emotionally Coloured Speech

Emotion in speech is an issue that has been attracting the interest of the speech community for many years, both in the context of speech synthesis as well as in automatic speech recognition (ASR). In spite of the remarkable recent progress in Large Vocabulary Recognition (LVR), it is still far behind the ultimate goal of recognising free conversational speech uttered by any speaker in any environment. Current experimental tests prove that using state of the art large vocabulary recognition systems the error rate increases substantially when applied to spontaneous/emotional speech. This paper shows that recognition rate for emotionally coloured speech can be improved by using a language model based on increased representation of emotional utterances.




References:
[1] K. Cummings, and M. Clements, Analysis of the glottal excitation of
emotionally styled and stressed speech. JASA, 98 (1), pp 88-98, 1995.
[2] H.J.M. Steeneken, and J.H.L. Hansen, Speech Under Stress
Conditions:Overview of the Effect of Speech Production and on System
Performance. IEEE ICASSP-99: Inter. Conf. on Acoustics, Speech, and
Signal Processing 4, pp 2079-2082, 1999.
[3] R. Cowie, and R. Cornelius, Describing the Emotional States that are
Expressed in Speech. Speech Communication, 40, pp 5-32, 2003.
[4] D.J. Litman, J.B. Hirschberg, and M. Swerts, Predicting Automatic
Speech Recognition Performance Using Prosodic Cues. Proceedings of
ANLP-NAACL, pp. 218-225, 2000.
[5] C.E. Williams, K.N. Stevens, Emotions and speech: Someacoustical
correlates. J. Acoust. Soc. Amer. 52, pp 1238-1250, 1972.
[6] S.T. Polzin, and A. Waibel, Pronunciation variations in emotional
speech. In H. Strik, J. M. Kessens & M. Wester (Eds.) Modeling
Pronunciation Variation for Automatic Speech Recognition. Proc. of the
ESCA Workshop, 1998, pp. 103-108.
[7] S.J. Young, Large Vocabulary Continuous Speech Recognition.IEEE
Signal Processing Magazine 13, (5), pp 45-57, 1996.
[8] T. Athanaselis, S. Bakamidis, I. Dologlou, R. Cowie, E. Douglas-Cowie,
and C. Cox, "ASR for emotional speech: clarifying the issues and
enhancing performance", Neural Networks Elsevier Publications,
Volume 18, Issue 4, pp 437- 444, 2005.
[9] C. Whissell, "The dictionary of affect in language". In R. Plutchnik & H.
Kellerman (Eds.) Emotion: Theory and research. New York, Harcourt
Brace, pp. 113-131, 1989.
[10] ERMIS FP5 IST Project http://manolito.image.ece.ntua.gr/ermis/
[11] EC HUMAINE project (http://www.emotion-research.net).
[12] R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey,
M.Schröder, 'Feeltrace': An instrument for recording perceived emotion
in real time. In E. Douglas-Cowie, R. Cowie & M. Schröder (Eds.)
Proceedings of the ISCA Workshop on Speech and Emotion: A
Conceptual Framework for Research, Belfast, pp.19-24, 2000.
[13] E. Douglas-Cowie, et al. Multimodal data in action and interaction:a
library of recordings and labelling schemes HUMAINE report D5d
http://emotion-research.net/deliverables/ 2003.