Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition

An emotional speech recognition system for the applications on smart phones was proposed in this study to combine with 3G mobile communications and social networks to provide users and their groups with more interaction and care. This study developed a mechanism using the support vector machines (SVM) to recognize the emotions of speech such as happiness, anger, sadness and normal. The mechanism uses a hierarchical classifier to adjust the weights of acoustic features and divides various parameters into the categories of energy and frequency for training. In this study, 28 commonly used acoustic features including pitch and volume were proposed for training. In addition, a time-frequency parameter obtained by continuous wavelet transforms was also used to identify the accent and intonation in a sentence during the recognition process. The Berlin Database of Emotional Speech was used by dividing the speech into male and female data sets for training. According to the experimental results, the accuracies of male and female test sets were increased by 4.6% and 5.2% respectively after using the time-frequency parameter for classifying happy and angry emotions. For the classification of all emotions, the average accuracy, including male and female data, was 63.5% for the test set and 90.9% for the whole data set.

[1] Skiba, B., Johnson, M., Dillon, M. and Harrison, C., (2000). Moving in
mobile media mode,
[2] Shneiderman, B. (1992). Designing the user interface: strategies for
effective human-computer interaction. Reading: Addison-Wesley.
[3] Plutchik, R. (1980). A general psychoevolutionary theory of emotion. San
Diego, CA: Academic Press.
[4] Russell, J. A. (1980). A circumplex model of affect. Journal of Personality
and Social Psychology, 39, 1161-1178.
[5] Posner, J., Russell, J. A. and Peterson, B. S. (2005). A circumplex model of
affect: an integrative approach to affective.
[6] Yen-Kung Yang (2003). Science Development. 367, 70-73.
[7] E. Douglas-Cowie, R. Cowie, and M. Schröder. (2000). Emotional speech:
towards a new generation of databases. Speech Communication, a special
issue on Speech and Emotion, 40(1-2), 33-60.
[8] Cover, T. M and Hart, P. E. (1967). Nearest neighbor pattern classification.
IEEE Transactions on Information Theory, 13, 21-27.
[9] Dimitrios Ververidis and Constantine Kotropoulos. (2006). Emotional
speech recognition: Resources, features and methods. Speech
Communication, 48 (9) 1162-1181.
[10] Cai, L., Jiang, C., Wang, Z., Zhao, L., and Zou, C. (2003). A method
combining the global and time series structure features for emotion
recognition in speech. In Proceedings of International Conference on
Neural Networks and Signal Processing (ICNNSP-03), 2, 904-907.
[11] Kwon, O. W., Chan, K., Hao, J., and Lee, T. W. (2003). Emotion
recognition by speech signal. The Eighth European Conference on Speech
Communication and Technology (EUROSPEECH-03), Geneva,
[12] Schuller, B., Rigoll, G., and Lang, M. (2003). Hidden Markov model based
speech emotion recognition. 28th IEEE International Conference on
Acoustic, Speech and Signal Processing (ICASSP-03).
[13] Vogt, T. and Andr'e, E. (2006). Improving automatic emotion recognition
from speech via gender differentiation. Language Resources and
Evaluation Conference.
[14] Petrushin, V. A. (2004). Emotion recognition in speech signal:
experimental study, development, and application." Sixth International
Conference on Spoken Language Processing (ICSLP).
[15] Reynolds, D. A. and Rose, R. C. (1995) .Robust text-independent speaker
identification using Gaussian mixture models. In Proceedings of the
European Conference on Speech Communication and Technology,
[16] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77, 257-286.
[17] K. Fukunaga. (1990). Introduction to statistical pattern recognition. San
Diego, CA: Academic Press.
[18] Cover, T. M and Hart, P. E. (1967).Nearest neighbor pattern classification.
IEEE Transactions on Information Theory, 13, 21-27.
[19] E. H. Han, G. Karypis and V. Kumar. (2001). Text categorization using
weight adjusted k-nearest neighbor classification. Pacific-Asia Conference
on Knowledge Discovery and Data Mining, 53-65.
[20] Rabiner, L. R. and Ronald W. Schafer. (1989). Digital processing of speech
signals. Prentice-Hall, Inc., Englewood Cliffs, NJ.
[21] Yao X. (1999). Evolving artificial neural networks. Proceedings of the
IEEE , 87(9), 1423-1447.
[22] V. N. Vapnik. (2000).The nature of statistical learning theory. Chapter 5-6,
138-167, Springer-Verlag, New York.
[23] C. C. Chang and C. J. Lin (2001). LIBSVM: a library for support vector