Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition
An emotional speech recognition system for the
applications on smart phones was proposed in this study to combine
with 3G mobile communications and social networks to provide users
and their groups with more interaction and care. This study developed
a mechanism using the support vector machines (SVM) to recognize
the emotions of speech such as happiness, anger, sadness and normal.
The mechanism uses a hierarchical classifier to adjust the weights of
acoustic features and divides various parameters into the categories of
energy and frequency for training. In this study, 28 commonly used
acoustic features including pitch and volume were proposed for
training. In addition, a time-frequency parameter obtained by
continuous wavelet transforms was also used to identify the accent and
intonation in a sentence during the recognition process. The Berlin
Database of Emotional Speech was used by dividing the speech into
male and female data sets for training. According to the experimental
results, the accuracies of male and female test sets were increased by
4.6% and 5.2% respectively after using the time-frequency parameter
for classifying happy and angry emotions. For the classification of all
emotions, the average accuracy, including male and female data, was
63.5% for the test set and 90.9% for the whole data set.
[1] Skiba, B., Johnson, M., Dillon, M. and Harrison, C., (2000). Moving in
mobile media mode, http://www.regisoft.com/articles/lehman.pdf.
[2] Shneiderman, B. (1992). Designing the user interface: strategies for
effective human-computer interaction. Reading: Addison-Wesley.
[3] Plutchik, R. (1980). A general psychoevolutionary theory of emotion. San
Diego, CA: Academic Press.
[4] Russell, J. A. (1980). A circumplex model of affect. Journal of Personality
and Social Psychology, 39, 1161-1178.
[5] Posner, J., Russell, J. A. and Peterson, B. S. (2005). A circumplex model of
affect: an integrative approach to affective.
[6] Yen-Kung Yang (2003). Science Development. 367, 70-73.
[7] E. Douglas-Cowie, R. Cowie, and M. Schröder. (2000). Emotional speech:
towards a new generation of databases. Speech Communication, a special
issue on Speech and Emotion, 40(1-2), 33-60.
[8] Cover, T. M and Hart, P. E. (1967). Nearest neighbor pattern classification.
IEEE Transactions on Information Theory, 13, 21-27.
[9] Dimitrios Ververidis and Constantine Kotropoulos. (2006). Emotional
speech recognition: Resources, features and methods. Speech
Communication, 48 (9) 1162-1181.
[10] Cai, L., Jiang, C., Wang, Z., Zhao, L., and Zou, C. (2003). A method
combining the global and time series structure features for emotion
recognition in speech. In Proceedings of International Conference on
Neural Networks and Signal Processing (ICNNSP-03), 2, 904-907.
[11] Kwon, O. W., Chan, K., Hao, J., and Lee, T. W. (2003). Emotion
recognition by speech signal. The Eighth European Conference on Speech
Communication and Technology (EUROSPEECH-03), Geneva,
Switzerland.
[12] Schuller, B., Rigoll, G., and Lang, M. (2003). Hidden Markov model based
speech emotion recognition. 28th IEEE International Conference on
Acoustic, Speech and Signal Processing (ICASSP-03).
[13] Vogt, T. and Andr'e, E. (2006). Improving automatic emotion recognition
from speech via gender differentiation. Language Resources and
Evaluation Conference.
[14] Petrushin, V. A. (2004). Emotion recognition in speech signal:
experimental study, development, and application." Sixth International
Conference on Spoken Language Processing (ICSLP).
[15] Reynolds, D. A. and Rose, R. C. (1995) .Robust text-independent speaker
identification using Gaussian mixture models. In Proceedings of the
European Conference on Speech Communication and Technology,
963-966.
[16] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77, 257-286.
[17] K. Fukunaga. (1990). Introduction to statistical pattern recognition. San
Diego, CA: Academic Press.
[18] Cover, T. M and Hart, P. E. (1967).Nearest neighbor pattern classification.
IEEE Transactions on Information Theory, 13, 21-27.
[19] E. H. Han, G. Karypis and V. Kumar. (2001). Text categorization using
weight adjusted k-nearest neighbor classification. Pacific-Asia Conference
on Knowledge Discovery and Data Mining, 53-65.
[20] Rabiner, L. R. and Ronald W. Schafer. (1989). Digital processing of speech
signals. Prentice-Hall, Inc., Englewood Cliffs, NJ.
[21] Yao X. (1999). Evolving artificial neural networks. Proceedings of the
IEEE , 87(9), 1423-1447.
[22] V. N. Vapnik. (2000).The nature of statistical learning theory. Chapter 5-6,
138-167, Springer-Verlag, New York.
[23] C. C. Chang and C. J. Lin (2001). LIBSVM: a library for support vector
machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[1] Skiba, B., Johnson, M., Dillon, M. and Harrison, C., (2000). Moving in
mobile media mode, http://www.regisoft.com/articles/lehman.pdf.
[2] Shneiderman, B. (1992). Designing the user interface: strategies for
effective human-computer interaction. Reading: Addison-Wesley.
[3] Plutchik, R. (1980). A general psychoevolutionary theory of emotion. San
Diego, CA: Academic Press.
[4] Russell, J. A. (1980). A circumplex model of affect. Journal of Personality
and Social Psychology, 39, 1161-1178.
[5] Posner, J., Russell, J. A. and Peterson, B. S. (2005). A circumplex model of
affect: an integrative approach to affective.
[6] Yen-Kung Yang (2003). Science Development. 367, 70-73.
[7] E. Douglas-Cowie, R. Cowie, and M. Schröder. (2000). Emotional speech:
towards a new generation of databases. Speech Communication, a special
issue on Speech and Emotion, 40(1-2), 33-60.
[8] Cover, T. M and Hart, P. E. (1967). Nearest neighbor pattern classification.
IEEE Transactions on Information Theory, 13, 21-27.
[9] Dimitrios Ververidis and Constantine Kotropoulos. (2006). Emotional
speech recognition: Resources, features and methods. Speech
Communication, 48 (9) 1162-1181.
[10] Cai, L., Jiang, C., Wang, Z., Zhao, L., and Zou, C. (2003). A method
combining the global and time series structure features for emotion
recognition in speech. In Proceedings of International Conference on
Neural Networks and Signal Processing (ICNNSP-03), 2, 904-907.
[11] Kwon, O. W., Chan, K., Hao, J., and Lee, T. W. (2003). Emotion
recognition by speech signal. The Eighth European Conference on Speech
Communication and Technology (EUROSPEECH-03), Geneva,
Switzerland.
[12] Schuller, B., Rigoll, G., and Lang, M. (2003). Hidden Markov model based
speech emotion recognition. 28th IEEE International Conference on
Acoustic, Speech and Signal Processing (ICASSP-03).
[13] Vogt, T. and Andr'e, E. (2006). Improving automatic emotion recognition
from speech via gender differentiation. Language Resources and
Evaluation Conference.
[14] Petrushin, V. A. (2004). Emotion recognition in speech signal:
experimental study, development, and application." Sixth International
Conference on Spoken Language Processing (ICSLP).
[15] Reynolds, D. A. and Rose, R. C. (1995) .Robust text-independent speaker
identification using Gaussian mixture models. In Proceedings of the
European Conference on Speech Communication and Technology,
963-966.
[16] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected
applications in speech recognition. Proceedings of the IEEE, 77, 257-286.
[17] K. Fukunaga. (1990). Introduction to statistical pattern recognition. San
Diego, CA: Academic Press.
[18] Cover, T. M and Hart, P. E. (1967).Nearest neighbor pattern classification.
IEEE Transactions on Information Theory, 13, 21-27.
[19] E. H. Han, G. Karypis and V. Kumar. (2001). Text categorization using
weight adjusted k-nearest neighbor classification. Pacific-Asia Conference
on Knowledge Discovery and Data Mining, 53-65.
[20] Rabiner, L. R. and Ronald W. Schafer. (1989). Digital processing of speech
signals. Prentice-Hall, Inc., Englewood Cliffs, NJ.
[21] Yao X. (1999). Evolving artificial neural networks. Proceedings of the
IEEE , 87(9), 1423-1447.
[22] V. N. Vapnik. (2000).The nature of statistical learning theory. Chapter 5-6,
138-167, Springer-Verlag, New York.
[23] C. C. Chang and C. J. Lin (2001). LIBSVM: a library for support vector
machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
@article{"International Journal of Electrical, Electronic and Communication Sciences:58381", author = "Wernhuar Tarng and Yuan-Yuan Chen and Chien-Lung Li and Kun-Rong Hsie and Mingteh Chen", title = "Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition", abstract = "An emotional speech recognition system for the
applications on smart phones was proposed in this study to combine
with 3G mobile communications and social networks to provide users
and their groups with more interaction and care. This study developed
a mechanism using the support vector machines (SVM) to recognize
the emotions of speech such as happiness, anger, sadness and normal.
The mechanism uses a hierarchical classifier to adjust the weights of
acoustic features and divides various parameters into the categories of
energy and frequency for training. In this study, 28 commonly used
acoustic features including pitch and volume were proposed for
training. In addition, a time-frequency parameter obtained by
continuous wavelet transforms was also used to identify the accent and
intonation in a sentence during the recognition process. The Berlin
Database of Emotional Speech was used by dividing the speech into
male and female data sets for training. According to the experimental
results, the accuracies of male and female test sets were increased by
4.6% and 5.2% respectively after using the time-frequency parameter
for classifying happy and angry emotions. For the classification of all
emotions, the average accuracy, including male and female data, was
63.5% for the test set and 90.9% for the whole data set.", keywords = "Smart phones, emotional speech recognition, socialnetworks, support vector machines, time-frequency parameter,Mel-scale frequency cepstral coefficients (MFCC).", volume = "4", number = "12", pages = "1832-8", }