Transformation of Vocal Characteristics: A Review of Literature

The transformation of vocal characteristics aims at modifying voice such that the intelligibility of aphonic voice is increased or the voice characteristics of a speaker (source speaker) to be perceived as if another speaker (target speaker) had uttered it. In this paper, the current state-of-the-art voice characteristics transformation methodology is reviewed. Special emphasis is placed on voice transformation methodology and issues for improving the transformed speech quality in intelligibility and naturalness are discussed. In particular, it is suggested to use the modulation theory of speech as a base for research on high quality voice transformation. This approach allows one to separate linguistic, expressive, organic and perspective information of speech, based on an analysis of how they are fused when speech is produced. Therefore, this theory provides the fundamentals not only for manipulating non-linguistic, extra-/paralinguistic and intra-linguistic variables for voice transformation, but also for paving the way for easily transposing the existing voice transformation methods to emotion-related voice quality transformation and speaking style transformation. From the perspectives of human speech production and perception, the popular voice transformation techniques are described and classified them based on the underlying principles either from the speech production or perception mechanisms or from both. In addition, the advantages and limitations of voice transformation techniques and the experimental manipulation of vocal cues are discussed through examples from past and present research. Finally, a conclusion and road map are pointed out for more natural voice transformation algorithms in the future.




References:
[1] H. Traunm├╝ller. Evidence for demodulation in speech perception.
ICSLP, workshop on The Nature of Speech Perception, 2000
[2] H. Traunm├╝ller. Modulation and demodulation in production,
perception, and imitation of speech and bodily gestures. in FONETIK
98, Dept. of Linguistics, Stockholm University, pp. 40 - 43. 1998.
[3] Y. Stylianou. Voice Conversion: Survey. icassp, pp.3585-3588, 2009.
[4] H. Traunm├╝ller. Perceptual dimension of openness in vowels. J. Acoust.
Soc. Am. 69: 1465 -1475, especially Exp.2 - 4, pp. 1469 - 1472, 1981.
[5] H. Traunm├╝ller. The context sensitivity of the perceptual interaction
between F0 and F1. Actes du XIIème Congres international des Science
Phonetiques, Aix-en-Provence, vol. 5, pp. 62 - 65, 1991.
[6] H. Traunm├╝ller. Conventional, biological and environmental factors in
speech communication: A modulation theory. Phonetica 51: 170 - 183,
1994.
[7] H. Traunm├╝ller. Articulatory and perceptual factors controlling the ageand
sex-conditioned variability in formant frequencies of vowels,.
Speech Comm. 3: 49 - 61, 1984.
[8] R.P. Fahey, and R.L. Diehl. The missing fundamental in vowel height
perception. Perc. & Psychophys. 58: 725 - 733, 1996.
[9] A. Klinkert and D. Maurer. Fourier spectra and formant patterns of
German vowels produced at F0 of 70 - 850 Hz J. Acoust. Soc. Am. 101:
3112 (A)., 1997.
[10] E. Zetterholm. Same speaker different voices: A study of one
impersonator and some of his different imitations. Proc. Int. Conf.
Speech Sci. & Tech., pages 70-75, 2006.
[11] A. Eriksson and P. Wretling. How flexible is the human voice?-A case
study of mimicry. Proc. Eurospeech, pages 1043-1046, 1997.
[12] T. Kitamura. Acoustic analysis of imitated voice produced by a
professional impersonator. Proc. Interspeech, pages 813-816, 2008.
[13] H. Kuwabara and Y. Sagisaka. Acoustic characteristics of speaker
individuality: Control and conversion. Speech
Communication,16(2):165-173, 1995.
[14] S. Furui. Digital Speech Processing, Synthesis, and Recognition. Marcel
Dekker, 1989.
[15] L. Rabiner, and B.-H. Juang. Fundamental of Speech recognition
Prentice-Hall, Upper Saddle River, NJ, 1993.
[16] M. Schröder. Emotional speech synthesis: A review. In Proc.
Eurospeech-01,Scandinavia, 2001.
[17] M. Schröder. Speech and Emotion Research. An Overview of Research
Frameworks and a Dimensional Approach to Emotional Speech
Synthesis. PhD thesis, Institut für Phonetik , Universität des Saarlandes.
Phonus no.7, 2004.
[18] S. Roehling, B. MacDonald, and C. Watson. Towards expressive speech
synthesis in English on a robotic platform. In Proc. 11th Australasian
International Conference on Speech Science and Technology, Auckland,
New Zealand. Univ. of Auckland, 2006.
[19] K. Silverman, M. Beckman, M. Pierrehumbert, J. Ostendorf, M.
Wightman, C. Price, P. and Hirschberg, J. Tobi. A standard scheme for
labeling prosody. In Proc. ICSLP-92, Banff., 1992.
[20] R. Donovan, and E. Eide. The IBM trainable speech synthesis system.
In Proc. ICSLP-98, Sydney, Australia, 1998.
[21] J. Pitrelli, R. Bakis, E. Eide, R. Fernandez, W. Hamza, and M. Picheny.
The IBM expressive text-to-speech synthesis system for american
english. IEEE Transactions on Audio, Speech and Language Processing,
14(4):1099-1108, 2006.
[22] Y. Stylianou, J. Laroche, and E. Moulines. High-Quality Speech
Modification based on a Harmonic + Noise Model. Proc.
EUROSPEECH, 1995.
[23] A. Kain. High resolution voice transformation. PhD thesis, OGI School
of Science and Eng., Portland, Oregeon, USA.
[24] A. Mouchtaris, J. Van derSpiegel, and P.Mueller. Non parallel training
for voice conversion based on a parameter adaptation. IEEE Trans.
Audio, Speech, and Language Processing, 14(3):952-963, 2006.
[25] T. Toda, H. Saruwatari, and K. Shikano. Voice Conversion Algorithm
based on Gaussian Mixture Model with Dynamic Frequency Warping of
STRAIGHT spectrum. In Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, pages 841-844, Salt Lake City, USA, 2001.
[26] D. Erro, T. Polyakova, and A. Moreno. On combining statistical methods
and frequency warping for high-quality voice conversion. In Proc. IEEE
Int. Conf. Acoust., Speech, Signal Processing, 2008.
[27] T. Toda, A.W. Black, and K. Tokuda. Spectral Conversion Based on
Maximum Likelihood Estimation considering Global Variance of
Converted Parameter. In Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, pages 9-12, Philadelphia, USA, 2005.
[28] L. Meshabi, V. Barreaud, and O. Boeffard. GMM-based Speech
Transformation Systems under Data Reduction. 6th ISCA Workshop on
Speech Synthesis, pages 119-124, August 22-24, 2007.
[29] H. Ye and S. Young. Quality-enhanced voice morphing using maximum
likelihood transformations. IEEE Trans. Audio, Speech, and Language
Processing, 14(4):1301-1312, July 2006.
[30] H. Duxans, A. Bonafonte, A. Kain, and J. van Santen. Including
dynamic and phonetic information in voice conversion systems. Proc.
ICSLP, pages 5-8, 2004.
[31] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara. Voice conversion
through vector quantization. In Proc. ICASSP88, pages 655-658, 1988.
[32] N. Iwahashi and Y. Sagisaka. Speech spectrum transformation based on
speaker interpolation. In Proc. ICASSP94, 1994.
[33] O. Turk and L. M. Arslan. Robust processing techniques for voice
conversion. Computer Speech and Language, 20:441-467, 2006.
[34] W. Verhelst and M. Roelands. An overlap-add technique based on
waveform similarity (wsola) for high quality time-scale modification of
speech. In Proc. ICASSP93, pages 554-557, 1993.
[35] J. van Santen, A. Kain, E. Klabbers, and T. Mishra. Synthesis of prosody
using multi-level unit sequences. Speech Communication, 46:365-375,
2005.
[36] D. Vincent and O. Rosec. A new method for speech synthesis and
transformation based on a ARX-LF source-filter decomposition and
HNM modeling. in ICASSP, 2007.
[37] Y. Agiomyrgiannakis, O. Rosec. ARX-LF-based source-filter methods
for voice modification and transformation. icassp, pp.3589-3592, 2009.
[38] R. J. McAulay and T. F. Quatieri. Speech analysis/synthesis based on a
sinusoidal representation. IEEE Trans. Acoust., Speech, Signal
Processing, ASSP-34(4):744-754, Aug 1986.
[39] P. Depalle and G. Poirrot. SVP: A modular system for analysis,
processing and synthesis of sound signals. in Proceedings of the
International Computer Music Conference, 1991.
[40] J. Laroche and M. Dolson. Improved phase vocoder timescale
modification of audio. IEEE Transactions on Audio and Speech
Processing, vol. 7, no. 3, 1999.
[41] H. Kawahara. Speech representation and transformation using adaptive
interpolation of weighted spectrum: vocoder revisited. In Proc. IEEE Int.
Conf. Acoust., Speech, Signal Processing, pages 1303-1306, Munich,
Germany, 1997.
[42] J. Liu, G. Beaudoin, and G. Chollet. Studies of glottal excitation and
vocal tract parameters using inverse filtering and a parameterized input
model. In Proc. ICSLP-92, pages 1051-1054, Banff, Alberta, Canada,
1992.
[43] P. Alku. Glottal wave analysis with pitch synchronous iterative adaptive
inverse filtering. Speech Communication, 11:109-118, 1992.
[44] O. O. Akande, and P. J. Murphy. Estimation of the vocal tract tranfer
function with application to glottal wave analysis. Speech
Communication, 46:15-36, 2005.
[45] D. G. Childers. Glottal source modeling for voice conversion. Speech
Communication, 16:127-138, 1995.
[46] G. Fant, J. Liljentcrats, and Q. Lin. A four parameter model of glottal
flow. In Quarterly Progress and Status Report, number 4 in STL-QPSR,
pages 1-13. KTH, Stockholm, Sweden, 1985.
[47] C. d-Alessandro, and B. Doval. Experiments in voice quality
modification of natural speech signals: the spectral approach. In Proc.
3rd ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, Jenolan
Caves House, Blue Mountains, NSW, Australia, 1998.
[48] P. Mokhtari, H. R. Pfitzinger, and C. T. Ishi. Principal components of
glottal waveforms: towards parameterisation and manipulation of
laryngeal voice quality. In Proc. VOQUAL-03, Geneva, 2003.
[49] M. Lugger, B. Yang, and W. Wokurek. Robust estimation of voice
quality parameters under real world disturbances. In Proc. ICASSP-06,
pages 1097-1100, 2006.
[50] K. Shikano, K. Lee, and R. Reddy, "Speaker adaptation through vector
quantization," in Proc. IEEE Int. Conf. Acoustics, Speech, Signal
Processing, 1986, pp. 2643-2646.
[51] H. Valbret, E. Moulines, and J. Tubach. Voice transformation using
PSOLA technique. Speech Communication, 11:175-187, 1992.
[52] A. Kain, and M. W. Macon. Spectral voice conversion for text-to-speech
synthesis. In Proc. ICASSP-98, volume 1, pages 285-288, 1998.
[53] L. M. Arslan. Speaker transformation algorithm using segmental
codebooks (STASC). Speech Communication, 28:211-226, 1999.
[54] O. Turk, and L. M. Arslan. Robust processing techniques for voice
conversion. Computer Speech and Language, 20:441-467, 2006.
[55] Y. Stylianou, O. Cappé, and E. Moulines, E. Continuous probabilistic
transform for voice conversion. IEEE Trans. on Speech and Audio
Processing, 6(2):131-142, 1998.
[56] P. Woodland. Speaker adaptation for continuous density hmms: a
review. In Proc. ITRW on Adaptation Methods for Speech Recognition,
pages 11-19, Sophia Antipolis, 2001.
[57] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.
Simultaneous modeling of spectrum, pitch and duration in HMM-based
speech synthesis. In Proc. Eurospeech-99, volume 5, pages 2347-2350,
Budapest, Hungary, 1999.
[58] T.Masuko, T., Tokuda, K., Kobayashi, T., and Imai, S. (1997). Voice
characteristics conversion for HMM-based speech synthesis. In Proc.
ICASSP-97, pages 1611-1614.
[59] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura.
Speaker interpolation in HMM-based speech synthesis system. In Proc.
Eurospeech-97, Rhodos, Greece, 1997.
[60] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi. Speaker
adaptation for HMM-based speech synthesis usingMLLR. In Proc. 3rd
ESCA/COCOSDAWorkshop (ETRW) on Speech Synthesis, Blue
Mountains, Australia, 1998.
[61] K. Shichiri, A. Sawabe, T. Yoshimura, K. Tokuda, T. Masuko, T.
Kobayashi, and T. Kitamura. Eigenvoices for HMM-based speech
synthesis. In Proc. ICSLP-02, Denver, Colorado, 2002.
[62] O. Cappé, J. Laroche, and E. Moulines. Regularized estimation of
cepstrum envelope from discrete frequency points. In Proc. IEEE ASSP
Workshop on Applications of Signal Processing to Audio and Acoustics,
Mohonk, 1995.
[63] E. Moulines, and F. Charpentier. Pitch-synchronous waveform
processing techniques for text-to-speech synthesis using diphones.
Speech Communication, 9(5):453-467, 1990.
[64] E. Moulines, and W. Verhelst. Time-domain and frequency-domain
techniques for prosodic modification of speech. In Kleijn, W. and
Paliwal, K., editors, Speech Coding and Synthesis, chapter 15, pages
519-555. Elsevier Science B.V., 1995.
[65] Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge
University Press.
[66] L.D. Alsteris and K.K. Paliwal. Short-time phase spectrum in speech
processing: A review and some experimental results. Digital Signal
Processing, 17:578-616, 2007.
[67] A. Kain, and M. W. Macon. Design and evaluation of a voice conversion
algorithm based on spectral envelop mapping and residual prediction. In
Proc. ICASSP-01, 2001.
[68] J. Yamagishi, H. Zen, Y.-J. Wu, T. Toda, and K. Tokuda. The HTS-2008
system: Yet another evaluation of the speaker-adaptive HMM-based
speech synthesis system in the 2008 Blizzard Challenge. In Proc.
Blizzard Challenge 2008, Brisbane, Australia, September 2008.
[69] G. Baudoin, and Y. Stylianou. On the transformation of the speech
spectrum for voice conversion. In Proc. ICSLP-96, Philadelphia, PA,
USA, 1996.