High-Individuality Voice Conversion Based on Concatenative Speech Synthesis

Concatenative speech synthesis is a method that can make speech sound which has naturalness and high-individuality of a speaker by introducing a large speech corpus. Based on this method, in this paper, we propose a voice conversion method whose conversion speech has high-individuality and naturalness. The authors also have two subjective evaluation experiments for evaluating individuality and sound quality of conversion speech. From the results, following three facts have be confirmed: (a) the proposal method can convert the individuality of speakers well, (b) employing the framework of unit selection (especially join cost) of concatenative speech synthesis into conventional voice conversion improves the sound quality of conversion speech, and (c) the proposal method is robust against the difference of genders between a source speaker and a target speaker.




References:
[1] Y. Stylianou, O. Cappé, and E. Moulines, "Statistical methods for voice
quality transformation," Proc. of EUROSPEECH, pp.447-450,
September 1995.
[2] A. Kain, and M. W. Macon, "Spectral voice conversion for text-to-speech
synthesis," Proc. of International Conference on Acoustics, Speech and
Signal Processing, Vol. 1, pp.285-288, 1998.
[3] T. Toda, H. Saruwatari, and K. Shikano, "Voice conversion algorithm
based on Gaussian mixture model with dynamic frequency warping of
straight spectrum," Proc. of International Conference on Acoustics,
Speech and Signal Processing, Vol. 2, pp.841-844, 2001.
[4] M. Abe, "A segment-based approach to voice conversion," Proc. of
International Conference on Acoustics, Speech and Signal Processing,
pp.765-768, 1991.
[5] D. S├╝ndermann, H. Höge, A. Bonafante, H. Ney, A. Black, and S.
Narayanan, "Text-independent voice conversion based on unit selection,"
Proc. of International Conference on Acoustics, Speech and Signal
Processing, 2006.
[6] E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale,
Improvements in Speech Synthesis, John Wiley & Sons, 1st Ed. 2001, ch.
1.
[7] N. Campbell, "CHATR: A high-definition speech re-sequencing system,"
Proc. of ASA/ASJ Joint Meeting, pp.1223-1228, Honolulu, December
1996.
[8] N. Campbell, and A. W. Black, "Prosody and the selection of source units
for concatenative synthesis," in Progress in Speech Synthesis, Springer
Verlag, Inc., New York, 1995, ch. 22.
[9] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, "Ximera: A New
TTS from ATR Based on Corpus-Based Technologies,"' Proc. of ISCA
5th Speech Synthesis Workshop, pp.179-184, Pittsburgh, U.S.A., June
2004.
[10] Synthetic speech sample demonstration of CHATR. Available:
http://feast.atr.jp/chatr/chatr/e_tour/synth_examples.html
[11] Open-Source Large Vocabulary CSR Engine Julius. Available:
http://julius.sourceforge.jp/en_index.php?q=en/index.html
[12] Speech Signal Processing Toolkit (SPTK) Ver 3.0. Available:
http://kt-lab.ics.nitech.ac.jp/%7Etokuda/SPTK/index.html
[13] The Snack Sound Toolkit. Available: http://www.speech.kth.se/snack/
[14] K. Fujii, R. Ueda, H. Kashioka and N. Campbell, "A trial to apply
concatenative speech synthesis to spontaneous speech," Proc. of
International Technical Conference on Circuits/Systems, Computers and
Communications, Vol. 2, pp.653-656, 2006.