High-Individuality Voice Conversion Based on Concatenative Speech Synthesis
Concatenative speech synthesis is a method that can
make speech sound which has naturalness and high-individuality of a
speaker by introducing a large speech corpus. Based on this method, in
this paper, we propose a voice conversion method whose conversion
speech has high-individuality and naturalness. The authors also have
two subjective evaluation experiments for evaluating individuality and
sound quality of conversion speech. From the results, following three
facts have be confirmed: (a) the proposal method can convert the
individuality of speakers well, (b) employing the framework of unit
selection (especially join cost) of concatenative speech synthesis into
conventional voice conversion improves the sound quality of
conversion speech, and (c) the proposal method is robust against the
difference of genders between a source speaker and a target speaker.
[1] Y. Stylianou, O. Cappé, and E. Moulines, "Statistical methods for voice
quality transformation," Proc. of EUROSPEECH, pp.447-450,
September 1995.
[2] A. Kain, and M. W. Macon, "Spectral voice conversion for text-to-speech
synthesis," Proc. of International Conference on Acoustics, Speech and
Signal Processing, Vol. 1, pp.285-288, 1998.
[3] T. Toda, H. Saruwatari, and K. Shikano, "Voice conversion algorithm
based on Gaussian mixture model with dynamic frequency warping of
straight spectrum," Proc. of International Conference on Acoustics,
Speech and Signal Processing, Vol. 2, pp.841-844, 2001.
[4] M. Abe, "A segment-based approach to voice conversion," Proc. of
International Conference on Acoustics, Speech and Signal Processing,
pp.765-768, 1991.
[5] D. S├╝ndermann, H. Höge, A. Bonafante, H. Ney, A. Black, and S.
Narayanan, "Text-independent voice conversion based on unit selection,"
Proc. of International Conference on Acoustics, Speech and Signal
Processing, 2006.
[6] E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale,
Improvements in Speech Synthesis, John Wiley & Sons, 1st Ed. 2001, ch.
1.
[7] N. Campbell, "CHATR: A high-definition speech re-sequencing system,"
Proc. of ASA/ASJ Joint Meeting, pp.1223-1228, Honolulu, December
1996.
[8] N. Campbell, and A. W. Black, "Prosody and the selection of source units
for concatenative synthesis," in Progress in Speech Synthesis, Springer
Verlag, Inc., New York, 1995, ch. 22.
[9] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, "Ximera: A New
TTS from ATR Based on Corpus-Based Technologies,"' Proc. of ISCA
5th Speech Synthesis Workshop, pp.179-184, Pittsburgh, U.S.A., June
2004.
[10] Synthetic speech sample demonstration of CHATR. Available:
http://feast.atr.jp/chatr/chatr/e_tour/synth_examples.html
[11] Open-Source Large Vocabulary CSR Engine Julius. Available:
http://julius.sourceforge.jp/en_index.php?q=en/index.html
[12] Speech Signal Processing Toolkit (SPTK) Ver 3.0. Available:
http://kt-lab.ics.nitech.ac.jp/%7Etokuda/SPTK/index.html
[13] The Snack Sound Toolkit. Available: http://www.speech.kth.se/snack/
[14] K. Fujii, R. Ueda, H. Kashioka and N. Campbell, "A trial to apply
concatenative speech synthesis to spontaneous speech," Proc. of
International Technical Conference on Circuits/Systems, Computers and
Communications, Vol. 2, pp.653-656, 2006.
[1] Y. Stylianou, O. Cappé, and E. Moulines, "Statistical methods for voice
quality transformation," Proc. of EUROSPEECH, pp.447-450,
September 1995.
[2] A. Kain, and M. W. Macon, "Spectral voice conversion for text-to-speech
synthesis," Proc. of International Conference on Acoustics, Speech and
Signal Processing, Vol. 1, pp.285-288, 1998.
[3] T. Toda, H. Saruwatari, and K. Shikano, "Voice conversion algorithm
based on Gaussian mixture model with dynamic frequency warping of
straight spectrum," Proc. of International Conference on Acoustics,
Speech and Signal Processing, Vol. 2, pp.841-844, 2001.
[4] M. Abe, "A segment-based approach to voice conversion," Proc. of
International Conference on Acoustics, Speech and Signal Processing,
pp.765-768, 1991.
[5] D. S├╝ndermann, H. Höge, A. Bonafante, H. Ney, A. Black, and S.
Narayanan, "Text-independent voice conversion based on unit selection,"
Proc. of International Conference on Acoustics, Speech and Signal
Processing, 2006.
[6] E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale,
Improvements in Speech Synthesis, John Wiley & Sons, 1st Ed. 2001, ch.
1.
[7] N. Campbell, "CHATR: A high-definition speech re-sequencing system,"
Proc. of ASA/ASJ Joint Meeting, pp.1223-1228, Honolulu, December
1996.
[8] N. Campbell, and A. W. Black, "Prosody and the selection of source units
for concatenative synthesis," in Progress in Speech Synthesis, Springer
Verlag, Inc., New York, 1995, ch. 22.
[9] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, "Ximera: A New
TTS from ATR Based on Corpus-Based Technologies,"' Proc. of ISCA
5th Speech Synthesis Workshop, pp.179-184, Pittsburgh, U.S.A., June
2004.
[10] Synthetic speech sample demonstration of CHATR. Available:
http://feast.atr.jp/chatr/chatr/e_tour/synth_examples.html
[11] Open-Source Large Vocabulary CSR Engine Julius. Available:
http://julius.sourceforge.jp/en_index.php?q=en/index.html
[12] Speech Signal Processing Toolkit (SPTK) Ver 3.0. Available:
http://kt-lab.ics.nitech.ac.jp/%7Etokuda/SPTK/index.html
[13] The Snack Sound Toolkit. Available: http://www.speech.kth.se/snack/
[14] K. Fujii, R. Ueda, H. Kashioka and N. Campbell, "A trial to apply
concatenative speech synthesis to spontaneous speech," Proc. of
International Technical Conference on Circuits/Systems, Computers and
Communications, Vol. 2, pp.653-656, 2006.
@article{"International Journal of Electrical, Electronic and Communication Sciences:50412", author = "Kei Fujii and Jun Okawa and Kaori Suigetsu", title = "High-Individuality Voice Conversion Based on Concatenative Speech Synthesis", abstract = "Concatenative speech synthesis is a method that can
make speech sound which has naturalness and high-individuality of a
speaker by introducing a large speech corpus. Based on this method, in
this paper, we propose a voice conversion method whose conversion
speech has high-individuality and naturalness. The authors also have
two subjective evaluation experiments for evaluating individuality and
sound quality of conversion speech. From the results, following three
facts have be confirmed: (a) the proposal method can convert the
individuality of speakers well, (b) employing the framework of unit
selection (especially join cost) of concatenative speech synthesis into
conventional voice conversion improves the sound quality of
conversion speech, and (c) the proposal method is robust against the
difference of genders between a source speaker and a target speaker.", keywords = "concatenative speech synthesis, join cost,
speaker individuality, unit selection, voice conversion", volume = "1", number = "11", pages = "1580-6", }