Unit Selection Algorithm Using Bi-grams Model For Corpus-Based Speech Synthesis
In this paper, we present a novel statistical approach to
corpus-based speech synthesis. Classically, phonetic information is
defined and considered as acoustic reference to be respected. In this
way, many studies were elaborated for acoustical unit classification.
This type of classification allows separating units according to their
symbolic characteristics. Indeed, target cost and concatenation cost
were classically defined for unit selection.
In Corpus-Based Speech Synthesis System, when using large text
corpora, cost functions were limited to a juxtaposition of symbolic
criteria and the acoustic information of units is not exploited in the
definition of the target cost.
In this manuscript, we token in our consideration the unit phonetic
information corresponding to acoustic information. This would be realized
by defining a probabilistic linguistic Bi-grams model basically
used for unit selection. The selected units would be extracted from
the English TIMIT corpora.
[1] T. Dutoit (1999). A Short Introduction to Text-To-Speech Synthesis. TTS
research Team, TCTS Lab.,Facult'e polytechnique de Mons, 2004.
[2] J. Schroeter. Text-To-Speech (TTS) Synthesis. Circuits, Signals, Speech
and Image Processing.
[3] A.J. Hunt and A.W. Black (1996). Unit selection in a concatenative
speech synthesis system using a large speech database. In Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), Atlanta, GA, pp. 373-376.
[4] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano (2002). Unit Selection for
Japanese Speech Synthesis Based on Both Phoneme Unit and Diphone
Unit. In Proc. of ICASSP, vol. 1, pp. 465-468, May 2002.
[5] A. Breen and P. Jackson, P. (1988). Non-Uniform Unit Selection and the
Similarity Metric Within BT-s LAUREATE TTS System. 3rd ESCA Int.
Workshop, November 1998.
[6] R. Prudon, and C. Alessandro (2001). A Selection/Concatenation Testto-
Speech System: Databases Development, System Design, Comparative
Evaluation. 4th ISCA Tutorial and Research Workshop on Speech Synthesis,
September 2001.
[7] G.R.W. Yi and J. Glass (2002). Information-Theoretic Criteria for Unit
Selection Synthesis. In Proc. of ICSLP, pp. 2617-2620, September 2002.
[8] M. Lee, D.P. Lopresti and J.P. Olive (2001). A Text-to-Speech Platform
for Variable Length Optimal Unit Searching Using Perceptual Cost Functions.
4th ISCA Tutorial and Research Workshop on Speech Synthesis,
September 2001.
[9] H. Peng, Y. Zhao, and M. Chu (2002). Perceptually Optimizing the Cost
Function for Unit Selection in TTS System With one Single Run of MOS
Evaluation. In Proc. of ICSLP, pp. 2613-2616, September 2002.
[10] R.E. Donovan and E.M. Eide(1998). The IBM Trainable Speech Synthesis
System. In Proc. of ICSLP, 1998.
[11] T. Nomura, H. Mizuno and H. Sato, H. (1990). Speech Synthesis by
Optimum Concatenation of Phoneme Segments. 1st ESCA-IEEE Tutorial
and Research Workshop on Speech Synthesis, pp. 39-42, 1990.
[12] Y. Pantazis, Y. Stylianou and E. Klabbers, E. (2005). Discontinuity
Detection in Concatenated Speech Synthesis Based on Nonlinear Speech
Analysis. In Proc. of Interspeech, 2005.
[13] A.J. Viterbi (1967). Error bounds for convolutional codes and an asymptotically
optimal decoding algorithm. IEEE Transactions on Information
Theory IT-13, 260-269.
[14] G.D. Forney (1973). The viterbi algorithm. Proceedings of the IEEE 61,
268-278.
[15] T. Dutoit (2004). TTSBOX 1.0: A Matlab toolbox for teaching Text-TOSpeech
Synthesis. Facult'e polytechnique de Mons, 2004.
[16] T. Dutoit and M. Cernˇak (2005). TTSBOX : A Matlab toolbox for
teaching Text-To-Speech Synthesis. IEEE-ICASSP, 2005.
[17] S.F. Chen and J. Goodman (1998). An empirical study of smoothing
techniques for language modeling. Center for Research in Computing
Technology, Harvard University, Cambridge, Massachusetts, 1998.
[1] T. Dutoit (1999). A Short Introduction to Text-To-Speech Synthesis. TTS
research Team, TCTS Lab.,Facult'e polytechnique de Mons, 2004.
[2] J. Schroeter. Text-To-Speech (TTS) Synthesis. Circuits, Signals, Speech
and Image Processing.
[3] A.J. Hunt and A.W. Black (1996). Unit selection in a concatenative
speech synthesis system using a large speech database. In Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), Atlanta, GA, pp. 373-376.
[4] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano (2002). Unit Selection for
Japanese Speech Synthesis Based on Both Phoneme Unit and Diphone
Unit. In Proc. of ICASSP, vol. 1, pp. 465-468, May 2002.
[5] A. Breen and P. Jackson, P. (1988). Non-Uniform Unit Selection and the
Similarity Metric Within BT-s LAUREATE TTS System. 3rd ESCA Int.
Workshop, November 1998.
[6] R. Prudon, and C. Alessandro (2001). A Selection/Concatenation Testto-
Speech System: Databases Development, System Design, Comparative
Evaluation. 4th ISCA Tutorial and Research Workshop on Speech Synthesis,
September 2001.
[7] G.R.W. Yi and J. Glass (2002). Information-Theoretic Criteria for Unit
Selection Synthesis. In Proc. of ICSLP, pp. 2617-2620, September 2002.
[8] M. Lee, D.P. Lopresti and J.P. Olive (2001). A Text-to-Speech Platform
for Variable Length Optimal Unit Searching Using Perceptual Cost Functions.
4th ISCA Tutorial and Research Workshop on Speech Synthesis,
September 2001.
[9] H. Peng, Y. Zhao, and M. Chu (2002). Perceptually Optimizing the Cost
Function for Unit Selection in TTS System With one Single Run of MOS
Evaluation. In Proc. of ICSLP, pp. 2613-2616, September 2002.
[10] R.E. Donovan and E.M. Eide(1998). The IBM Trainable Speech Synthesis
System. In Proc. of ICSLP, 1998.
[11] T. Nomura, H. Mizuno and H. Sato, H. (1990). Speech Synthesis by
Optimum Concatenation of Phoneme Segments. 1st ESCA-IEEE Tutorial
and Research Workshop on Speech Synthesis, pp. 39-42, 1990.
[12] Y. Pantazis, Y. Stylianou and E. Klabbers, E. (2005). Discontinuity
Detection in Concatenated Speech Synthesis Based on Nonlinear Speech
Analysis. In Proc. of Interspeech, 2005.
[13] A.J. Viterbi (1967). Error bounds for convolutional codes and an asymptotically
optimal decoding algorithm. IEEE Transactions on Information
Theory IT-13, 260-269.
[14] G.D. Forney (1973). The viterbi algorithm. Proceedings of the IEEE 61,
268-278.
[15] T. Dutoit (2004). TTSBOX 1.0: A Matlab toolbox for teaching Text-TOSpeech
Synthesis. Facult'e polytechnique de Mons, 2004.
[16] T. Dutoit and M. Cernˇak (2005). TTSBOX : A Matlab toolbox for
teaching Text-To-Speech Synthesis. IEEE-ICASSP, 2005.
[17] S.F. Chen and J. Goodman (1998). An empirical study of smoothing
techniques for language modeling. Center for Research in Computing
Technology, Harvard University, Cambridge, Massachusetts, 1998.
@article{"International Journal of Electrical, Electronic and Communication Sciences:56326", author = "Mohamed Ali KAMMOUN and Ahmed Ben HAMIDA", title = "Unit Selection Algorithm Using Bi-grams Model For Corpus-Based Speech Synthesis", abstract = "In this paper, we present a novel statistical approach to
corpus-based speech synthesis. Classically, phonetic information is
defined and considered as acoustic reference to be respected. In this
way, many studies were elaborated for acoustical unit classification.
This type of classification allows separating units according to their
symbolic characteristics. Indeed, target cost and concatenation cost
were classically defined for unit selection.
In Corpus-Based Speech Synthesis System, when using large text
corpora, cost functions were limited to a juxtaposition of symbolic
criteria and the acoustic information of units is not exploited in the
definition of the target cost.
In this manuscript, we token in our consideration the unit phonetic
information corresponding to acoustic information. This would be realized
by defining a probabilistic linguistic Bi-grams model basically
used for unit selection. The selected units would be extracted from
the English TIMIT corpora.", keywords = "Unit selection, Corpus-based Speech Synthesis, Bigram model", volume = "3", number = "11", pages = "2050-6", }