Unit Selection Algorithm Using Bi-grams Model For Corpus-Based Speech Synthesis

In this paper, we present a novel statistical approach to corpus-based speech synthesis. Classically, phonetic information is defined and considered as acoustic reference to be respected. In this way, many studies were elaborated for acoustical unit classification. This type of classification allows separating units according to their symbolic characteristics. Indeed, target cost and concatenation cost were classically defined for unit selection. In Corpus-Based Speech Synthesis System, when using large text corpora, cost functions were limited to a juxtaposition of symbolic criteria and the acoustic information of units is not exploited in the definition of the target cost. In this manuscript, we token in our consideration the unit phonetic information corresponding to acoustic information. This would be realized by defining a probabilistic linguistic Bi-grams model basically used for unit selection. The selected units would be extracted from the English TIMIT corpora.




References:
[1] T. Dutoit (1999). A Short Introduction to Text-To-Speech Synthesis. TTS
research Team, TCTS Lab.,Facult'e polytechnique de Mons, 2004.
[2] J. Schroeter. Text-To-Speech (TTS) Synthesis. Circuits, Signals, Speech
and Image Processing.
[3] A.J. Hunt and A.W. Black (1996). Unit selection in a concatenative
speech synthesis system using a large speech database. In Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), Atlanta, GA, pp. 373-376.
[4] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano (2002). Unit Selection for
Japanese Speech Synthesis Based on Both Phoneme Unit and Diphone
Unit. In Proc. of ICASSP, vol. 1, pp. 465-468, May 2002.
[5] A. Breen and P. Jackson, P. (1988). Non-Uniform Unit Selection and the
Similarity Metric Within BT-s LAUREATE TTS System. 3rd ESCA Int.
Workshop, November 1998.
[6] R. Prudon, and C. Alessandro (2001). A Selection/Concatenation Testto-
Speech System: Databases Development, System Design, Comparative
Evaluation. 4th ISCA Tutorial and Research Workshop on Speech Synthesis,
September 2001.
[7] G.R.W. Yi and J. Glass (2002). Information-Theoretic Criteria for Unit
Selection Synthesis. In Proc. of ICSLP, pp. 2617-2620, September 2002.
[8] M. Lee, D.P. Lopresti and J.P. Olive (2001). A Text-to-Speech Platform
for Variable Length Optimal Unit Searching Using Perceptual Cost Functions.
4th ISCA Tutorial and Research Workshop on Speech Synthesis,
September 2001.
[9] H. Peng, Y. Zhao, and M. Chu (2002). Perceptually Optimizing the Cost
Function for Unit Selection in TTS System With one Single Run of MOS
Evaluation. In Proc. of ICSLP, pp. 2613-2616, September 2002.
[10] R.E. Donovan and E.M. Eide(1998). The IBM Trainable Speech Synthesis
System. In Proc. of ICSLP, 1998.
[11] T. Nomura, H. Mizuno and H. Sato, H. (1990). Speech Synthesis by
Optimum Concatenation of Phoneme Segments. 1st ESCA-IEEE Tutorial
and Research Workshop on Speech Synthesis, pp. 39-42, 1990.
[12] Y. Pantazis, Y. Stylianou and E. Klabbers, E. (2005). Discontinuity
Detection in Concatenated Speech Synthesis Based on Nonlinear Speech
Analysis. In Proc. of Interspeech, 2005.
[13] A.J. Viterbi (1967). Error bounds for convolutional codes and an asymptotically
optimal decoding algorithm. IEEE Transactions on Information
Theory IT-13, 260-269.
[14] G.D. Forney (1973). The viterbi algorithm. Proceedings of the IEEE 61,
268-278.
[15] T. Dutoit (2004). TTSBOX 1.0: A Matlab toolbox for teaching Text-TOSpeech
Synthesis. Facult'e polytechnique de Mons, 2004.
[16] T. Dutoit and M. Cernˇak (2005). TTSBOX : A Matlab toolbox for
teaching Text-To-Speech Synthesis. IEEE-ICASSP, 2005.
[17] S.F. Chen and J. Goodman (1998). An empirical study of smoothing
techniques for language modeling. Center for Research in Computing
Technology, Harvard University, Cambridge, Massachusetts, 1998.