Slovenian Text-to-Speech Synthesis for Speech User Interfaces

The paper presents the design concept of a unitselection text-to-speech synthesis system for the Slovenian language. Due to its modular and upgradable architecture, the system can be used in a variety of speech user interface applications, ranging from server carrier-grade voice portal applications, desktop user interfaces to specialized embedded devices. Since memory and processing power requirements are important factors for a possible implementation in embedded devices, lexica and speech corpora need to be reduced. We describe a simple and efficient implementation of a greedy subset selection algorithm that extracts a compact subset of high coverage text sentences. The experiment on a reference text corpus showed that the subset selection algorithm produced a compact sentence subset with a small redundancy. The adequacy of the spoken output was evaluated by several subjective tests as they are recommended by the International Telecommunication Union ITU.




References:
[1] A.W. Black and K.A. Lenzo, "Flite: a small fast run-time speech
synthesis engine," In Proceedings of the 4th ISCA Workshop on Speech
Synthesis, 2001, pp. 204-207.
[2] M.L. Tomokoyo, W.A. Black and K.A. Lenzo, "Arabic in my hand:
small footprint synthesis of Egyptian Arabic," In Proceedings of the
Eurospeech-03, Geneva, Switzerland, 2003, pp. 2049-2052.
[3] T. Šef and M. Gams, "Speaker (GOVOREC): a complete Slovenian textto
speech system," International journal on speech technologies, vol.6,
2003, pp. 277-287.
[4] N. Pave┼íić, J. Gros, S. Dobri┼íek and F. Miheli─ì, "Homer II - man -
machine interface to internet for blind and visually impaired people,".
Computer communications, 2003, vol. 26, pp. 438-443.
[5] B. Vesnicer and F. Miheli─ì, "Evaluation of the Slovenian HMM-based
speech synthesis system," Proc. TSD'04, Lecture notes in computer
science, vol. 1692, Berlin, Springer Verlag, 2004, pp. 513-520.
[6] J. Gros, F. Miheli─ì, N. Pave┼íić, M. Žganec, A. Miheli─ì, M. Knez, A.
Merčun and D. Škerl, "The phonectic SMS reader," Proc. TSD'01,
Lecture notes in computer science, vol. 1692, Springer Verlag, Berlin,
2001, pp. 334-340.
[7] N. Campbell, "CHATR: a high-definition speech resequencing system,"
In Proceedings of the 3rd ASA/ASJ Joint Meeting, 1996, pp. 1223-1228.
[8] M. Beutnagel, A. Conkie, J. Schroeter and Y. Stylianou, "The AT&T
Next-Gen TTS System," in Proceedings of the 137th Meeting of the
Acoustic Society of America, 2000.
[9] B. Möbius, "The Bell Labs German text-to-speech system," Computer
Speech and Language, vol. 13, 1999, pp. 319-358.
[10] J. Meron and P. Veprek, "Compression of exception lexicons for small
footprint grapheme-to-phoneme conversion," In Proceedings of the
ICASSP-05, Philadelphia, USA, March 18-23, 2005.
[11] J. Gros, N. Pave┼íić and F. Miheli─ì, "Syllable and segment duration at
different speaking rates for the Slovenian language," in Proceedings of
the Eurospeech-97, Rhodes, Greece, 1997, pp. 1-4.
[12] J. Gros, N. Pave┼íić and F. Miheli─ì, "Speech timing in Slovenian TTS",
in Proceedings of the Eurospeech-97, Rhodes, Greece, 1997, pp. 323-
326.
[13] A. Conkie, "Robust unit selection system for speech synthesis," in
Proceedings of the Eurospeech'99, Budapest, Hungary, 1999.
[14] M. Beutnagel, R. Mohri and M. Riley, "Rapid unit selection from a large
speech corpus for concatenative speech synthesis," in Proceedings of the
Eurospeech '99, Budapest, Hungary, 1999.
[15] J. Tian, J. Nurminen and I. Kiss, "Optimal subset selection from text
databases," In Proceedings of the ICASSP-05, Philadelphia, USA, March
18-23, 2005.
[16] J.P.H. Van Santen, "Methods for optimal text selection," In Proceedings
of the Eurospeech-97, Rhodes, Greece, 1997, pp. 553-556.
[17] H. Kawai, S. Yamamoto and T. Shimizu, "A design method of speech
corpus for text-to-speech synthesis taking into account prosody," in
Proceedings of the ICSLP-00, 2000, pp. 420-425.
[18] C. Kuo and J. Huang, "Efficient and scalable methods for text script
generation in corpus-based TTS design," in Proceedings of the
ICSLP-02, 2002, pp. 121-124.
[19] B. Bozkurt, O. Ozturk and T. Dutoit, "Text design for TTS speech
corpus building using a modified greedy selection," in Proceedings of
the Eurospeech-03, Geneva, Switzerland, 2003, pp. 277-180.
[20] M. Isogai, M. Mizuno and K. Mano, "Recording script design for
corpus-based TTS system based on coverage of various phonetic
elements," In Proceedings of the ICASSP-05, Philadelphia, USA, March
18-23, 2005.
[21] F. Malfrère and T. Dutoit, "High quality speech synthesis for phonetic
speech segmentation," In Proceedings of the Eurospeech-97, Rhodes,
Greece, 1997, pp. 2631-2634.
[22] F. Miheli─ì, J. Gros, S. Dobri┼íek, J. Žibert and N. Pave┼íić, "Spoken
language resources at LUKS of the University of Ljubljana,"
International Journal on Speech Technologies, vol. 6, no. 3, 2003, pp.
221-232.
[23] G. Xydas and G. Kouroupetroglou, "An intonation model for embedded
devices based on natural F0 samples," In Proceedings of the
Interspeech-04, Korea, 2004, pp. 801-804.
[24] ITU, "A method for subjective performance assessment of the quality of
speech voice output devices," ITU-T Recommendation P.85, ITU, 1994.
[25] ITU, "Telephone transmission quality subjective opinion tests -
Modulated noise reference unit," ITU-T Recommendation P.81, ITU,
Blue Book, (5), pp. 1-5, 1993.
[26] J. Gros, F. Miheli─ì and N. Pave┼íić, "Slovene interactive text-to-speech
evaluation site - SITES," Proc. TSD'99, Lecture notes in computer
science, vol. 1692, Berlin, Springer Verlag, 1999, pp. 223-228.