Abstract: In this paper, we present a novel statistical approach to
corpus-based speech synthesis. Classically, phonetic information is
defined and considered as acoustic reference to be respected. In this
way, many studies were elaborated for acoustical unit classification.
This type of classification allows separating units according to their
symbolic characteristics. Indeed, target cost and concatenation cost
were classically defined for unit selection.
In Corpus-Based Speech Synthesis System, when using large text
corpora, cost functions were limited to a juxtaposition of symbolic
criteria and the acoustic information of units is not exploited in the
definition of the target cost.
In this manuscript, we token in our consideration the unit phonetic
information corresponding to acoustic information. This would be realized
by defining a probabilistic linguistic Bi-grams model basically
used for unit selection. The selected units would be extracted from
the English TIMIT corpora.
Abstract: Concatenative speech synthesis is a method that can
make speech sound which has naturalness and high-individuality of a
speaker by introducing a large speech corpus. Based on this method, in
this paper, we propose a voice conversion method whose conversion
speech has high-individuality and naturalness. The authors also have
two subjective evaluation experiments for evaluating individuality and
sound quality of conversion speech. From the results, following three
facts have be confirmed: (a) the proposal method can convert the
individuality of speakers well, (b) employing the framework of unit
selection (especially join cost) of concatenative speech synthesis into
conventional voice conversion improves the sound quality of
conversion speech, and (c) the proposal method is robust against the
difference of genders between a source speaker and a target speaker.