Abstract
A text-to-speech(TTS) system is one of the human-machine interfaces using speech.
In recent years, TTS system is developed as an output device of human-machine
interfaces, and it is used in many application such as a car navigation system, in-
formation retrieval over the telephone, voice mail, a speech-to-speech translation
system and so on. However, although most text-to-speech systems still cannot syn-
thesize speech with various voice characteristics such as speaker individualities and
emotions. To obtain various voice characteristics in text-to-speech systems based on
the selection and concatenation of acoustical units, a large amount of speech data is
necessary. However, it is difficult to collect, segment, and store it. From these points
of view, in order to construct a speech synthesis system which can generate various
voice characteristics, an HMM-based text-to-speech system has been proposed. This
dissertation presents the construction of the HMM-based text-to-speech system, in
which spectrum, fundamental frequency and duration are modeled simultaneously
in a unified framework of HMM.
In the system, mainly three techniques are used; (1) a mel-cepstral analysis/synthesis
technique, (2) speech parameter modeling using HMM and (3) a speech parameter
generation algorithm from HMM. Since the system uses above three techniques,
the system has several capabilities. First, since the TTS system uses the speech
parameter generation algorithm, the generated spectral and pitch paramters from
the trained HMMs can be similar to those of real speech. Second, by transforming
HMM parameters appropriately, voice characteristics of synthetic speech can be
changed since the system generates speech from the HMMs. Third, this system
is trainable. In this thesis, first, the above three techniques are presented, and
simultaneous modeling of phonetic and prosodic parameters in a framework of HMM
is proposed.
Next, to improve of the quality of synthesized speech, the mixed excitation model of
the speech coder MELP and postfilter are incorporated into the system. Experimen-
tal results show that the mixed excitation model and postfilter significantly improve
the quality of synthesized speech.
Finally, for the purpose of synthesizing speech with various voice characteristics
such as speaker individualities and emotions, the TTS system based on speaker
interpolation is presented.
Links and resources
Tags