@m-toman

Simultaneous Modeling of Phonetic and Prosodic Parameters, and Characteristic Conversion for HMM-Based Text-to-Speech Systems

. Nagoya Institute of Technology, Nagoya, Japan, (2002)

Abstract

A text-to-speech(TTS) system is one of the human-machine interfaces using speech. In recent years, TTS system is developed as an output device of human-machine interfaces, and it is used in many application such as a car navigation system, in- formation retrieval over the telephone, voice mail, a speech-to-speech translation system and so on. However, although most text-to-speech systems still cannot syn- thesize speech with various voice characteristics such as speaker individualities and emotions. To obtain various voice characteristics in text-to-speech systems based on the selection and concatenation of acoustical units, a large amount of speech data is necessary. However, it is difficult to collect, segment, and store it. From these points of view, in order to construct a speech synthesis system which can generate various voice characteristics, an HMM-based text-to-speech system has been proposed. This dissertation presents the construction of the HMM-based text-to-speech system, in which spectrum, fundamental frequency and duration are modeled simultaneously in a unified framework of HMM. In the system, mainly three techniques are used; (1) a mel-cepstral analysis/synthesis technique, (2) speech parameter modeling using HMM and (3) a speech parameter generation algorithm from HMM. Since the system uses above three techniques, the system has several capabilities. First, since the TTS system uses the speech parameter generation algorithm, the generated spectral and pitch paramters from the trained HMMs can be similar to those of real speech. Second, by transforming HMM parameters appropriately, voice characteristics of synthetic speech can be changed since the system generates speech from the HMMs. Third, this system is trainable. In this thesis, first, the above three techniques are presented, and simultaneous modeling of phonetic and prosodic parameters in a framework of HMM is proposed. Next, to improve of the quality of synthesized speech, the mixed excitation model of the speech coder MELP and postfilter are incorporated into the system. Experimen- tal results show that the mixed excitation model and postfilter significantly improve the quality of synthesized speech. Finally, for the purpose of synthesizing speech with various voice characteristics such as speaker individualities and emotions, the TTS system based on speaker interpolation is presented.

Links and resources

Tags