Unit selection in a concatenative speech synthesis system using a large speech database
A. Hunt, and A. Black. Proceedings of the 1996 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, page 373-376. Atlanta, GA, USA, (May 1996)
DOI: 10.1109/ICASSP.1996.541110
Abstract
One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning
%0 Conference Paper
%1 Hunt1996
%A Hunt, Andrew J.
%A Black, Alan W.
%B Proceedings of the 1996 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
%C Atlanta, GA, USA
%D 1996
%K Viterbi algorithm context cost;state cost;waveform database;natural database;training;transition decoding;concatenative decoding;search estimation;Viterbi information;prosodic information;pruned languages;Network network;synthesis occupancy problems;speech recognition;Speech search;state sequence;phonetic speech speech;natural-sounding speech;phoneme synthesis synthesis;Control synthesis;Costs;Databases;Laboratories;Natural synthesis;Speech synthesis;State synthesis;Viterbi synthesized system system;database transition unit;large
%P 373-376
%R 10.1109/ICASSP.1996.541110
%T Unit selection in a concatenative speech synthesis system using a large speech database
%V 1
%X One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning
@inproceedings{Hunt1996,
abstract = {One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning},
added-at = {2021-02-01T10:51:23.000+0100},
address = {Atlanta, GA, USA},
author = {Hunt, Andrew J. and Black, Alan W.},
biburl = {https://www.bibsonomy.org/bibtex/215558da362f59d7fdd21eb4a7768b0b4/m-toman},
booktitle = {Proceedings of the 1996 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
doi = {10.1109/ICASSP.1996.541110},
file = {:pdfs/hunt_icassp_1996.pdf:PDF},
interhash = {b1658aa9854d8ac5f504120525b280a4},
intrahash = {15558da362f59d7fdd21eb4a7768b0b4},
issn = {1520-6149},
keywords = {Viterbi algorithm context cost;state cost;waveform database;natural database;training;transition decoding;concatenative decoding;search estimation;Viterbi information;prosodic information;pruned languages;Network network;synthesis occupancy problems;speech recognition;Speech search;state sequence;phonetic speech speech;natural-sounding speech;phoneme synthesis synthesis;Control synthesis;Costs;Databases;Laboratories;Natural synthesis;Speech synthesis;State synthesis;Viterbi synthesized system system;database transition unit;large},
month = may,
owner = {schabus},
pages = {373-376},
timestamp = {2021-02-01T10:51:23.000+0100},
title = {Unit selection in a concatenative speech synthesis system using a large speech database},
volume = 1,
year = 1996
}