Speechsynthesis
Speechsynthesis
ISHA GAUTAM
ABSTRACT
Speech Synthesis is the artificial production of human speech. A computer system used for this purpose is called a
speech synthesizer and can be implemented in software or hardware. Automatic generation of speech wave forms
has been under development for several Decades. Recent progress in speech synthesis has produced synthesizers
with very high intelligibility but the sound quality and naturalness still remain a major problem. However, the
quality of present products has reached an adequate level for several applications such as multimedia, tele-
communications etc. With some audiovisual information or facial animation (talking head) it is possible to increase
speech intelligibility considerably. This paper describes the work in developing Multilingual & International speech
application, SpeechTek West [1], Hilton San Francisco by using speech synthesis technologies like Text-To-Speech
(TTS), Formant and Articulatory etc. Important issues are involved in enumerating a phone set to represent different
languages and selection of basic unit for synthesis of half-phones, diaphones, syllables etc. After it, creating a
generic acoustic database that covers language variations for modeling language specific prosody. Voice enabled
services are rapidly growing around the world. It is very difficult to have one speech synthesizer for each language.
The focus is also to develop common multilingual corpora with support for multiple languages and to build
appropriate language specific linguistic analysis modules for Text-To-Speech (TTS) synthesis. There are various
challenges in Speech Synthesis Technology like Text normalization, Text to phoneme, Evaluation, Prosodies,
Emotional contents etc.
2. VARIOUS TECHNIQUES
In an articulatory synthesis, models of the On top of using the best voices available a layer
human articulators (tong, lips, teethes, jaw) and vocal of improvement can be deployed, both general and
ligament are used to simulate how an airflow passes customer specific customizations. Current systems
through, to calculate what the resulting sound will be have linguists with long experience of speech
like. It is a great challenge to find good mathematical synthesis working with transcriptions to tweak the
models and therefore the development of articulatory pronunciation and reading of the spoken text.
synthesis is still in research. The technique is very Therefore, system offers great help to all customers
computation-intensive and a memory requirement is that like to put some effort to get as good quality of
almost nothing but CPU usage is large. the speech as ever possible. Sometimes it is enough
to do a quality control of a couple of hours listening
2.4 Concatenative Synthesis to your website and correct the errors which occurs.
Sometimes there is a lot of brands and specific words
The concatenative method provides more natural on the site with great importance that they are
and individual sounding speech but the quality with pronounced correctly in all manners . One of the
some consonants may vary considerably and the largest customizations for a customer who sent us a
controlling of pitch and duration may be in some list of over 3000 words that had to be quality
cases difficult, especially with longer units. However, controlled. Another customization was for a site with
for an example diaphone methods such as PSOLA about 200000 pages where the same acronym or
may be used. Some other efforts for controlling abbreviation should be expanded differently
of pitch and duration have been made by depending on at what part of the site it was
example of Galanes et al.[8]. They proposed an mentioned. Many users wonder why the same voice
interpolation/decimation method for re-sampling the reads so much better when it is used in services
speech signals. With concatenation methods the compared to when the same voice, or Text-To-
collecting and labeling of speech samples have Speech (TTS) system, is used for reading similar, or
usually been difficult and very time-consuming. the same, content with other software’s or services.
Currently most of this work can be done The answer is the above mentioned customizations.
automatically by using for example speech
recognition. 4. PHONETICS AND THEORY OF SPEECH
PRODUCTION
2.5 HMM Synthesis
In most languages, the written text does not
A quite new technology is speech synthesis correspond to its pronunciation so that in order to
based on HMM; a mathematical concept called describe correct pronunciation some kind of symbolic
Hidden Markov models. It is a statistical method presentation is needed. Every language has a different
where TTS system is based on a model that is not phonetic alphabet and a different set of possible
known before hand but it is refined by continuous phonemes and their combinations. The number of
training. The technique consumes large CPU phonetic symbols is between 20 and 60 in each
resources but very little memory. This approach language [12]. A set of phonemes can be defined as
seems to give a better prosody, without glitches, and the minimum number of symbols needed to describe
still producing very natural sounding. every possible word in a language. In English there
are about 40 phonemes[5]. Due to complexity and
2.6 Hybrid Synthesis different kind of definitions, the number of phonemes
in English and most of the other languages can not be
defined exactly. Phonemes are abstract units and their frequencies [13]. Each formant frequency has
pronunciation depends on contextual effects, also amplitude and bandwidth. The fundamental
speaker's characteristics, and emotions. During frequency and formant frequencies are probably the
continuous speech, the articulatory movements most important concepts in speech synthesis and also
depend on the preceding and the following in speech processing in general. Whispering is the
phonemes. The articulators are in different position special case of speech. When whispering a voiced
depending on the preceding one and they are sound there is no fundamental frequency in the
preparing to the following phoneme in advance. This excitation and the first formant frequencies produced
causes some variations on how the individual by vocal tract are perceived. Speech signals of the
phoneme is pronounced. These variations are called three vowels (/a/ /i/ /u/) are presented in time and
allophones which are the subset of phonemes and the frequency domain in Fig. 1. The fundamental
effect is known as co-articulation. For example, a frequency is about 100 Hz in all cases and the
word lice contains a light /l/ and small contains a dark formant frequencies F1, F2 and F3 with vowel /a/ are
/l/. These l's are the same phoneme but different approximately 600 Hz, 1000 Hz and 2500 Hz
allophones and have different vocal tract respectively. With vowel /i/ the first three formants
configurations. are 200 Hz, 2300 Hz and 3000 Hz with /u/ 300 Hz,
600 Hz and 2300 Hz. The harmonic structure of the
The phonetic alphabet is usually divided in two main excitation is also easy to perceive from frequency
categories, vowels and consonants. Vowels are domain presentation.
always voiced sounds and they are produced with the
vocal cords in vibration, while consonants may be
either voiced or unvoiced. Vowels have considerably
higher amplitude than consonants and they are also
more stable and easier to analyze and describe
acoustically. Because consonants involve very rapid
changes they are more difficult to synthesize
properly. Some efforts to construct language-
independent phonemic alphabets were made during
last decades.
For determining the fundamental frequency or pitch Human speech is produced by Vocal organs
of speech cepstral analysis may be used [5], presented in Fig. 5. The main energy source is the
Cepstrum is obtained by first windowing and making Lungs with the diaphragm. When speaking, the air
Discrete Fourier Transform (DFT) for the signal and flow is forced through the Glottis between the Vocal
then logarithmic power spectrum and finally cords and the Larynx to the three main cavities of the
transforming it back to the time-domain by Inverse Vocal tract like the Pharynx, the Oral and Nasal
Discrete Fourier Transform (IDFT). The procedure is cavities. From the oral and nasal cavities the air flow
shown in Fig. 3. exits through the Nose and Mouth respectively. The
V-shaped opening between the Vocal cords, called
the Glottis, is the most important sound source in the
Vocal system. The Vocal cords may act in several
different ways during speech. The most important
Fig. 3. Cepstral Analysis function is to modulate the air flow by rapidly
opening and closing, causing buzzing sound from
Cepstral analysis provides a method for separating which vowels and voiced consonants are produced.
the vocal tract information from excitation. Thus the
reverse transformation can be carried out to provide The fundamental frequency of vibration depends on
smoother power spectrum known as homo-morphic the mass and tension and is about 110 Hz, 200 Hz,
filtering. Fundamental frequency or intonation and 300 Hz with men, women and children
contour over the sentence is important for correct respectively. With stop consonants the Vocal cords
prosody and natural sounding speech. The different may act suddenly from a completely closed position,
contours are usually analyzed from natural speech in in which they cut the air flow completely to totally
specific situations and with specific speaker open position producing a light cough or a glottal
characteristics and then applied to rules to generate stop. On the other hand, with unvoiced consonants,
the synthetic speech. The fundamental frequency such as /s/ or /f/, they may be completely open. An
contour can be viewed as the composite set of intermediate position may also occur with for
hierarchical patterns shown in Fig. 4. The overall example phonemes like /h/.
contour is generated by the superposition of these
patterns.
Where l is the length of the tube, w is radian
frequency and c is sound velocity. The denominator
is zero at frequencies Fi = wi/2p (i=1, 2, 3,), where