0% found this document useful (0 votes)
14 views6 pages

Speechsynthesis

Speed synthesis

Uploaded by

Mohammad asif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Speechsynthesis

Speed synthesis

Uploaded by

Mohammad asif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 6

Speech Synthesis Technology

ISHA GAUTAM

(A.P, Electronics & ELECTRICAL Engg.)**


Manav Rachna International University, Faridabad India

ABSTRACT
Speech Synthesis is the artificial production of human speech. A computer system used for this purpose is called a
speech synthesizer and can be implemented in software or hardware. Automatic generation of speech wave forms
has been under development for several Decades. Recent progress in speech synthesis has produced synthesizers
with very high intelligibility but the sound quality and naturalness still remain a major problem. However, the
quality of present products has reached an adequate level for several applications such as multimedia, tele-
communications etc. With some audiovisual information or facial animation (talking head) it is possible to increase
speech intelligibility considerably. This paper describes the work in developing Multilingual & International speech
application, SpeechTek West [1], Hilton San Francisco by using speech synthesis technologies like Text-To-Speech
(TTS), Formant and Articulatory etc. Important issues are involved in enumerating a phone set to represent different
languages and selection of basic unit for synthesis of half-phones, diaphones, syllables etc. After it, creating a
generic acoustic database that covers language variations for modeling language specific prosody. Voice enabled
services are rapidly growing around the world. It is very difficult to have one speech synthesizer for each language.
The focus is also to develop common multilingual corpora with support for multiple languages and to build
appropriate language specific linguistic analysis modules for Text-To-Speech (TTS) synthesis. There are various
challenges in Speech Synthesis Technology like Text normalization, Text to phoneme, Evaluation, Prosodies,
Emotional contents etc.

Key words: TTS, prosody, linguistic data, intonation, syllables, phonemes.

1. INTRODUCTION The Text-To-Speech (TTS) Synthesis procedure


consists of two main phases. The first one is Text
Speech is the primary means of communication Analysis, where the input text is transcribed into a
among the people. The simplest way to produce phonetic or some other Linguistic representation and
synthetic speech is to play long prerecorded samples the second one is the generation of speech wave
of Natural speech, such as single words or forms where the acoustic output is produced from this
announcing and information systems. However, The phonetic and prosodic information. These two phases
concatenation method provides high quality and are usually called as high and low level synthesis
naturalness, but has a limited vocabulary and usually respectively. The input text might be data from a
only one voice. The method is very suitable for some word processor, standard ASCII from e-mail, a
create a database of all words and common names in mobile text-message, or scanned text from a
the world. It is may be even inappropriate to call this newspaper. The character string is then preprocessed
speech synthesis because it contains only recordings. and analyzed.
Thus, for unrestricted speech synthesis (Text-To-
Speech) the shorter pieces of speech signal are being 2.2 Fomant Snthesis
used such as syllables, phonemes, diaphones or even
shorter segments. Speech quality is a multi-
dimensional term.

2. VARIOUS TECHNIQUES

2.1 Text-To-Speech Synthesis


On the other hand, computational capabilities are
The synthesis is a sort of source-filter-method increasing rapidly and the analysis methods of speech
that is based on mathematic models of the human production are developing fast so the method may be
speech organ. The approach pipe is modeled from a useful in the future. Naturally, some combinations
number of resonances with resemblance to the and modifications of these basic methods have been
formants (frequency bands with high energy in used with variable success. An interesting approach is
voices) in natural speech. The first electronic voices to use a hybrid system where the formant and
Vocoder, and later on OVE and PAT, were speaking concatenative methods have been applied in parallel
with totally synthetic and electronic produced sounds to phonemes where they are the most suitable [9].
using formant synthesis.

2.3 Articulatory Synthesis 3. CUSTOMIZATIONS AND IMPROVEMENTS

In an articulatory synthesis, models of the On top of using the best voices available a layer
human articulators (tong, lips, teethes, jaw) and vocal of improvement can be deployed, both general and
ligament are used to simulate how an airflow passes customer specific customizations. Current systems
through, to calculate what the resulting sound will be have linguists with long experience of speech
like. It is a great challenge to find good mathematical synthesis working with transcriptions to tweak the
models and therefore the development of articulatory pronunciation and reading of the spoken text.
synthesis is still in research. The technique is very Therefore, system offers great help to all customers
computation-intensive and a memory requirement is that like to put some effort to get as good quality of
almost nothing but CPU usage is large. the speech as ever possible. Sometimes it is enough
to do a quality control of a couple of hours listening
2.4 Concatenative Synthesis to your website and correct the errors which occurs.
Sometimes there is a lot of brands and specific words
The concatenative method provides more natural on the site with great importance that they are
and individual sounding speech but the quality with pronounced correctly in all manners . One of the
some consonants may vary considerably and the largest customizations for a customer who sent us a
controlling of pitch and duration may be in some list of over 3000 words that had to be quality
cases difficult, especially with longer units. However, controlled. Another customization was for a site with
for an example diaphone methods such as PSOLA about 200000 pages where the same acronym or
may be used. Some other efforts for controlling abbreviation should be expanded differently
of pitch and duration have been made by depending on at what part of the site it was
example of Galanes et al.[8]. They proposed an mentioned. Many users wonder why the same voice
interpolation/decimation method for re-sampling the reads so much better when it is used in services
speech signals. With concatenation methods the compared to when the same voice, or Text-To-
collecting and labeling of speech samples have Speech (TTS) system, is used for reading similar, or
usually been difficult and very time-consuming. the same, content with other software’s or services.
Currently most of this work can be done The answer is the above mentioned customizations.
automatically by using for example speech
recognition. 4. PHONETICS AND THEORY OF SPEECH
PRODUCTION
2.5 HMM Synthesis
In most languages, the written text does not
A quite new technology is speech synthesis correspond to its pronunciation so that in order to
based on HMM; a mathematical concept called describe correct pronunciation some kind of symbolic
Hidden Markov models. It is a statistical method presentation is needed. Every language has a different
where TTS system is based on a model that is not phonetic alphabet and a different set of possible
known before hand but it is refined by continuous phonemes and their combinations. The number of
training. The technique consumes large CPU phonetic symbols is between 20 and 60 in each
resources but very little memory. This approach language [12]. A set of phonemes can be defined as
seems to give a better prosody, without glitches, and the minimum number of symbols needed to describe
still producing very natural sounding. every possible word in a language. In English there
are about 40 phonemes[5]. Due to complexity and
2.6 Hybrid Synthesis different kind of definitions, the number of phonemes
in English and most of the other languages can not be
defined exactly. Phonemes are abstract units and their frequencies [13]. Each formant frequency has
pronunciation depends on contextual effects, also amplitude and bandwidth. The fundamental
speaker's characteristics, and emotions. During frequency and formant frequencies are probably the
continuous speech, the articulatory movements most important concepts in speech synthesis and also
depend on the preceding and the following in speech processing in general. Whispering is the
phonemes. The articulators are in different position special case of speech. When whispering a voiced
depending on the preceding one and they are sound there is no fundamental frequency in the
preparing to the following phoneme in advance. This excitation and the first formant frequencies produced
causes some variations on how the individual by vocal tract are perceived. Speech signals of the
phoneme is pronounced. These variations are called three vowels (/a/ /i/ /u/) are presented in time and
allophones which are the subset of phonemes and the frequency domain in Fig. 1. The fundamental
effect is known as co-articulation. For example, a frequency is about 100 Hz in all cases and the
word lice contains a light /l/ and small contains a dark formant frequencies F1, F2 and F3 with vowel /a/ are
/l/. These l's are the same phoneme but different approximately 600 Hz, 1000 Hz and 2500 Hz
allophones and have different vocal tract respectively. With vowel /i/ the first three formants
configurations. are 200 Hz, 2300 Hz and 3000 Hz with /u/ 300 Hz,
600 Hz and 2300 Hz. The harmonic structure of the
The phonetic alphabet is usually divided in two main excitation is also easy to perceive from frequency
categories, vowels and consonants. Vowels are domain presentation.
always voiced sounds and they are produced with the
vocal cords in vibration, while consonants may be
either voiced or unvoiced. Vowels have considerably
higher amplitude than consonants and they are also
more stable and easier to analyze and describe
acoustically. Because consonants involve very rapid
changes they are more difficult to synthesize
properly. Some efforts to construct language-
independent phonemic alphabets were made during
last decades.

One of the best known is perhaps IPA (International


Phonetic Alphabet) which consists of a huge set of
symbols for phonemes, supra segmental, tones/word
accent contours and diacritics. For example, there are Fig. 1. The time and frequency-domain presentation
over twenty symbols for only fricative consonants of vowels /a/, /i/ and /u/
[2]. Complexity and the use of Greek symbols makes
IPA alphabet quite unsuitable for computers which It can be seen that the first three formants are inside
usually requires standard ASCII as input. Another the normal telephone channel (from 300 Hz to 3400
such kind of phonetic set is SAMPA (Speech Hz) so the needed bandwidth for intelligible speech
Assessment Methods – Phonetic Alphabet) which is is not very wide. For higher quality up to 10 kHz
designed to map IPA symbols to 7-bit printable bandwidth may be used which leads to 20 kHz
ASCII characters. In SAMPA system, the alphabets Sampling frequency. Unless, the fundamental
for each language are designed individually. frequency is outside the telephone channel, the
human hearing system is capable to reconstruct it
5. REPRESENTATION AND ANALYSIS OF from its harmonic components. Another commonly
SPEECH SIGNALS used method to describe a speech signal is the
spectrogram which is a time-frequency-amplitude
Continuous speech is a set of complicated audio presentation of a signal. The spectrogram and the
signals which makes producing them artificially time domain waveform of Finnish word kakis (two)
difficult. Speech signals are usually considered as are presented in Fig. 2.
voiced or unvoiced but in some cases they are
something between these two. Voiced sounds consist
of fundamental frequency (F0) and its harmonic
components produced by vocal cords (vocal folds).
The vocal tract modifies this excitation signal causing
formant (pole) and sometimes anti-formant (zero)
Fig. 4. Hierarchical levels of fundamental frequency
Fig. 2. Spectrogram and time-domain presentation of
Finnish word kakis (two) 6. SPEECH PRODUCTION

For determining the fundamental frequency or pitch Human speech is produced by Vocal organs
of speech cepstral analysis may be used [5], presented in Fig. 5. The main energy source is the
Cepstrum is obtained by first windowing and making Lungs with the diaphragm. When speaking, the air
Discrete Fourier Transform (DFT) for the signal and flow is forced through the Glottis between the Vocal
then logarithmic power spectrum and finally cords and the Larynx to the three main cavities of the
transforming it back to the time-domain by Inverse Vocal tract like the Pharynx, the Oral and Nasal
Discrete Fourier Transform (IDFT). The procedure is cavities. From the oral and nasal cavities the air flow
shown in Fig. 3. exits through the Nose and Mouth respectively. The
V-shaped opening between the Vocal cords, called
the Glottis, is the most important sound source in the
Vocal system. The Vocal cords may act in several
different ways during speech. The most important
Fig. 3. Cepstral Analysis function is to modulate the air flow by rapidly
opening and closing, causing buzzing sound from
Cepstral analysis provides a method for separating which vowels and voiced consonants are produced.
the vocal tract information from excitation. Thus the
reverse transformation can be carried out to provide The fundamental frequency of vibration depends on
smoother power spectrum known as homo-morphic the mass and tension and is about 110 Hz, 200 Hz,
filtering. Fundamental frequency or intonation and 300 Hz with men, women and children
contour over the sentence is important for correct respectively. With stop consonants the Vocal cords
prosody and natural sounding speech. The different may act suddenly from a completely closed position,
contours are usually analyzed from natural speech in in which they cut the air flow completely to totally
specific situations and with specific speaker open position producing a light cough or a glottal
characteristics and then applied to rules to generate stop. On the other hand, with unvoiced consonants,
the synthetic speech. The fundamental frequency such as /s/ or /f/, they may be completely open. An
contour can be viewed as the composite set of intermediate position may also occur with for
hierarchical patterns shown in Fig. 4. The overall example phonemes like /h/.
contour is generated by the superposition of these
patterns.
Where l is the length of the tube, w is radian
frequency and c is sound velocity. The denominator
is zero at frequencies Fi = wi/2p (i=1, 2, 3,), where

If l=17 cm, V (w) is infinite at frequencies Fi = 500,


1500, 2500... Hz which means resonances every 1
kHz starting at 500 Hz. If the length l is other than 17
cm, the frequencies Fi will be scaled by factor 17/l.

Vowels and Consonants can be approximated with a


two-tube and three-tube model respectively presented
Fig. 5. The Human Vocal organs (1) Nasal cavity in Fig. 6. Where lb, lc, and lf are the length of the
(2) Hard palate (3) Alveolar edge (4) Soft palate back, center, and front tube respectively. With the
(Velum) (5) Tip of the tongue (6) Dorsum (7) typical constriction length of 3 cm, the resonances
Uvula (8) Radix (9) Pharynx (10) Epiglottis (11) occur at multiples of 5333 Hz and can be ignored in
False Vocal cords (12) Vocal cords (13) Larynx (14) applications that use less than 5 kHz bandwidth [12].
Esophagus and (15) Trachea For example, with vowel /a/ the narrower tube
represents the pharynx opening into wider tube
The Pharynx connects the Larynx to the Oral cavity. representing the oral cavity. If assumed that both
It has almost fixed dimensions but its length may be tubes have an equal length of 8.5 cm, formants occur
changed slightly by raising or lowering the Larynx at at twice the frequencies noted earlier for a single
one end and the Soft palate at the other end. The Soft tube. [12].
palate also isolates or connects the route from the
Nasal cavity to the Pharynx. At the bottom of the
Pharynx are the Epiglottis and False Vocal cords to
prevent food reaching the Larynx and to isolate the
Esophagus acoustically from the Vocal tract. The
Epiglottis, the False Vocal cords and the Vocal cords
are closed during swallowing and open during normal
breathing. The Oral cavity is one of the most
important parts of the Vocal tract.

Its size, shape and acoustics can be varied by the


movements of the Palate, the Tongue, the Lips, the Fig. 6. Examples of two and three tube models for
Cheeks and the Teeth. Especially the Tongue is very the Vocal tract
flexible, the tip and the edges can be moved
independently and the entire Tongue can move
forward, backward, up and down. The Lips control 7. CONCLUSION
the size and shape of the Mouth opening through
which speech sound is radiated. Unlike the Oral Speech synthesis has been developed steadily
cavity, the Nasal cavity has fixed dimensions and over the last decades and it has been incorporated
shape. Its length is about 12 cm and volume 60 cm 3. into several new applications. For most applications,
The air stream to the Nasal cavity is controlled by the the intelligibility and comprehensibility of synthetic
Soft palate. From technical point of view, Glottal speech have reached the acceptable level. However,
excited Vocal tract may be then approximated as a in prosodic, text preprocessing and pronunciation
straight pipe closed at the Vocal cords where the fields there are still much work and improvements to
acoustical impedance Zg = and open at the be done to achieve more natural sounding speech.
mouth (Zm = 0). In this case the volume-velocity Present speech synthesis systems are so complicated
transfer function of vocal tract is [12]. that one researcher can not handle the entire system.
With good modularity it is possible to divide the
system into several individual modules whose
developing process can be done separately if the
communication between the modules is made
carefully.
The three basic methods used in speech synthesis. speech synthesis for about a decade. Like PSOLA
The most commonly used techniques in present methods, the sinusoidal modeling is best suited for
systems are based on formant and Concatenative periodic signals but the representation of unvoiced
synthesis. The latter one is becoming more and more speech is difficult. However, the sinusoidal methods
popular since the methods to minimize the problems have been found useful with singing voice synthesis
with the discontinuity effects in concatenation points [5-7]. Some other techniques have been applied to
are becoming more effective. The Concatenative speech synthesis such as Artificial Neural Networks.
method provides more natural and individual
sounding speech but the quality with some 8. REFERENCES
consonants may vary considerably and the controlling
of pitch and duration may be in some cases difficult, [1] Acero A. Multilingual & International speech
especially with longer units. application, SpeechTek West, Hilton San Francisco
(2007).
However, for an example diaphone methods such as
PSOLA may be used. Some other efforts for [2]. BellLaboratories TTS Homepage.
controlling of pitch and duration have been made [8]. <http://www.bell-labs.com/project/tts/>.Bellcore
With formant synthesis the quality of synthetic ORATORHomepage.<http://www.bellcore.com/ORA
speech is more constant but the speech sounds TOR>(1998).
slightly more unnatural and individual sounding
speech is more difficult to achieve. Formant synthesis [3] Beskow J., Dahlquist M., Granström B.,
is also more flexible and allows a good control of Lundeberg M., Spens K-E and Öhman T. The Tele
fundamental frequency. The third basic method, the faces Project - Disability, Feasibility, and
articulatory synthesis, is perhaps the most feasible in Intelligibility. Proceedings of Fonetik97 (1997).
theory especially for stop consonants because it
models the human articulation system directly. On [4] Beskow K., Elenius K. and McGlashan S. The
the one hand, the articulatory based methods are OLGA Project: An Animated Talking Agent in a
usually rather complex and the computational load is Dialogue System, Proceedings of Euro speech 97.
high so the potential has not been realized yet. On the <http://www.speech.kth.se/multimodal/papers/>
other hand, computational capabilities are increasing (1997)
rapidly and the analysis methods of speech
production are developing fast so the method may be [5] Altosaar T., Karjalainen M and Vainio M. A
useful in the future. Multilingual Phonetic Representation and Analysis
for Different Speech Databases. Proceedings of
Naturally, some combinations and modifications of ICSLP 96 (3) (1996).
these basic methods have been used with variable
success. An interesting approach is to use a hybrid [6] Beskow J. Talking Heads - Communication,
system where the formant and Concatenative Articulation and animation. Proceedings of Fonetik-
methods have been applied in parallel to phonemes 96: 53-56(1996).
where they are the most suitable [9-10]. In general,
combining the best parts of the basic methods is a [7] Amundsen M. MAPI, SAPI, and TAPI
good idea but in practice, controlling of synthesizer Developers Guide. Sams Publishing (1996).
may become difficult. Also some speech coding
methods have been applied to speech synthesis such [8] Benoit C. Speech Synthesis: Present and Future.
as Linear Predictive Coding and Sinusoidal European Studies in Phonetic & Speech
Modeling. Actually, the first speech synthesizer, Communication, Netherlands. pp. 119-123(1995).
VODER, was developed from the speech coding
system VOCODER [9-10]. [9] Belhoula K. Rule-Based Grapheme-to-Phoneme
Conversion of Names, Proceedings of Euro speech
Linear Prediction has been used for several decades 93 (2): 881-884(1993).
but with the basic method the quality has been quite
poor. However, with some modifications, such as [10] Abadjieva E., Murray I. and Arnott J.
Warped Linear Prediction (WLP), considerable Applying Analysis of Human Emotion Speech to
achievements have been reported [2]. Warped Enhance Synthetic Speech. Proceedings of Euro
filtering takes advantage of hearing properties so it is speech 93 (2): 909-912(1993).
perhaps useful in all source-filter based synthesis
methods. Sinusoidal models have also been applied to ).

You might also like