A Class of Physical Modeling Recurrent Networks For Analysis / Synthesis of Plucked String Instruments
A Class of Physical Modeling Recurrent Networks For Analysis / Synthesis of Plucked String Instruments
Abstract—A new approach is proposed that closely synthesizes the excitation wavetable takes lots of memory space. Finally,
tones of plucked string instruments by using a class of physical the solution for special effects such as portamento usually seen
modeling recurrent networks. The strategies employed in this in plucked-string instruments is not addressed.
paper consist of a fast training algorithm and a multistage training
procedure that are able to obtain the synthesis parameters for In [14] and [15], a recurrent network based approach called
a specific instrument automatically. The training vector can be scattering recurrent network (SRN) for simulating the vibration
recorded tones of most target plucked instruments with ordinary of a plucked string succeeded in synthesizing plucked-string
microphones. The proposed approach delivers encouraging results tones. The structure of the SRN is similar to a lattice filter
when it is applied to different types of plucked string instruments because this is the basic form that simulates one-dimensional
such as steel-string guitar, nylon-string guitar, harp, Chin,
Yueh-chin, and Pipa. The synthesized tones sound very close to (1-D) wave propagation. One of the major contributions of
the originals produced by their acoustic counterparts. In addition, this approach is that the system parameters can be determined
this paper presents an embedded technique that can produce automatically by using the backpropagation through time
special effects such as vibrato and portamento that are vital to the (BPTT) training algorithm [16]. However, there exist some
playing of plucked-string instruments. The computation required difficulties when this technology is applied to practical music
in the resynthesis processing is also reasonable.
synthesis systems. First, structures of musical instruments are
Index Terms—Physical modeling, plucked string instruments, usually too complicated to be modeled by such a simple 1-D
portamento, recurrent networks.
model. Even if a multidimensional architecture is used [17], the
computation for the re-synthesis processing will be enormous.
I. INTRODUCTION Second, it is usually difficult to measure the time domain
responses of a played instrument at various positions so that
T RANSIENT responses of most acoustic instruments are
very difficult to reproduce. This is also the main reason
that synthetic sounds are not realistic enough with traditional
the measurement can be used as the training vector. Third, the
BPTT method takes lots of iterations to converge. Finally, the
simple waveforms used by SRN as the excitation signals can
approaches such as wavetable and FM methods. Model-based
no longer produce good synthesis results in any case.
approaches claim to be able to reproduce such dynamics by
What we want to achieve is to accurately synthesize the tones
modeling the sounding mechanism of a target instrument phys-
for any specific plucked-string instrument with reasonable
ically. There are plenty of works focusing on analysis and mod-
cost. Furthermore, it is desired that the synthesizer design
eling of piano soundboards, and top plates and air cavities of
can be done automatically. Therefore, several modifications
guitars and violins [1]–[3]. Techniques such as finite element
to the SRN method are proposed. First, the architectures of
based methods and ray-tracing methods are useful in analyzing
the networks are simplified so that the complexities required
musical instruments but none of them are practical enough to
in the training stage and the synthesis stage can be reduced.
be used to synthesize musical tones in real-time applications.
Second, the training vectors can be musical tones recorded by
The most successful applications of model-based techniques
using ordinary microphones. This allows easy measurement for
are compression, synthesis, and recognition of speech signals
users without complicate measurement devices. Third, a new
by simulating human vocal tracts with a class of digital lattice
training algorithm modified from simulated annealing resilient
filters [4], [5]. Among several physical-modeling music syn-
backpropagation (SARPROP) is used to speed up the training
thesis methods, the digital waveguide filters (DWFs) [6]–[8],
and obtain better system parameters [16], [18]. Fourth, the
[11] and the wave digital filters (WDFs) [9], [10] are the most
excitation wavetable should be kept small in its size and can
popular and practical ones. An efficient way of applying the
be obtained in the training process. It is noted that it is very
DWF method to plucked-string instruments has been proposed
difficult for the training to converge to a good solution without
in [12], [13]. There are some problems with these approaches,
this step. Finally, portamento and vibrato effects should be
however. First, the synthesizer design is complicated. Second,
embedded.
In Section II, a class of physical modeling recurrent networks
Manuscript received November 15, 1999; revised March 19, 2001. This work is proposed and the simplified version of the networks is pre-
was supported in part by National Science Council, Taiwan, R.O.C., under Grant
NSC 89-2218-E-006-132. sented. Its connection with the 1-D string model is described. In
A. W. Y. Su is with the Department of CSIE, National Cheng-Kung Univer- Section III, resynthesis processing with the proposed technique
sity, Tainan, Taiwan. is presented. In Section IV, a multistage training procedure and
S.-F. Liang is with the Department of Electrical and Control Engineering,
National Chiao-Tung University, Hsin-Chu, Taiwan. a new training algorithm are presented. Synthesis model pa-
Publisher Item Identifier S 1045-9227(02)03982-6. rameters and the excitation signal are obtained in this stage.
1045-9227/02$17.00 © 2002 IEEE
1138 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
(a)
(b)
Fig. 2. The SRN model and the proposed synthesis model. (a) The SRN model for the modeling of musical strings. (b) The proposed new network structure for
synthesis of plucked-string instrument tones.
(6)
(a)
(b) (c)
Fig. 4. The three basic components in the synthesis model shown in Fig. 2(b). (a) The structure of a PB. (b) A pair of delay lines that connect PB-(i 0 1) and
PB-i. (c) The operation of two reflective ends.
Then, it is equally distributed into the right-going departure where the upper-track boundary delay buffer receives the
neuron and the left-going departure neuron as follows: right-going output signal of PB- ,
and the lower-track boundary delay buffer receives the
(8) left-going output signal of PB- .
There are two basic operations in a neuron. The first one sums
the signals flowing into this neuron as the net-input. The second
According to Fig. 4(b), represents the initial magnitude of one is the so-called activation function that is a mapping be-
the th delay unit of the th pair of the delay lines. Similar to tween the net-input and the corresponding output. The arrival
(8), the initial values of delay buffers are also one half of . neurons receive the weighted outputs from the departure neu-
rons or those from the delay buffers as
(9)
(11) (14)
SU AND LIANG: CLASS OF PHYSICAL MODELING RECURRENT NETWORKS 1141
(15)
and
(16)
IV. TRAINING
Fig. 5. The multistage training strategy for determination of model
There are some improvements with the model compared to parameters.
the one in [15]. First, a multistage training procedure is used
to obtain multiple sets of synthesis parameters for the varying
characteristics of an instrument. Second, a supervised training a quick gradient descent algorithm, called resilient backpropa-
method is used to obtain the synthesis parameters and the ini- gation (RPROP) [22], with a simulated annealing (SA)-based
tial excitation waveform automatically. Third, a hybrid-training global searching technique [23]. The RPROP takes into account
algorithm is used to speed up the training and obtain better syn- the sign of the gradient as seen by a particular parameter instead
thesis parameters. of the magnitude of the gradient. The SA involves the addition of
random noise to the parameter updates as well as decreases the
A. Multistage Training Strategy magnitudes of the updates in the training process gradually. The
The multistage training strategy for the synthesis model SARPROP method can indeed converge much faster compared
is shown in Fig. 5. The mean square difference between the to the BPTT method. In our experiments, it is found that the
recorded and the synthesized tones is used to adjust the initial SARPROP method is very sensitive to the learning parameters
excitation waveform and synthesis parameters. In Stage #1, the and the initial condition of the system parameters. The training
initial excitation waveform, denoted by and in (7) and diverges or converges to a totally unacceptable solution for a re-
(9), as well as the first set of the synthesis parameters have to current network soemtimes.
be determined. This training stage employs the recorded tone In this paper, a hybrid-training algorithm consisting of BPTT
within the interval as the training vector and the and SARPROP as shown in Fig. 6 is used in the training pro-
resultant parameters are used for synthesizing a tone from cedure of the proposed physical modeling recurrent network.
to . Because the synthesis processing no longer requires Since this synthesis network is a recurrent neural network,
external signals after the initialization stage, it is not necessary BPTT is used to calculate the magnitude of the gradient
to have the initial input waveform updated after this stage. for each parameter and the corresponding parameter update
Stage #2 begins at by using the recorded tone from to value is obtained by SARPROP. In Stage #1, both synthesis
as the training vector and the second set of parameters is parameters and initial excitation waveform must be updated.
obtained when this training stage finishes. This procedure stops Only synthesis parameters are updated in the other stages.
when all the training vectors are finished. Particularly, the synthesis network is constructed based on a
simplified physical model of a musical instrument. Therefore,
B. Training Algorithm each of the synthesis parameters has its physical meaning.
For a recurrent neural network (RNN), BPTT [16], [20] is The -type parameters simulate the nonuniform characteristics
a widely used training algorithm that is an extension of the at various physical positions and the -type ones simulate
standard backpropagation algorithm. This algorithm requires at the energy decay factors. The initial values of the synthesis
least 10 000 epochs to converge when it is applied to the pro- parameters can be reasonable values derived from the physical
posed synthesis model. In [18], the SARPROP method is pro- characteristics of the target instrument instead of random
posed for feedforward type networks. This algorithm combines values such that the training can be better. This is different
1142 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
match the desired output is called the visible neuron and the
remaining neurons are called hidden neurons. In each training
stage, let denote the recorded tone at time and let the
output of the corresponding displacement neuron , which is
the th displacement neuron in PB- , be chosen as the synthetic
result of the synthesis model. The error signal at any time is
defined as
(17)
(18)
(19)
(20)
from most other applications using neural networks as their
parameter-finding mechanisms. where represents the local error of the neuron at
The temporal operation of the proposed model can be time layer . The gradient values of the -type parameters at
unfolded into a multilayer feedforward architecture with time layer can also be derived as shown in (21) and (22) at the
synchronous update. Those who are interested in this can bottom of the page.
refer to references such as [15], [21]. A neural network layer The local error of a displacement neuron can be obtained by
representing one time instant is called a time layer. Since (23). If a displacement neuron is a hidden neuron, its gradient
the synthesized signal is the output of a chosen displacement value can be computed based on the collection of the local errors
neuron, only this displacement neuron can have a teacher of the departure neurons connected with it. If this displacement
signal that is the recorded tone of the target instrument. The neuron is a visible neuron (denoted as ), it means that it will
displacement neuron that generates the synthesized output to directly contribute the error signal, as shown in (17), to the total
(21)
and
(22)
SU AND LIANG: CLASS OF PHYSICAL MODELING RECURRENT NETWORKS 1143
cost function. Therefore, this error signal must be involved in The local errors of the delay buffers in the upper track and the
the computation of the local error corresponding to this neuron lower track can be computed as shown in (28) and (29) at the
bottom of the page.
Since the total gradient for each parameter is the sum of the
gradient value corresponding to every time layer, the total gra-
dient for the synthesis parameters in one epoch can be computed
by
otherwise.
(30)
(23)
and
The local error of a departure neuron is obtained based on the
local error of the arrival neuron or the delay buffer connected
with this departure neuron by
(31)
(32)
(28)
and
(29)
1144 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
(a) (b)
(c)
Fig. 7. Simulation of left-hand finger-gliding playing behavior. (a) The left-hand finger glides along the string from position A to position B. (b) The left-hand
finger glides along the string from position B to position A. (c) Modified structure of the proposed model for simulating the finger-gliding effect.
(a) (b)
(c) (d)
(e) (f)
Fig. 9. Original and synthetic tones of various plucked-string instruments. (a) Steel-string guitar. (b) Nylon-string guitar. (c) Harp. (d) Pipa. (e) Yueh-chin. (f) Chin.
finger pressed firmly on the physical position corresponding to The synthesized tones produced with these operations can no
position B. In this case, outputs of all the displacement neurons longer sound so similar to the tones produced by the target
in the region to the left-hand side of position B are forced to zero. acoustic instruments. However, if the initial part of the synthetic
On the contrary, if the value of is restored gradually from tone is similar to the original, subjects tend to consider that
to its original value along the dash curve shown in Fig. 8, the these two are produced from the same instrument and the
model is gradually restored to the original situation. The pitch special effects are simply ornaments.
of synthetic tone will change from high to low.
Vibrato and Portamento effects are actually produced by VI. ANALYSIS/SYNTHESIS OF PLUCKED-STRING INSTRUMENTS
combining such shortening and lengthening operations. If the
gliding on the string can be described by a function of time, The followings are the analysis/synthesis experiments
the above effects can be easily achieved by changing the -type with respect to various types of plucked-string instruments,
parameters of the proposed model according to this function. steel-string guitar, nylon-string guitar, harp, Pipa, Yueh-chin,
1146 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
TABLE I
THE SIGNAL-TO-NOISE RATIOS OF SYNTHETIC RESULTS CORRESPONDING TO VARIOUS MUSICAL INSTRUMENTS
(a) (b)
(c) (d)
Fig. 10. Short-time-Fourier analysis of the signals shown in Fig. 9(d) and (f). (a) STFA of original Pipa tone. (b) STFA of synthesized Pipa tone. (c) STFA of
original Chin tone. (d) STFA of synthesized Chin tone.
and Chin to demonstrate the performance of the proposed Fig. 10(c) and (d) that the Chin tone has a smoother decay
method. Pipa, Yueh-Chin, and Chin are three Chinese tra- pattern compared to the Pipa tone. Although the SNR results
ditional plucked-string instruments [24]. In each case, there do not look impressed, such performance is not possible in the
are 7 PBs in the proposed model and each PB contains three past. In fact, most physical modeling synthesis methods can
displacement neurons. only reproduce the magnitude part of the frequency response.
1) Synthesis Results: The analysis/synthesis results are In general, the first few fractions of a second of a tone are how
shown in Fig. 9. Upper part of each subfigure shows the original people judge the instrument. Listening tests show that subjects
tone and lower part shows the corresponding synthesized tone. can find differences between the original and the synthesized
The waveforms of the original tone and the synthesized tone are tones but consider that they do sound very similar and regard
very close to each other. The SNR for each of the pairs is shown that the tones are generated from the same instruments.
in Table I. By examining Fig. 10(a) and (b), there are still small 2) Portamento Effects: Portamemto and vibrato are fre-
differences coming from the high-frequency components. In quently used in the playing of many stringed instruments. Chin
general, if the sounding mechanism of an instrument is less is an ancient Chinese plucked-string instrument that consists
perfect, the synthesis is more difficult. For example, Pipa, of a shallow rectangular-like wooden chamber and seven
a lute-like instrument, has a very thin top plate, a nonrigid strings and is the known instrument that uses portamento and
bridge and less well-constructed strings [25]. Therefore, its vibrato most. Since there is no fret on the top plate, the player’s
response is less smooth compared to instruments such as harp left-hand fingers can glide along strings to produce vibrato and
and Chin. This can also be seen on the STFT plots shown in portamento effects.
SU AND LIANG: CLASS OF PHYSICAL MODELING RECURRENT NETWORKS 1147
ACKNOWLEDGMENT
The authors would like to thank the editor and the reviewers
for their valuable comments.
REFERENCES
[1] P. M. Morse, Vibration and Sound. Woodbury, NY: Amer. Inst.
Phys./Acoust. Soc. Amer., 1936.
[2] N. H. Fletcher and T. D. Rossing, The Physics of Musical Instru-
ments. New York: Springer-Verlag, 1991.
[3] L. Cremer, The Physics of the Violin. Cambridge, MA: MIT Press,
Fig. 11. The STFT plot of the synthetic tone with portamento effect. 1984.
[4] A. V. Oppenhim and R. W. Schafer, Discrete-Time Signal Pro-
cessing. Englewood Cliffs, NJ: PrenticeHall, 1989.
[5] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing
The technique described in the previous section is used of Speech Signals. New York: Macmillan, 1993.
to simulate portamento. Fig. 11 shows the STFT plot of the [6] J. O. Smith, “Physical modeling using digital waveguides,” Comput.
Music J., vol. 16, no. 4, pp. 74–87, 1992.
simulation. The fundamental frequency is shifted from 190 [7] , “Music Application of Digital Waveguide,” Stanford Univ., Stan-
Hz to 215 Hz. Though the fundamental frequency is shifted ford, CA, CCRMA Tech. Rep. STAN-M-67.
to the desired pitch, the timbre has changed and is different [8] , “Efficient synthesis of stringed musical instruments,” in Proc.
1993 Int. Comput. Music Conf., 1993, pp. 64–71.
from the Chin used in this experiment. Because the beginning [9] A. Fettweis, “Wave digital filters: Theory and practice,” Proc. IEEE, vol.
transient sounds similar enough to the original, this timbre 74, pp. 227–327, Feb. 1986.
difference is usually ignored. Nevertheless, to simulate these [10] A. Sart and G. De Poli, “Toward nonlinear wave digital filters,” IEEE.
Trans. Signal Processing, vol. 47, pp. 597–, Sept. 1999.
effects without changing the timbre is still an interesting and [11] V. Duyne et al., “The 3-D tetrahedral digital waveguide with musical
challenging topic. applications,” in Proc. 1996 ICMC, Hong Kong, Aug. 1996, pp. 9–16.
[12] V. Välimäki et al., “Physical modeling of plucked string instruments
with application to real-time sound synthesis,” J. Audio Eng. Soc., vol.
VII. CONCLUSION AND FUTURE WORK 44, no. 5, pp. 331–353, 1996.
[13] M. Karjalainen et al., “Plucked-string models: From the Karplus-Strong
A class of physical modeling recurrent networks is pro- algorithm to digital waveguides and beyond,” Comput. Music J., vol. 22,
no. 3, pp. 17–32, 1998.
posed to synthesize musical tones of plucked-string instru-
[14] A. W. Su and S. F. Liang, “Synthesis of plucked-string tones by physical
ments. All the required parameters of the synthesis model modeling with recurrent neural networks,” Proc. IEEE 1997 Workshop
can be efficiently and automatically obtained by a hybrid Multimedia Signal Processing, pp. 71–76, June 1997.
BPTT/SARPROP learning algorithm. It is possible to closely [15] S. F. Liang, A. W. Su, and C. T. Lin, “Model-based synthesis of plucked
string instruments by using a class of scattering recurrent networks,”
synthesize for a specific instrument if electronic musicians IEEE Trans. Neural Networks, vol. 11, pp. 171–185, Jan. 2000.
consider the sound of this particular instrument is indis- [16] F. J. Pineda, “Recurrent backpropagation and the dynamical approach
pensable. The approach is also tested over a wide range of to adaptive neural computation,” Neural Comput., vol. 1, pp. 161–172,
1989.
plucked-string instruments and proven to be a very general [17] H. Krauß and R. Rabenstein, “Application of multidimensional wave
method. Based on this synthesis model, portamento effect can digital filters to boundary value problems,” IEEE Signal Processing
be easily synthesized. Because the training vector is easy to Lett., vol. 2, pp. 183–187, July 1995.
[18] N. K. Treadgold and T. D. Gedeon, “Simulated annealing and weight
obtain, it is possible for users to design their own synthesizers. decay in adaptive learning: The SARPROP algorithm,” IEEE Trans. on
Although the computation complexity in the resynthesis pro- Neural Networks, vol. 9, no. 4, pp. 662–668, 1998.
cessing is still large, it is close to the computation complexity [19] L. E. Kinsler et al., Fundamentals of Acoustics, 3rd ed. New York:
Wiley, 1982.
of speech synthesis. Based on the rapid progress of current [20] R. J. Williams and J. Peng, “An efficient gradient-based algorithm for
DSP processor design, computation cost in this range should on-line training of recurrent network trajectories,” Neural Comput., vol.
not cause much trouble. 2, pp. 490–501, 1990.
[21] S. Haykin, Neural Networks. Englewood Cliffs, NJ: Prentice-Hall,
Our future works are stated as follows. First, the SNR of 1994.
the synthetic tone to the original tone is still not good enough. [22] M. Riedmiller and H. Braun, “A direct adaptive method for faster back-
Actually, the high-frequency part contributes most of the error. propagation learning: The RPROP algorithm,” in Proc. ICNN93, San
Francisco, CA, 1993, pp. 586–591.
This will be our major focus. Second, playing techniques play
[23] H. Szu, “Fast simulated annealing,” in Neural Networks for Computing,
very important roles in how an instrument sounds. For ex- J. S. Denker, Ed. New York: Amer. Inst. Phys., 1986, pp. 420–425.
ample, Chin has thousands of techniques and each technique [24] H. D. Bodman, Chinese Musical Iconography: A History of Musical In-
produces a different timbre. How to handle this problem is strument Depicted in Chinese Art. Taipei, Taiwan, R.O.C.: Asian-Pa-
cific Cultural Center, 1987.
a challenging issue. Finally, it is desired to extend the pro- [25] S. Feng, “Some acoustical measurements on the chinese musical instru-
posed methodology to other types of instruments such as wind ment p’i-p’a,” J. Acoust. Soc. Amer., vol. 75, no. 2, pp. 599–602, 1984.
1148 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002
Alvin W. Y. Su (M’97) was born in Taiwan in 1964. Sheng-Fu Liang was born in Tainan, Taiwan, in
He received the B.S. degrees in control engineering 1971. He received the B.S. and M.S. degrees in
from National Chiao-Tung University (NCTU), control engineering from the National Chiao-Tung
Taiwan, in 1986. He received the M.S. and Ph.D. University (NCTU), Taiwan, in 1994 and 1996, re-
degrees in electrical engineering from Polytechnic spectively. He received the Ph.D. degree in electrical
University, Brooklyn, NY, in 1990 and 1993, and control engineering from NCTU in 2000.
respectively. Currently, he is a Research Assistant Professor
From 1993 to 1994, he was with Center for Com- in electrical and control engineering at NCTU.
puter Research in Music and Acoustics (CCRMA), His research activities include model-based music
Stanford University, Stanford, CA. From 1994 to synthesis, neural networks, and image processing.
1995, he was with Computer Communication Lab of His current projects include audio processing and
the Industrial Technology Research Institute (CCL. ITRI.), Taiwan. In 1995, he video signal processing.
joined the Department of Information Engineering and Computer Engineering
at Chung-Hwa University, where he serves as an Associate Professor. In 2001,
he joined the Department of Computer Science and Information Engineering,
National Cheng-Kung University. His research interests include digital audio
signal processing, physical modeling of acoustic musical instruments, human
computer interface design, video and color image signal processing, and VLSI
signal processing.
Dr. Su is a Member of IEEE Computer Society and Signal Processing Society.
He is also a Member of Acoustical Society of America and Audio Engineering
Society.