SSP Project
SSP Project
former are unrecoverable errors in most systems. However, the distance metric. The peaks in the distance function are then
DER, used to evaluate speaker diarization performance, treats found and define the change points if their absolute value ex-
both forms of error equally. ceeds a predetermined threshold chosen on development data.
For telephone audio, typically some form of standard en- Smoothing the distance distribution or eliminating the smaller
ergy/spectrum-based speech activity detection is used since of neighboring peaks within a certain minimum duration pre-
nonspeech tends to be silence or noise sources, although the vents the system overgenerating change points at true bound-
GMM approach has also been successful in this domain with aries. Single Gaussians are generally preferred to GMMs due to
single-channel [21] or cross-channel [22] classes. For meeting the simplified distance calculations. Typical window sizes are
audio, the nonspeech can be from a variety of noise sources, 1–2 or 2–5 s when using a diagonal or full covariance Gaussian,
like paper shuffling, coughing, laughing, etc. and energy-based respectively. As with BIC, the window length constrains the de-
methods do not currently work well for distant microphones tection of short turns.
[23], [24], so using a simple pretrained speech/nonspeech GMM Since the change point detection often only provides an initial
is generally preferred [6], [25], [23]. An interesting alternative base segmentation for diarization systems, which will be clus-
uses a GMM, built on the normalized energy coefficients of tered and often resegmented later, being able to run the change
the test data, to determine how much nonspeech to reject [24], point detection very fast (typically less than 0.01 for a di-
while preliminary work in [6] shows potential for the future for agonal covariance system) is often more important than any per-
a new energy-based method. When supported, multiple channel formance degradation. In fact, [11] and [19] found no significant
meeting audio can be used to help speech activity detection performance degradation when using a simple initial uniform
[26]. This problem is felt to be so important in the meetings segmentation within their systems.
domain that a separate evaluation for speech activity detection Both change detection techniques require a detection
was introduced in the spring 2005 Rich Transcription meeting threshold to be empirically tuned for changes in audio type and
evaluation [27]. features. Tuning the change detector is a tradeoff between the
desires to have long, pure segments to aid in initializing the
B. Change Detection clustering stage, and minimizing missed change points which
The aim of this step is to find points in the audio stream produce contaminations in the clustering.
likely to be change points between audio sources. If the input Alternatively, or in addition, a word or phone decoding step
to this stage is the unsegmented audio stream, then the change with heuristic rules may be used to help find putative speaker
detection looks for both speaker and speech/nonspeech change change points such as in [18] and the Cambridge 1998–2003
points. If a speech detector or gender/bandwidth classifier has systems [16], [20]. However, this approach can over-segment
been run first, then the change detector looks for speaker change the speech data and requires some additional merging or clus-
points within each speech segment. tering to form viable speech segments, and can miss boundaries
Two main approaches have been used for change detection. in fast speaker interchanges if relying on the presence of silence
They both involve looking at adjacent windows of data and or gender changes between speakers.
calculating a distance metric between the two, then deciding
whether the windows originate from the same or a different C. Gender/Bandwidth Classification
source. The differences between them lie in the choice of dis- The aim of this stage is to partition the segments into
tance metric and thresholding decisions. common groupings of gender (male or female) and bandwidth
The first general approach used for change detection, used (low-bandwidth: narrow-band/telephone or high-bandwidth:
in [15], is a variation on the Bayesian information criterion studio). This is done to reduce the load on subsequent clus-
(BIC) technique introduced in [28]. This technique searches for tering, provide more flexibility in clustering settings (for
change points within a window using a penalized likelihood example female speakers may have different optimal parameter
ratio test of whether the data in the window is better modeled settings to male speakers), and supply more side information
by a single distribution (no change point) or two different dis- about the speakers in the final output. If the partitioning can
tributions (change point). If a change is found, the window is be done very accurately and assuming no speaker appears in
reset to the change point and the search restarted. If no change the same broadcast in different classes (for example both in
point is found, the window is increased and the search is redone. the studio and via a prerecorded field report) then performing
Some of the issues in applying the BIC change detector are as this partitioning early on in the system can also help improve
follows. 1) It has high miss rates on detecting short turns ( 2–5 performance while reducing the computational load [33]. The
s), so can be problematic to use on fast interchange speech like potential drawback in this partitioning stage, however, is if
conversations. 2) The full search implementation is computa- a subset of a speaker’s segments is misclassified the errors
tionally expensive (order ), so most systems employ some can be unrecoverable, although it is possible to allow these
form of computation reductions (e.g., [29]). classifications to change in a subsequent resegmentation stage,
A second technique used first in [30] and later in [13], [17], such as in [19].
and [31] uses fixed-length windows and represents each window Classification for both gender and bandwidth is typically
by a Gaussian and the distance between them by the Gaussian done using maximum-likelihood classification with GMMs
Divergence (symmetric KL-2 distance). The step-by-step im- trained on labeled training data. Either two classifiers are run
plementation in [19] and system for telephone audio in [32] (one for gender and one for bandwidth) or joint models for
are similar but use the generalized log likelihood ratio as the gender and bandwidth are used. This can be done either in
1560 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006
conjunction with the speech/nonspeech detection process or parent cluster in the penalty factor represents a “local”
after the initial segmentation. Bandwidth classification can also BIC decision, i.e., just considering the clusters being combined.
be done using a test on the ratio of spectral energy above and This has been shown to perform better than the corresponding
below 4 kHz. An alternative method of gender classification, “global” BIC implementation which uses the number of frames
used in [17], aligns the word recognition output of a fast ASR in the whole show instead [20], [31], [36].
system with gender dependent models and assigns the most Slight variations of this technique have also been used. For
likely gender to each segment. This has a high accuracy but is example, the system described in [18] uses essentially the local
unnecessarily computationally expensive if a speech recogni- BIC score (with the number of parameters term incorporated
tion output is not already available and segments ideally should within the penalty weight), but sets different thresholds for po-
be of a reasonable size (typically between 1 and 30 s). Gender tential boundaries occurring during speech or nonspeech, moti-
classification error rates are around 1%–2% and bandwidth vated by an observation that most true speaker change points
classification error rates are around 3%–5% for broadcast news occurred during nonspeech regions. A further example, used
audio. in the system described in [11] and [37] removes the need for
tuning the penalty weight on development data, by ensuring
D. Clustering that the number of parameters in the merged and separate
The purpose of this stage is to associate or cluster segments distributions are equal, although the base number of Gaussians
from the same speaker together. The clustering ideally produces and, hence, number of free parameters needs to be chosen care-
one cluster for each speaker in the audio with all segments from fully for optimal effect. Alternatives to the penalty term, such as
a given speaker in a single cluster. The predominant approach using a constant [38], the weighted sum of the number of clus-
used in diarization systems is hierarchical, agglomerative clus- ters and number of segments [13], or a penalized determinant
tering with a BIC based stopping criterion [28] consisting of the of the within-cluster dispersion matrix [34], [39] have also had
following steps: moderate success, but the BIC method has generally superseded
these. Adding a Viterbi resegmentation between multiple itera-
0) initialize leaf clusters of tree with speech segments;
tions of clustering [31] or within a single iteration [11] has also
1) compute pair-wise distances between each cluster;
been used to increase performance at the penalty of increased
2) merge closest clusters;
computational cost.
3) update distances of remaining clusters to new cluster;
An alternative approach described in [40] uses a Euclidean
4) iterate steps 1)–3) until stopping criterion is met.
distance between MAP-adapted GMMs and notes this is highly
The clusters are generally represented by a single full covari-
correlated with a Monte Carlo estimation of the Gaussian Di-
ance Gaussian [5], [12], [15], [17], [31], [34], but GMMs have
vergence (symmetric KL-2) distance while also being an upper
also been used [11], [19], [35], sometimes being built using
bound to it. The stopping criterion uses a fixed threshold, chosen
mean-only MAP adaptation of a GMM of the entire test file
on the development data, on the distance metric. The perfor-
to each cluster for increased robustness. The standard distance
mance is comparable to the more conventional BIC method.
metric between clusters is the generalized likelihood ratio
A further method described in [15] uses “proxy” speakers.
(GLR). It is possible to use other representations or distance
A set of proxy models is applied to map segments into a vector
metrics, but these have been found the most successful within
space, then a Euclidean distance metric and an ad hoc occu-
the BIC clustering framework. The stopping criterion compares
pancy stopping criterion are used, but the overall clustering
the BIC statistic from the two clusters being considered, and
framework remains the same. The proxy models can be built
, with that of the parent cluster, , should they be merged, the
by adapting a universal background model (UBM) such as a
formulation being for the full covariance Gaussian case
128 mixture GMM to the test data segments themselves, thus
making the system portable to different shows and domains
while still giving consistent performance gain over the BIC
method.
Regardless of the clustering employed, the stopping crite-
rion is critical to good performance and depends on how the
output is to be used. Under-clustering fragments speaker data
over several clusters, while over-clustering produces contam-
inated clusters containing speech from several speakers. For
where is the number of free parameters, the number of indexing information by speaker, both are suboptimal. How-
frames, the covariance matrix, and the dimension of the ever, when using cluster output to assist in speaker adaptation
feature vector. (see, e.g., [20] for a more complete derivation.) of speech recognition models, under-clustering may be suit-
If the pair of clusters are best described by a single full covari- able when a speaker occurs in multiple acoustic environments
ance Gaussian, the will be low, whereas if there are two and over-clustering may be advantageous in aggregating speech
separate distributions, implying two speakers, the will be from similar speakers or acoustic environments.
high. For each step, the pair of clusters with the lowest is
merged and the statistics are recalculated. The process is gener- E. Joint Segmentation and Clustering
ally stopped when the lowest is greater than a specified An alternative approach to running segmentation and clus-
threshold, usually 0. The use of the number of frames in the tering stages separately is to use an integrated scheme. This was
TRANTER AND REYNOLDS: OVERVIEW OF AUTOMATIC SPEAKER DIARISATION SYSTEMS 1561
first done in [13] by employing a Viterbi decode between iter- mean and variance normalization [15] and feature warping [44]
ations of agglomerative clustering, but an initial segmentation using a sliding window of 3 s [14], [17]. The latter method had
stage was still required. A more recent completely integrated previously been found by one study to be more effective than
scheme, based on an evolutive-HMM (E-HMM) where detected other standard normalization techniques on a speaker verifica-
speakers help influence both the detection of other speakers and tion task on cellular data [45]. In [17], it was found the fea-
the speaker boundaries, was introduced in [41] and developed ture normalization was necessary to get significant gain from
in [19] and [42]. The recording is represented by an ergodic the cluster recombination technique.
HMM in which each state represents a speaker and the tran- When the clusters are merged, a new speaker model can be
sitions model the changes between speakers. The initial HMM trained with the combined data and distances updated (as in [14]
contains only one state and represents all of the data. In each it- and [17]) or standard clustering rules can be used with a static
eration, a short speech segment assumed to come from a nonde- distance matrix (as in [15]). This recombination can be viewed
tected speaker is selected and used to build a new speaker model as fusing intra- and inter- [43] audio file speaker clustering tech-
by Bayesian adaptation of a UBM. A state is then added to the niques. On the RT-04F evaluation it was found that this stage
HMM to reflect this new speaker, and the transitions probabili- significantly improves performance, with further improvements
ties are modified accordingly. A new segmentation is then gen- being obtained subsequently by using a variable prior iterative
erated from a Viterbi decode of the data with the new HMM, and MAP approach for adapting the UBMs, and building new UBMs
each model is adapted using the new segmentation. This reseg- including all of the test data [17].
mentation phase is repeated until the speaker labels no longer
change. The process of adding new speakers is repeated until G. Resegmentation
there is no gain in terms of comparable likelihood or there is no
The last stage found in many diarization systems is a reseg-
data left to form a new speaker. The main advantages of this in-
mentation of the audio via Viterbi decoding (with or without it-
tegrated approach are to use all the information at each step and
erations) using the final cluster models and nonspeech models.
to allow the use of speaker recognition-based techniques, like
The purpose of this stage is to refine the original segment bound-
Bayesian adaptation of the speaker models from a UBM.
aries and/or to fill in short segments that may have been removed
for more robust processing in the clustering stage. Filtering the
F. Cluster Recombination
segment boundaries using a word or phone recognizer output
In this relatively recent approach [31], state-of-the-art can also help reduce the false alarm component of the error rate
speaker recognition modeling and matching techniques are [31].
used as a secondary stage for combining clusters. The signal
processing and modeling used in the clustering stage of Sec- H. Finding Identities
tion II-D are usually simple: no channel compensation, such as
RASTA, since we wish to take advantage of common channel Although current diarization systems are only evaluated using
characteristics among a speaker’s segments, and limited param- “relative” speaker labels (such as “spkr1”), it is often possible to
eter distribution models, since the model needs to work with find the true identities of the speakers (such as “Ted Koppel”).
small amounts of data in the clusters at the start. This can be achieved by a variety of methods, such as building
With cluster recombination, clustering is run to under-cluster speaker models for people who are likely to be in the news
the audio but still produce clusters with a reasonable amount of broadcasts (such as prominent politicians or main news anchors
speech s . A UBM is built on training data to represent and reporters) and including these models in the speaker clus-
general speakers. Both static and delta coefficients are used and tering stage or running speaker-tracking systems.
feature normalization is applied to help reduce the effect of the An alternative approach, introduced in [46], uses linguistic in-
acoustic environment. Maximum a posteriori (MAP) adaptation formation contained within the transcriptions to predict the pre-
(usually mean-only) is then applied on each cluster from the vious, current, or next speaker. Rules are defined based on cat-
UBM to form a single model per cluster. The cross likelihood egory and word N-grams chosen from the training data, and are
ratio (CLR) between any two given clusters is defined [31], [43] then applied sequentially on the test data until the speaker names
have been found. Blocking rules are used to stop rules firing in
certain contexts, for example, the sequence “[name] reports ”
assigns the next speaker to be [name] unless is the word “that.”
An extension of this system described in [47], learns many rules
where is the average likelihood per frame of data and their associated probability of being correct automatically
given the model . The pair of clusters with the highest CLR from the training data and then applies these simultaneously on
is merged and a new model is created. The process is repeated the test data using probabilistic combination. Using automatic
until the highest CLR is below a predefined threshold chosen transcriptions and automatically found speaker turns naturally
from development data. Because of the computational load at degrades performance but potentially 85% of the time can be
this stage, each gender/bandwidth combination is usually pro- correctly assigned to the true speaker identity using this method.
cessed separately, which also allows more appropriate UBMs to Although primarily used for identifying the speaker names
be used for each case. given a set of speaker clusters, this technique can associate the
Different types of feature normalization have been used with same name for more than one input cluster and, therefore, could
this process, namely RASTA-filtered cepstra with 10-s feature be thought of as a high-level cluster-combination stage.
1562 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006
I. Combining Different Diarization Methods then scored against reference “ground-truth” speaker segmenta-
Combining methods used in different diarization systems tion which is generated using the rules given in [52]. Since the
could potentially improve performance over the best single hypothesis speaker labels are relative, they must be matched ap-
diarization system. It has been shown that the word error rate propriately to the true speaker names in the reference. To accom-
(WER) of an automatic speech recognizer can be consistently plish this, a one-to-one mapping of the reference speaker IDs to
reduced when combining multiple segmentations even if the in- the hypothesis speaker IDs is performed so as to maximize the
dividual segmentations themselves do not offer state-of-the-art total overlap of the reference and (corresponding) mapped hy-
performance in either DER or resulting WER [48]. Indeed, it pothesis speakers. Speaker diarization performance is then ex-
seems that diversity between the segmentation methods is just pressed in terms of the miss (speaker in reference but not in
as important as the segmentation quality when being combined. hypothesis), false alarm (speaker in hypothesis but not in refer-
It is expected that gains in DER are also possible by combining ence), and speaker-error (mapped reference speaker is not the
different diarization modules or systems. same as the hypothesized speaker) rates. The overall DER is the
Several methods of combining aspects of different diariza- sum of these three components. A complete description of the
tion systems have been tried, for example the “hybridization” evaluation measure and scoring software implementing it can be
or “piped” CLIPS/LIA systems of [35] and [49] and the “plug found at http://nist.gov/speech/tests/rt/rt2004/fall.
and play” CUED/MIT-LL system of [20] which both combine It should be noted that this measure is time-weighted, so the
components of different systems together. A more integrated DER is primarily driven by (relatively few) loquacious speakers
merging method is described in [49], while [35] describes a way and it is, therefore, more important to get the main speakers
of using the 2002 NIST speaker segmentation error metric to complete and correct than to accurately find speakers who do not
find regions in two inputs which agree and then uses these to speak much. This scenario models some tasks, such as tracking
train potentially more accurate speaker models. These systems anchor speakers in broadcast news for text summarization, but
generally produce performance gains, but tend to place some re- there may be other tasks (such as for speaker adaptation within
striction on the systems being combined, such as the required ar- automatic transcription, or ascertaining the opinions of several
chitecture or equalizing the number of speakers. An alternative speakers in a quick debate) for which it is less appropriate. The
approach introduced in [50] uses a “cluster voting” technique to same formulation can be modified to be speaker weighted in-
compare the output of arbitrary diarization systems, maintaining stead of time weighted if necessary, but this is not discussed
areas of agreement and voting using confidences or an external here. The utility of either weighting depends on the application
judging scheme in areas of conflict. of the diarization output.
[27] J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot, and C. Laprun, [47] S. E. Tranter, “Who really spoke when?—Finding speaker turns and
“The rich transcription 2005 spring meeting recogntion evaluation,” in identities in audio,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Proc. Machine Learning for Multimodal Interaction Workshop (MLMI), Process., vol. I, Toulouse, France, May 2006, pp. 1013–1016.
Edinburgh, UK, Jul. 2005, pp. 369–389. [48] M. J. F. Gales, D. Y. Kim, P. C. Woodland, H. Y. Chan, D. Mrva, R. Sinha,
[28] S. S. Chen and P. S. Gopalakrishnam, “Speaker, environment and S. E. Tranter, “Progress in the CU-HTK transcription system,” IEEE
and channel change detection and clustering via the bayesian Trans. Audio, Speech, Lang, Process., vol. 14, no. 5, pp. 1511–1523, Sep.
information criterion,” in Proc. 1998 DARPA Broadcast News 2006.
Transcription and Understanding Workshop, Lansdowne, VA, 1998, [49] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J.-F. Bonastre,
pp. 127–132. “The ELISA consortium approaches in speaker segmentation during the
[29] B. Zhou and J. Hansen, “Unsupervised audio stream segmentation NIST 2003 Rich Transcription evaluation,” in Proc. IEEE Int. Conf.
and clustering via the Bayesian information criterion,” in Proc. Int. Acoust., Speech, Signal Process., vol. 1, Montreal, QC, Canada, May
Conf. Spoken Language Process., vol. 3, Beijing, China, Oct. 2000, pp. 2004, pp. 373–376.
714–717. [50] S. E. Tranter, “Two-way cluster voting to improve speaker diarization
[30] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic segmenta- performance,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
tion, classification and clustering of broadcast news,” in Proc. DARPA vol. I, Philadelphia, PA, Mar. 2005, pp. 753–756.
Speech Recognition Workshop, Chantilly, VA, Feb. 1997, pp. 97–99. [51] D. Liu, D. Kiecza, A. Srivastava, and F. Kubala, “Online speaker adap-
[31] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Improving speaker tation and tracking for real-time speech recognition,” in Proc. Eur. Conf.
diarization,” in Proc. Fall Rich Transcription Workshop (RT-04), Speech Commun. Technol., Lisbon, Portugal, Sep. 2005, pp. 281–284.
Palisades, NY, Nov. 2004, [Online]. Available: http://www.limsi.fr/In- [52] J. G. Fiscus, J. S. Garofolo, A. Le, A. F. Martin, D. S. Pallett, M. A.
dividu/barras/publis/rt04f_diarization.pdf. Przybocki, and G. Sanders, “Results of the fall 2004 STT and MDE
[32] A. E. Rosenberg, A. Gorin, Z. Liu, and S. Parthasarathy, “Unsupervised evaluation,” in Proc. Fall 2004 Rich Transcription Workshop (RT-04),
speaker segmentation of telephone conversations,” in Proc. Int. Conf. Palisades, NY, Nov. 2004.
Spoken Language Process., Denver, CO, Sep. 2002, pp. 565–568. [53] Q. Jin, K. Laskowski, T. Schultz, and A. Waibel, “Speaker segmenta-
[33] S. Meignier, D. Moraru, C. Fredouille, L. Besacier, and J.-F. Bonastre, tion and clustering in meetings,” in Proc. ICASSP Meeting Recogni-
“Benefits of prior acoustic segmentation for automatic speaker segmen- tion Workshop, Montreal, QC, Canada, May 2004, [Online]. Available:
tation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. I, http://isl.ira.uka.de/publications/SchultzJin_NIST04.pdf.
Montreal, QC, Canada, May 2004, pp. 397–400. [54] D. Istrate, N. Scheffler, C. Fredouille, and J.-F. Bonastre, “Broadcast
[34] D. Liu and F. Kubala, “Online speaker clustering,” in Proc. IEEE Int. news speaker tracking for ESTER 2005 campaign,” in Proc. Eur. Conf.
Conf. Acoust., Speech, Signal Process., vol. I, Hong Kong, China, Apr. Speech Commun. Technol., Lisbon, Portugal, Sep. 2005, pp. 2445–2448.
2003, pp. 572–575. [55] F. Kubala, S. Colbath, D. Liu, A. Srivastava, and J. Makhoul, “Integrated
[35] D. Moraru, S. Meignier, L. Besacier, J.-F. Bonastre, and I. technologies for indexing spoken language,” Commun. ACM, vol. 43, no.
Magrin-Chagnolleau. The ELISA consortium approaches in speaker 2, pp. 48–56, Feb. 2000.
segmentation during the NIST 2002 speaker recognition evaluation. [56] J. H. L. Hansen, R. Huang, B. Z. M. Seadle, J. J. R. Deller, A. R. Gurijala,
presented at Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.. M. Kurimo, and P. Angkititrakul, “Speechfind: Advances in spoken doc-
[Online]. Available: http://www.lia.univ-avignon.fr/fich_art/339-mor- ument retrieval for a national gallery of the spoken word,” IEEE Trans.
icassp2003.pdf Speech Audio Process., vol. 13, no. 5, pp. 712–730, Sep. 2005.
[36] M. Cettolo, “Segmentation, classification and clustering of an [57] J. F. Bonastre, F. Wils, and S. Meignier, “Alize: A free toolkit for speaker
Italian corpus,” in Proc. Recherche d’Information Assisté par Or- recogntion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
dinateur (RIAO), Paris, France, Apr. 2000, [Online]. Available: vol. I, Philadelphia, PA, Mar. 2005, pp. 737–740.
http://munst.itc.it/people/cettolo/papers/riao00a.ps.gz. [58] D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami,
[37] J. Ajmera and C. Wooters, “A Robust Speaker Clustering Algorithm,” Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones,
in Proc. IEEE ASRU Workshop, St Thomas, U.S. Virgin Islands, Nov. and B. Xiang, “The superSID project: Exploiting high-level informa-
2003, pp. 411–416. tion for high-accuracy speaker recognition,” in Proc. IEEE Int. Conf.
[38] S. E. Tranter, M. J. F. Gales, R. Sinha, S. Umesh, and P. C. Wood- Acoust., Speech, Signal Process., vol. IV, Hong Kong, China, Apr. 2003,
land, “The development of the Cambridge University RT-04 diarization pp. 784–787.
system,” in Proc. Fall 2004 Rich Transcription Workshop (RT-04), Pal-
isades, NY, Nov. 2004, [Online]. Available: http://mi.eng.cam.ac.uk/re-
ports/abstracts/tranter_rt04.html. Sue E. Tranter (M’04) received the M.Eng. degree
[39] H. Jin, F. Kubala, and R. Schwartz, “Automatic speaker clustering,” in in engineering science, specializing in information
Proc. DARPA Speech Recognition Workshop, Chantilly, VA, Feb. 1997, engineering, from the University of Oxford, Oxford,
pp. 108–111. U.K., in 1996 and the M.Phil. degree in computer
[40] M. Ben, M. Betser, F. Bimbot, and G. Gravier, “Speaker diarization using speech and language processing from the University
bottom-up clustering based on a parameter-derived distance between of Cambridge, Cambridge, U.K., in 1997.
adapted GMMs,” in Proc. Int. Conf. Spoken Language Processing, Jeju Following this, she worked as a Research Assistant
Island, Korea, Oct. 2004, pp. 2329–2332. on MultiMedia Document Retrieval at the University
[41] S. Meignier, J.-F. Bonastre, C. Fredouille, and T. Merlin, “Evolutive of Cambridge until 2000, and then on nonlinear con-
HMM for multispeaker tracking system,” in Proc. IEEE Int. Conf. trol theory at the University of Oxford. Since 2002
Acoust., Speech, Signal Process., vol. II, Istanbul, Turkey, Jun. 2000, she has been a Research Associate on the Effective
pp. 1201–1204. Affordable Reusable Speech-To-Text (EARS) project at the University of Cam-
[42] S. Meignier, J.-F. Bonastre, and S. Igounet, “E-HMM approach for bridge, specializing in speaker segmentation and clustering.
learning and adapting sound models for speaker indexing,” in Proc.
Odyssey Speaker and Language Recognition Workshop, Crete, Greece,
Jun. 2001, pp. 175–180. Douglas Reynolds (SM’98) received the B.E.E. de-
[43] D. Reynolds, E. Singer, B. Carlson, J. O’Leary, J. McLaughlin, and M. gree (with highest honors) and the Ph.D. degree in
Zissman, “Blind clustering of speech utterances based on speaker and electrical engineering, both from the Georgia Insti-
language characteristics,” in Proc. Int. Conf. Spoken Language Process., tute of Technology, Atlanta.
vol. 7, Sydney, Australia, Dec. 1998, pp. 3193–3196. He joined the Speech Systems Technology Group
[44] J. Pelecanos and S. Sridharan, “Feature warping for Robust speaker ver- (now the Information Systems Technology Group),
ification,” in Proc. Odyssey Speaker and Language Recognition Work- Lincoln Laboratory, Massachusetts Institute of Tech-
shop, Crete, Greece, Jun. 2001, pp. 213–218. nology, Cambridge, in 1992. Currently, he is a Se-
[45] C. Barras and J.-L. Gauvain, “Feature and score normalization for nior Member of Technical Staff and his research in-
speaker verification of cellular data,” in Proc. IEEE Int. Conf. Acoust., terests include robust speaker and language identifi-
Speech, Signal Process., vol. II, Hong Kong, China, Apr. 2003, pp. cation and verification, speech recognition, and gen-
49–52. eral problems in signal classification and clustering.
[46] L. Canseco-Rodriguez, L. Lamel, and J.-L. Gauvain, “Speaker Diariza- Douglas is a Senior Member of IEEE Signal Processing Society and a co-
tion from Speech Transcripts,” in Proc. Int. Conf. Spoken Language founder and member of the steering committee of the Odyssey Speaker Recog-
Process., Jeju Island, Korea, Oct. 2004, pp. 1272–1275. nition workshop.