Anguera Xavier Robust Speaker Diarization For Meetings
Anguera Xavier Robust Speaker Diarization For Meetings
This thesis shows research performed into the topic of speaker diarization for meeting rooms.
It looks into the algorithms and the implementation of an offline speaker segmentation and
clustering system for a meeting recording where usually more than one microphone is available.
The main research and system implementation has been done while visiting the International
Computes Science Institute (ICSI, Berkeley, California) for a period of two years.
Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most
of the proposed systems involve some sort of hierarchical clustering of the data into clusters,
where the optimum number of speakers of their identities are unknown a priory. A very commonly
used method is called bottom-up clustering, where multiple initial clusters are iteratively merged
until the optimum number of clusters is reached, according to some stopping criterion. Such
systems are based on a single channel input, not allowing a direct application for the meetings
domain. Although some efforts have been done to adapt such systems to multichannel data, at
the start of this thesis no effective implementation had been proposed. Furthermore, many of
these speaker diarization algorithms involve some sort of models training or parameter tuning
using external data, which impedes its usability with data different from what they have been
adapted to.
The implementation proposed in this thesis works towards solving the aforementioned prob-
lems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from
ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and
obtain a single enhanced signal from all available microphones. It then implements a train-free
speech/non-speech detection on such signal and processes the resulting speech segments with
an improved version of the mono-channel speaker diarization system. Such system has been
modified to use speaker location information (when available) and several algorithms have been
adapted or created new to adapt the system behavior to each particular recording by obtaining
information directly from the acoustics, making it less dependent on the development data.
The resulting system is flexible to any meetings room layout regarding the number of mi-
crophones and their placement. It is train-free making it easy to adapt to different sorts of data
and domains of application. Finally, it takes a step forward into the use of parameters that are
more robust to changes in the acoustic data. Two versions of the system were submitted with
excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where
data from two different subdomains (lectures and conferences) was evaluated. Also, experiments
using the RT datasets from all meetings evaluations were used to test the different proposed
algorithms proving their suitability to the task.
v
Resum
Aquesta tesi doctoral mostra la recerca feta en l’àrea de la diarització de locutor per a sales de
reunions. En la present s’estudien els algorismes i la implementació d’un sistema en diferit de
segmentació i aglomerat de locutor per a grabacions de reunions a on normalment es té accés a
més d’un micròfon per al processat. El bloc més important de recerca s’ha fet durant una estada
al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un perı́ode de dos
anys.
La diarització de locutor s’ha estudiat força per al domini de grabacions de ràdio i televisió.
La majoria dels sistemes proposats utilitzen algun tipus d’aglomerat jeràrquic de les dades en
grups acústics a on de bon principi no se sap el número de locutors òptim ni tampoc la seva
identitat. Un mètode molt comunment utilitzat s’anomena “bottom-up clustering” (aglomerat
de baix-a-dalt), amb el qual inicialment es defineixen molts grups acústics de dades que es van
ajuntant de manera iterativa fins a obtenir el nombre òptim de grups tot i acomplint un criteri
de parada. Tots aquests sistemes es basen en l’anàlisi d’un canal d’entrada individual, el qual
no permet la seva aplicaci’o directa per a reunions. A més a més, molts d’aquests algorismes
necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual
dificulta l’aplicabilitat d’aquests sistemes per a dades diferents de les usades per a l’adaptació.
La implementació proposada en aquesta tesi es dirigeix a solventar els problemes mencionats
anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diarització
de locutor basat en l’aglomerat de “baix-a-dalt”. Primer es processen els canals de grabació
disponibles per a obtindre un sol canal d’àudio de qualitat major, a més d’informació sobre la
posició dels locutors existents. Aleshores s’implementa un sistema de detecció de veu/silenci que
no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versió
millorada del sistema mono-canal de diarització de locutor. Aquest sistema ha estat modificat
per a l’ús de l’informació de posició dels locutors (quan es tingui) i s’han adaptat i creat nous
algorismes per a que el sistema obtingui tanta informació com sigui possible directament del
senyal acustic, fent-lo menys depenent de les dades de desenvolupament.
El sistema resultant es flexible i es pot usar en qualsevol tipus de sala de reunions pel
que fa al nombre de micròfons o la seva posició. El sistema, a més, no requereix en absolut
dades déntrenament, sent més senzill adaptar-lo a diferents tipus de dades o dominis d’aplicació.
Finalment, fa un pas endavant en l’ús de parametres que siguin mes robusts als canvis en les dades
acústiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions
de RT05s i RT06s del NIST en transcripció rica per a reunions, a on aquests es van avaluar amb
dades de dos subdominis diferents (conferencies i reunions). A més a més, es fan experiments
utilitzant totes les dades de les evaluacions RT per a demostrar la viabilitat dels algorismes
proposats en aquesta tasca.
vii
Acknowledgements
It is always difficult to find a good balance on who to thank in order not to leave anyone
important behind and not to become too long. Everyone’s life changes constantly, influenced by
the people in it. In my life there have been some people that have had an influence in who I
have become, to whom I have to be thankful for.
First of all, to my parents and grandparents. I have always felt they were behind me, helping
me to overcome any rock on my road, and giving me the opportunity to always pursue my
dreams. This is to my parents to whom I specially dedicate the effort put into writing this
thesis.
I would certainly not be writing this is it was not for Javier Hernando, my co-advisor, who
became my friend a few years ago, then my co-worked, then my advisor. I thank him for always
believing in me and for letting me find my path.
Although I met him later than Javier, Chuck Wooters, my second co-supervisor, has been an
incredible source of knowledge, advise and friendship during my stay in Berkeley. He was always
welcoming me with a smile and ready to help. It soon became clear we both shared a passion
for speech processing and very similar ideals. It has been a pleasure and an honor being able to
work with him.
The two years doing research at ICSI in Berkeley have been very important for me and for
this PhD. They would not had been possible without the AMI project training program, the
Spanish visitors program and ICSI for hosting me. Many thanks go to Barbara Peskin, Nelson
Morgan and everyone responsible for both programs.
In my path through education there have been many good friends and partners of sacrifice
in which I have looked upon for motivation, support and enjoyment. In these later PhD years I
would like to always remember people like Pablo, Jan, Mireia, Pere, Andrey, Marta, Jordi and
Josep Maria at UPC, and Manolo, Kofi, Marc, Arlo, James, Andy, Adam and Mathew at ICSI,
just to name a few.
Over the years I started learning some dancing steps until now when it became not only
a good exercise activity but also a relaxation method. Thanks to all my tango and ballroom
partners for suffering my stumbles and grumpy days.
Finally, thank you for reading this thesis.
ix
Contents
1 Introduction 1
1.1 Context and Motivations of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Definition of the Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
xi
xii CONTENTS
6 Experiments 139
6.1 Meetings Domain Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.1.1 Baseline Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1.4 Reference Segmentation Selection and Calculation . . . . . . . . . . . . . 144
6.2 Experiments from Broadcast News to Meetings . . . . . . . . . . . . . . . . . . . 147
6.3 Speech/Non-Speech Detection Block . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4 Acoustic Beamforming Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.4.1 Baseline System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.4.2 Reference Channel Estimation Analysis . . . . . . . . . . . . . . . . . . . 153
6.4.3 TDOA Post-Processing Analysis . . . . . . . . . . . . . . . . . . . . . . . 154
6.4.4 Signal Output Algorithms Analysis . . . . . . . . . . . . . . . . . . . . . . 158
6.4.5 Use of the Beamformed Signal for ASR . . . . . . . . . . . . . . . . . . . 159
6.5 Speaker Diarization Module Experiments . . . . . . . . . . . . . . . . . . . . . . 159
6.5.1 Individual Algorithms Performance . . . . . . . . . . . . . . . . . . . . . . 160
6.5.2 Algorithms Agglomeration Performance . . . . . . . . . . . . . . . . . . . 169
6.6 Overall Experiments and Analysis of Results . . . . . . . . . . . . . . . . . . . . 176
8 Conclusions 197
8.1 Overall Thesis Final Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2 Review of Objectives Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.3 Possible Future Work Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Bibliography
List of Tables
xv
xvi LIST OF TABLES
7.1 Systems summary description and DER on the evaluation set for RT05s . . . . . 188
7.2 Results for RT06s Speaker Diarization, conference room environment . . . . . . . 191
7.3 Results for RT06s Speaker Diarization, lecture room environment . . . . . . . . . 191
7.4 Results for RT06s Speech Activity Detection (SAD). Results with * are only for a
subset of segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.1 Linear microphone array with all microphones equidistant at distance d . . . . . 111
xvii
xviii LIST OF FIGURES
6.1 Energy-based system errors depending on its segment minimum duration . . . . . 148
6.2 Model-based system errors depending on its segment minimum duration . . . . . 149
6.3 Individual meetings DER vs. SNR vs. number of microphones in the RT06s system152
6.4 Development set SNR modifying the percentage of noise threshold adjustment . . 155
6.5 Development set SNR values modifying the Viterbi transition prob. weights in the
F&S algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.6 Development set SNR values modifying the number of N-best values used for
TDOA selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.7 DER for the model complexity selection algorithm using different CCR values . . 161
6.8 DER for the initial number of clusters algorithm using different CCR values . . . 162
6.9 DER for the combination of complexity selection + initial number of clusters using
different CCR values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.10 DER variation with the number of parallel models used in CV-EM training . . . . 163
6.11 DER variation with the number of friends used in the friends-and-enemies ini-
tialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.12 DER variation with the percentage of accepted frames and used Gaussians in
frame purification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.13 DER scores for the baseline system setting the relative weights by hand on devel-
opment data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.14 DER evolution with the weight computation iterations . . . . . . . . . . . . . . . 167
6.15 DER evolution changing the initial feature stream weights . . . . . . . . . . . . . 168
LIST OF FIGURES xix
6.16 DER variation with the number of Gaussian mixtures initially assigned to the
TDOA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.17 DER variation with the CCR parameter in the agglomerate system . . . . . . . . 171
6.18 DER variation with the number of friends in the agglomerate system . . . . . . . 172
6.19 DER variation with the number of EM iterations of a standard EM-ML training
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.20 DER variation with the number CV-EM parallel models . . . . . . . . . . . . . . 173
6.21 DER variation with the frame % acceptance for frame purification algorithm . . . 174
6.22 DER variation with the Gaussian % used in the frame purification algorithm . . 175
7.1 DER Break-down by meeting for the RT05s conference data . . . . . . . . . . . . 188
7.2 DER break-down by show for the RT05s lecture data . . . . . . . . . . . . . . . . 189
7.3 DER break-down by show for the RT06s conference data . . . . . . . . . . . . . . 192
7.4 DER break-down by show for the RT06s lecture data . . . . . . . . . . . . . . . . 193
Chapter 1
Introduction
The purpose of this initial chapter it to present the problems and motivations that sparkled and
impulsed the development of this thesis work and exposes what it aimed to achieve. Working
towards this, section 1.1 introduces the topic of the thesis and the motivations behind it. Section
1.2 defines the set of objectives pursued with its development. Finally, section 1.3 outlines the
contents to be found in each of the remaining chapters in this document.
Speech is still one of the most used way that humans have to communicate their ideas and to
convey information to the world outside of ourselves. In fact, the quantity of available information
by means of speech (telephone, radio, television, meetings, lectures, internet, etc) that is being
stored is very big and rapidly increasing given the cheaper and cheaper ways of storage available
nowadays. Following the two maxima that say ”time is money” and ”information is power”, it
becomes clear how desirable it is to have access to all this information, but as we only have two
ears and limited time, we would like someone else to access it for us and to tell us only what
is important, not waisting time in listening to multiple hours of contentless recordings. Some
other times we might be interested in accessing some particular bit of this information which we
do not know where it is, lost inside of our “Alexandria audio library”. This is one area where
speech technology can make a big contribution by means of techniques like audio indexing, where
information is automatically extracted from the audio, which allows the processing, search and
recovery of the desired content much easier. Considering a parallelism, acoustic indexing could
be considered to an audio-based library what a good librarian is to a paper-based library.
1
2 1.1. Context and Motivations of this Thesis
Most of the times when a person speaks, his/her speech is directed to someone or something
else, which we expect to communicate with. In fact, even when we are talking to an animal, a
machine or a little baby we are adapting our speech so that the message is conveyed to this outer
entity. When dealing with information extraction from a recording, it becomes very important
to answer questions like: “what was said?” as it conveys the message, but also “who said it?” as
information varies depending on who utters the spoken words.
Within the speech technologies, The broad topic of acoustic indexing studies the classification
of sounds into different classes/sources. Such classes could be as broad as [cats, dogs, humans] or
more concrete like [pit bull, pug, German shepherd]. Algorithms used for acoustic indexing worry
about the correct classification of the sounds, but not necessarily about the correct separation of
them when more than one exist in the same audio segment. These purely classification techniques
have sometimes been called audio clustering, which benefit from the broad topic of clustering,
well studies in many areas.
When multiple sounds appear in the same audio signal one must turn his attention to
techniques called as audio diarization to process them. As described in Reynolds and Torres-
Carrasquillo (2004), audio diarization is known as the process of annotating an input audio
signal with information that attributes (possibly overlapping) temporal regions of signal to their
specific sources/classes (i.e. creating a “diary” of events in the audio document). These can
include particular speakers, music, background noise sources, and other signal source/channel
characteristics. It is very dependent on the application which particular classes are defined, be-
coming as broad or narrow as intended. In the simplest case, one could refer as audio diarization
to the task of speech versus non-speech detection.
When the possible classes correspond to the different speakers in a recording these techniques
are called speaker diarization. They aim at answering the question “Who spoke when?” given
an audio signal. Algorithms doing speaker diarization need to locate each speaker turn and
assign them to the appropriate speaker cluster. The output of the system is a set of segments
with a unique ID assigned to each person that intervenes in the recording, leaving it to speaker
identification systems to determine the person’s identity given each ID. Until the present time,
the domains that have received most research attention within the speaker diarization community
have been
• Telephone speech: Speaker diarization systems started being evaluated by NIST (National
Institute for Standards and Technology 2006) using single channel telephone speech signals,
within the speaker recognition evaluations in the late 1990’s.
• Broadcast News (radio and TV broadcasts): Mainly with the impulse of DARPA’s EARS
Chapter 1. Introduction 3
program (DARPA Effective, Affordable, Reusable Speech-to-Text (EARS) 2004) rich tran-
scription of broadcasted news content became the primary research domain for speaker
diarization roughly from 2002 to 2004. Rich transcription consists on the addition of extra
information (generally called metadata, including speaker diarization information) to the
speech-to-text transcriptions.
• Meetings (lectures and conferences): Mainly due to the impulse of the European CHIL
and AMI projects (Computers in the Human Interaction Loop (CHIL) website (2006),
Augmented Multiparty Interaction (AMI) website (2006)) the focus of research shifted
from broadcast news to meetings around 2004. Although of its current prominence, many
smaller projects had studied and recorded meetings previously in the 1990’s.
When talking about speaker diarization is equivalent to saying speaker segmentation and
clustering of an audio document as both these techniques are normally used together in diariza-
tion. On one hand, speaker segmentation (also called speaker change detection) aims at finding
changes of speaker in an audio recording. It differs from acoustic change detection in that it
does not consider changes in the background sounds during a single speaker segment to be a
change to consider. On the other hand, speaker clustering agglomerates audio segments into
homogeneous groups, coming from a similar/same source. In the general definition it does not
constrain the process to a single file as all it requires is that each segment contain only a single
speaker. When used in conjunction with speaker segmentation for speaker diarization, it clusters
the segments created by the segmentation of one single recording.
Finally, also related to speaker diarization there are techniques regarding speaker tracking
where the identity of one or more speakers is known a priory and the aim is to locate their
interventions within the audio document.
In a general point of view, speaker diarization algorithms are a very useful part of many
speech technology systems, for example:
• Speaker indexing and rich transcription: By indexing the audio according to the speakers
and adding extra information to speech transcripts it becomes easier for humans to locate
information and for machines to process it. typical automatic uses of such output might
be speech summarization and translation.
• Speaker segmentation and clustering helping Automatic Speech Recognition (ASR) sys-
tems: Segmentation algorithms are used to split the audio into small segments (maintaining
all acoustic units intact) for the ASR systems to process. Also, speaker diarization algo-
rithms are used to cluster all the input data into speakers towards model adaptation.
Sometimes the clustering is performed into broader speaker clusters (less than the actual
amount of speakers) to maximize the amount of adaptation data.
4 1.1. Context and Motivations of this Thesis
• Preprocessing modules for speaker-based algorithms: Speaker diarization can be used be-
fore speaker tracking, speaker identification, speaker verification and other single speaker-
based algorithms, to split the data into individual speakers.
This thesis verses about speaker diarization pursued on the meetings environment. While
doing so, and following the guidelines proposed by NIST in the Rich Transcription (RT) evalu-
ations, it processes the data without any prior information on the number of speakers present
in the meeting or their identities. Furthermore, the system is intended for use without any
assumption on the meeting room layout, which usually contains multiple microphones record-
ing synchronously. These microphones are of different kinds and it is assumed that their exact
location is unknown to the system.
This thesis is being presented in partial fullfillment of the requirements for the PhD in The-
ory of Signal and Communications in the UPC, where I have taken the necessary doctorate
courses and previously prepared the thesis proposal. The proposed system was implemented at
the International Computer Science Institute during a two years research stay with funding from
the AMI project on the first year and the Spanish visitors program on the second. The imple-
mentation of speaker diarization for meetings takes into account all available prior knowledge
in speaker diarization for the broadcast news environment present at ICSI at the start of this
project. It is based on a modified version of the Bayesian Information Criterion (BIC) which
does not require the tuning of any penalty term. It performs an agglomerative clustering of the
initial acoustic data after it has been filtered with a speech/non-speech detector to eliminate all
non-speech information. The clustering finished when no cluster pair is available for merging.
Improvements are proposed for the system in three main areas. To extend the applicability
of the system to multiple microphone recordings it implements a filter&sum beamforming and
adds several algorithms to improve the output signal when microphones are very dissimilar. The
beamforming algorithm also started being used by the ASR system at ICSI in the meetings Rich
Transcription evaluations with great results directly attributed to this module. Another area is
the speech/non-speech detection where a new train-free system was implemented to allow for an
accurate filtering of the silence segments in a meeting. Finally, within the inherited broadcast
news system, several algorithms are either added or improved to increase the robustness of the
system and to allow for it to extract as much information as possible from each recording allow-
ing for fast adaptation to new domains. These include the automatic initial number of clusters
and model complexity selection algorithms, two purification algorithms to allow better compar-
isons between clusters and a a more robust training. Finally, the time delay of arrival between
microphones in the beamforming module is successfully used in the diarization to increase the
amount of information used to perform the diarization.
Chapter 1. Introduction 5
The main objective of this thesis is the development of a robust speaker diarization system
towards its use in the meetings domain. In order to fully accomplish this, a set of concrete
objectives is established (without any order of importance):
• The speaker diarization system is to be built using the expertise accumulated at ICSI
in the research done in broadcast news. First, the differences between broadcast news
and meetings need to be analyzed. Then the mono-channel speaker diarization system
used in broadcast news is to be adapted to the meetings domain by first addressing the
points where both domains differ, and then improving current algorithms to improve its
performance.
• The algorithms implemented for the meetings system should reduce show flakiness, which
accounts for sudden changes to the system performance, within the same set, upon slight
modification of its parameter settings. It should also improve within sets robustness, with
similar results when running the same system in different data than the development. This
can be achieved by research in system parameters that focus on the particular character-
istics of the individual audio excerpts instead of the whole set, thus becoming more robust
to changes in the used set. These parameters need to have a flat performance response
around the optimum to allow for small changes not to dramatically affect the outcome.
• In a similar fashion, the system is also aimed at being train-free (no external data is
used to train acoustic models prior to the test). This allows both a quick adaptation to
domains and a robust performance when new data within the same domain has a different
acoustic content than the development data. This was already a goal of the broadcast news
system, where only the speech/non-speech detector needed to be trained. The proposed
system aims at replacing this module by a train-free alternative and to implement all new
algorithms and improvements to be independent of any data outside of the test set for
models training. Still development data will be used to set the system parameters.
• The system is developed for participation in the NIST Rich Transcription (RT) evaluations
for 2005 and 2006 in order to benchmark the performance of the technology and algorithms
6 1.3. Outline of the Thesis
implemented in comparison to other systems given the same data. All decisions taken and
parameter settings are in tune with the existing rules in these evaluations, which intend
to measure general system performance, without emphasis in any particular application.
• Last but not least, emphasis is put at the publication of results and improvements made
to the system to allow for other research groups to know the research progress made at
ICSI in terms of speaker diarization. Furthermore, efforts are made into making the system
available for people to use it, either entirely or some of its modules, and both internally
or by external users, giving support when possible.
This thesis is split into seven main chapters on the topic of robust speaker diarization for
meetings. A brief description follows of what is to be found in each chapter.
Chapter 2 takes a look into the proposed problem: how to robustly and optimally determine
“who spoke when?” in a meeting domain where multiple microphones are usually available for
recording. In order to address it, a review of what feature parameters have been previously
used in speaker-related problems is followed by an analysis of the state of the art in speaker
segmentation, which plays an important part in many speaker diarization algorithms. Then a
review of previously proposed diarization algorithms and implementations sets the grounds for
a description of the projects, databases and systems that, to the date, have had its main focus
in the meetings domain. Finally, and given the multichannel nature of a meeting room, acoustic
enhancement theory is introduced to process multiple microphones, and the main techniques are
reviewed for the purpose of obtaining a single “enhanced” channel from multiple inputs.
Chapter 3 Leads the reader through the system implementation, basing it in the diarization
system that existed for broadcast news prior to this thesis work. An initial review of the ideas
behind the system and the implementation of the broadcast news speaker diarization system is
followed by an analysis of the differences and needs in order to adapt it to the meetings domain.
Finally, a description of the meetings implementation in comparison to the prior system is
pursued. Each of the blocks and algorithms that have been reused, refurbished or created from
scratch for the meetings domain are introduced, while leaving for later chapters the description
in detail of the novel algorithms presented in this thesis.
Chapter 4 describes in detail all the novel techniques introduced in this thesis for the pro-
cessing of single channel acoustic data. These include a new speech/non-speech decoder which
improves the previous version by being totally train-free and more adapted to the diarization
process. Also, several algorithms for speaker clusters description and modeling, including al-
gorithms for description of the number of clusters, model complexity selection, a new training
Chapter 1. Introduction 7
Chapter 6 describes the experiments to show the appropriateness of all techniques. First,
it describes the setup for running the experiments and then shows and explains the results for
each one, comparing it to a baseline derived from the original broadcast news system prior to
this thesis work or from intermediate (well established) points.
Chapter 7 describes the content and motivations behind the NIST Rich Transcription evalua-
tions, which has been the tool used to assess the quality and to compare the proposed diarization
system to other research systems and is the source of all datasets used in the experiments. A
description of ICSI’s submissions for 2005 and 2006 is explained in detail and results for those
evaluations are given.
Finally, chapter 8 summarizes the major contributions and results obtained in this thesis
and proposes some improvements and future work.
Chapter 2
In this chapter the main techniques used over the recent years on the task towards speaker
diarizaton (i.e. speaker segmentation and clustering) and on acoustic beamforming are reviewed.
Initially, the features that have been found suitable for speaker diarization are explained. Then,
a look at the algorithms and systems to deal in general with the task at hand are introduced.
Finally some ground is set on techniques oriented towards performing speaker diarization in
meetings, being this the main domain of application of this thesis.
Speaker diarization can be defined in terms of being a subtype of audio diarization, where
the speech segments of the signal are broken into the different speakers (Reynolds and Torres-
Carrasquillo 2004). It generally answers to the question ”Who spoke when?” and it is sometimes
referred to as speaker segmentation and clustering. In the domain of application of this thesis
it is performed without any prior knowledge of the identity of the speakers in the recordings
or how many are there. This, though, is not a requirement to call it speaker diarization as
partial knowledge on the identities of some people in the recordings, the number of speakers
or the structure of the audio (what follows what) might be available and used depending on
the application at hand. None of these informations is provided in the RT evaluation cam-
paigns organized by NIST (NIST Spring Rich Transcription Evaluation in Meetings website,
http://www.nist.gov/speech/tests/rt/rt2005/spring 2006) which is the task used to evaluate all
the algorithms presented in this thesis.
According to Reynolds and Torres-Carrasquillo (2004), there are 3 main domains of appli-
cation for speaker diarization that have received special attention over the years:
• Broadcast news audio: Radio and TV programs with various kinds of programming, usually
containing commercial breaks and music, over a single channel.
• Recorded meetings: meetings or lectures where multiple people interact in the same room
or over the phone. Normally recordings are made with several microphones.
9
10
Furthermore, one could consider other particular domains, like air traffic communications,
dialog in the car, and others.
As part of speaker diarization, speaker segmentation and speaker clustering belong to the
pattern classification family, where one tries to find categorical (discrete) classes for continuous
observations of speech and, by doing so, it finds the boundaries between them. Speech recognition
is also a pattern classification problem. As such, they all need to work on a feature set that
represents well the acoustic data and define a distance measure/method to assign each feature
vector to a class.
In general, clustering data into classes is a well studied technique for statistical data analysis,
with applications in many fields, including machine learning, data mining, pattern recognition,
image analysis, bioinformatics and others.
When using clustering techniques for speaker or acoustic clustering one needs to previously
define the segments that are going to be clustered, which might be of different sizes and char-
acteristics (speech, non-speech, music, noises). In creating the segments using segmentation
techniques one needs to be able to separate the speech stream into speakers and not words or
phones. Any speech segment is populated with voiced and unvoiced phones, and short pauses
between some phones or when punctuating. A speaker segmentation and clustering algorithm
needs to define the properties of the speaker’s data to be assigned to a single speaker and define
techniques to assign such data into a single cluster. To do so one needs to use the appropriate
acoustic models, their size and training algorithms so that they identify differences correctly in
the acoustics at the speaker level.
The first section in this chapter takes a look at the features that have been proven useful
for speaker based processing (like speaker diarization). Emphasis is given to alternatives to the
traditional features, to focus on speaker characteristics that better discriminate and help identify
the speakers present in a recording.
Following the features review, an overview of the main techniques that have been used in
the area of speaker segmentation and speaker diarization is pursued. Speaker segmentation is a
first step in many speaker diarization systems and therefore it is found useful to review what
techniques have been mainly used in the past and to create a ground theory for the speaker
diarization review. After explaining the main speaker diarization systems focus will be geared
towards speaker diarization for meetings, which is the focus of implementation in this thesis.
In meetings one usually encounters several available microphones for processing, are all
Chapter 2. State of the art 11
located inside the meetings room in several locations around the speakers. Although most of
these microphones are not defined to form a microphone array in theory, in practice it is found
useful to use microphone array beamforming techniques in order to combine the microphones
data into one “enhanced” channel and then process only this channel using the diarization
system. This has the advantage that the speaker diarization system stays totally transparent
of the particularities of each meeting room setting and processes only one channel in any case,
improving in speed, versus any other solutions involving some sort of processing of all channels
in parallel.
In the last section of this state of the art review the main techniques currently available
in acoustic beamforming will be covered, which have been applied in the implemented system
in order to take advantage of the multiplicity of available microphones. First, an overview of
the techniques used to obtain an “enhanced” signal as an output from multiple input signals is
covered, and then possible ways to estimate the delay between each of these channels is explored,
necessary in order to align the acoustic data, used the majority of beamforming algorithms.
Speaker diarization falls into the category of the speaker-based processing techniques. Features
extracted from the acoustic signal are intended to convey information about the speakers in the
conversations in order to enable the systems to separate them optimally.
Likewise speaker recognition and speech recognition systems, well used parametrization fea-
tures in speaker diarization are Mel Frequency Cepstral Coefficients (MFCC), Linear frequency
cepstral coefficients (LFCC), Perceptual Linear Predictive (PLP), Linear Predictive Coding
(LPC) and others.
In this section some research is pointed out that propose alternative parameters focusing on
the speaker characteristics and/or particular conditions of the tasks that they are applied to, all
within the speaker-based area, which can constitute an advantage if used alone or in conjunction
with the most common parametrization techniques. Although the use of these parameters is still
not general, these should constitute the tip of the iceberg of parameters exploiting speaker
12 2.1. Acoustic Features for Speaker Diarization
information to come.
In order to avoid the influence of background noises and other non-speaker related events,
in Pelecanos and Sridharan (2001) and more recently in Ouellet, Boulianne and Kenny (2005),
feature warping techniques are proposed to change the shape of the p.d.f. of the features to a
Gaussian shape prior to their modeling. They have been applied with success in Sinha, Tranter,
Gales and Woodland (2005) and Zhu, Barras, Lamel and Gauvain (2006) for speaker diarization
in broadcast news and meetings respectively.
In the area of speech activity detection (SAD) there have been also several features proposed
in the latter years. In Kristjansson, Deligne and Olsen (2005) some well known features and other
new ones are proposed, based on autocorrelation of the signal or on the spectrum characteristics.
In Nguyen (2003) a new theoretical framework for natural isometric frontend parameters
based on differential geometry is presented and applied to speaker diarization, improving per-
formance when used in combination to standard MFCC parameters.
In Moh, Nguyen and Junqua (2003), Tsai, Cheng and Wang (2004) and Tsai, Cheng, Chao
and Wang (2005) speaker diarization systems are proposed by constructing a speaker space from
the data and projecting the feature vectors in it prior to the clustering step. Similarly, Collet,
Charlet and Bimbot (2005) proposes the technique of anchor modeling (introduced in Sturim,
Reynolds, Singer and J.P.Campbell (2001)) where acoustic frames are projected into an anchor
model space (previously defined from outside data) and performs speaker tracking with the
resulting parameter vectors. They show that it improves robustness against outside interfering
signals and they claim it to be domain independent.
When more than one microphone is collecting the recordings (for examples in meeting rooms)
Pardo, Anguera and Wooters (2006a), Pardo, Anguera and Wooters (2006b), ICSI Meeting
Recorder Project: Channel skew in ICSI-recorded meetings (2006), Lathoud and McCowan (2003)
show that it is useful for speaker diarization the use of the time-delays between microphones.
Finally, in Chan, Lee, Zheng and hua Ouyang (2006) they propose the use of vocal source
features for the task of speaker segmentation using a system based on Delacourt and Wellekens
(2000). Also in Lu and Zhang (2002b) a real-time 2-step algorithm is proposed by doing a
bayesian fusion of LSP, MFCC and pitch features.
Chapter 2. State of the art 13
Speaker segmentation has sometimes been referred to as speaker change detection and is closely
related to acoustic change detection. For a given speech/audio stream, speaker segmentation/
change detection systems find the times when there is a change of speaker in the audio. On a
more general level, acoustic change detection aims at finding the times when there is a change
in the acoustics in the recording, which includes speech/non-speech, music/speech and others.
Acoustic change detection can detect boundaries within a speaker turn when the background
conditions change.
Although erroneously, the term “speaker segmentation” has sometimes been used instead of
speaker diarization for systems performing both a segmentation into different speaker segments
and a clustering of such segments into homogeneous groups. As it will be pointed out later
on, many systems obtain a speaker diarization output by means of first performing a speaker
segmentation and then grouping the segments belonging to the same speaker. Other times this
distinction is not so clear as segmentation and clustering are mixed together. In this thesis a
system will be called to perform speaker segmentation when all frames, assigned to any partic-
ular speaker ID, are contiguous in time. Otherwise the system will be said to perform speaker
segmentation and clustering (or equivalently speaker diarization).
In a very general level, two main types of speaker segmentation systems can be found in the
bibliography. The first kind are systems that perform a single processing pass of the acoustic
data, from where the change-points are obtained. A second broad class of systems are those that
perform multiple passes, refining the decision of change-point detection on successive iterations.
This second class of systems include two-pass algorithms where in a first pass many change-
points are suggested (more than there actually are, therefore with a high false alarm error) and
in a second pass such changes are reevaluated and some are discarded. Also part of the second
broad class of systems are those that use an iterative processing of some sort to converge into
an optimum speaker segmentation output. Many of the algorithms to find the change-points
reviewed in this section (including all of the metric-based techniques) can either work alone or
in a two-step system together with another technique.
On another level, a general classification of the methods available for speaker segmentation
will be used in this section to describe the different algorithms available. In the bibliography
(Ajmera (2004), Kemp, Schmidt, Westphal and Waibel (2000), Chen, Gales, Gopinath, Kanvesky
and Olsen (2002), Shaobing Chen and Gopalakrishnan (1998), Perez-Freire and Garcia-Mateo
(2004)) three groups are defined: metric-based, silence-based and model-based algorithms. In this
thesis this classification will be augmented with a fourth group (called “others”) to amalgamate
all other techniques that do not fit any of the three proposed classes. In the next section the
metric-based techniques are reviewed in detail and in 2.2.2 the other three groups are treated.
14 2.2. Speaker Segmentation
Metric based segmentation is probably the most used technique up to date. It relies on the
computation of a distance between two acoustic segments to determine wether they belong to
the same speaker or to different speakers, and therefore wether there exists a speaker change
point in the audio at the point being analyzed. The two acoustic segments are usually next to
each other (in overlap or not) and the change-point considered is between them. Most of the
distances used for acoustic change detection can also be applied to speaker clustering in order
to compare the suitability that two speaker clusters belong to the same speaker.
Let’s consider two audio segments (i,j) of parameterized acoustic vectors Xi and Xj of
lengths Ni and Nj respectively, and with mean and variance values µi , σi and µj , σj . Each one
of these segments is modeled using Gaussian processes Mi (µi , σi ) and Mj (µj , σj ), which can
be a single Gaussian or a Gaussian Mixture Model (GMM). On the other hand, let’s consider
the agglomerate of both segments into X , with mean and variance µ, σ and the corresponding
Gaussian process M (µ, σ).
In general, there are two different kinds of distances that can be defined between any pair of
such audio segments. The first kind compares the sufficient statistics from the two acoustic sets
of data without considering any particular model applied to the data, which from now on will
be called statistics-based distances. These are normally very quick to compute and give good
performances if Ni and Nj are big enough to robustly compute the data statistics and the data
being modeled can be well defined using a single mean and variance.
A second group of distances are based on the evaluation of the likelihood of the data according
to models representing it. These distances are slower to compute (as models need to be trained
and evaluated) but can achieve better results than the statistics-based as bigger models can
be used to fit more complex data. These will be referred as likelihood-based techniques. The
following are the metrics that have been found of interest used in the literature for either case:
• Bayesian Information Criterion (BIC): The BIC is probably the most extensively used
segmentation and clustering metric due to its simplicity and effectiveness. It is a likelihood
criterion penalized by the model complexity (amount of free parameters in the model)
introduced by Schwarz (1971) and Schwarz (1978) as a model selection criterion. For a
given acoustic segment Xi , the BIC value of a model Mi applied to it indicates how well
the model fits the data, and is determined by:
1
BIC(Mi ) = log L(Xi , Mi ) − λ #(Mi )log(Ni ) (2.1)
2
Being log L(Xi , Mi ) the log-likelihood of the data given the considered model, λ is a free
design parameter dependent on the data being modeled; Ni is the number of frames in the
Chapter 2. State of the art 15
considered segment and #(Mi ) the number of free parameters to estimate in model Mi .
Such expression is an approximation of the Bayes Factor (BF) (Kass and Raftery (1995),
Chickering and Heckerman (1997)) where the acoustic models are trained via ML methods
and Ni is considered big.
In order to use BIC to evaluate whether a change point occurs between both segments it
evaluates the hypothesis that X better models the data versus the hypothesis that Xi + Xj
does instead, like in the GLR, by computing:
The term R(i) can be written for the case of models composed on a single Gaussian as:
N Ni Nj
R(i, j) = log|ΣX | − log|ΣXi | − log|ΣXj | (2.3)
2 2 2
where P is the penalty term, which is a function of the number of free parameters in the
model. For a full covariance matrix it is
1 1
P = (p + p(p + 1))log(N )
2 2
The penalty term accounts for the likelihood increase of bigger models versus smaller ones.
For cases where GMM models with multiple Gaussian mixtures are used, eq. 2.2 is written
as
∆BIC(Mi ) = log L(X , M) − (log L(Xi , Mi ) + log L(Xj , Mj )) − λ∆#(i, j)log(N ) (2.4)
where ∆#(i, j) is the difference between the number of free parameters in the combined
model versus the two individual models. For a mathematical proof on the equality of
equations 2.3 and 2.4 please refer to the appendix section.
Although ∆BIC(i, j) is the difference between two BIC(i) criterions in order to determine
which model suits better the data, it is usual in the speaker diarization literature to refer
to the difference as BIC criterion. For the task of speaker segmentation, the technique was
first used by Chen and Gopalakrishnan (Shaobing Chen and Gopalakrishnan (1998), Chen
and Gopalakrishnan (1998), Chen et al. (2002)) where a single full covariance Gaussian
was used for each of the models, as in eq. 2.3.
Although not existent in the original formulation, the λ parameter was introduced to adjust
the penalty term effect on the comparison, which constitutes a hidden threshold to the BIC
difference. Such threshold needs to be tuned to the data and therefore its correct setting has
been subject of constant study. Several people propose ways to automatically selecting λ,
16 2.2. Speaker Segmentation
(Tritschler and Gopinath (1999), Delacourt and Wellekens (2000), Delacourt, Kryze and
Wellekens (1999a), Mori and Nakagawa (2001), Lopez and Ellis (2000a), Vandecatseye,
Martens et al. (2004)). In Ajmera, McCowan and Bourlard (2003) a GMM is used for each
of the models (M , Mi and Mj ) and by building the model M with the sum of models
Mi and Mj complexities, it cancels out the penalty term avoiding the need to set any λ
value. The result is equivalent to the GLR metric where the models have the complexity
constraint imposed to them.
In the formulation of BIC by Schwarz (1978) the number of acoustic vectors available to
train the model were supposed to be infinite for the approximation to converge. In real
applications this becomes a problem when there is a big mismatch between the length
of the two adjacent windows or clusters being compared. Some people have successfully
applied slight modification to the original formula, either to the penalty term (Perez-Freire
and Garcia-Mateo 2004) or to the overall value (Vandecatseye and Martens 2003) to reduce
this effect.
Several implementations using BIC as a segmentation metric have been proposed. Initially
Shaobing Chen and Gopalakrishnan (1998) proposed a multiple changing point detection
algorithm in two passes, and later Tritschler and Gopinath (1999), Sivakumaran, Fortuna
and Ariyaeeinia (2001), sian Cheng and min Wang (2003), Lu and Zhang (2002a), Cettolo
and Vescovi (2003) and Vescovi, Cettolo and Rizzi (2003) followed with one or two-pass
algorithms. They all propose a system using a growing window with inner variable length
analysis segments to iteratively find the changing points. In Tritschler and Gopinath (1999)
it proposes some ways to make the algorithm faster and to focus on detecting very short
speaker changes. In Sivakumaran et al. (2001), Cettolo and Vescovi (2003) and Vescovi
et al. (2003) speedups are proposed in ways of computing the mean and variances of the
models. In Roch and Cheng (2004) a MAP-adapted version of the models is presented,
which allows for shorter speaker change points to be found. By using MAP, this work
opposes to the way the models are described to be trained in the original formula (which
defines an ML criterion).
Even with the efforts to speed up the processing of BIC, it is computationally more in-
tensive than other statistics-based metrics when used to analyze the signal with high
resolution, but its good performance has kept it as the algorithm of choice in many appli-
cations. This is why some people have proposed BIC as the second pass (refinement) of
a 2-pass speaker segmentation system. As described earlier, an important step in this di-
rection is taken with DISTBIC (Delacourt and Wellekens (2000), Delacourt et al. (1999a),
Delacourt, Kryze and Wellekens (1999b)) where the GLR is used as a first pass. Also in
this direction are Zhou and Hansen (2000), Kim, Ertelt and Sikora (2005) and Tranter and
Reynolds (2004), proposing the to use Hotelling’s T 2 distance, and Lu and Zhang (2002a)
using KL2 (Kullback-Leibler) distance. In Vandecatseye et al. (2004) a normalized GLR
Chapter 2. State of the art 17
(called NLLR) is used as a first pass and a normalized BIC is used in the refinement step.
Some research has been done to combine alternative sources of information to help the BIC
in finding the optimum change point. This is the case in Perez-Freire and Garcia-Mateo
(2004) where image shot boundaries are used.
In sian Cheng and min Wang (2004) a two-pass algorithm using BIC in both passes is
proposed. This is peculiar in that instead of producing a first step with high FA and a
second step that merges some of the change-points, the first step tries to minimize the FA
and the second step finds the rest of unseen speaker changes.
• Generalize Likelihood Ratio (GLR): The GLR (first proposed for change detection by
Willsky and Jones (1976) and Appel and Brandt (1982)) is a likelihood-based metric that
proposes a ratio between two hypotheses: on one hand, H0 considers that both segments
S
are uttered by the same speaker, therefore X = Xi Xj ∼ M (µ, σ) represents better the
data. On the other hand, H1 considers that each segment has been uttered by a different
speaker, therefore Xi ∼ Mi (µi , σi ) and Xj ∼ Mj (µj , σj ) together suit better the data. The
ratio test is computed as a likelihood ratio between the two hypotheses as
and determining the distance as D(i, j) = −log(GLR(i, j)) which upon using an appro-
priate threshold one can decide wether both segments belong to the same speaker or
otherwise. The GLR differs from a similar metric called the standard likelihood ratio test
(LLR) in that the p.d.f.’s for the GLR are unknown and must be estimated directly from
the data within each considered segment, whereas in the LLR the models are considered
to be known a priory. In speaker segmentation the GLR is usually used with two adjacent
segments of the same size which are scrolled through the signal, and the threshold is either
pre-fixed or it dynamically adapts.
In Bonastre, Delacourt, Fredouille, Merlin and Wellekens (2000) the GLR is used to seg-
ment the signal into speaker turns in a single step processing for speaker tracking. The
threshold is set so that miss errors are minimized (at the cost of higher false alarms),
as each segment is then independently considered as a potential speaker in the tracking
algorithm.
On the same two-speaker detection task, in Adami, Kajarekar and Hermansky (2002) the
first second of speech is considered to be from the first speaker and the second speaker is
found determining the change-points via GLR. A second step assigns segments of speech
to either speaker by comparing the GLR score of each of the two speakers computed across
the recording and selecting the regions where either one is higher.
On the task of change detection for transcription and indexing in Liu and Kubala (1999)
a penalized GLR is used as a second step, to accept/reject change-points previously found
using a pre-trained phone-based decoder (where the ASR phone-set has been reduced into
phone clusters). The penalty applied to the GLR is proportional to the amount of training
data available in the two segments as
GLR(i, j)
GLR′ (i, j) = (2.6)
(N1 + N2 )θ
where θ is determined empirically. On the same tone, Metze, Fugen, Pan, Schultz and Yu
(2004) uses the GLR for a segmentation step in a transcription system for meetings.
Probably the most representative algorithm of the use of GLR for speaker segmentation
is DISTBIC (Delacourt and Wellekens (1999), Delacourt et al. (1999a), Delacourt et al.
(1999b), Delacourt and Wellekens (2000)) where GLR is proposed as the first step of a
two-step segmentation process (using BIC as the second metric). Instead of using the
GLR distance by itself, a low pass filtering is applied to it in order to reduce ripples in
the computed distance function (which would generate false maxima/minima points) and
then the difference between each local maxima and adjacent minima is used to assert the
change-points.
In Kemp et al. (2000) the Gish distance is compared to other techniques for speaker
segmentation.
• Kullback-Leibler distance (KL or KL2): The KL and KL2 distances (Siegler, Jain, Raj
and Stern (1997), Hung, Wang and Lee (2000)) are well used due to their fast computation
Chapter 2. State of the art 19
and acceptable results. Given two random distributions X, Y , the K-L distance (also called
divergence) is defined as
PX
KL(X; Y ) = EX (log ) (2.8)
PY
Where EX is the expected value with respect to the PDF of X. When the two distributions
are taken to be Gaussian, one can obtain a close form solution to such expression (Campbell
1997) as
1 1
KL(X, Y ) = tr[(CX − CY )(CY−1 − CX
−1
)] + tr[(CY−1 − CX
−1
)(µX − µY )(µX − µY )T ] (2.9)
2 2
For GMM models there is no close form solution and the KL distance needs to be computed
using sample theory or one needs to use approximations as shown below. The KL2 distance
can be obtained by symmetrizing the KL in the following way:
As previously, if both distributions X and Y are considered to be Gaussian one can obtain
a closed form solution for the KL2 distance in function of their covariance matrices and
means.
Given any two acoustic segments X1 and X2 can be considered as X and Y and therefore
obtain the distance between them using these distances.
In Delacourt and Wellekens (2000) the KL2 distance is considered as a first of two steps for
speaker change detection. In Zochova and Radova (2005) KL2 is used again in an improved
version of the previous algorithm.
In Hung et al. (2000) the MFCC acoustic vectors are initially processed via a PCA dimen-
sionality reduction for each of the contiguous scrolling segments (either two independent
PCA or one applied to both segments) and then Mahalanobis, KL and Bhattacharyya
distances are used to determine if there is a change point.
• Divergence Shape Distance(DSD): In a very similar fashion as how the Gish distance
is defined in Gish et al. (1991), the DSD is derived from the KL distance of two classes
with n-variate normal pdfs by eliminating the part affected by the mean, as it is easily
biased by environment conditions. Therefore, it corresponds to the expression
1
D(i, j) = tr[(Ci − Cj )(Cj−1 − Ci−1 )] (2.11)
2
20 2.2. Speaker Segmentation
In Kim et al. (2005) it is used in a single-step algorithm and its results are compared to
BIC.
The DSD is also used in Lu and Zhang (2002a) as a first step of a two step segmentation
system, using BIC on the refinement step. In Lu and Zhang (2002b) some speed-ups are
proposed to make previous the system real-time.
The same authors present in Wu, Lu, Chen and Zhang (2003b), Wu, Lu, Chen and Zhang
(2003a) and Wu, Lu, Chen and Zhang (2003c) an improvement to the algorithm using DSD
and a Universal Background Model (UBM) trained from only the data in the processed
show. Evaluation of the likelihood of the data according to the UBM is used to categorize
the features in each analysis segment and only the good quality speech frames from each
one are compared to each other. They use an adaptive threshold (adapted from previous
values) to determine change points.
Such work is inspired by Beigi and Maes (1998) where each segment is clustered in three
classes via a k-means and a global distance is computed by combining the distances between
classes. There is no word in this work regarding to which particular distance is used between
the classes.
• Cross-BIC (XBIC): This distance was introduced by the author in Anguera and Hernando
(2004b) and Anguera (2005), which derives a distance between two adjacent segments by
cross-likelihood evaluation, inspired on the BIC distance by comparison to a distance
between HMM presented in Juang and Rabiner (1985):
• Other distances: There are many other metrics that are able to define a distance between
two sets of acoustic features or two models. Some of them have been applied to the speaker
segmentation task.
In Omar, Chaudhari and Ramaswamy (2005) the CuSum distance (Basseville and
Nikiforov 1993), the Kolmogorov-Smirnov test (Deshayes and Picard 1986) and BIC are
first used independently to find putative change points and then fused at likelihood level
to assert those changes.
In (Hung et al. 2000) the Malalanobis and Bhattacharyya distances (Campbell 1997) are
used in comparison to the KL distance for change detection.
Chapter 2. State of the art 21
In Kemp et al. (2000) the entropy loss (Lee 1998) of coding the data in two segments
instead of only one is proposed in comparison to the Gish and KL distances.
In Mori and Nakagawa (2001) applies VQ (Vector quantization) techniques to create
a codebook from one of two adjacent segments and applies a VQ distortion measure
(Nakagawa and Suzuki 1993) to compare its similarity with the other segment. Results
are compared to GLR and BIC techniques.
In Zhou and Hansen (2000) and Tranter and Reynolds (2004) Hotelling’s T 2 distance is
proposed, being it a multivariate analog of the t-distribution. It is applied for the first of
a two-step segmentation algorithm. It finds the distance between two segments, modeling
each one with a single Gaussian where both covariance matrices are set to be the same.
• In Lu, Zhang and Jiang (2002), Lu and Zhang (2002b) and Lu and Zhang (2002a) an
adaptive threshold is made dependent on the P previous as
P
1 X
T hi = α D(i − p − 1, i − p) (2.13)
P
p=0
• In Rougui, Rziza, Aboutajdine, Gelgon and Martinez (2006) a dynamic threshold is de-
fined in comparing speaker clusters (rather than speaker segments) where a population of
clusters is used to decide on the threshold value. It is defined as
T h = max(hist(d(Mi , Mj ), ∀i 6= j) (2.14)
where hist denotes the histogram and d() is the distance between two models, which in
that work is defined as a modified KL distance to compare two GMM models.
22 2.2. Speaker Segmentation
In this section the other three classes of speaker segmentation are reviewed, namely
silence/decoder-based, model-based and other segmentation techniques.
These techniques detect speaker changes hypothesizing that most changes between speakers will
be through a silence segment. These have been traditionally implemented towards using the
segments for speech recognition, as it is very important to obtain clean speaker changes without
cutting any words in half. Systems falling into this category are energy-based and decoder-based
systems.
The energy-based systems use an energy detector to find the points where it is most probable
to exist a speaker change. The detector normally obtains a curve with minimum/maximum
points in potential silences. A threshold is usually used to determine them (Kemp et al. (2000),
Wactlar, Hauptmann and Witbrock (1996), Nishida and Kawahara (2003)). In Siu, Yu and Gish
(1992) the MAD (Mean absolute deviation statistic), which measures the variability in energy
within segments, is used instead in order to find the silence points.
In contrast, decoder-guided segmenters run a full recognition system and obtain the change
points from the detected silence locations (Kubala, Jin, Matsoukas, Gnuyen, Schwartz and Ma-
choul (1997), Woodland, Gales, Pye and Young (1997), Lopez and Ellis (2000b), Liu and Kubala
(1999), Wegmann, Scattone, Carp, Gillick, Roth and Yamron (1998)) they normally constrain
the minimum duration of the silence segments to reduce false alarms. Some of these systems
use extra information from the decoder, such as gender labels (Tranter and Reynolds 2004) or
wide/narrow band plus music detectors (Hain, Johnson, Turek, Woodland and Young 1998).
The output has normally been used as an input to recognition systems, but not for indexing or
Diarization as there is not a clear relationship between the existence of a silence in a recording
and a change of speaker. In such systems they sometimes take these points as hypothetic speaker
change points, and then using other techniques define which of them actually mark a change of
speaker and which do not.
Model-Based Segmentation
Initial models (for example GMM’s) are created for a closed set of acoustic classes (telephone-
wideband, male-female, music-speech-silence and combinations of them) by using training data.
The audio stream is then classified by ML (Maximum Likelihood) selection using these mod-
els (Gauvain, Lamel and Adda (1998), Kemp et al. (2000), Bakis, Chen, Gopalakrishnan and
Gopinath (1997), Sankar, Weng, Stolcke and Grande (1998), Kubala et al. (1997)). The bound-
Chapter 2. State of the art 23
aries between models become the segmentation change points. One could also consider the
decoder-guided systems to be model-based, as they model each phoneme and silence, but here
they try to distinguish among broader models, instead of models derived from speech recognition
and trained for individual phones.
This segmentation method resembles very closely the speaker clustering techniques where
the identity of the different speakers (in this case acoustic classes) is known a priory and an ML
segmentation is found. Both areas have a robustness problem given that they require initial data
to train the models. As will be shown in the speaker clustering section, in the later years there
has been research done on the topic of blind speaker clustering, where no initial information of
the clusters is known. There is some of this research that applies these techniques to speaker
segmentation, in particular some clustering systems make use of an ML decoding of evolutive
models that look for the optimum acoustic change points and speaker models at the same time.
In Ajmera, Bourlard and Lapidot (2002) and Ajmera and Wooters (2003) the iterative decod-
ing is done bottom-up (starting with a high number of speaker changes as product of a first step
processing and then eliminating them until obtaining the optimum amount) and in Meignier,
Bonastre and Igournet (2001) and Anguera and Hernando (2004a) it is done top-down (starting
with one segment and adding extra segments until the desired amount is reached).
In Meignier, Moraru, Fredouille, Besacier and Bonastre (2004) they analyze the use of evo-
lutive systems where pretrained models are also used modeling background conditions, showing
that in general the more prior information that can be given to the system the better performance
it achieves.
All of these systems use Gaussian Mixture Models (GMM) to model the different classes and
an ML/Viterbi decoding approach to obtain the optimum change points. In Lu, Li and Zhang
(2001) SVMs (Support Vector Machines) are used as a classifier instead of GMM models and
the ML decoding, training them using pre-labelled data.
There are some speaker segmentation techniques proposed in the literature that are not a clear
fit to any of the previous categories. These are therefore mentioned here.
In Vescovi et al. (2003) and Zdansky and Nouza (2005) dynamic programming is proposed to
find the speaker change points. In Zdansky and Nouza (2005) BIC is used as marginal likelihood,
solving the system via ML where all possible number of change points is considered. In Vescovi
et al. (2003) they also use BIC and explore possible computation reduction techniques.
In Pwint and Sattar (2005) a genetic algorithm is proposed where the number of segments
is estimated via the Walsh basis functions and the location of change points is found using a
24 2.3. Speaker Diarization
In Lathoud, McCowan and Odobez (2004) segmentation is based on the location estimation
of the speakers by using a multiple-microphone setting. The difference between two locations is
used as a feature and tracking techniques are employed to estimate the change points of possibly
moving speakers. Further work on using location cues for clustering will be presented in the next
section.
In some occasions the use of the term speaker diarization is confused with speaker clustering.
One must refer as speaker clustering the techniques and algorithms that agglutinate together all
segments that belong to the same speaker. This does not entail wether such segments come from
the same acoustic file or different ones. It also does not say anything about how acoustically
homogeneous segments within a single file are obtained. The term speaker diarization refers
to the systems that perform a speaker segmentation of the input signal and then a speaker
clustering of the created segments into homogeneous groups (or some hybrid mechanism doing
both at the same time), all within the same file or input stream.
In the literature one can normally find two main applications for speaker diarization. On
one hand, Automatic Speech Recognition (ASR) systems make use of the speaker homogeneous
clusters to adapt the acoustic models to be speaker dependent and therefore increase recognition
performance. On the other hand, speaker indexing and rich transcription systems use the speaker
diarization output as one of (possibly) many information pieces extracted from a recording, which
allow its automatic indexation and other further processing areas.
This section reviews the main systems present in the literature for both applications. It
mainly focuses on systems that propose solutions to a blind speaker diarization problem, where
no information is known a priori about the number of people or their identities. On one hand,
it is crucial for systems oriented towards rich transcription of the data to accurately estimate
the number of speakers present, as error measures penalize any incorrectly assigned speaker
segment. On the other hand, in ASR systems it becomes more important to have sufficient data
to accurately adapt the resulting speaker models, therefore several speakers with similar acoustic
characteristics are preferably grouped together.
At a high level point of view one can differentiate between online and offline systems. The
systems that process the data offline have access to all the recording before they start processing
it. These are the most common in the bibliography and they are the main focus of attention of
this review. The online systems only have access to the data that has been recorded up to that
point. They might allow a latency in output to allow for a certain amount of data to become
Chapter 2. State of the art 25
available for processing, but in any case no information on the complete recording is available.
Such systems usually start with one single speaker (whoever starts talking at the beginning of
the recording) and iteratively increase the number of speaker as they intervene. The following
are some representative systems used for online processing:
In Mori and Nakagawa (2001) a clustering algorithm based on the Vector Quantization
(VQ) distortion measure (Nakagawa and Suzuki 1993) is proposed. It starts processing with one
speaker in the code-book and incrementally adds new speakers whose VQ distortion exceeds a
threshold in the current code-book.
In Rougui et al. (2006) a GMM based system is proposed, using a modified KL distance
between models. Change points are detected as the speech becomes available and data is assigned
to either speaker present in the database or a new speaker is created, according to a dynamic
threshold. Emphasis is put into fast classification of the speech segments into speakers by using
a decision tree structure for speaker models.
All systems presented below are based on offline processing, although some of the techniques
presented could potentially be used also in an online implementation. These systems can be
classified in two main groups, on one hand the hierarchical clustering techniques reach the
optimum diarization by iterative processing of different number of possible clusters obtained
by merging or splitting existing clusters. On the other hand, other clustering techniques first
estimate the number of clusters and obtain a diarization output without deriving the clusters
from bigger/smaller ones.
Most of the reviewed offline clustering algorithms use hierarchical schemes, where speech
segments or clusters are iteratively split or merged until the optimum number of speakers is
reached. In figure 2.1 a pedantic abstraction of the two mostly used techniques in speaker
clustering is shown. Bottom-up clustering systems are those which start with a big number of
26 2.3. Speaker Diarization
segments/clusters and via merging techniques converge to the optimum amount of clusters. On
the other hand, top-down systems usually start with one or very few clusters and work its way
up (in the number of clusters, down in the figure) via splitting procedures to obtain the optimum
amount. In the design of either system, two items need to be defined:
Classified by the type of clustering, the following are the most representative techniques
described in the literature:
This is by far the mostly used approach for speaker clustering as it welcomes the use of the
speaker segmentation techniques to define a clustering starting point. It is also referred as ag-
glomerative clustering and has been used for many years in pattern classification (see for example
Duda and Hart (1973)). Normally a matrix distance between all current clusters (distance of
any with any) is computed and the closest pair is merged iteratively until the stopping criterion
is met.
One of the earliest research done in speaker clustering for speech recognition was proposed in
Jin, Kubala and Schwartz (1997), using the Gish distance (Gish et al. 1991) as distance matrix,
with a weight to favor neighbors merging. As stoping criterion, the minimization of a penalized
version (to avoid over-merging) of the within-cluster dispersion matrix is proposed as
K
X √
WJin = Nk Σk k (2.15)
k=1
where K is the number of clusters considered, Σk is the covariance matrix of cluster k, with Nk
acoustic segments and | · | indicating the determinant.
Around the same time, in Siegler et al. (1997) the KL2 divergence distance was used as a
distance metric and a stopping criterion was determined with a merging threshold. It shows that
the KL2 distance works better than the Mahalanobis distance for speaker clustering. Also in Zhou
Chapter 2. State of the art 27
and Hansen (2000) the KL2 metric is used as a cluster distance metric. In this work they first
split the speech segments into male/female and perform clustering on each one independently;
this reduces computation (the number of cluster-pair combinations is smaller) and gives them
better results.
In general, the use of statistics-based distance metrics (not requiring any models to be
trained) is limited in speaker clustering as they implicitly define distances between single mean
and covariance matrices from each set, which in speaker clustering falls short many times in
modeling the amount of data available from one speaker. Some people have adapted these
distances and obtained multi-Gaussian equivalents.
In Rougui et al. (2006) they propose a distance between two GMM models based on the KL
distance. Given two models M1 and M2 , with K1 and K2 Gaussian mixtures each, and Gaussian
weights W1 (i), i = 1 . . . K1 and W2 (j), j = 1 . . . K2 , the distance from M1 to M2 is
K1
X K2
d(M1 , M2 ) = W1 (i) min KL(N1 (i), N2 (j)) (2.16)
j=1
i=1
In Beigi, Maes and Sorensen (1998) a distance between two GMM models is proposed by
using the distances between the individual Gaussian mixtures. A distance matrix of d(i, j), ∀i, j
between all possible Gaussian pairs in the two models is processed (distances proposed are the
Euclidean, Mahalanobis and KL) and then the weighted minima for each row and column is
used to compute the final distance.
In Ben, Betser, Bimbot and Gravier (2004) and Moraru, Ben and Gravier (2005) cluster
models are obtained via MAP adaptation from a GMM trained on the whole show. A novel
distance between GMM models is derived from the LK2 distance for the particular case where
only means are adapted (and therefore weights and variances are identical in both models). Such
distance is defined as
v
u M D
uX X (µ1 (m, d) − µ2 (m, d))2
D(M1 , M2 ) = t Wm 2 (2.17)
m=1
σm,d
d=1
where µ1 (m, d) and µ2 (m, d) are the mean dth components for the mean vector for Gaussian m,
2
σm,d is the dth variance component for Gaussian m and M, D are the number of mixtures and
dimension of the GMM models respectively.
In Ben et al. (2004) a threshold is applied to such distance to serve as stopping criterion,
while in Moraru et al. (2005) the BIC for the global system is used instead.
28 2.3. Speaker Diarization
Leaving behind the statistics-based methods, in Gauvain et al. (1998) and Barras, Zhu,
Meignier and Gauvain (2004) a GLR metric with two penalty terms is proposed, penalizing for
large number of segments and clusters in the model, with tuning parameters. Iterative Viterbi
decoding and merging iterations find the optimum clustering, which is stopped using the same
metric.
Solomonov, Mielke, Schmidt and Gish (1998) also uses GLR and compares it to KL2 as
distance matrices and iteratively merges clusters until it maximizes the estimated cluster purity,
defined as the average over all segments and all clusters of the ratio of segments belonging to
cluster i among the n closest segments to segment k (which belongs to i). The same stopping
criterion is used in Tsai et al. (2004), where several methods are presented to create a different
reference space for the acoustic vectors that better represents similarities between speakers. The
reference space defines a speaker space to which feature vectors are projected, and the cosine
measure is used as a distance matrix. It is claimed that such projections are more representative
of the speakers.
Other research is done using GLR as distance metric, including Siu et al. (1992) for pilot-
controller clustering and Jin, Laskowski, Schultz and Waibel (2004) for meetings diarization
(using BIC as stopping criterion).
The most commonly used distance and stopping criteria is again BIC, which was initially
proposed for clustering in Shaobing Chen and Gopalakrishnan (1998) and Chen and Gopalakr-
ishnan (1998). The pair-wise distance matrix is computed for each iteration and the pair with
biggest ∆BIC value is merged. The process finishes when all pairs have a ∆BIC< 0. In some
later research (Chen et al. (2002), Tritschler and Gopinath (1999), Tranter and Reynolds (2004),
Cettolo and Vescovi (2003) for Italian language and Meinedo and Neto (2003) for Portuguese
language) propose modifications to the penalty term and differences in the segmentation setup.
In Sankar, Beaufays and Digalakis (1995) and Heck and Sankar (1997) the symmetric rela-
tive entropy distance (Juang and Rabiner 1985) is used for speaker clustering towards speaker
adaptation in ASR. This distance is similar to Anguera (2005) and equivalent to Malegaonkar
et al. (2006), both used for speaker segmentation. It is defined as
1
D(M1 , M2 ) = [Dλ1 ,λ2 + Dλ2 ,λ1 ] (2.18)
2
An empirically set threshold on the distance is used as a stopping criterion. Later on, the same
Chapter 2. State of the art 29
authors propose in Sankar et al. (1998) a clustering based on a single GMM model trained on all
the show and the weights being adapted on each cluster. The distance used then is a weighted
by counts entropy change due to merging two clusters (Digalakis, Monaco and Murveit 1996).
In Barras et al. (2004), Zhu, Barras, Meignier and Gauvain (2005), Zhu et al. (2006) and later
Sinha et al. (2005) propose a diarization system making use of speaker identification techniques
in the area of speaker modeling. A clustering system initially proposed in Gauvain et al. (1998)
is used to determine an initial segmentation in Barras et al. (2004), Zhu et al. (2005) and Zhu
et al. (2006), while a standard speaker change detection algorithm is used in Sinha et al. (2005).
The systems then use standard agglomerative clustering via BIC, with a λ penalty value set to
obtain more clusters than optimum (under-cluster the data). On the speaker diarization part, it
first classifies each cluster for gender and bandwidth (in broadcast news) and uses a Universal
Background Model (UBM) and MAP adaptation to derive speaker models from each cluster. In
most cases a local feature warping normalization (Pelecanos and Sridharan 2001) is applied to
the features to reduce non-stationary effects of the acoustic environment. The speaker models
are then compared using a metric between clusters called cross likelihood distance (Reynolds,
Singer, Carlson, O’Leary, McLaughlin and Zixxman 1998), and defined as
where Mi−U BM indicates that the model has been MAP adapted from the UBM model. An
empirically set threshold stops the iterative merging process.
The same cross-likelihood metric is used in Nishida and Kawahara (2003) to compare two
clusters. In this paper emphasis is given to the selection of the appropriate model when training
data is very small. It proposes a vector quantization (VQ) based method to model small segments,
by defining a model called common variance GMM (CVGMM) where Gaussian weights are set
uniform and variance is tied among Gaussians and set to the variance of all models. For each
cluster BIC is used to select either GMM or CVGMM as the model to be used.
Some other people integrate the segmentation with the clustering by using a model-based
segmentation/clustering scheme. This is the case in Ajmera et al. (2002), Ajmera and Wooters
(2003) and Wooters, Fung, Peskin and Anguera (2004) where an initial segmentation is used to
train speaker models that iteratively decode and retrain on the acoustic data. A threshold-free
BIC metric (Ajmera et al. 2003) is used to merge the closest clusters at each iteration and as
stopping criterion.
In Wilcox, Chen, Kimber and Balasubramanian (1994) a penalized GLR is proposed within
an traditional agglomerative clustering approach. The penalty factor favors merging clusters
which are close in time. To model the clusters, a general GMM is built from all the data in
30 2.3. Speaker Diarization
the recording and only the weights are adapted to each cluster as in Sankar et al. (1998). A
refinement stage composed of iterative Viterbi decoding and EM training follows the clustering,
to redefine segment boundaries, until likelihood converges.
In Moh et al. (2003) a novel approach to speaker clustering is proposed using speaker tri-
angulation to cluster the speakers. Given a set of clusters Ck , k = 1 . . . K and the group of
non-overlapped acoustic segments Xs , j = s . . . S which populate the different subsets/clusters.
The first step generates the coordinates vector of each cluster according to each segment (mod-
eled with a full covariance Gaussian model) by computing the likelihood of each cluster to each
segment. The similarity between two clusters is then defined as the cross correlation between
such vectors as
X
C(k, j) = p(Ck |Xs )p(Cj |Xs ) (2.21)
s
merging those clusters with higher similarity. This can also be considered as a projection of the
acoustic data into a speaker space prior to the distance computation.
In the current literature there are fewer systems that start from one cluster and iteratively split
until the stopping criterion is met than the previously presented systems, doing otherwise.
In Johnson and Woodland (1998) a top-down clustering method is proposed for speaker
clustering towards ASR, and in Johnson (1999) and Tranter and Reynolds (2004) it is applied
to speaker diarization. The algorithm splits the data iteratively into four sub-clusters and allows
for merging clusters that are very similar to each other. In Johnson and Woodland (1998) it
proposes two different implementations of the algorithm. On one hand it proposes an MLLR
likelihood optimization technique to obtain resulting clusters well adapted to the ASR MLLR
adaptation step. On the other hand it proposes the Arithmetic Harmonic Sphericity (AHS)
metric (Bimbot and Mathan 1993) to assign speech segments to the created sub-clusters at each
stage, and uses a minimum occupancy stopping criterion. The AHS is defined for single Gaussian
models as
In Johnson (1999) and Tranter and Reynolds (2004) the AHS-based algorithm is used for
speaker diarization and the stopping criterion is changed to be a cost-based function depending
on several criteria.
Chapter 2. State of the art 31
In Meignier et al. (2001) and Anguera and Hernando (2004a) an initial cluster is trained with
all the acoustic data available. Iterative decoding/MAP adaptation of new models is performed
where new clusters are split using a likelihood metric averaged over a window. The variation of
the overall likelihood of the data given all models is used as a stopping criterion. In Anguera and
Hernando (2004a) a similar approach is followed and a repository model is further introduced
to improve the purity of the created clusters.
While bottom-up clustering is much more popular than top-down clustering, it is not clear
which one can achieve better results and in which conditions. On the topic of broadcast news
transcription, in Hain et al. (1998) both techniques were compared. On one hand, bottom-
up clustering uses a divergence-like distance measure and a minimum cluster feature count as
stoping criterion. On the other hand, top-down clustering uses the arithmetic harmonic sphericity
distance and also the cluster count as stopping criterion.
Given that both Top-down and bottom-up techniques could eventually complement each
other, some people have proposed systems that can combine multiple systems and obtain an
improved speaker diarization.
In Tranter (2005) a cluster voting algorithm is presented to allow diarization output im-
provement by merging two different speaker diarization systems. Tests are performed using two
top-down and two bottom-up systems.
There are some papers in the literature that do not fit into an hierarchical clustering context.
The systems reviewed here all define an algorithm or metric to determine the optimum number
of speakers and a method for finding the optimum speaker clustering given a that number.
In Tsai and Wang (2006) a genetic algorithm is proposed to obtain an optimum speaker
clustering that optimizes the overall model likelihood by initial random cluster assignment and
iterative evaluation of the likelihood and mutation. In order to select the optimum amount of
32 2.3. Speaker Diarization
A relatively new learning technique called Variational Bayesian learning (VB) or Ensemble
learning (Attias (2000), MacKay (1997)) is used in Valente and Wellekens (2004), Valente and
Wellekens (2005) and Valente (2006) for speaker clustering. the VB training has the capacity
of model parameter learning and model complexity selection, all in one algorithm. The models
trained with this technique adapt their complexity to the amount of data available for training.
In the proposed systems it computes the optimum clustering for a range of different number of
clusters and uses a distance called free energy to determine the optimum.
In Lapidot (2003) self-organizing maps (SOM) (Lapidot, Gunterman and Cohen (2002),
Kohonen (1990)) are proposed for speaker clustering given a known number of speakers. This is
a VQ algorithm for training the code-books representing each of the speakers. An initial non-
informative set of code-books is created and then SOM is iterated, retraining them until the
number of acoustic frames switching between clusters is close to 0. In order to determine the
optimum number of clusters a likelihood function is defined (derived from the code-words in the
code-books by assigning a Gaussian pdf) and BIC is used.
When used for certain applications it is feasible to obtain an improvement in speaker diarization
by using information other than the acoustics. In this section the use of the transcripts from the
recording and the time delays between channels in a multi-microphone setting are visited.
A very interesting area of study to improve speaker Diarization in certain conditions is the use of
the transcripts from the acoustic signal in order to extract information that can help assigning
each speaker turn to each cluster. Such transcripts can be obtained via an automatic speech
recognition system.
One outstanding characteristic of the meetings domain is that multiple microphones are usually
available for processing. The time differences between microphones can be used as a feature to
identify the speakers in a room by their locations as the speech uttered by each speaker takes
a different time to reach each of the microphones according to their position in the room. Such
feature has two main drawbacks from the acoustic features. On one hand it is prone to errors
when speakers are located in symmetry to the microphones. On the other hand, they become less
tractable when two speakers move inside the room, which accounts then for tracking algorithms
to be used.
For the task of speaker segmentation, in Lathoud, McCowan and Odobez (2004) a speaker
tracking approach is proposed using only between channel differences. In Lathoud, Odobez and
McCowan (2004) the same is extended to speaker clustering and algorithms are proposed for
detection of concurrent events. Ellis and Liu (2004) and Pardo et al. (2006a) also use only delays
for clustering.
Given the literature, the delays between channels can not outperform the acoustic features,
although in Ajmera, Lathoud and McCowan (2004) it is shown that the combination of delays
and MFCC parameters can improve clustering. In Pardo et al. (2006b) it reaches the same
conclusion and further improves results by using a weighted combination of the delays and
MFCC likelihoods.
On recent years there has been increasing emphasis on research for speech and video processing
for the meeting room domain. Within the different projects that are interested in this area, two
different alternative meeting settings have been proposed. On one hand, some consider a lecture
environment where a single speaker gives a talk in front of an audience, which intervenes at
different points of the lecture with questions and remarks. In this situation there is always a
main speaker, facing an audience, and many people listening, facing the speaker. On the other
hand, the conference room environment is a gathering of people where mostly everyone speaks
and discussions are being carried on one or more common topics to all attendees.
There are many research institutions carrying out research on one topic or another related to
meetings. In here some of the projects that have led the research efforts in the latest years are
pointed out.
34 2.4. Speaker Diarization in Meetings
The project Computers in the Human Interaction Loop (CHIL) aims at the creation of
computers to help in the normal human-human interaction, in a non obtrusive way. CHIL is
an Integrated Project (IP) within the European Union (EU) sixth framework program which
started in 2004 for three years. Within the many lines of research it opened, several intelligent
meeting rooms with audio and video sensors were built where data is collected and research
is performed on the lecture-type meetings (Computers in the Human Interaction Loop (CHIL)
website 2006).
The project Augmented Multimodal Interaction (AMI) is focused on the use of advanced
signal processing, machine learning models and social interaction dynamics to improve human-
to-human communications, particularly during business meetings between local and remote (vir-
tual) participants (Augmented Multiparty Interaction (AMI) website 2006). The AMI project is
also an IP project within the European sixth framework program, focusing on the conference-
type meetings. AMI has been granted a continuation project (called AMIDA) within the Euro-
pean Union seventh framework program.
There are other projects with emphasis on multimodal interaction and human-to-human
communications. Some of them are the “Similar” network of excellence (Similar Network of
Excellence website 2006), the Pascal network (Pattern analysis, Statistical modeling and Com-
putational learning (Pascal) website 2006) and Humaine emotions research (Humaine emotion
research website 2006) in Europe, and Video analysis and content extraction for defense intel-
ligence (VACE) (Video analysis and content extraction for defense intelligence (ARDA-VACE
II) 2006) and Cognitive Assistant that Learns and Organizes (CALO) (Cognitive Assistant that
Learns and Organizes (CALO) website 2006) in the USA.
Chapter 2. State of the art 35
2.4.2 Databases
In order for research to be performed in speech technologies, there is a constant need for data
collection and annotation. In this respect there have been several efforts over the years to collect
data on the meeting environment. On the particular area of speaker diarization systems for
Meetings, there needs to be meetings databases accurately transcribed into speaker segments.
Nowadays a few databases are already available and a few more are currently being recorded
and transcribed, some of them are:
• ICSI Meetings Corpus (ICSI Meetings Recorder corpus (2006), Janin, Baron, Edwards,
Ellis, Gelbart, Morgan, Peskin, Pfau, Shriberg, Stolcke and Wooters (2003)): 75 meet-
ings with about 72 hours in total. They were recorded in a single meeting room, with 4
omnidirectional tabletop and 2 electret microphones mounted on a mock PDA.
• CMU Meeting Corpus (CMU Meetings Corpus website (2006), Burger, Maclaren and Yu
(2002)) : 104 meetings of an average duration of 60 minutes with 6.4 participants (in aver-
age) per meeting (only 18 meetings are publicly available through LDC). They are focused
on a given scenario or topic, changing from meeting to meeting. Initial meetings have 1
omnidirectional microphone, newer ones have 3 omnidirectional tabletop microphones.
• NIST Pilot Meeting Corpus (NIST Pilot Meeting Corpus website 2006): Consists of 19
meetings with a total of about 15 hours. Several meeting types are proposed to the atten-
dants. Audio recordings are done using 3 omnidirectional table-top microphones and one
circular directional microphone with 4 elements.
• CHIL Corpus: Recordings were conducted in 4 different meeting room locations consisting
on lecture type meetings. Each meeting room is composed of several distant microphones,
as well as speaker localization microphones and microphone arrays. Each meeting also
contains several video cameras.
• AMI corpus (Augmented Multiparty Interaction (AMI) website 2006): About 100 hours of
meetings with generally 4 participants were recorded, transcribed and released through
their website. These are split into two main groups: real meetings and scenario-based
meetings (where people are briefed to talk about a particular topic). One or more circular
arrays of 8 microphones each are centrally located in the table. no video was collected.
• VACE multimodal corpus (Chen, Rose, Parrill, Han, Tu, Huang, Harper, Quek, McNeill,
36 2.4. Speaker Diarization in Meetings
Tuttle and Huang 2005): Is a video and acoustics meeting database created within the
ARDA VACE-II project recording mainly military related meetings.
• LDC meetings data: The Linguistic Data Consortium (LDC) has been in charge of tran-
scribing and distributing most of the databases in this list. Also, in an effort to contribute
to the NIST Meetings evaluation campaigns, it recorded a set of meetings (Strassel and
Glenn 2004) within the SPINE/ROAR project (Speech in noisy environments 2006).
The National Institute for Standards and Technology (NIST) (National Institute for Standards
and Technology 2006) has been organizing multiple evaluations over the years on many aspects
of speech technologies. In the area of speaker diarization evaluations, they started in year 2000
with interest in telephone speech (2000, 2001, 2002), broadcast news (2002, 2003, 2004) and
meetings (2002, 2004, 2005, 2006). In the latest two years, focus has been geared exclusively
towards the meetings environment.
The datasets used in the meetings evaluations were hand-transcribed by LDC. This acoustic
data constitutes the basis for the development and evaluation of the algorithms proposed in
this thesis. Initially, in 2002, the speaker segmentation task was enclosed within the speaker
recognition evaluation (SRE-02) and used data from the NIST meeting room research project.
This changed for 2004-2006 when speaker diarization has been a part of the Rich Transcription
(RT) evaluation (RT04s, RT05s and RT06s), grouping it with the speech-to-text evaluation
(STT) on meetings data. The datasets used for these evaluations contain data from CMU, ICSI,
LDC, NIST, CHIL and AMI.
In the following sections the main ideas in the systems presented to each of the NIST meetings
evaluations are explained, together with the particular algorithms that were created explicitly
for processing of meetings data.
In 2002 NIST started the series of speaker diarization evaluations for meetings including them in
the speaker recognition evaluation. In that occasion systems were evaluated for broadcast news
recordings, telephone conversations and meetings recordings. In that case only one channel of
audio data was provided for any of the cases, therefore multiple channel techniques were not
necessary. The meetings data used was recorded by NIST.
There were four participants in that evaluation, namely CLIPS-IMAG, LIA, ELISA consor-
tium and MITLL. The systems can be grouped in two:
Chapter 2. State of the art 37
• MIT Lincoln Labs (MITLL) presented a system inspired in speaker identification tech-
niques (Dunn, Reynolds and Quatieri 2000). It first performs a speaker segmentation
using a modified GLR metric like in Wilcox et al. (1994) and follows with a GMM-UBM
modeling technique to cluster segments into the different speakers.
Within the NIST 2004 Spring Rich Transcription Evaluation (NIST Spring Rich Transcrip-
tion Evaluation in Meetings website, http://www.nist.gov/speech/tests/rt/rt2005/spring 2006)
speaker diarization was evaluated in meeting recordings in two different conditions: Multiple
Distant Microphones (MDM) and Single Distant Microphone (SDM). The MDM condition uses
multiple microphones located in the center of a meetings table, and the SDM case uses only
one of these microphones, normally the most centrally located. This is the first time that this
task was performed for meetings environment on the MDM condition. A full description of the
different tasks evaluated and the results of such evaluation can be found in Garofolo, Laprun
and Fiscus (2004). Following are the approaches (in brief) that the participants proposed for
the MDM and SDM conditions:
• Macquarie University in Cassidy (2004) proposes the same system for SDM than for MDM,
using always the SDM channel. A BIC based speaker segmentation step is followed by an
agglomerative clustering using Mahalanobis distance between clusters and BIC as stopping
criterion.
• The ELISA consortium in Fredouille et al. (2004) proposes a two-axis merging strategy. An
horizontal merging consists on the collapse and resegmentation of the clustering output of
their two expert systems (based on BIC and EHMM) as proposed in the RT03 and SRE02
evaluations (Moraru, Meignier, Fredouille, Besacier and Bonastre 2004). This is done for
each individual MDM channel or for the SDM channel. The vertical merging is applied
38 2.4. Speaker Diarization in Meetings
when processing multiple channels and unifies all the individual channels into one single
resulting output by merging all channels at the output level. It uses an iterative process
that searches for the longest speaker interventions that are common to all outputs and
finally assigns to the closest speaker those segments of short duration where the different
channels do not agree on.
• Carnegie Mellon University (CMU) in Jin et al. (2004) presents a clustering scheme based
on GLR distance and BIC stopping criterion. In order to obtain the initial segmentation
of the data it does a three steps process, first a Speech Activity Detection (SAD) is done
over all the channels, then the resulting segments for all channels are collapsed into a
single segmentation and the best channel (according to an energy/SNR metric) is chosen
for each segment. Finally GLR change detection is applied on segments >5s to detect any
missed change point. The speaker clustering is done using a global GMM trained on all the
meeting excerpt data and adapted to each segment and uses GLR to compute the cluster
pair distances to be used in an agglomerative clustering processing with BIC stopping
criterion.
The RT05s evaluation welcomed a different kind of meetings to be evaluated. These are the
meetings in a lecture environment, where a speaker is giving a lecture in front of an audience and
there are eventual questions and answer periods. In this evaluation systems could be presented
for either or both subtasks (lecture room and conference room data). The sets of microphones
used was extended from the previous evaluations due to the existence of two new kinds in the
lecture room data (entirely recorded by the partners in the CHIL project). These were labelled
as MM3A (Multiple Mark III microphone arrays) which consisted on one or several 64 elements
microphone arrays developed by NIST and positioned on one of the walls of the meetings room;
and MSLA (Multiple source localization microphones) which are four sets of four microphones
each, used primarily for speaker localization, but available also for speaker diarization. For a
more thorough description of the tasks and microphone types please refer to Fiscus, Radde,
Garofolo, Le, Ajot and Laprun (2005). The following is a brief description of the approaches
taken in this evaluation:
• The Macquarie University system (Cassidy 2004) participated only on SDM which expands
its work from the RT04s system. In the RT05s submission it uses the KL distance between
clusters and does a post-processing of the segments using speaker identification techniques
to refine the segments-to-speakers assignment.
• The TNO speaker diarization system (van Leeuwen 2005) presents a system for MDM using
a single channel. It first uses a Speech Activity Detector (SAD) to filter out non-speech
Chapter 2. State of the art 39
frames. Then it does a segmentation and clustering using an agglomerative clustering via
BIC.
• The ICSI-SRI speaker diarization system (Anguera, Wooters, Peskin and Aguilo 2005)
uses a filter&sum module to obtain an enhanced signal on the MDM condition, and then
uses an iterative agglomerative clustering using a BIC-alike metric. This system and its
improvements for RT06s are described in this thesis.
• The ELISA consortium system (Istrate, Fredouille, Meignier, Besacier and Bonastre 2005)
is different from their system in RT04s in that a preprocessing step is performed on the
MDM channels to obtain a single enhanced channel. It is based on a weighted sum of
the individual channels, weighted by their relative Signal to Noise Ration (SNR) without
any relative delays estimation. Three different clustering systems are then proposed. The
first system is based on EHMM (Meignier et al. 2001), doing a top-down clustering. The
second and third systems are both bottom-up, one using speaker change detection via GLR
and agglomerative clustering via BIC, and the other using BIC for change detection and
UBM-BIC in the agglomerative clustering part. All systems use a resegmentation stage at
the end in order to refine the speaker segments. For this evaluation either system was run
individually, with no collapse of the different outputs.
The RT06s evaluation continues its parallel testing of conference room data and lecture room
data. This year five laboratories participated in the evaluation, making it a very good evaluation
in terms of new systems and ideas. For a full description refer to Fiscus, Ajot, Michet and
Garofolo (2006). An overview of the systems in RT06s follows:
• The Athens Information Technology (AIT) system (Rentzeperis, Stergiou, Boukis, Pnev-
matikakis and Polymenakos 2006) uses a speaker segmentation and then clustering steps.
The classic BIC implementation (Shaobing Chen and Gopalakrishnan 1998) is used for
speaker segmentation as their primary system. A contrastive system uses a silence-based
method cutting segments in silence points. A first step of the clustering process it also
uses BIC to merge adjacent segments believed to be from the same speaker. Finally, all
segments are modeled with GMM and a likelihood based technique is used to cluster them.
• The LIMSI system (Zhu et al. 2006) adapts their high-performance system presented for
RT04f (Zhu et al. 2005) in order to process lecture room data. It is based on a 2-stage
processing where a BIC agglomerative clustering precedes a speaker identification module
where cross likelihood (Reynolds et al. 1998) is used to finish the clustering. In this system
the speech activity detection module is reworked to adapt it to the lecture acoustics by
40 2.5. Multichannel Acoustic Enhancement
using a likelihood ratio between pretrained speech and silence models. The MDM condition
is processed by randomly selecting one of the channels in the set and running the system
in that one alone.
• The LIA system (improvements of the E-HMM based speaker diarization system for meet-
ings records 2006) presents a single system based on the EHMM top-down hierarchical
clustering that has been presented in previous evaluations. In this submission there are a
few improvements to the system. One improvement deals with the selection of new speakers
added to the system, which is modified to take into account all currently selected speakers
to make it more robust and allow for all speakers to fall at least in one cluster. Also, a
segment purification algorithm is proposed following Anguera, Wooters, Peskin and Aguilo
(2005) in order to purify the existing clusters from segments belonging to other speakers.
Furthermore, some feature normalization techniques were applied at the frontend level. Fi-
nally, an algorithm to detect overlapping speech was proposed, although it did not succeed
in lowering the final diarization error rate.
• The AMI team (Leeuwen and Huijbregts 2006) was formed by TNO and University of
Twente. They presented three systems to the evaluation. The first system is very similar
to what was presented by TNO in RT05s (van Leeuwen 2005). The other two systems use
a hierarchical clustering following the work at ICSI and presented in Anguera, Wooters,
Peskin and Aguilo (2005). One of the two systems improves in runtime by considering a
Viterbi-based clusters merging criterion. Each cluster is taken out of the ergodic HMM
model (one at a time) and a Viterbi decoding gives the likelihood of the rest modeling the
data. The cluster which causes the least loss in likelihood is eliminated and merged with
the rest. The system iterates while the overall likelihood increases.
• The ICSI system (Anguera, Wooters and Pardo 2006b) is based on the system for RT05s
(Anguera, Wooters, Peskin and Aguilo 2005) and includes many new ideas which will
be covered in the rest of this thesis. The main step forward is the total independence
from training data achieved by the creation of a new hybrid speech/non-speech detector
(Anguera, Aguilo, Wooters, Nadeu and Hernando 2006) and the inclusion of delays as an
independent feature stream.
Possibly the most noticeable difference when performing speaker diarization in the meetings
environment versus other domains (like broadcast news or telephone speech) is the availability, at
times, of multiple channels which are laid out inside the meetings room, synchronously recording
what occurs in the meeting.
Chapter 2. State of the art 41
In order to take advantage of this fact one needs to explore an area of signal processing
that differs from standard speech modeling techniques pointed out in previous sections and
which constitutes a complex topic of research by itself. This is the area of microphone array
beamforming for speech/acoustic enhancement (see for example Veen and Buckley (1988), Krim
and Viberg (1996)). Although the task at hand differs from some of the assumptions taken in
the beamforming theory, it will be found beneficiary to take it as a background for the use of
all the microphones available.
Microphone array beamforming techniques usually take advantage of the fact that the same
acoustic signal arrives to each of the microphones (forming the shape decided for the array) at a
slightly different time due to the delay of propagation of the signal through the air. By combining
the signals of all microphones (in different ways) one can simulate a directional microphone whose
acoustic beam focuses on the speaker or acoustic event which is predominant, at each instant, in
the meetings room. There are multiple acoustic beamforming techniques which require different
degrees of knowledge on the microphone characteristics and the location of the speakers.
One singular trait of meeting rooms is the existence in some settings of multiple microphones
recording the meeting synchronously. This is taken advantage of in this thesis to obtain a better
signal to be further processed by the speaker diarization system. In this section the basic concepts
behind microphone array processing are introduced to serve as a background on the developed
techniques for this thesis.
1 δ2
▽2 x(t, r) − x(t, r) = 0 (2.23)
c2 δ 2
were ▽2 is the Laplacian operator, x(.) is the wave field (of any sort) as a function if time and
space, r is the 3D position of the wave and c is the speed of sound (about 330 m/s in air).
In microphone array processing this equation can be solved for two particular cases. On
one hand, when the acoustic wave field is considered monochromatic and plane (for far-field
conditions) it is solved as
where w = 2πf is the considered frequency (in radiants per second), A(t) is the wave field
amplitude and k is the wavenumber and is defined as
2π
k= [sin θ cos φ sin θ sin φ cos θ]
λ
where λ is the wavelength (λ = c/f ), θ and φ are the polar coordinates for elevation and azimut
(respectively) of the waveform position in space.
On the other hand, when the wave is considered spherical (propagating in all directions), as
used in near-field conditions, it is solved as
−A(t) j(wt−kr)
x(t, r) = e (2.25)
4πr
where now r = |r| determines the scalar distance to the source in any direction and k is the
scalar version of the wavenumber, k = 2π/λ for all directions.
From these formulas one can observe how any acoustic wave can be sampled both in time
and space in a similar way (both dimensions being in the exponential). Time sampling is done
to obtain a digital signal and space sampling is done by a microphone array. In both cases one
can reconstruct the original signal as long as it complies with the Nyquist rule (Ifeachor and
Jervis 1996) (or else there will be spacial/temporal aliasing).
Passive Apertures
In order to describe the effect of the signal when received by a microphone array, the theory
behind transmission/reception of propagating waves needs to be reviewed. An aperture is defined
as a spacial region designed to emit (active) or receive (passive) propagating waves. The concept
of aperture is very broad and is used for many different kinds of waves.
Chapter 2. State of the art 43
1
ve
wa
ne
pla
pla
ne
Amount of signal seen
by the aperture
wa
ve
2
linear aperture
As can be seen in figure 2.2, a passive aperture has a particular spacial orientation in space
and therefore alters the receiving signal in a different way for each frequency and location. In this
context the aperture function or sensitivity function (A(f, r), with impulsive response α(t, r)) is
F
defined as such response of the aperture to the incoming signal x(τ, r) ⇐⇒ X(f, r), resulting on
XR (f, r) as
Z ∞
F
xR (t, r) = x(τ, r)α(t − τ, r)dτ XR (f, r) = X(f, r)A(f, r) (2.26)
−∞ ⇐⇒
The aperture function is defined for a particular direction of arrival. In order to measure
and characterize the response of an aperture for all directions, the directivity pattern (or beam
pattern) is defined as the aperture response to each frequency and direction of arrival. It is given
by:
Z ∞
DR (f, α) = Fr {A(f, r)} = A(f, r)ej2παr dr (2.27)
−∞
where Fr is the 3D fourier transform, r now indicates a point along the aperture and α is the
direction vector of the wave
1
α= [sin θ cos φ sin θ sin φ cos θ]
λ
44 2.5. Multichannel Acoustic Enhancement
Over the general directivity pattern in eq. 2.27 one can apply some simplifications, oriented
towards array processing:
2L2
• A far-field signal is received (therefore |r| > λ , with L being the total length of the
aperture.
mλ
which contains zeros of reception at αx = L with m being a scalar value.
At all effects, a linear sensor array can be considered as a sampled version of a continuous
linear aperture. One can obtain the aperture function of the array as the superposition of all
individual element functions (en (·)) which are equivalent to the array function and measure the
element’s response for a particular direction of arrival. The aperture function is now written as:
N −1
X2
for an array with N elements, where en is the element function for element n, wn (f ) is the
complex weighting for element n and xn is the position of such element in the x axis.
For the far-field case, and considering all elements with identical element function, the di-
rectivity function can be computed as
N −1
X2
In where the complex weighting can be expressed as module and phase in the following way:
wn (f ) = an (f )ejϕn (f ) (2.31)
Chapter 2. State of the art 45
where an (f ) can be used to control the shape of the directivity and ϕn (f ) to control the angular
location of the main lobe, being both scalar functions.
Beamforming techniques that use a microphone array for acoustic enhancement of the signal
play with these two parameters to obtain the desired shaping and steering of the lobes of the
directivity pattern to certain locations in the space. Some of these techniques use the approxima-
tion of far-field signals done in here and others (fewer) consider near-field waves, with different
directivity pattern development.
The application of the general signal beamforming theory to the case of acoustic beamforming
has some peculiarities and has been broadly studied in the past. In general it is considered that
the acoustic signal is generated by a far-field source (therefore it arrives at the microphones as a
flat wave) and that it has usually been considered as having a narrow-band frequency response
(not taking into consideration the different behavior of the arrays to multiple frequencies).
There are two main groups of beamforming techniques that can be found in the bibliography.
These are data-independent (or fixed) and data-dependent (or adaptive). The techniques that
are data-independent fix their parameters and maintain them throughout the processing of the
input signal. Data dependent techniques update their parameters to better suit the input signal,
adapting to changing noise conditions. Moreover, there are several postprocessing techniques
that are applied after the beamforming, some of them very linked to the beamforming process.
Fixed beamforming techniques are simpler to implement than the adaptive ones, but are more
limited in their ability to eliminate highly directive (and sometimes changing) noise sources. The
simplest beamforming technique in this group is the delay&Sum (D&S) technique (Flanagan,
Johnson, Kahn and Elko (1994), Johnson and Dudgeon (1993)). The output signal y[n] is defined
as:
N
1 X
y(n) = xm (n − τm ) (2.32)
M
m=1
given a set of M microphones, where each microphone has a delay of τm relative to the others.
In this technique all channels are equally weighted at the output. The D&S beamforming is a
particular case of a more general definition of a filter&sum beamforming where an independent
filter is applied to each channel:
N
X
y(n) = wm [n]xm (n − τm ) (2.33)
m=1
46 2.5. Multichannel Acoustic Enhancement
One application of such techniques is the Superdirective Beamforming (SDB) (Cox, Zeskind
and Kooij (1986), Cox, Zeskind and Owen (1987)), where the channel filters (or also called
superdirective beamformers) are defined to maximize the array gain (or directivity factor), which
is defined as the improvement in signal to noise ratio between the reference channel and the
“enhanced” system output.
For the case of near-field signals (like when a microphone array is located right in front of
the speaker in a workstation) the SDB has been reformulated (Tager (1998a), Tager (1998b),
McCowan, Moore and Sridharan (2000)) by using the near-field propagation functions for the
acoustic waves, where waves are not considered planar anymore.
Considering the speech signal to be narrow-band simplifies the design of beamforming sys-
tems but does not represent well the reality. To deal with broadband signals in an effective
manner, several sub-array beamforming techniques have been proposed (Fischer and Kammeyer
(1997), Sanchez-Bote, Gonzalez-Rodriguez and Ortega-Garcia (2003)) where the set of micro-
phones is split into several sub-arrays which focus their processing in a particular band, collapsing
all the information into the “enhanced” signal at the end.
The adaptive beamforming techniques present a higher capacity at reducing noise interference
but are much more sensitive to steering errors due to the approximation of the channel delays.
The Generalized Sidelobe Canceller (GSC) technique (Griffiths and Jim 1982) aims at en-
hancing the signal that comes from the desired direction while cancelling out signals coming
from other sources. This is achieved by creating a double path for the signal in the algorithm.
A standard beamforming path is modified by an adaptive path consisting of a blocking matrix
and a set of adaptive filters that aim at minimizing the output noise power. The blocking matrix
blocks the desired signal from the second path. At the end both paths are subtracted to obtain
the output signal. In order to find the optimum coefficients for the lower part, an algorithms
like the Least Mean Squares (LMS) can be used.
Although widely used, in practice the GSC can suffer from distortion of the output signal
normally called signal leakage. This is due to the inability of the blocking matrix to completely
eliminate the desired signal from the adaptive path (which is very common in speech due to
its broadband properties). This problem is treated in Hoshuyama, Sugiyama and Hirano (1999)
where the blocking matrix is designed with control of the allowed target error region.
A different kind of adaptive beamforming techniques are those that allow a small amount
of distortion of the desired signal as it is considered not to affect the quality of the signal as
perceived by human ears. One of such techniques is named the AMNOR (Adaptive Microphone-
array system for Noise Reduction), introduced by Kaneda and Ohga (1986), Kaneda (1991) and
Kataoka and Ichirose (1990). It introduces a known fictitious desired signal during noise-only
periods in order to adapt the filters to cancel such signal and therefore improve the quality of
Chapter 2. State of the art 47
the speech parts. One drawback of this technique is the need for accurate speech/non-speech
detection.
Some efforts have been reported applying the adaptive beamforming techniques to the near-
field case. In McCowan, Marro and Mauuary (2000) and McCowan, Pelecanos and Sridharan
(2001) adaptive beamforming and super-directive beamforming are combined for this effect.
In real applications none of the previously described beamforming techniques achieves the
levels of improvement on the signal set theoretically. In practice a post-processing of the acoustic
signal is necessary in order to obtain the optimum output quality. In Zelinski (1988) a Wiener
post-filtering is applied where time delays information is used to further enhance the signal
in the filter. In Marro, Mahieux and Simmer (1998) it does a very thorough analysis of the
interaction of Wiener filtering with a filter&sum beamforming, showing that the post-filter can
cancel incoherent noise and allows for slight errors in the estimated array steering. Other post-
filtering approaches applied to microphone arrays beamforming are proposed in Cohen and
Berdugo (2002) and Valin, Rouat and Michaud (2004).
There are many post-processing techniques aimed to the “enhanced” single channel signal
resulting from the beamforming. Some of them take into account acoustic considerations (Rosca,
Balan and Beaugeant (2003), Zhang, Hansen and Rehar (2004)) or acoustic models (Brandstein
and Griebel 2001) to better enhance the signal.
In order to apply almost any of the array beamforming techniques in an acoustic signal, and
given that the location of the acoustic source is not given, one needs a way to estimate the TDOA
between channels or the Direction of Arrival (DOA) of the signal. In practice the DOA estimation
is much less used for signal enhancement in this domain as its requirements and computational
cost are normally higher than for TDOA. DOA estimation has also been considered less suitable
than TDOA for broadband signals.
There have been many techniques proposed in the past in order to estimate the TDOA
between a pair of sensors, like the use of LMS adaptive filters used in sonar (F. Reed and
Bershad (1981), Schmidt (1986)).
However, the approaches that have become more popular on recent years have been those
based on the cross-correlation of the signals. Given two real signals, x1 and x2 , the cross-
correlation between them is defined as:
although, as in practice one cannot work with infinite signals, it is estimated as:
N
X
R̂x1 x2 = x1 (n) ∗ x2 (n − m) = x1 (n) · x∗2 (n − m) (2.35)
n=−N
where each signal has length N. In order to do this computation in a more efficient way, both
signals are first Fourier transformed, the product is computed and then the inverse Fourier
transform is applied.
When the cross-correlation between two signals is computed where one of the signals is a
(similar) delayed version of the other by a time T, the main peak of the cross-correlation will
be located at either time ±T (depending on which signal is x1 and x2 ). In real applications
though, there are many disturbing factors that will affect the position of the peak or will mask
it. These factors can be noise, reverberation and others. The case of reverberation has been
greatly studied in the literature (Champagne, Bedard and Stephenne (1996), Brandstein and
Silverman (1997)).
Addressing this problem, the Generalized Cross Correlation (GCC) was introduced (Knapp
and Carter 1976). It implements a frequency domain weighting of the cross correlation according
to different criteria, in order to make it more robust to external disturbing factors. The general
expression for the GCC is:
RxGCC
1 x2
(m) = F −1 (X1 (w) · X2∗ (w) · ψ(w)) (2.36)
where ψ is a weighting function. If ψ = 1 for all w the standard cross correlation formula is
obtained.
The first weighting function that will be considered is the Roth correlation (Roth 1971), which
weights the cross correlation according to the Signal to Noise Ratio (SNR) value of the signal.
Its results approximate an optimum linear Wiener-Hopf filter (Trees 1968). Frequency bands
with a low SNR obtain a poor estimate of the cross correlation and therefore are attenuated
versus high SNR bands.
1
ψROT H (w) = (2.37)
X1 (w) · X1∗ (w)
A variation of the ROTH weight is the Smoothed Coherence Factor (SCOT ) (Carter, Nuttall
and Cable 1973) which acts upon the same SNR-based weighting concept, but allows both signals
being compared to have a different spectral noise density function.
Chapter 2. State of the art 49
1
ψSCOT (w) = p (2.38)
X1 (w) · X1∗ (w) · X2 (w) · X2∗ (w)
In environments with high reverberation, the Phase Transform (PHAT ) weighting function
(Knapp and Carter 1976) is the most appropriate as it normalizes the amplitude of the spectral
density of the two signal and uses only the phase information to compute the cross correlation.
It is applied to speech signals in reverberant rooms by Brandstein and Silverman (1997).
1
ψP HAT (w) = (2.39)
|X1 (w) · X2∗ (w)|
The GCC-PHAT achieves very good performance when the SNR of the signal is high, but
deteriorates when the noise level increases. This is the solution used as weighting function in
the beamforming implementation proposed in this thesis.
Another weighting function of interest is the Hannan & Thomson (Knapp and Carter (1976),
Brandstein, Adcock and Silverman (1995)), also known as Maximum Likelihood (ML) correla-
tion, which also tries to maximize the SNR ratio of the signal. For speech applications, Brandstein
et al. (1995) proposed the approximation:
Finally, the Eckart filter (Eckart 1952) maximizes the deflection criterion, i.e. the ratio of the
change in mean correlation output due to signal present compared to the standard deviation of
correlation output due to noise alone. The weighting function achieving this is:
S1 (w)S1∗ (w)
ψeckart = (2.41)
N1 (w)N1∗ (w) · N2 (w)N2∗ (w)
When applying a known technique to a new task it is preferable to do it starting from some
well rooted theory and some implementation that has been proven to be successful in a task
similar to the proposed one, while analyzing its shortcoming on this new domain and propos-
ing improvements to it. This is the case of the diarization system presented for the meetings
environment, which is based on the system previously developed at the International Computer
Science Institute (ICSI) for the task of broadcast news. It has been developed by proposing al-
ternatives to the algorithms that had some room for improvement or that needed to be adapted
to better fit the new domain. Also, given that the broadcast news system is designed to run only
on a single-channel recording, the necessary algorithms have also been implemented to adapt
the signals from multiple channels/microphones to be able to process them with the presented
system.
This chapter covers the description of both the broadcast news system and the new meetings
domain system, bridging the gap between both by analyzing the differences that have been
observed during development.
In the first part, the broadcast news system is described in detail, pointing out the main
ideas behind it and its implementation, and baseline results are shown regarding its performance
for the NIST Rich transcription evaluations for broadcast news (RT03s and RT04f) in which
ICSI participated.
Following the broadcast news description, a comparison on some of the parameters measur-
able on both domains (meetings and broadcast news) is offered. The differences between them
are pointed out, as well as the areas where this thesis proposes improvements in converting a
system from one task to the other.
Finally, a description of the meetings domain speaker diarization system is given. The detailed
51
52 3.1. The ICSI Broadcast News System
description of all the novel algorithms involved in the new system is split between the current
and next two chapters. In this chapter a detailed description is given of those algorithms that
have been adapted from different sources but that are not considered a novelty of this thesis
by themselves. It also gives an overview description of the rest of the algorithms (novel in this
thesis) to obtain a complete view of the overall system.
The techniques considered to be the primary contribution of this thesis will be described in
chapters 4, focusing in those algorithms within the single-channel speaker diarization system,
and 5 which deals with the use of the multiple channels in a meeting room to further improve
the system.
The broadcast news (BN) system currently used at ICSI and which has been used as a base
for the meetings system, was originally created by Jitendra Ajmera circa 2003. He built the
system while he was a PhD student at EPFL (Lausanne, Switzerland) and IDIAP (Martigny,
Switzerland) and implemented it at ICSI while visiting for 6 months. During Ajmera’s stay,
ICSI participated in the NIST 2003 Rich transcription of broadcast news spring evaluation with
the developed system, and soon afterwards in the RT03f (“who spoke the words” evaluation).
The diarization system was then improved and ICSI participated again in the RT04f evaluation
(Wooters et al. 2004), also in broadcast news.
The system is a bottom-up agglomerative clustering approach that uses a modified version
of the BIC distance (Ajmera et al. 2003) in order to iteratively merge the closest clusters until
the same BIC distance determines the system to stop. Speaker segmentation of the data is not
done explicitly before the clustering part, but it is done via Viterbi decoding of the data given
the current speaker models at every iteration. For a thorough description of the system refer to
Ajmera (2004).
The philosophy behind the system and all research that has been done towards implemen-
tation of the meetings system is based on these key concepts:
1. Make the system as robust as possible to data within the same domain which the system
has not been adapted to.
2. Allow for a fast adaptation of the system to use it in new domains (i.e. broadcast news,
meetings, telephone speech, and others).
These key concepts were put into practice by imposing the following guidelines:
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 53
• Use as few training data as possible so that the system can be easily adapted to new
domains and is not over-tuned to the data it is trained on.
• Avoid as much as possible the use of thresholds and tuning parameters. If not possible,
try to define parameters that once tuned can achieve good performance in different kinds
of data.
The implementation of the broadcast news system used as a baseline for the meetings domain
was presented to the RT04f broadcast news evaluation (Wooters et al. 2004), which is the latest
broadcast news evaluation conducted by NIST within the EARS (Effective Affordable Reusable
Speech-to-Text) program. It differs from the original diarization system created by Ajmera
(Ajmera and Wooters 2003) in four main points. First the inclusion of a speech/non-speech
detector to filter out the non-speech segments prior to doing any further processing to the data
and the discontinuation of use of a speech/music classifier used in the RT03s evaluation. Also,
the parameterization used was MFCC, instead of PLP used until then. Finally, the inclusion
of am iterative segmentation-training loop in the algorithm to allow models to converge to the
clusters data.
It can be seen in figure 3.1 the main blocks constituting the system. In the following sections
a detailed description of the different blocks is given.
The speech/non-speech detector for the BN system is a two-class detector, in which each
class is modeled by a three-state HMM, with a minimum duration of 30 msec. The non-speech
model includes both music and silence. The features used in the SNS detector (MFCC12) are
different from the features used for clustering. This detector that was initially used for ICSI-
SRIs BN STT system on RT04f. It was trained on 80 hours of 1996 HUB4 BN acoustic data.
No tuning was made to adapt the detector to speaker diarization for the RT04f evaluation.
System initialization
no
Iterative
detectors. The Diarization Error Rate (DER) is the percentage of time that the system miss-
attributes speakers/non-speech segments. It can be broken down into speaker errors, which ac-
counts for miss-attributed speaker segments, false alarms (FA) and missed speech errors (MISS),
which account for non-speech labelled as speech, and viceversa. For an exhaustive definition of
each on of these types of error refer to section 6.1.3.
The first column shows the baseline system composed of the RT03f system. It has an overall
non-speech error of 5.1% and a speaker error of 17.8%. By adding the speech/non-speech detector
proposed for broadcast news it not only improved the non-speech errors but also reduces the
speaker error, due to the reduction in clustering errors as noted above. Finally, it is interesting
to see how much can be achieved in terms of DER if a perfect spnsp detector was built. Such
detector is obtained by extracting the speaker segments from the reference segmentation and
running the diarization with those as spnsp input. It can be seen that the proposed spnsp
detector is still about 1.2% worse than the perfect detector. The speaker error is lower in the
proposed spnsp detector than in the ideal one. This could indicate that some non-speech data
can still be beneficiary to train discriminant speaker models. In this implementation the system
obtained a 0.2% and 0.1% MISS errors in the perfect spnsp and baseline systems which was
later reduced to 0%.
System used %MISS %FA %SPKR %DER
RT03f system 0.1 5.0 17.8 22.95
+SRI/spnsp 1.5 1.2 15.4 18.17
+ideal spnsp 0.2 0.0 16.8 16.98
With respect to the parameters used in the system, as it happens with other speech processing
areas, acoustic modeling for speaker diarization is performed based of acoustic features extracted
from the input signal. For the broadcast news system at ICSI the features used have been
modified over the years finally settling down into the use of MFCC features with 19 coefficient,
without any deltas or double deltas and without the zeroth cepstral coefficient, linked to the
energy of the signal. For broadcast news these features were computed over a 60 millisecond
analysis window in 20 milliseconds intervals. Multiple tests were done resulting on the selection
of these features. On one hand, the increase in computation involved in using the delta and
double delta coefficients was considered unacceptable given that the system gave mixed results
when using them. On the other hand, MFCC19 were chosen as opposed to PLP12, which were
used on RT03f, due to a slightly better performance when using them together with the spnsp
detector.
As can be seen in table 3.2 also from Wooters et al. (2004), the baseline system using PLP
and no spnsp detector produces better overall results than the counterpart MFCC system, but
this second one is better when spnsp is added. In the diarization system for meetings a possible
56 3.1. The ICSI Broadcast News System
combination to use delays as features is proposed which is also applicable to all other kinds of
feature vectors.
The system implemented at ICSI for broadcast news is based on an agglomerative clustering
process that iteratively merges clusters until getting to the optimum number of clusters. In order
to initialize the system one needs to obtain an initial set of clusters K (where K > Kopt , the
optimum number of clusters representing the number of speakers in the recording). During the
implementation of the original system two alternatives were considered, on one hand, a k-means
algorithm was tested in order to split the data into K clusters containing homogeneous frames.
Another alternative was to split the data into even sized pieces. The second was finally selected
due to its simplicity and the good results that it achieved on most data.
The linear initialization of the data into clusters is a simple algorithm that clusters the data
according to its temporal proximity rather than acoustic proximity, allowing for models to be
trained with acoustic data of very different acoustic characteristics that belongs to the same
speaker.
In order to create K clusters the clusters initialization algorithm first splits the show into P
partitions (where P = 2 for Broadcast news). Then for each partition the data is split into K
segments of the same size and labelled 1 . . . K. The initial data for cluster k (where kǫ1 . . . K)
is the union of the data labelled k for each of the partitions. This technique is thought to
work better than a more elaborate frame-level k-means algorithm because it takes into account
the possible acoustic variation of the speech belonging to a single speaker. By clustering the
data with k-means one cannot ensure that the resulting clusters contain frames from the same
speaker, but maybe it contains acoustic frames that belong to the same phonetic class from
several speaker.
Each initial cluster obtained via linear initialization it will most certainly have data belonging
to more than one source/speaker. In order for the clusters to achieve some speaker homogeneity
before stating the merging iterations the algorithm performs three iterations of models training
and Viterbi segmentation of the data. Next section goes into more detail how clusters are mod-
eled. The resulting clusters tend to contain data from a single speaker or at least a majority of
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 57
it.
There are some occasions when using linear initialization that it creates clusters with acoustic
segments from more than one speaker, causing them potential merging errors and therefore a
decrease in performance. In the improvements for the meetings room data a new initialization
algorithm and a segment purification algorithm, that detects and splits such clusters, will be
proposed.
K−1
states (clusters)
K
α
sub−states
β
1 2 S−1 S
GMM
The broadcast news clustering algorithm models the acoustic data using an ergodic hidden
Markov model (HMM) topology, as seen in figure 3.2, where each initial state corresponds to
one of the initial clusters. Upon completion of the algorithm’s execution, each remaining state is
considered to represent a different speaker. Each state contains a set of M D sub-states, imposing
a minimum duration of staying in any model. Each one of the sub-states has a probability density
function modeled via a Gaussian mixture model (GMM). The same GMM model is tied to all
sub-states in any given state. Upon entering a state at time n, the model forces a jump to the
following sub-state with probability 1.0 until the last sub-state is reached. In that sub-state,
we can remain in the same sub-state with transition weight α, or jump to the first sub-state of
another state with weight β/M , where M is the number of active states/clusters at that time.
The diarization system for broadcast news used values α = 0.9 and β = 0.1 with the intention
of favoring the system to stay in the same cluster and therefore model speaker turns bigger than
M D frames. As will be shown, this implicitly models the maximum length for a speaker term.
58 3.1. The ICSI Broadcast News System
Each of the GMM models initially has a complexity of 5 Gaussian mixtures, which was
optimized using development data from previous evaluations. Upon deciding that two clusters
belong to the same speaker, one of the clusters/models is eliminated from the HMM topology,
M is reduced by 1 and the resulting model is trained from scratch with a complexity being the
sum of the previous two models. This ensures that the complexity of the overall system after
any particular iteration remains constant and therefore the overall likelihood of the data given
the overall HMM model can be compared between iterations.
Given M clusters with their corresponding models, the matrix of distances between every cluster
pair is created and the closest pair is merged if it is determined that both clusters contain
data from the same speaker. In order to obtain a measure of similarity between two clusters
modeled by a GMM a modified version of the Bayesian Information Criterion (BIC) is used,
as introduced by Ajmera in Ajmera, McCowan and Bourlard (2004) and Ajmera and Wooters
(2003). As explained in the state of the art chapter, the BIC value quantifies the appropriateness
of a model given the data. It is a likelihood-based metric that introduces a penalty term (in the
standard formulation) to penalize models by their complexity.
Given two clusters Ci and Cj composed of Ni and Nj acoustic frames respectively, they are
modeled with two GMM models Mi and Mj . The data used to train each of the models is the
union of the data belonging to each one of the segments labelled as belonging to cluster C. . In
S
the same manner a third model Mi+j is trained with Ci+j = Ci Cj .
In the standard BIC implementation to compare two clusters (Shaobing Chen and
Gopalakrishnan 1998) the penalty term adds a factor λ that is used to determine the effect
of the penalty on the likelihood. The equation of the standard ∆BIC for GMM models is
where #M. is the number of free parameters to be estimated for each of the models, i.e. relative
to the topology and complexity of the model.
It is considered that two clusters belong to the same speaker if they have a positive ∆BIC
value. Such value is affected by the penalty term (including the α, which acts as a threshold
determining which clusters to merge and which not to). The penalty term also modifies the
order in which the cluster pairs are merged in an agglomerative clustering system, as each pair
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 59
will have a different number of total frames and/or models complexities, which will cause the
penalty term to be different. In systems based on ∆BIC this penalty term (λ) is usually tuned
based on development data and it always takes values greater than 0. In some cases two different
values are defined, one for the merging criterion and the other one for the stopping criterion.
When training models to be used in ∆BIC, these need to be trained using an ML approach,
but there is no constraint in the kind of model to use. Ajmera’s modification to the traditional
BIC formula comes with the inclusion of a constraint to the combined model Mi+j :
This is an easy to follow rule as model Mi+j is normally built exclusively for the comparison.
By Applying this rule to the ∆BIC formula one avoids having to decide on a particular λ
parameter and therefore the real threshold which is applied to consider if two clusters are from
the same speaker becomes 0. The formula of the modified-BIC becomes equivalent to the GLR,
but with the condition that 3.2 applies.
The lack of an extra tuning parameter makes the system more robust to changes in the
data to be processed, although, as the BIC formula is just an approximation of the Bayesian
Factor (BF) formulation, sometimes the robustness increase comes with a small detriment on
performance.
In the broadcast news system that has been described, the model Mi+j is generated directly
from the two individual models i and j by pooling all the Gaussian mixtures together. Then the
data belonging to both parent models is used to train the new model via ML. Training is always
performed using an Expectation Maximization (EM-ML) algorithm and performing 5 iterations
on the data.
Once the ∆BIC metric between all possible cluster pairs has been computed, it searches for
the biggest value and if ∆BICmax (Ci , Cj ) > 0 the two clusters are merged into a single cluster.
In this case, the merged cluster is created in the same way as Mi+j is created, although it is
not a requirement. The total complexity of the system remains intact through the merge as the
number of Gaussian mixtures representing the data is the same, clustered in M-1 clusters.
The computation of ∆BIC for all possible combinations of clusters is by far the most com-
putationally intensive step in the agglomerative clustering system. Given that the models are
retrained and the data is resegmented after each merge it obtains models at each iteration that
are quite dissimilar to the models in the previous iteration, therefore it is recommended to com-
pute all values again. Some techniques were tested to speedup this process by looking at the
behavior of the ∆BIC values. Some are:
60 3.1. The ICSI Broadcast News System
• In every iteration merge more than one cluster pair, selecting them to be the pairs with
highest BIC value and positive. This generates mixed results as modifies the way that the
models are grouped.
• Compute the BIC value only for clusters which have changed considerably between iter-
ations, maintaining the same value as previous iterations for those with almost the same
segments assigned. This also resulted in mixed results, depending on the shows evaluated.
• Do not compute the BIC value for clusters pairs that obtain a negative value at any given
iteration. This reduces the computation as only positive BIC values will be considered in
subsequent iterations. It was implemented with success in the broadcast news system.
Introduced in the system presented for the RT04f evaluation, after each cluster pair merge a set
of three iterations of models training and Viterbi segmentation of the data given the models is
performed. This achieved a small improvement on the RT04f evaluation data but proved to be
positive and to increase robustness in the system.
After any cluster pair merges the cluster structure changes, with one less cluster in the
system. When performing a Viterbi segmentation many segment boundaries change and some
segments are reassigned to different clusters. Such new clusters are used to retain the models
which are used again to segment the data. After three iterations the segmentation has usually
converged and a new merging step is started.
In order to stop the clustering processing at the optimum number of clusters, two different
alternatives were proposed as stoping criterion in the RT04f evaluation system. On one hand,
the clustering can be performed while there is any positive ∆BIC distance between any two
clusters, it is called the BIC stopping criterion. On the other hand, the overall likelihood of the
data given all the acoustic models can be compared between iterations and stop the processing
when it starts decreasing (and revert to the previous segmentation), it is called Viterbi stopping
criterion.
It must be noted that the Viterbi criterion can be applied only because the overall system
complexity remains constant between iterations and therefore overall likelihoods are comparable.
Note also that the Viterbi stopping criterion is in fact the BIC criterion applied over the overall
model, comparing a model with M clusters and a model with M-1 clusters and stopping when
M-1 is better than M.
Table 3.3 shows the resulting scores when using either BIC or Viterbi stopping criterions on
the RT04f dataset. Although Viterbi stopping criterion achieves an absolute 1.55% improvement
over BIC, the breakout by shows indicated mixed results and overall results are the opposite for
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 61
other sets.
Table 3.3: Comparison between BIC and Viterbi stopping criterions for RT04f data
The final system for broadcast news uses a BIC stopping criterion as it does not require an
extra clustering iteration for the stopping point to be found.
Once the system stops merging, the segmentation is output into a file. At this stage all re-
moved non-speech regions are taken into account and inserted into the output where appropriate
so that the output file is synchronous with the reference file used to evaluate its performance.
Taking the broadcast news speaker diarization system as a backbone to build a meetings di-
arization system requires the adaptation of the previous system to the new requirements. While
doing so, it was considered important to keep as much as possible the same structure and to
make all the changes adaptable to obtain a system that could run with a correct performance
in both broadcast news and meeting domains.
In this section, first an analysis is performed of some of the parameters that might differ from
meetings to broadcast news, by looking at the input signals and reference segmentation files.
This is intended to be a practical comparison between the two domains in order to identify the
strengths and weaknesses of each one. Then, a more theoretical comparison is proposed between
both domains and some high level changes are proposed to adapt the original broadcast news
system to the meeting domain.
In this section some parameters are computed both in meetings and broadcast news shows in
order to draw some conclusions on the nature of the input data to the speaker diarization system.
In order to constraint the analysis to a known set of data, it has been performed on the RT04f
broadcast news evaluation set and on the RT06s meetings evaluation set.
The RT04f set is composed of 12 shows, both from radio and television programs. The
evaluation region in each of the shows is approximately 40 minutes, although the recording
might be longer. The RT06s set is composed of two subsets, for the lecture data and conference
data subdomains. The conference room data is composed of 8 meeting excerpts, with a length of
62 3.2. Analysis of Differences from Broadcast News to Meetings
around 15 minutes each. The Lecture room set is composed of 28 lecture excerpts with varying
times.
The first parameter obtained from the input signals is the signal to noise ratio (SNR). It was
computed using the stnr tool from the NIST Speech Quality Assurance Package (SPQA) (NIST
Speech tools and APIs 2006) which is also used in the acoustic beamforming system evaluated
in the experiments section. This program estimates the SNR of a file, defined as
where power refers to the Root Mean Square (RMS) of the signal over a sliding window of 20ms,
with a scrolling size of 10ms. A histogram is created using the RMS values and then the noise
and speech values are computed.
To determine the noise average power, a raised cosine function is fitted to the peak in the left
hand side of the histogram (lowest values) using a search algorithm to minimize the Chi-Square
distance between the histogram and the function. The midpoint of such function is considered
the mean noise power. Then the obtained raised cosine is subtracted from the histogram in order
to estimate the speech power distribution. The peak speech power is defined as the histogram
bin midpoint where the 95% of the power falls below it. Given that the speech power contains
additive noise, the computed noise power is subtracted from it to use it in the SNR formula.
As speech and noise do not exist independently in the recorded signal, this method is only
an approximation of the SNR. The result from this tool might not be comparable to the result
from other tools, but according to the authors it is consistent to results using the same tool and
therefore adequate to compare the quality difference of several signals as it is intended in this
section. It must be noted that for a few cases this algorithm is known to give erroneous results,
therefore it should be taken as an information source and the average should be taken to avoid
misreadings.
In order to compute the SNR values, both for the meetings and for the broadcast news
recording, only the regions determined to be part of the evaluation were considered. As pointed
out before, some of the recordings contain more acoustic data than the evaluated region, which
sometimes is excluded due to problems with the microphones (in meetings) or because it contains
commercials or very noisy acoustics (in broadcast news).
First of all, the SNR is computed for the files in the RT04f data set. As it can be seen in
table 3.4 the speech peak power remains constant at a very high value, with an average of 65db,
while the noise average power is very variable and ranges from around 15db to around 62db.
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 63
Such averages are taken over the SNR values in log domain, and aim at indicating the overall
quality of the dataset.
Some of the shows contain news material where reporters give their chronicles from the field,
with a high level of background noise, while the conductor is in the studio, with a very good
quality microphone in a controlled environment. The shows that contain less or none of the
field recordings achieve a very good SNR (around 50db) while others perform very poorly (for
example the CNN headline news, ABC and CNBC shows).
In the meetings domain the SNR is computed separately for the conference and lecture room
sets. On the conference room set each of the rooms contains a variable number of microphones,
mostly separated into 2 groups: the microphones situated in the middle of the table (labelled
MDM) and the head-mounted microphones, worn by some of the participants (labelled IHM).
Although the speaker diarization system presented in this thesis does not analyze the IHM case,
the SNR for these microphones is also computed for comparison purposes.
Tables 3.5 and 3.6 show average SNR for the MDM and IHM channels in the RT06s meetings
in the conference room. In both tables the number of microphones available is indicated in the
second column. Then, the third though fifth columns indicate the average (in the linear domain)
of the SNR values for all channels in each meeting. As the variety of microphones causes them
to have very diverse quality levels, the last two columns indicate the maximum and minimum
SNR values to give an idea of how disperse these are. Finally, the averages (in the log domain,
as done in the broadcast news results) are computed for all meetings.
The speech quality for all cases is approximately the same (around 65db). The noise level for
the MDM channels is much higher than the Broadcast news channels, which causes a decrease
in SNR of almost 5db. The Average noise level is lower for the IHM channels than for the MDM
64 3.2. Analysis of Differences from Broadcast News to Meetings
or the broadcast news shows which leads to an overall better SNR. This is due to the proximity
of these microphones to the speakers and that a meeting room contains less noise than some
broadcast news shows. It is interesting to point out the outstanding quality of the IHM channels
used in the Edimburgh recordings (within the AMI project), but at the same time these meetings
have some of the worse quality MDM microphones.
In overall, the MDM channels in the conference room are of less quality than the average in
the broadcast news, but they remain more constant in quality across meetings.
Table 3.5: Estimated SNR for RT06s Conference Meetings, MDM channels
Table 3.6: Estimated SNR for RT06s Conference Meetings, IHM channels
Finally, table 3.7 shows the computed averages for the RT06s meetings in the lecture room
dataset. In the same way as in 3.5, for each recording several distant microphones are available.
The Average between microphones is done in the linear domain while the average over all
recordings is done in the log domain.
Although all meeting recordings were done within the CHIL project, the specifications on
the room layout and on the acoustic environment change within each lecture room. The speech
average peak power changes immensely among the rooms (from around 50db on AIT recordings
to around 80db on UKA decordings) remaining stable within the same lecture room. The same
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 65
Table 3.7: Estimated SNR for RT06s Lecture Room Meetings, MDM channels
thing happens with the noise average power, which is the lowest for the AIT recordings and
the highest for the UKA. This indicates that the recording settings were not set equally for all
settings, being such difference possibly due solely to the amplification applied to the signal by
the recording equipment.
Regarding the SNR over all the channels, the AIT recordings are constantly achieving SNR
values on the twenties, while the other shows are usually on the tens, with a global average of
18.47, which is slightly lower in average than the meetings in the conference room subdomain.
The differences between minimum and maximum SNR values remain in the same line as in 3.5.
66 3.2. Analysis of Differences from Broadcast News to Meetings
Apart from looking at the acoustic signal, the reference transcriptions were also analyzed. The
first parameter computed both on the meetings and broadcast news is the speaking time per
speaker in each of the shows. This is important as it is necessary to create models that can
train optimally to the data, and therefore need to be adjusted if the amount of data per speaker
changes across domains. Tables 3.8, 3.9 and 3.10 show the number of speakers, average time per
speaker and maximum and minimum speaking times.
Show ID # ave. time max. time min. time ave. time max. time min. time
spkr. manual manual manual FA FA FA
CMU 20050912-0900 4 373.01 620.49 169.77 283.71 487.14 105.96
CMU 20050914-0900 4 368.04 544.88 131.66 277.61 444.69 102.46
EDI 20050216-1051 4 301.65 432.01 184.03 224.01 306.88 111.21
EDI 20050218-0900 4 318.01 489.42 210.46 238.46 361.15 162.5
NIST 20051024-0930 9 182.45 503.49 32.83 118.46 384.29 1.15
NIST 20051102-1323 8 179.25 336.05 46.04 121.47 257.0 2.76
VT 20050623-1400 5 258.16 438.16 155.11 195.60 342.06 118.84
VT 20051027-1400 4 244.27 581.27 103.53 184.21 457.39 70.19
Table 3.8: Average total speaker duration in RT06s conference room data
In all cases the average values vary greatly even within the same domain. For example, the
CSPAN show in broadcast news contains an average speaker length several orders of magnitude
higher than any of the other shows. This is due to the rather common total show lengths
imposed by NIST for the evaluations but the variability in the number of speakers existent in
each recording. from these results it is clear that an automatic way of selecting the speaker
models complexity is necessary in order to be able to model each of the possibilities correctly,
as the more data available from a speaker, the more complex the models need to be to be able
to represent the same level of detail for that speaker compared to others.
Another observation is on the minimum and maximum speaking times columns. The maxi-
mum speaking time indicates how long has the main speaker in the recording spoken. In both
the lecture room and broadcast news recordings this column tends to contain values very much
higher than the average speaking time. In lectures it is the case when the excerpt mainly con-
tains the lecturer giving his talk (sometimes filling the entire excepts and sometimes with a small
question and answers section). In the broadcast news shows it is usual when the show contains
an anchor speaker that directs the flow of the program. In the conference room meetings the
NIST shows also tend to have a dominant speaker.
The minimum speaking time column indicates the length of time that the speaker with less
interventions speaks. In many of the lecture room meetings this is nonexistent as the lecturer
speaks for the whole time. In the other cases, many of the recordings in lectures and broadcast
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 67
Table 3.9: Average total speaker duration in RT06s lecture room data
news, and the NIST recordings in conference room, contain very short durations. These speakers
are difficult to model as not much data is available and could create many problems and errors
when comparing their models with the longer speaking ones. This is why it is sometimes desirable
to talk of agglomerative clustering systems (like the one presented in this thesis) as having the
goal of obtaining the optimum number of final clusters, instead of the exact number of existing
speakers. Although detecting these short speakers and labelling them as independent clusters is
always desirable, it can normally lead to other errors and therefore should be considered as a
secondary priority.
In Table 3.8 two different transcriptions were used to compute these parameters. On one
hand, one set of transcriptions were generated by hand, distributed by NIST and used in the
evaluations. On the other hand, another set of reference transcriptions were generated automat-
ically via forced-alignment of the reference speech-to-text transcriptions to the IHM channels.
68 3.2. Analysis of Differences from Broadcast News to Meetings
Table 3.10: Average total speaker duration in RT04f broadcast news data
The forced-alignments are the ones used in this thesis in the experiments section. For a more
detailed description on the differences and motivation behind the forced-aligned transcriptions
refer to the experiments chapter 6.
Another parameter that describes the different subdomains of application is the number of
expected speakers to be clustered. Given that the number of initial clusters needs to be higher
than the optimum number of clusters, it is important to define an upper boundary on the number
of speakers so that systems are ensured to be able to reach the optimum point. Although an
optimal speaker diarization system using hierarchical agglomerative clustering should be able to
start at a very high number of clusters and work its way down, in reality it makes a difference in
the resulting performance the correct estimation of an appropriate upper limit for the number
of clusters. This is explained more in detail in section 4.2.2.
The average number of speakers and their minimum and maximum values are represented
for the three datasets in table 3.11. One can observe how in general the broadcast news shows
contain a vast amount of speakers (averaging 19), although in the shows considered there was
one case (CSPAN) with 4 speakers. This creates a very big variation (or standard deviation)
between the values. The system processing the broadcast news data needs to ensure a good
performance both when many speakers are present (with smaller speaking time) and when less
are available. Without any automatic initial number of speakers detection algorithm, the system
starts at 40 clusters.
In the case of meetings, the lecture room data contains many recordings where only the
lecturer speaks, and other with several people, going to a maximum number of four speakers.
The standard deviation is therefore smaller compared to broadcast news. On conference rooms
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 69
the number of speakers range between 4 and 9, with an average of 5.25 speakers present. Without
automatic detection the systems start at 10 or 16 clusters for meetings and at 5 or 10 for lecture
room.
Domain # excerpts ave. spkrs min. spkrs max. spkrs std spkrs
Meetings conference 8 5.25 4 9 2.05
Meetings lecture 28 3 1 5 1.49
Broadcast news 12 19 4 29 8.35
Given the HMM acoustic modeling presented in the previous section, a minimum duration is
applied to the segment length when performing the acoustic decoding of the data using the
Viterbi algorithm. Such segments constitute the speaker turns and therefore it is important to
analyze how long in average these are in the different subdomains, in order to adapt (if necessary)
this parameter of the system to allow for smaller/larger turns.
In Table 3.12 the average, maximum and standard deviation of the speaker turn duration is
given both for conference room meetings data and for the lecture room data. In the case of the
conference room data, both the manual transcriptions and the forced-alignments are analyzed.
When analyzing this property, two speaker turns from the same speaker but separated with a
silence are considered different turns, and their durations counted separately.
It is clear that in the lecture room data the speaker turns are in average of much greater
length, given that many times a lecturer speaker for elongated amounts of time. In some cases
though it was seen that the transcriptions contained errors when small silence segments needed
to be transcribed as such and were included within the speaker segment adjacent to it. The
maximum speaker turn length for these is of 1:45 minutes approx.
On the conference room data there is a difference between the two transcriptions sources.
This is mainly due to the discrepancies on the transcription of small silences. According to
NIST rules for the evaluations, any silence segment of length greater than 0.3 seconds needs to
be considered as such. This can be implemented efficiently in the forced-alignment transcriptions
but it is more difficult to be followed by the human transcribers, leading to longer segments being
annotated.
70 3.2. Analysis of Differences from Broadcast News to Meetings
To further illustrate the distribution of the speaker turn durations, Figure 3.3 shows the
histogram of all three analyzed cases, showing the durations histogram of the first 10s, with a
resolution of 0.1 second. It can be observed that both transcriptions for the meetings conference
room recordings have a similar shape, being very pointy around 0.3s, while the broadcast news
shows reflect a broader shape. Such a small peak duration in the conference data has a reason to
be in that overlap segmentes were also used when computing the speaker turn duration. These
segments occur when two or more people are speaking at the same time and can just refer to
people uttering affirmative/negative responses or short sentences.
BN−RT04f
Meetings−RT06s
Meetings−RT06s−FA
10 20 30 40 50 60 70 80 90 100
In overall, the speaker turn length of meetings is much smaller than the average in broadcast
news, which is modeled by the system with a minimum duration of 3 seconds. Such duration
would be enough if the meetings system was evaluated using the hand-alignments, but needs to
be reduced when evaluated with forced-alignments.
Given the analysis performed in the previous subsection, it was found interesting to look at the
overlap regions in more detail. These are found with less frequency in the broadcast news data
and it only started being evaluated with the start of the meetings domain evaluations. Nowadays
overlap is considered an important feature of the meeting data and therefore is included in the
main metric in the NIST RT evaluations. An analysis of the overlap was performed for both
forced-alignments and hand-alignments in the conference room data, and it is shown in table
3.13. In it, the average, maximum and standard deviation segment length is computed for the
overlap regions alone and for the regions without any overlap.
From the average duration of the overlap regions one can see how much difference in average
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 71
Table 3.13: Overlap analysis between hand and forced alignments in RT06s conference room
meetings
length there is between both transcription sources. The hand alignments are double the length
than the forced-alignments in average, probably due to the difference in how the transcriptions
are created. A human transcriber upon listening to an overlap region might have labelled it
grossly, allowing for a few extra milliseconds in either side. The forced-alignments are based on
the uttered words, which are tightly aligned by the ASR system. One drawback of the forced-
alignments on overlap regions comes when the transcribers that wrote down the words miss the
words or sounds existing in the overlaps, and therefore the transcription is not aligned correctly.
Finally, on the overlap results, note that the values on the hand-alignments have a much bigger
standard deviation than the automatically generate ones.
Regarding the analysis of values in the regions without any overlap, the same observation
as in the previous subsection can be made. The average length of the speaker turns is bigger in
the hand-alignments, probably due to the consistent miss of small silence regions, shown also by
the values of the maximum segment lengths.
To further analyze the duration of the overlaps, in figure 3.4 the histograms of the lengths
of the overlap segments in both forced-alignments and hand-alignments is shown. As hinted by
the averages, the peak of the forced-alignment overlaps falls around 0.5 seconds, while the peak
of the hand-alignments is around 1 second and has a broader range of bigger values than the
forced-alignments.
Meetings−RT06s
Meetings−RT06s−FA
10 20 30 40 50 60 70 80 90 100
In a more theoretical point of view there are many other differences between meetings and
broadcast news that need the system builder’s attention when converting a system to the other
domain. In table 3.14 some of these differences a pointed out (some of them already studied in
this section) and in some cases a proposed solution, as described in this thesis, is given.
Table 3.14: Main differences between Meetings and Broadcast News recordings
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 73
The proposed speaker diarization system for meetings can be broken up into three/four main
blocks, as shown in figure 3.5. These are the single channel Signal-to-Noise Ratio (SNR) enhance-
ment block, the multi-channel acoustic fusion and the segmentation and clustering system, which
itself could be broken up into the speech/non-speech detection and the single-channel speaker
diarization system.
Chan. 1
Wiener filter
Diarization
delay&sum single-channel output
processing Diarization system
Chan. N
Wiener filter speech/
non-speech
Figure 3.5: Main blocks involved in the meetings speaker diarization system
When only one channel is to be processed, the acoustic fusion block is bypassed and the
individual, SNR enhanced, signal is processed directly by the segmentation and clustering blocks.
The advantage of this architecture is obvious, there is no need to generate a different system or
techniques depending on the number of microphones available as both the acoustic fusion and
the segmentation and clustering blocks accept an acoustic channel as an input and can be simply
turned on/off depending on the characteristics of the data. The following sections describe each
of the blocks in more detail.
Both the individual channel enhancement block and the acoustic fusion block aim at obtaining
a signal with a better quality than the original in order to improve the performance of the
diarization system.
The individual channels are first Wiener-filtered (Wiener and Norbert 1949) to improve the
SNR with the same algorithm as in the ICSI-SRI-UW Meetings recognition system (Mirghafori,
Stolcke, Wooters, Pirinen, Bulyko, Gelbart, Graciarena, Otterson, Peskin and Ostendorf 2004),
which uses a noise reduction algorithm developed for the Aurora 2 front-end, proposed origi-
nally in Adami, Burget, Dupont, Garudadri, Grezl, Hermansky, Jain, Kajarekar, Morgan and
74 3.3. Robust Speaker Diarization System for Meetings
Sivadas (2002). The algorithm performs Wiener filtering with typical engineering modifications,
such as a noise over-estimation factor, smoothing of the filter response, and a spectral flooring.
The original algorithm was modified to use a single noise spectral estimate for each meeting
waveform. This was calculated over all the frames judged to be non-speech by the voice-activity
detection component within the module. As observed in Figure 3.5, the algorithm is indepen-
dently applied to each meeting channel and uses overlap-add resynthesis to create noise-reduced
output waveforms, which then are either fed into the acoustic fusion block (multi-channel) or
directly into the segmentation and clustering block (single-channel).
The acoustic fusion module makes use of standard beamforming techniques in order to obtain
an “enhanced” version of the signal as a combination of the multiple channel input signals. It
considers the multiple channels to form a microphone array. Neither the microphone positions
nor their acoustic properties are known. Given these constraints, a variation of the simple (yet
effective) delay&sum beamforming technique is applied as it does not require any information
from the microphones in order to operate. As the different microphones have a different acoustic
directivity pattern and are located in places in the room where the noise level is different, a
dynamic weighting of the individual channels and a triangular filtering is used to reduce its
negative effects. By using such channel filtering the system will be referred to as filter&sum
from now on.
The filter&sum beamforming technique involves estimating the relative time delay of arrival
(TDOA) of the acoustic signal with respect to a reference channel. The GCC-PHAT (Generalized
Cross Correlation with Phase Transform) is used to find the potential relative delays regarding
each of the speakers in the meeting. In order to avoid impulsive noise, short-term events and
overlap speech from tainting the correct approximation of the TDOA, multiple TDOA values
are computed for each time and a double post-processing algorithm is implemented to select
the most appropriate value. On one hand, noise is detected by measuring the quality of the
computed cross-correlation values at each point with respect to the rest of the meeting and the
computed TDOA values are substituted by the previous (more reliable) values when considered
too low. On the other hand, impulsive events and overlap is dealt with by using a double-step
Viterbi decoding of the delays in order to obtain the optimum set of TDOA values that are
both reliable and stable. A more in depth explanation of these and other steps involved in the
acoustic fusion block is given in chapter 5.
Apart from using the post-processed estimated delays for the filter&sum beamforming, they
are also used in the segmentation and clustering block as they can convey information about the
speaker through his/her location in the room. Such information is orthogonal to the acoustic
information and therefore adds useful information to the diarization system. In section 5.3 the
combination of both features is presented in detail.
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 75
The speaker segmentation and clustering block of the overall speaker diarization system contains
two main blocks, the speech/non-speech detector and the single-channel speaker diarization
system. The speech/non-speech detector is different from the one used for broadcast news and
does not require any training data for the acoustic models. It is a hybrid energy-based/model-
based system which considers that most non-speech to be detected in a meeting, which can harm
the diarization, is silence. It will be further described in 4.1.
The single-channel speaker diarization module has evolved from the broadcast news system
by adding new algorithms and proposing improvements to existing ones. In figure 3.6 a block
diagram of the diarization process is shown with the newly proposed algorithms and changes to
the baseline system in a darker box. There are also other improvements in various steps of the
algorithms that are not reflected in the figure, these are the modification in duration modeling
within the models and the models initialization algorithm. In the following sections a description
of each one of these new modules and algorithm improvements is described in detail. For those
not changing from the baseline refer to section 3.1 for complete details.
As mentioned earlier, the meetings speaker diarization system makes use of the TDOA
values (when available) as an independent feature stream. These features contain N-1 dimension
vectors, where N is the number of channels available, computed at the same rate as the MFCC
parameters for synchronous operation. They are reused without any further processing, just
converting them to HTK format to be properly read by the system.
The acoustic features continue to be Mel Frequency Cepstrum Coefficients (MFCC) but with
an analysis window of 30ms (instead of 60ms) and computed every 10ms (instead of 20ms). The
increase in computation due to having double the amount of features is explained by an increase
in performance.
In order to initialize the hierarchical bottom-up agglomerative clustering one needs to first
define an initial number of clusters Kinit , bigger than the optimum number of clusters Kopt .
The system defined for broadcast news used Kinit = 40 clusters, value chosen empirically given
some development data. It was found that even though the optimum number of clusters in a
recording is independent of the length of such recording, in terms of selecting an initial number of
clusters for the agglomerative system the total length of the available data has to be considered
to allow for clusters to be well trained and best represent the speakers. By making the Kinit
constant for any kind of data used in the system makes some recordings do not perform as
well since the initial models either contain too much or too few acoustic data. In the system
76
Figure 3.6: single-channel speaker diarization for meetings block diagram
System initialization
data
Closest pair segment level Local BIC yes resegmentation Diarization
merging purification stopping criterion with different output
assessment constraints
presented here for meetings, this initial number is made dependent on the amount of data after
the speech/non-speech detection. A new parameter called Cluster Complexity Ratio (CCR)
represents the relationship between data and cluster complexity. The algorithm used is further
described in detail in 4.2.2.
The same CCR parameter is also used throughout the agglomerative clustering process to
determine the complexity (number of Gaussian mixtures) of the speaker models. Such mechanism
ensures that all models remain at a complexity relative to the amount of data that they are
trained with, and therefore remain comparable to each other. This is further explained in section
4.2.2.
Given the data assigned to each cluster, in order to obtain an initial GMM model with a
certain complexity the technique used in the baseline system has been replaced by another one in
order to obtain better initialized models. It was seen in experiments that the initial models play
an important role in the overall performance of the system as the initial position for the mixtures
is an important factor in how well the model can be trained using EM-ML and therefore how
representative it will be of the data. This is particularly crucial in speaker diarization where
small models (initially 5 Gaussians) are used due to little training data.
The broadcast news system uses a method that resembles the HCompV routine in the HTK
toolkit (Young, Kershaw, Odell, Ollason, Valtchev and Woodland 2005) for initialization without
a reference transcription. Given a set of acoustic vectors X = {x[1] . . . x[N ]} and a desired GMM
with complexity M Gaussians, the first Gaussian is computed via the sufficient statistics of the
data X as
X N
1
µ1 = x[i]
size(X)
i=1
N
1 1 X 2
σ12 = ( x [i] − µ21 )
M N
i=1
For the rest of the Gaussian mixtures, equidistant points in X are chosen as means and the
same variance as in Gaussian 1 is used:
N
µi = X[i · ]
M
σi2 = σ12
1
with Gaussian weights kept equal for all mixtures, Wi = M.
This method has two obvious drawbacks. On one hand, as pointed out above, this technique
does not consider a global ML approach and therefore Gaussian mixtures can easily end up in
78 3.3. Robust Speaker Diarization System for Meetings
local maxima. On the other hand, it does not ensure that all the acoustic space of the acoustic
data is covered by the positioned Gaussians.
Step 1 µ1
Training data σ1
X
Multi-Gauss
2M’new<M split
Step 2
EM
1-Gauss
split M’new<M
Step 3
EM
The introduced technique is inspired on the split and vanish techniques used in the GMTK
toolkit (Bilmes and Zweig 2002) and the mixture incrementing function in HTK. As seen in
figure 3.7, the initial mean and variance of data X are computed in the same way as in the
′
previous technique (step 1). Then the algorithm iteratively splits each of the Mprev Gaussian
′
mixtures into two mixtures, obtaining a total of Mnew ′
mixtures, while 2Mnew < M , the desired
′
model complexity. The Mnew Gaussian mixtures are computed from their previous counterpart
by
2 2 2
σnew1 = σnew2 = σprev
Wprev
Wnew1 = Wnew2 =
2
After each split, a single step EM training of the current models given data X is performed
to allow for the Gaussian mixtures to adapt mean and variance to the data.
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 79
Once an extra splitting iteration would overpass the desired number of desired Gaussian
mixtures, the algorithm moves into a single Gaussian split mode (step 3). In it the Gaussian
selected to split is the one with the highest weight, and it is split in the same way as shown before.
Some experiments were performed with different alternative splitting/vanishing procedures but
to initialize GMM models with a small number of Gaussian mixtures it was seen that performance
would diminish any time that vanishing was applied, therefore the technique applied here only
uses a splitting procedure. Also, the defunct function implemented by HTK to discard Gaussians
with low weigh was seen to be perjudicial for the GMM models grown here.
Once the number of initial cluster Kinit is defined, in the broadcast news system it was
explained how speaker clusters were initialized by evenly assigning the available data into the
different clusters and doing several segmentation-training iterations to allow for homogeneous
data to cluster together. While this mechanism is very simple and gives surprisingly good results,
it does not ensure that the final clusters contain only data from one cluster (i.e. with a high
purity).
In order to improve on the linear initialization technique, several alternative methods were
tested, including K-means at the segment level, E-HMM top-down clustering (Meignier et al.
2001) and others, finally designing a brand new algorithm that has been called the friends-and-
enemies initialization and is further explained in section 4.2.1.
In order to train the speaker models used throughout the processing a standard EM-ML algo-
rithm was used by the broadcast news system. It performed a five iterations EM-ML algorithm
regardless of the data or the models being trained. The use of EM in small training datasets
has two potential problems. On one hand the models can suffer from overfitting to the available
data, becoming not general enough to represent the speaker at hand. On the other hand there is
no guarantee that the models will converge to the best possible parameters that maximize the
likelihood of the data given such model. The use of k = 5 iteration of EM training is a parameter
that needs to be defined for the system in order to avoid overfitting but to allow the models to
be correctly trained to the data. It was seen that modifying the value of the parameter k would
considerably alter the final performance, and therefore it was found desirable to find a more
robust algorithm.
For these reasons a new training algorithm has been implemented. The choice of implementa-
tion has been the cross-validation EM training algorithm (CV-EM for short), recently proposed
by T. Shinozaki in Shinozaki and Ostendorf (2007). It introduces a cross-validation technique,
in use for decision tree design, to the iterative process of the EM, addressing the problems of
overfitting and potential local maxima.
80 3.3. Robust Speaker Diarization System for Meetings
Initial model M
Final model M
Figure 3.8 shows the CV-EM procedure. The system starts from an initial single model to
be trained and finishes also with a single model. On the initial E-step of the EM processing
the training data is split into N partitions as homogeneously as possible (in the implementation
each consecutive frame is assigned to a different partition sequentially until all frames have been
assigned). Then the conditional probability of each frame to each Gaussian mixture in the initial
model is computed. This process is identical to the initial E-step in a similar technique called
parallel EM training (Young et al. 2005).
In the following M-step, each model Mi is reestimated using the sufficient statistics computed
for all partitions except for SSi , which is kept as cross-validation data. This differs from the
parallel EM technique, which collapses all the statistics into creating a single model, losing
the cross-validation properties. In the CV-EM algorithm, once all the N models have been
approximated, new conditional probabilities are computed for the frames in each partition SSi
using model Mi . As data in partition SSi has not been involved in the reestimation of the
parameters in Mi , the accumulated likelihood from all partitions can be used as a cross-validation
to check for convergence, avoiding the possible overfitting to the data. In the implementation a
∆Linc = 0.1% likelihood increase criterion is used.
In Shinozaki and Ostendorf (2007) it proposes a 5 iterations step when training models
towards speech recognition, although in speaker diarization a likelihood relative increase stopping
criterion is preferred in order to bound the likelihood variation between iterations.
Given two clusters A and B, with data XA and XB and their respective models, MA and
MB , when training such models let us consider the variation in likelihood between two EM
iterations as ∆L(XA |MA ) and ∆L(XB |MB ). Within the diarization system we want to use the
∆BIC metric to determine wether they belong to the same speaker or not. By using the modified
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 81
In the usual proceeding of the algorithm, by comparing the resulting ∆BIC value to a threshold
0 it will be determined wether both clusters are the same speaker or not. If each of the models
is trained an extra EM iteration, and using the notation introduced before, one can express the
′
resulting ∆ BIC in terms of the one just computed in equation 3.3 as
In order for the system to be robust and results consistent it is desired that BIC ′ (A, B) =
BIC(A, B) which leads to having the likelihood variation terms to cancel out. While it is not
possible to control the exact likelihood variations between iterations, by using a minimum relative
likelihood variation as a stopping criterion for the CV-EM training makes these terms upper
bounded and the BIC more stable. Furthermore, by forcing these variations to be small will
result in BIC(A, B) ≃ BIC ′ (A, B) as desired.
According to Shinozaki and Ostendorf (2007), since N cross-validation models are reestimated
from different subsets of the data it could potentially create a problem where the Gaussian
mixtures would behave differently to the data and obtain totally different parallel models, in
which case the CV-EM algorithm would not be usable. In reality the difference in number of
1
samples between any two models is N −1 , which becomes very small when N is large, and therefore
prevents this divergence from happening.
Once the CV-training stopping criterion is reached, the current sufficient statistics computed
for each of the subsets are used to derive a single output model. The increase in computation
for this parallel training technique is small as only in the M-step the number of operations is
increased. When the size of the training data is big, the most costly part of the EM algorithm
is the E-step, which takes the same time to be computed as by the CV-EM algorithm.
In order to avoid quick changes in the speaker turns in both the baseline and the current
system, a minimum duration of 3 seconds is imposed when performing Viterbi segmentation
of the data. This is imposed in the speaker model by using multiple consecutive states with
transition probability 1 between them, and tied Gaussian mixture models, as seen in figure 3.2.
On the contrary, it was observed that the maximum turn duration for the speaker turn is
82 3.3. Robust Speaker Diarization System for Meetings
artificially constrained by the α and β parameters in figure 3.2. As explained in detail in section
4.2.3 these were changed to α = 1 and β = 1 to allow the maximum duration to be solely decided
by the acoustics. This is an important change given that conference room data is very different
in terms of average speaker turn length to broadcast news and to lecture room data.
As mentioned earlier, when processing multiple microphones the system creates an indepen-
dent feature stream to the acoustic stream composed of the TDOA values between microphones.
As explained in section 5.3, each one of the feature streams is represented by different models
and the total likelihood of the data at any instant is obtained as the weighted sum of the log-
likelihood of the respective feature vectors according to their models. The resulting log-likelihood
affects the decisions made in the Viterbi segmentation module and in the ∆BIC computation
between two clusters, which otherwise are identical to the broadcast news system.
In order for the different independent feature streams to be combined at the log-likelihood
level a relative weight has to be assigned for each one depending on their reliability to contribute
to the diarization. Although an initial weight is set for all meetings using development data,
each particular meeting will respond differently to the use of the TDOA values and therefore an
automatic system of reestimating these initial weights is desirable. An effective way was found
using a metric derived from the ∆BIC values computed between all pairs for all feature streams.
It is described in section 5.3.2.
When computing the BIC metric between two clusters it was observed that small amounts of
non-speech data affect negatively the speaker models and therefore could create errors when
deciding wether to merge them or not. A new technique called frame-based cluster purification
is introduced to modify the cluster models for the BIC comparison step in order to obtain more
discriminant models. It is explained in detail in section 4.3.1.
It has also been observed that some clusters contain speaker segments from more than one
speaker. The models associated with these clusters are able to model both speaker correctly and
therefore cause a problem when comparing with other clusters containing either one of those
speakers, leading to potentially serious decrease of performance due to erroneous cluster merges.
For this reason, and during the initial iterations of the segmentation and clustering algorithm,
a segment-level cluster purification algorithm aims at detecting speaker segments that are very
dissimilar to the cluster they belong to, and assign them to a new cluster. A further description
of the algorithm is given in section 4.3.2.
In order to save some computation when computing the ∆BIC metric among all possible
pairs, a pruning algorithm was implemented in broadcast news that would not compute the
∆BIC for those pairs that had previously obtained a negative value. For the meetings system this
Chapter 3. Speaker Diarization: from Broadcast News to Meetings 83
was revisited and it was observed on development data that, specially during the initial iterations
of the algorithm, the ∆BIC metric would oscillate between small positive and negative values
for some clusters until they would finally stabilize its assigned data. By using such a restrictive
pruning, the system does not allow such clusters to eventually merge, even though they might
be from the same speaker.
For this reason the pruning algorithm was relaxed to eliminate a cluster pair from further
comparisons only if its ∆BIC value falls below a certain threshold (< 0), much safer to use.
Such threshold was set to -100 as it was seen that ∆BIC values below this threshold would
always remain negative throughout the process and therefore there is no chance of eliminating
any potential merge pair.
As in the broadcast news system, the cluster pair with the biggest ∆BIC value is merged
into a common cluster. The resulting model is the union of both merged models. If none of the
cluster pairs obtains a positive ∆BIC value, no merging takes place and the system prepares to
finish, as it failed the stopping criterion.
In the Meetings system only the local ∆BIC stopping criterion is used as tests using a likeli-
hood criterion (in the same way as in the broadcast news system) resulted in worse performance.
When the system’s stopping point is reached, the algorithm does a final post-processing in
order to output a final clustering. During the iterative merging process the minimum speaker
turn duration is set to be 3 seconds. This is necessary when there are many clusters, each
containing small amounts of data, as the corresponding models can fluctuate a lot and not
model the speaker appropriately.
Once the system determines to stop merging, the optimum amount of clusters has been
reached and the models are expected to contain enough data to model each speaker appropriately.
At this point, a single Viterbi segmentation iteration is performed where the minimum duration
is set to 1.5 seconds in order to allow the output segmentation to contain smaller speaker turns,
given that in meetings the average speaker turn duration is smaller than in broadcast news, as
seen in section 3.2.
Chapter 4
This chapter covers the main contributions of this thesis in the area of acoustic modeling for
speaker diarization in the meeting domain. As pointed out earlier, these algorithms were defined
either to improve an existing algorithm in the baseline system or created new to solve problems
detected in the system.
This chapter is structured into three main sections. The first section introduces a new
speech/non-speech detector that does not require any training data while achieving similar
performance to the prior pre-trained system on non-speech detection, and better diarization
performance.
The second section covers four algorithms used in the definition of the speaker clusters and
the related models. The first algorithm automatically defines a number of initial clusters for
the agglomerative clustering to start with. The second algorithm obtains an initial clustering by
classifying the acoustic data into the desired number of initial clusters. On the topic of speaker
modeling, the third algorithm is used to determine the complexity of each model in the system
given the amount of data available for training. Finally, a modification to the baseline duration
modeling is proposed to avoid any artificial constraints imposed previously to the speaker turn
duration.
The third section explores the problems derived of clusters containing data other than a
single-speaker. When comparing two speaker models an erroneous decision can be made de-
pending on the amount of such misplaced data. This section presents two algorithms to purify
the clusters and avoid such problems. On one hand, the frame level purification modifies the
speaker models only in the comparison step by filtering out acoustic frames that might harm the
comparison. On the other hand, the segmentation level purification detects full segments that
do not match the cluster they belong to and assigns it to a new cluster.
85
86 4.1. Speech/Non-Speech Algorithm
In the broadcast news domain it was shown in Wooters et al. (2004) that the speaker diarization
performance can be improved by the use of a speech/non-speech detector as a first step to the
agglomerative clustering process. The speech/non-speech system used in this previous work was
based on acoustic models that needed to be trained on data as similar as possible to the test data.
This poses a robustness problem when one intends to use the diarization system on “unseen”
data, and slows down the portability of the system to new environments, where new training
data needs to be labelled/located and new speech/non-speech models need to be trained. For
this purpose an alternative was sought that would not require any training.
Among the systems that do not use acoustic models for speech/non-speech detection, the
most widely used, when non-speech is considered mainly of a silence/noise nature, always in-
clude energy as a feature. The performance of such systems is dependent on setting appropriate
thresholds which are typically tuned using some development data. Tests done with the energy-
based decoder, that is part of the hybrid system presented below, showed that the optimum
threshold depends on the meeting acoustics and therefore it should be tuned whenever data
from a different source needed to be processed, falling in the same trap as with model-based
detectors.
A novel system was designed and is presented in this thesis to perform speech/non-speech
detection, and its application to speaker diarization in the meetings environment. Such system
takes advantage of the fact that most non-speech in meetings is silence. It first performs an
energy-based detection of the silence portions in the input data using energy derivative filtering
based on Li, Zheng, Tsai, and Zhou (2002). This system only needs a coarse setting of a threshold,
which is then iteratively modified until hypothesizing a reasonable number of silence segments.
The second stage of the system models speech and silence given the output from the first
stage by using GMM models, and creates a final speech/non-speech segmentation to be used
in the diarization system. By running this two-stage system, it is avoided to use any external
training data to obtain an initial set of acoustic models. From these initial models several
iterations are performed between segmenting the data and retraining the models to obtain the
final speech/non-speech segmentation.
The introduced hybrid system therefore attempts at solving some of the problems from
both model-based and energy-based speech/non-speech detectors. On one hand, it is avoided to
accurate tune the energy threshold by using an iterative search of a rough speech/non-speech
segmentation used to initialize the model-based decoder. On the other hand, by using such
initialization on the model-based decoder, it is avoided having to train its models with pre-
labelled data, resulting in a system that is freed of the need for training data.
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 87
In the following sections both the energy-based decoder and the model-based decoder used
in the hybrid system are described. Finally, the combination of both systems into the hybrid
decoder is explained.
Energy-based Model-based
detector Silence
decoder
assessment
Wienner Speaker
filtering Diarization
The first stage of the process consists on an energy-based speech/non-speech detector which
can be divided into three major blocks as seen in figure 4.1. Each of these blocks are explained
below. First of all, the data is preprocessed using common engineering techniques with the
purpose of increasing the quality of the speech signal. Then a derivative filter is applied over
the energy signal. Finally we use a thresholding method together with a minimum duration
enforcement via a Finite State Machine (FSM) to detect silences. This work was initiated by
M. Aguilo while visiting ICSI, and was assembled into the current system by the author. For a
deeper explanation of each individual module refer to Aguilo’s master’s thesis in (Aguilo 2005).
Data Preprocessing
Due to the different sources and recording setups, the average amplitude of the signal to be
processed can vary over a large range. Therefore it needs to be normalized to be able to bring
consistency in the follow-on processing. A standard energy average over all the recording would
not be plausible due to the existence of extended silence regions and of sudden noise bursts. In
order to compute the normalization constant µ was chosen, which is more robust to these effects
as shown in equation 4.1. This same expression is used in the filter&sum processing to obtain
the overall channels weighting factor of the input signals 5.2.2.
P
1 X
µ= max(s[p cdotT Fs ], · · · , s[(p + 1) · T Fs ]) (4.1)
P
p=1
88 4.1. Speech/Non-Speech Algorithm
where P is the total amount of non overlapped blocks of duration T Fs (with Fs being the
sampling rate in samples/second, and T is the analysis segment size in seconds) in the recording.
Each block of samples ranges from p · T Fs to (p + 1) · T Fs
Finally a low-pass Butterworth filter deletes all high band noises leaving only information
of the signal below 4kHz. This is done because the major part of the energy of the signal is
contained in this band and no information is needed but the energy at this point of the non-
speech detection process. This Butterworth filter has been implemented using its IIR form.
Derivative Filtering
Given the normalized and filtered energy signal (ẽ[n]) a derivative filter is used in order to
enhance the speech/non-speech change-points. This processing helps prevent degradation due
to low signal-to-noise ratios or nonstationary environments and was first introduced by Li et al.
(2002). Such filter is defined via the following impulse response,
Where,
And,
A = 0.41s
7
s =
W
W = Half of the window length. (4.4)
And the values of the coefficients [K1 . . . K6 ] = [1.583, 1.468, −0.078, −0.036, −0.872, −0.56],
for a chosen window length W = 31. The selection of an appropriate value for the W parameter
is important as it sets the temporal resolution of the detector.
As shown in fig. 4.2 the result of the convolution of ẽ[n] and h[n], ê[n] is thresholded and
labelled, each sample, as speech or non-speech.
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 89
Figure 4.2: Left, filter over ẽ[n]. Decision of silence in red after the thresholding.
After the energy is filtered for the third time one needs to impose some time constraints to
avoid changing too quickly between speech and non-speech. A finite state machine (FSM) has
been implemented for this purpose. In this FSM described in figure 4.3 the time constraints are
forced through enter times and leave times according to the values of ê[n] using two thresholds
(enter thrld, Θenter and leave thrld, Θleave ) on each sample. The selection of the right thresholds
is crucial to the correctness of the detector and, although the energies have been initially normal-
ized, might differ from meeting to meeting. The threshold enter thrld is defined to be an order
of magnitude bigger than leave thrld and its value is iteratively defined by the hybrid system
described below. As for the appropriate minimum time of either speech or non-speech states it
must be estimated using development data, but as it will be shown, it is more independent to
meeting room variations than the threshold values.
Inside the FSM, the conditions to go from non-speech to speech are the same to go from
speech to non speech. This way to go from speech to non-speech, ê[n] has to be higher than the
threshold to enter (Θenter ), and vice versa:
The second stage of the process consists of a model-based speech/non-speech detector which
obtains an initial segmentation (used for training its models) from the output of the energy-
based detector. It then produces the speech/non-speech labels that are used for the speaker
diarization task. By training the models from the output of the energy-based detector, it avoids
the need for any external training data (or pretrained models).
The model-based decoder is composed of a two states ergodic HMM (following the same
architecture as the speaker Diarization system), where one state models silence using a single
Gaussian model, and the speech state uses a GMM with M mixtures (M > 1). In each state a
minimum duration M D is imposed which is allowed to be different from the duration set in the
energy-based detector. EM-ML is used to train the models and Viterbi to decode the acoustic
data. An iterative segmentation-training is performed until the overall meeting likelihood stops
increasing, then the system outputs the speech/non-speech labels.
In order for the speech and silence models to represent well the acoustic information, there
needs to be enough frames of data in the input segmentation for each model. As seen in Anguera,
Wooters and Hernando (2006d) and Anguera, Wooters and Hernando (2006b) the silence data
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 91
can be modeled with a single Gaussian with a very narrow variance. On the other hand, the
speech information is much “broader” and dependant on the speakers present in the meeting.
It is therefore important for the data used in training the silence model to contain as little
speech data as possible. This translates into a very small “missed speech” rate requirement in
the energy based detector.
s[n] e[n] ~
e [n] Derivative eˆ[ n] Minimum
e[n] = s[ n]
2
e [ n] = Low pass
µ filter filter duration
Norm.
factor
As described above, the functioning of the energy detector depends on setting a threshold
value properly. In an exclusively energy-based system such threshold has to be defined using a
development set as close as possible to the test set to obtain optimum results. By using a model-
based decoding as a second step one can relax the need for a perfectly tuned threshold since the
aim now is to obtain a rough distinction between speech and non-speech. The Energy detector
is initially run with a very low threshold pair (1e-5/1e-6). While the number of non-speech
segments found (Nsil ) is smaller than 10 the threshold pair is raised by an order of magnitude
and the energy system is rerun (the system’s computational requirement’s are minimal).
This is done iteratively until Nsil > 10). At that point, if Nsil > 100 it is considered that there
are too many silence segments and a refinement step lowers the threshold pair, using a smaller
threshold step size, until obtaining between 10 and 100 non-speech segments. The selection of
the range (10 - 100) is defined a grosso modo in order to obtain a sufficient amount of silence
frames to train the silence model in the model-based decoder with a low percentage of speech
labelled as silence.
Such speech/non-speech segments are used to train the two models in the model-based
92 4.2. Speaker Clusters Description and Modeling
decoder, which performs iterative Viterbi decodings and EM-ML training on the data until
reaching likelihood convergence.
The use of two well known speech/non-speech detection techniques back-to-back allows for
the creation of a more robust system than using either of them alone. On one hand, on a
system totally energy-based it is found that the optimum thresholds defining the speech and
non-speech segments are different from one recording type to another (as it depends on the
room, microphones used, distance of the people to them, etc.) and therefore they need to be
optimized using data from the same source, becoming very dependent on it. On the other hand,
in a totally cluster-based system, there is a need for pre-labelled data in order to train the models
(or somehow generated initial models), which is also very dependent on the type of recording.
By using both systems any kind of data can be processed on its own, without the burden of
similar data collection or annotation.
The proposed system is not parameter free. There are three main parameters that need
to be determined in order to obtain optimum results. These are the minimum duration of the
speech/non-speech in the energy-based detector, the number of Gaussian mixtures assigned to
speech in the model-based decoder and the minimum duration of speech and non-speech in such
decoder. These are though more robust to changes in the recording acoustics.
In this section a new cluster initialization algorithm is presented (see Anguera, Wooters and
Hernando (2006c)), which has been called “friends-and-enemies” initialization due to the way
that single-speaker segments are grouped with those closer to them, and new clusters are created
as enemies of the existing clusters. A cross-likelihood metric is used to determine “friendliness”.
This algorithm is aimed at improving the prior linear initialization algorithm, explained in
section 3.1.2.
The clusters initialization block has often been considered to be of less importance in the
past, as many segmentations and models retraining iterations take place later in the process that
should allow any “pseudo-optimal” initializations to perform as well as any other in the end. In
this respect it has been considered that the best initialization is that which does not introduce
any computational burden to the overall system.
With a marked reduction of the error in the current system, it has been seen that the linear
initialization does cause a problem on the final score, since some initial clustering errors are
propagated all the way to the end of the agglomerative clustering and show up in the final
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 93
result. It has also been observed that a linear initialization without any acoustic constraints on
the initially created clusters introduces a random effect in the system which could be one of the
sources of per-show “flakiness”, as presented in Mirghafori and Wooters (2006).
The complete initialization is composed of three distinct blocks, as shown in Figure 4.6.
The first block performs a speaker-change detection on the acoustic data to identify segments
with a high probability of containing only one acoustic event. Such acoustic events can either
be silence, various noises, an individual speaker or various speakers overlapping each other.
This first step is performed using the modified Bayesian Information Criterion (BIC) metric
(introduced by Ajmera and Wooters (2003)) computed between two models created from the
data in two adjacent windows of size W , connected at the evaluated possible change point. The
modified ∆BIC metric is computed over all the acoustic data every S frames. A possible change
point is selected if BIC < 0, it corresponds to a local minimum of the ∆BIC values around
94 4.2. Speaker Clusters Description and Modeling
it, and there is no other possible change point with smaller ∆BIC value which is closer than
M D frames to it. In the implementation W = 2 second windows are used, with a scroll S = 0.5
seconds. Each window is modeled using a model with 5 Gaussian mixtures (therefore with 10
Gaussians for the combined model) and a M D of 3 seconds, equal to the minimum speaker turn
duration used in the following agglomerative-clustering process.
The second block in the initialization algorithm creates clusters by identifying the segments
defined in the first part as friends or enemies of each other. It is considered that two given acoustic
segments are friends if they contain acoustically homogeneous data; only the best friends are
brought together to form a cluster. In the same way, it is considered that two segments are
enemies if they contain very dissimilar acoustic data. It is intended to obtain N final enemy
groups (the desired final number of clusters) consisting of F segments each, which are friends of
each other. Three different similarity metrics were experimented with to compare each segment
pair S1 and S2 . On the first place a geometric mean of the frame cross-likelihood as
where NSi is the number of frames in segment i, and ΘSi is a model with 5 Gaussian Mixtures
trained with Si .
The second metric normalizes each term by the number of frames in the linear domain
instead, resulting in a penalty to the cross-likelihood as
d2 (S1 , S2 ) = log lkld(S1 |ΘS2 ) + log lkld(S2 |ΘS1 ) − log LS1 − log LS2 (4.7)
The third metric does a full cross-likelihood as introduced by Rabiner in Juang and Rabiner
(1985)
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 95
d3 (S1 , S2 ) = log lkld(S1 |ΘS2 ) + log lkld(S2 |ΘS1 ) − log lkld(S1 |ΘS1 ) − log lkld(S2 |ΘS2 ) (4.8)
All three metrics are bigger the closest the segments are to each other. In order to initiate
the process one needs to define an initial segment. Again, three criteria have been considered:
1. Select the segment which is most representative of the meeting, which might indicate that
it belongs to the speaker with most participation. In order to find it, a global GMM with
16 Gaussian mixtures is trained using all data in the meeting and the segment with biggest
likelihood (normalized using geometric mean) is chosen.
2. Select the segment which is the least representative of the meeting, which might indicate
that it belongs to the speaker with smallest participation. Using the same GMM model as
before, the segment with smallest averaged likelihood is chosen.
3. Select the segment with the closest average distance to all other segments. Using one of
the distances presented above, the average of distances between each segment and all other
segments is computed and the maximum is chosen.
Figure 4.6 shows an example case on how the algorithm works. In the horizontal axis the
speaker segments as found by the first block, are represented. The vertical axis shows the distance
value associated to each segment. In step (0) the initial segment needs to be determined. In this
example the criterion 2 is used to find the segment with smallest averaged likelihood, S1 .
Then, in step (1a) the data in S1 is used to train a model with 5 Gaussian mixtures (ΘS1 )
and compute either metric d1···3 between itself and all other segments. The F − 1 segments with
bigger value are its friends. In this example, F = 3 and the selected friends for S1 are chosen to
be S5 and S7 . In step (1b), a new model is trained from all data in this first cluster (Θ1 ) and
the same metric as before is computed, except that now it is measured between all segments in
the model and each of the remaining segments.
A new enemy S2 is selected as the segment with smaller value to the first cluster. Also in
the same way, in step (2a) F − 1 friends are chosen for S2 and in (2b) we select a new enemy for
both previously established clusters. This is done by computing the sum of the used metric for
each segment given all predefined groups. The processing continues until the desired number of
initial clustering N is reached or it runs out of free segments.
At that point in the third block all created models are used to reassign the acoustic data into
the Kinit classes. This is done using a Viterbi decoding where the resulting segmentation is not
constrained to the predefined speaker changes, therefore any previous speaker change detection
errors can be corrected. All data gets assigned to its closest cluster, classifying any acoustic
96 4.2. Speaker Clusters Description and Modeling
Lkld ( Si | W )
Step 0
Segment i
a) b)
Friends selection for S1 Segment i New enemy selection (S2) Segment i
Step 2 ∑ Lkld (S |θ j )
Lkld ( S i | θ S 2 ) j
i
a) b)
Friends selection for S2 Segment i New enemy selection (S3) Segment i
frames not assigned in the previous block. Finally, one cluster model is trained from each of the
resulting clusters.
This section describes two algorithms that are used to automatically determine the number of
Gaussian mixtures per model and the number of initial clusters to be used in the system. In
the baseline system these values were tuned using development data. This approach though was
considered deficient as it assumes that both development and test data perform the same way. It
was seen that the appropriate number of clusters and the complexity of each model at each stage
are strongly dependent on the amount of data available, therefore any difference in the length
of the data to be clustered between development and test was seen to harm the performance.
Furthermore, each meeting contains a different amount of data after the speech/non-speech
detection, which makes any defined parameters not tuned to the particular meeting’s properties.
In order to determine the number of Gaussian mixtures and the number of initial clusters,
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 97
the algorithms presented below base their selection on information on each particular recording
rather than defining a pre-fixed value for all recordings of a certain type. In order to do this, a
new parameter is defined which is called Cluster Complexity Ratio (CCR), and which defines
a ratio between the amount of data being modeled and the mixtures needed to represent it.
The CCR ratio is defined using development data, and it is used to define recording-specific
values for the above mentioned parameters. Although the CCR still needs tuning, it allows for
individual parameters to be determined for each show, adding robustness to the system.
The acoustic models used to represent each cluster are a key part of the agglomerative clustering
process. On one hand, comparing their likelihood given the data is how it is decided whether
two models belong to the same cluster or not. On the other hand, they are used in the decoding
process to redistribute the acoustic data into the different clusters on every iteration.
When designing their size, an important decision is whether to use fixed models (meaning a
fixed number of Gaussian mixtures from start to finish), or if it allows the number of Gaussian
mixtures to vary according to time or occupancy. Using fixed models is a viable alternative, but
runs into the problem of having sufficient training data when the number of Gaussian mixtures
is set to be high, or being too general a model when it is set to be small.
Furthermore, when comparing two models via ∆BIC, if they are too general they tend
to over-merge, and when they are too specific to the data they under-merge. Therefore it is
important to find a tradeoff on the number of mixtures used (model complexity). This has been
addressed in the systems presented by ICSI to the RT evaluations for meetings and broadcast
news (Anguera, Wooters, Peskin and Aguilo (2005) and Wooters et al. (2004)) by using variable
complexities as the merging process progresses. In such systems, all cluster models (regardless
of their size) are initially trained using a fixed number of Gaussian mixtures. Upon merging any
two clusters, the data from both original clusters are merged and a new cluster model is created
as the sum of both parents’ Gaussian mixtures.
Such an approach has a drawback that is addressed with the proposed technique. Models
with the same complexity are modeling different amounts of data (sometimes very different),
therefore their focus is very different. When doing a ∆BIC comparison of such models one cannot
expect to obtain coherent results, therefore system performance can degrade.
An algorithm is presented that selects the number of mixtures to be used when modeling
each cluster according to its occupancy count. This could be referred to as an occupancy
driven approach. After each important change in the amount of data assigned to each cluster
(normally due to a segmentation step), the number of acoustic frames that are assigned to each
of the models is used to determine the number of mixtures by:
98 4.2. Speaker Clusters Description and Modeling
Nij
Mij = round( ) (4.9)
CCRgauss
In both approaches, the previous and this new one, the total number of mixtures used over
all models remains constant in average, being distributed between the different cluster models
as described above. This allows tracking of the system evolution by inspection of the Viterbi
decoding total likelihood, which can be compared across merging iterations.
The model complexity selection algorithm is executed in the places described in Figure 3.6.
The desired complexity of each model is computed using the equation described above and when
it is different than the current complexity of the model it is readjusted in one of two possible
ways:
• When the final complexity is bigger than the current one, the model is grown, one Gaussian
at a time, as described in step 3 of section 3.3.3.
• When the final complexity is smaller than the current one, models are trained from
scratch following the procedure in section 3.3.3. As it is explained in that section, elim-
inating/vanishing Gaussian mixtures from small models is not desirable and leads to a
decrease in performance.
In order to perform an agglomerative clustering on the data an initial number of clusters Kinit
needs to be defined. This value needs to be higher than the actual number of speakers to allow
the system to perform some iterations before finding the optimum number of clusters Kopt . It
also cannot be too big, as each model needs a minimum cluster occupancy to be trained properly,
and to avoid unnecessary computation.
In prior work (Anguera, Wooters, Peskin and Aguilo (2005) for the meetings domain and
Wooters et al. (2004) for broadcast news data), the number of initial clusters was fixed within
each domain. In the meetings domain, it was set to either 10 or 16 initial clusters, and in the
broadcast news domain it was set to 40 initial clusters. The selection of these values had to be
tuned to be greater than the possible number of speakers in any given recording while maximizing
the performance. As pointed out earlier, this leads to suboptimal results when conditions change.
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 99
With the following method, the number of initial clusters is defined on a per recording basis
by taking into account the total amount of data available for clustering:
Ntotal
Kinit = (4.10)
Gclusinit CCRgauss
The number of initial clusters is a function of the amount of data available for clustering
Ntotal , the number of Gaussian mixtures to initially assign per cluster Gclusinit (as in prior
work, Gclus = 5) and the Cluster Complexity Ratio CCRgauss presented in the previous section.
This initializes the system using an average complexity of Gclusinit and the amount of data per
cluster as defined by CCRgauss . This technique does not try to guess the real number of speakers
present in a recording, but rather sets an upper boundary to the number of speakers that is
closely coupled with the complexity selection algorithm and which allows a correct modeling of
each initial cluster for each particular recording by determining the optimum amount of data it
should be trained with.
In this section a small change to the cluster models is proposed which leads to the elimination of
the dependency of the acoustic models on the average speaker turn length. This is achieved by
modifying the acoustic modeling topology by changing the probabilities of self-loop and transi-
tion in the last state. By doing so, a minimum duration for a speaker turn can be implemented
like in the past while not influencing the final duration of a speaker turn. While setting a mini-
mum duration for speaker turns is advantageous for the processing of the recordings and can be
set to be independent of the kind of recording encountered, the average speaker turn duration
is quite variable between individual recordings and domains. It is therefore better to let the
acoustic data alone define when the speaker turn finishes once it achieves a minimum length.
In the cluster models each state contains a set of M D sub-states, as seen in figure 4.7,
imposing a minimum duration of each model. Each one of the sub-states has a probability
density function modeled via a Gaussian mixture model (GMM). The same GMM model is tied
to all sub-states in any given state. Upon entering a state, at time n the model forces a jump to
the following sub-state with probability 1.0 until the last sub-state is reached. In that sub-state,
it can remain in the same sub-state with transition weight α, or jump to the first sub-state of
another state with weight β/M , where M is the number of active states/clusters at that time.
In the baseline system these were set to α = 0.9 and β = 0.1 (summing to 1).
One disadvantage of using these settings is that it creates an implicit duration model on the
data beyond the minimum duration M D, set as a parameter. Let us consider a sequence of N
feature vectors X={x[1] . . . x[N]}. Let us also consider a set of K cluster models Θ = {Θ1 . . . ΘK }.
100 4.2. Speaker Clusters Description and Modeling
beta=1
1 2 S−1 S
Speaker GMM
Figure 4.7: Cluster models with Minimum duration and modified probabilities
The system imposes an equal probability to choose either cluster once it outputs a prior cluster
and has a minimum duration M D inside either cluster.
In order to study the interaction between α, β and M D parameters, the likelihood of the
data given the models is analyzed. In equation 4.11 the likelihood is written when the system
selects model 1 as the initial model and stays in it for the whole N acoustic frames, therefore
creating 0 model changes as
M
Y D
L0 (X|Θ) = L(x[1]|Θ1 ) (1 · L(x[i]|Θ1 ))
i=2
N
Y
· (α · L(x[i]|Θ1 )) (4.11)
i=M D+1
In equation 4.12 the likelihood is computed for the case when one cluster change occurs
within the decoded N frames. The decoding used imposes that the second model will contain at
least M D acoustic frames. Considering models 1 and 2 it can be written as:
M
Y D N1
Y
L1 (X|Θ) = L(x[1]|Θ1 ) (1 · L(x[i]|Θ1 )) · (α · L(x[i]|Θ1 ))
i=2 i=M D+1
N1Y
+M D N
Y
β
· (1 · L(x[i]|Θ2 )) · (α · L(x[i]|Θ2 )) (4.12)
K
i=N1 i=N1 +M D+1
where N1 indicates a random point in the N frames, as long as N1 > M D and N1 < N − M D.
The transition probabilities from these equations are the terms not affected by the acoustic
models. By extending the number of changes to C, the transition probability can be proven that
takes the expression:
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 101
C
β
T r(C) = α(N −(C+1)M D) (4.13)
K
It is composed of two parts. On one hand, the left side depends on the β parameter and
depends exclusively on the number of cluster changes and the number of possible clusters to go
to. On the other hand, the right side is dependent on the α parameter and encodes the duration
modeling of each of the acoustic models. This duration model depends on the number of speaker
changes C and the minimum duration M D.
On the broadcast news system the parameters were set as α = 0.9, β = 0.1 and M D = 3
seconds. This led to a transition probability which is dependent on C and M D, which for many
cases created segments that in average were very close to duration M D. This was because on
most cases when evaluating on N frames of data, Li6=0 (X|Θ) > L0 (X|Θ). In order to avoid cluster
changes every M D seconds a lower boundary for α must be set by ensuring that tri6=0 < tr0
computed for a hypothetic case when all models are the same (i.e. Θi = Θj , ∀i, j). Applying this
condition to the transition probabilities for all possible C values gives:
β
αM D > (4.14)
K
In order to remove the dependency of the M D on duration modeling, and agreeing with
equation 4.14, the parameters were set as α = 1.0 and β = 1.0. Thus, once a segment exceeds
the minimum duration, the HMM state transitions no longer influence the speaker turn length;
it is solely governed by acoustics. This creates a non-standard (but valid) HMM topology as
α + β no longer sums to 1.
Given the speaker clustering algorithm presented in this thesis, there are usually acoustic frames
assigned to a cluster which do not belong to the modeled speaker. These frames are either non-
speech or frames from another speaker. In this thesis this phenomenon is referred as cluster
“impurity”. It is very important to ensure that the clusters only contain one speaker and therefore
the merging decision and stoping point criterion don’t suffer from cluster impurity. Such cluster
impurity has been studied separating it into two levels of detail (relative to two sources of error)
and two algorithms are presented to detect and purify the clusters.
One source of error occurs when a cluster is created from speech segments from multiple
speakers. In standard agglomerative systems there is no mechanism to split a cluster when
segments from different speakers are assigned to the same cluster. This effect causes an increase
in the final speaker error as seen in the example in Figure 4.8(a) for the case of two misplaced
102 4.3. Cluster Purification Algorithms
∆BIC < 0
Spkr 1 Spkr 2
Figure 4.8: Possible Speaker clustering errors due to clusters purity problems
segments of two existing speakers. It is very possible that the speaker model for the mixed cluster
is able to represent both speakers’ data and therefore Viterbi segmentation does not achieve to
homogenize the cluster classifying the acoustic frames into their respective clusters. At the end of
the processing, the mixed cluster is likely to be assigned to an non existent speaker (or to either
of the speaker present in it), causing a large increase on the Diarization Error Rate (DER).
The second source of error comes from the interference of non-speech frames in both clusters
during cluster comparison. This is particularly true for short silences and short acoustic events
that belong to the modeled speaker but do not discriminate one speaker from another. This
can affect the final clustering in two ways, as seen in Figure 4.8(b). First, when comparing two
clusters belonging to the same speaker, the confounding frames can cause ∆BIC to decide to keep
them separate. Second, false alarm errors are produced when non-speech frames are assigned to
one of the speakers.
Both sources of error are interrelated and are caused by frames that are assigned to the
wrong acoustic model. The difference is the unit that is considered is miss-assigned (segment
or frame). In next subsections tho algorithms are proposed towards solving both problems. The
first algorithm identifies the segments that acoustically deviate most from their cluster, and
splits them into a new cluster. This is referred to as “segment-level” purification. The second
algorithm locates the individual frames within a cluster than can cause problems in the merging
state and avoids using them when computing the distance between the cluster pair. It is referred
to as “frame-level” purification.
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 103
Due to the use of a minimum duration in the acoustic modeling, speech segments that legiti-
mately belong to a particular cluster can be “infected” with sets of non-speech frames and frames
belonging to other sources. Such sets are too short to be taken into account by the segment-based
decoding as independent clusters or eliminated by the model-based speech/non-speech detector
without an important increase in miss speech error. They cause the models to diverge from their
acoustic modeling targets, which is particularly important when considering whether to merge
two clusters. The frame level purification presented here focuses on detecting and eliminating
the non-speech frames that do not help to discriminate between speakers (e.g. short pauses,
occlusive silences, low-information fricatives, etc).
Speech-silence histogram
silence histogram
speech histogram
frame scores
In order to see the effect of typical acoustic speaker models with non-speech data an experi-
ment was performed on all the data belonging to an ICSI meeting used in the RT04s evaluation.
All the acoustic frames X from that meetings were split into speech frames X0 and non-speech
frames X1 according to the reference segmentation file provided by NIST. A speaker model with
5 Gaussian Mixtures was trained using only the speech-labelled frames X1 . Then both speech
and non-speech frames were evaluated using such model and two normalized histograms were
created from the resulting likelihood scores, as can be seen in Figure 4.9.
The scores of the non-speech frames X0 are mainly located in the higher part of the histogram,
indicating that X0 usually obtains higher likelihood scores than X1 even when evaluating it
on a model trained only with X1 data. Part of the X1 frames are also in the upper part of
the histogram, which are most probably non-speech frames that are labelled as speech in the
104 4.3. Cluster Purification Algorithms
reference file. Even with the use of a speech/non-speech detector, a residual error of around
5% of non-speech data enters the clustering system. In order to purify a cluster both the non-
speech (undetected) data and the speech-labelled non-speech data needs to be eliminated while
maintaining the rest of acoustic frames that discriminate between speakers. It is clear that
likelihood can be used to detect and filter out these frames.
A possible explanation for this behavior is illustrated in Figure 4.10 where a cluster model
ΘA , using M Gaussian mixtures, is trained using acoustic data X1 labelled as speech by the
speech/non-speech detector. After training the model, a group of Gaussian mixtures M1 adapt
their mean and variances to model the subset of the speaker data X1,1 , while another group of
Gaussians M2 appears to model the subset of data X1,2 which are nons-speech frames remaining
in X1 . Since the number of frames in X1,1 is typically much larger than those of X1,2 , the number
of Gaussian mixtures ssociated to each subgroup are |M1 | >> |M2 | and, at times, |M2 | could
be 0 if the non-speech data is minimal. Furthermore, the variance of the non-speech Gaussian
mixtures in M2 is always much smaller than M1 . This is the reason why any non-speech frame
evaluated by the model gets a higher score than a speech frame. This is taken advantage of in
the frame level purification algorithm.
To further prove that the acoustic frames with a higher likelihood are those which are less
suitable to discriminate between speaker models another experiment was performed taking two
speaker clusters trained with acoustic data for two different speakers according to the reference
segmentation. Figure 4.11 illustrates the relationship between the likelihood scores of the data
used in training each of the two models and evaluated on both models. It is possible to determine
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 105
-40
-50
-60
Lkld(x|model2)
-70
-80
-100
-100 -90 -80 -70 -60 -50 -40
Lkld(x|model1)
an axis between the likelihood values of the two models. The distance to this axis indicates the
discriminative power of the data from each cluster. Frames from both clusters with the highest
likelihood values are grouped together on this axis, indicating how badly they can differentiate
between speakers.
In order to detect and filter out the non-speech frames using the detected likelihood property of
the non-speech data, two variants of a likelihood-based metric are proposed.
Q/2−1 f
M
1 X X
L̄(x[i]|ΘA ) = log WA [m]NA,m (x[i + j]) (4.15)
Q
j=−Q/2 m=1
The two metrics are based in equation 4.15 where Q defines the length of an average window
and is used to average the measure around the desired value to avoid noisy values; M f is the
f < M , the number of
number of Gaussian mixtures used to compute the likelihood (where M
mixtures in the model); WA [m] is the mixture weight and NA,m (x[i + j])(x[·]) is the result of
evaluating x[·] on the Gaussian mixture NA,m (x[i + j]):
Metric 1 A standard smoothed likelihood over 100ms of data (Q = 5 with 10ms acoustic
f = M (all mixtures in model ΘA ).
frames) around each acoustic frame, with M
Metric 2 The same smoothed likelihood (over 100ms) given a model formed by a subset of
106 4.3. Cluster Purification Algorithms
all Gaussian mixtures in the speaker model, which include the mixtures assigned to non-
speech. The mixtures used are selected by computing the sum of variance over all di-
f = Mnon−speech . This
mensions and selecting those with smaller accumulated variance, M
second metric is equivalent to metric 1 when 100% of the Gaussian mixtures are selected.
When running the Speaker Diarization algorithm, each cluster is modeled with a variable number
of Gaussian mixtures according to the amount of data it contains. It is necessary to analyze at
what cluster complexity this behavior is present and the presented metrics can be used. In
figure 4.12, the histograms of speech and non-speech (according to the reference file) are shown
of metric 1 evaluated using models ranging from 1 to 8 mixtures. All model complexities have
been trained with the same data and used to evaluate metric 1 on all the meeting in the same
way as in figure 4.9.
1Gaussian Mixtures 2Gaussian Mixtures 3Gaussian Mixtures 4Gaussian Mixtures
Silence
Speech
It is seen that only the case of 1 Gaussian mixture shows a bigger overlap between the speech
and non-speech histograms, while after 3 mixtures all plots seem identical (in fact, running the
same experiments from 1 to 20 mixtures/model gives identical results from 9 to 20). The frame-
level purification algorithm is therefore applied whenever the number of Gaussian mixtures is
greater than one.
The algorithm is used when gathering the data to compare two clusters using the ∆BIC
metric in the following way:
1. Retrieve all frames assigned to each of two clusters and use either metric for each frame
in both clusters.
2. If Mi > 1, eliminate the P % of frames in each cluster with the highest computed metric,
Chapter 4. Acoustic Modeling Algorithms for Speaker Diarization in Meetings 107
3. Train two new models with the remaining data and use them for computing the ∆BIC
metric.
There are some situations where a cluster retains speaker segments from more than one speaker;
the segment-level cluster purification algorithm is a proposed mechanism used to force splitting
these cluster into two parts. The algorithm detects the segments in each cluster that are likely
to belong to another speaker and reassigns one of them to a new cluster in each iteration of the
agglomerative clustering algorithm. The algorithm works as follows:
1. Find the segment that best represents each model (highest normalized likelihood). This is
done to isolate the effect of a big speaker model when trying to determine if it contains any
segments from more than one speaker. The most representative segment is very probable
to contain only data from one speaker and it is more reliable to compare it with other
segments of similar size.
2. Compute, within each cluster, the ∆BIC value between the best segment (found in step
1) and each of the other segments. If all pairs have a value greater than a minimum
purity (empirically set to -50) that model is labelled as “pure” and is not checked again
in subsequent iterations.
3. The segment that most differs from its model’s best segment is assigned to a new model.
All models are retrained and the data is resegmented with Viterbi.
In order to avoid instability, the algorithm is run at most Kinit times (Kinit being the number
of initial clusters). Doing so avoids clusters continuously split and merge the same segments over
and over.
Chapter 5
In meeting room recordings there is normally access to more than one microphone that recorded
what occurred in the room synchronously, bringing spacial diversity. It is desirable to take
advantage of such signal multiplicity by using multichannel algorithms like acoustic beamforming
techniques. In section 2.5 the basic microphone array theory and the main techniques covered
by the literature in the topic of acoustic beamforming were reviewed, as well as its use for speech
enhancement and the methods previously applied to the meetings environment to this effect.
In order to use multichannel beamforming techniques in the meetings domain one needs to
consider the set of multiple microphones to constitute a microphone array. The characterization
and use of this array is not done in a classical manner as the locations and characteristics of the
available microphones can be non-conventional. The system to be used is required to be robust
and not require much prior information as the microphones can be located anywhere (with
varying distances between them) and can be of very different quality and characteristics (both
in directivity and type). By applying the appropriate techniques, in most cases it is possible to
obtain a signal quality gain and to improve speaker diarization and speech recognition results.
One of the necessary steps to perform acoustic beamforming on this environment, with
the selected techniques, is the estimation of delays between channels using cross-correlation
techniques. Such delays output can be also used as an input to the speaker diarization system
to cluster the different speakers in a meeting room by their locations (derived from the delays).
Although by themselves they do not carry as much information as the acoustic signal, when
combined with it using a multi-stream diarization system (presented in this section) important
gains are observed with respect to using acoustic alone.
First, in section 5.1 the real issues encountered in a meeting room multichannel layout are
exposed and a filter-and-sum algorithm is proposed and described to process the data. Then in
109
110 5.1. Multichannel Acoustic Beamforming for Meetings
section 5.2 the full acoustic beamforming implementation developed and used for the speaker
diarization and Automatic Speech Recognition (ASR) tasks is covered. Finally, in section 5.3
the use of the delays obtained from the speaker location estimation is explained, describing how
they improve the acoustic diarization performance by combining both types of features and how
the weighting between features is automatically computed.
Although linear microphone arrays theory sets a solid theoretical background for acoustic micro-
phone array beamforming, its assumptions differ many times from real life applications. In this
section the practical characteristics that are encountered in the beamforming implementation
and the basic theory behind the implemented system are explained.
When using multichannel enhancement techniques to improve speaker diarization in the meetings
environment one usually encounters a set of characteristics in the meeting room data that will
condition the practical implementation of the system.
1. Microphone array definition: In a meeting the microphones are set in different positions
of the room. Some microphones are in the meetings table, some are in the room walls and
in some occasions the attendees are wearing head-mounted or lapel microphones. Such
multiplicity of types and positions defies the standard concept of microphone array as
analyzed in the theory. The lack of a single/optimum microphone, which obtains a clean
signal from all participants, makes acoustic beamforming using all available microphones
a feasible and worthwhile application for these microphones. Such implementation needs
to be sufficiently general to fit such loose microphone array definition.
2. Number of elements: The number of acoustic channels (microphones) available for pro-
cessing varies from meeting to meeting, not necessarily being kept constant for meetings
coming from the same source. The implementation cannot impose any constraint in the
number of channels it requires for processing, and should optimally obtain an “enhanced”
signal with better quality than either one of the individual signals alone.
3. Different microphone qualities: The frequency response of the different microphones in the
meeting room cannot be considered equal as these can be of multiple types. One needs to
consider possible differences in the contribution of each of the microphones according to
their signal quality, either known a priory or computed automatically by the system.
Chapter 5. Multichannel Processing for Meetings 111
4. Microphone locations: The exact location of the microphones in the room is unknown.
In some cases one can know the relationship between the positions of certain microphone
groups (for example microphones within a circular microphone array or a linear array).
The microphone settings change for each meeting room and it should not be necessary to
know them a priory.
The filter-and-sum beamforming is one of the simplest beamforming techniques but still gives a
very good performance. It is based on the fact that applying different phase weights to the input
channels the main lobe of the directivity pattern can be steered to a desired location, where
the acoustic input comes from. It differs from the simpler delay-and-sum beamformer in that an
independent weight is applied to each of the channels before summing them.
z
′
θ
x−2
d
x−1
y
′ x0
φ
x1
x2
x
Figure 5.1: Linear microphone array with all microphones equidistant at distance d
−2π(n − 1)dcos(φ′ )f
ϕn = (5.1)
c
N
1 X j 2πf (n−1)d(cosφ−cosφ′ )
D(f, φ) = e c (5.2)
N
n=1
where the term cosφ′ forces the main lobe to move to the direction φ = φ′ .
Such steering can be applied in real applications by inserting time delays to the different
microphone inputs. In this case the delay to be applied to each microphone to steer at angle φ′
is:
(n − 1)dcosφ′
τn = (5.3)
c
Each of the microphone inputs is delayed a time τn and then all signals are summed to obtain
the delay-and-sum output, as it can be seen in figure 5.2. The physical interpretation when the
waveform front is considered flat is that τn is the time that takes the same signal wave to reach
each of the microphones.
By using such time delays equivalence the delay-and-sum output y(t) can be written as
N
1 X
y(t) = xn (t − τn ) (5.4)
N
n=1
The basic delay-and-sum beamforming considers all channels to have an identical frequency
response and then uses equal amplitude weights (an ) to all channels. In the application for
meetings it will be considered that microphones have different (and unknown) frequency re-
sponses. This problem can be addressed by adding a non-uniform amplitude weight and making
both amplitude and phase weights frequency dependent. Therefore obtaining a filter-and-sum
beamforming system output as
N
X N
X N
X
(n−1)dcosφ′
j
y(f ) = wn (f )xn (f ) ≡ an (f )xn (f )e−2πf c ⇐⇒ y(t) = an (t)xn (t − τn ) (5.5)
n=1 n=1 n=1
Figure 5.2 represents what is seen in equation 5.5. The input signal (considered to be coming
′
from a distant source and flat) arrives to each microphone from an angle φ at a different time
instant. The signals from the different microphones are passed through a filter wi , independent
for each microphone (1 through N ), which accounts for an amplitude and time delay (as seen
in eq. 5.5). The output or “enhanced” signal is the sum of all filtered individual signals.
Chapter 5. Multichannel Processing for Meetings 113
flat wave
w1 (f )
d φ′
w2 (f ) τ
y(f)
wN −1 (f )
wN (f )
This type of beamforming technique was selected for the implementation of the meetings
system because it agrees with all desired characteristics. Furthermore, its simplicity allows for a
fast implementation, normally under real-time, that allows it to eventually be used in a real-time
system.
This section describes the implementation of the multichannel acoustic beamforming system
for meetings based of the filter-and-sum system presented in section 5.1 and also described in
Anguera, Wooters and Hernando (2005). This involves the processing of the signal from each
available microphone until obtaining the final “enhanced” channel output and other related in-
formation useful for further processing in the mono-channel speaker diarization system presented
in the next section.
Figure 5.3 shows the different blocks involved in the filter-and-sum process. The system is
able to handle from 2 microphones to as many microphones as memory allows in the computer
where it is executed. Each processing stage is either performed on each individual microphone
or to all microphones in combination. Each processing block is described in detail below.
Prior to doing any multichannel beamforming each individual channel is Wiener filtered (Wiener
and Norbert 1949). It aims at cleaning the signal from corrupting noise, which is considered to
114
Figure 5.3: filter-and-sum implementation blocks diagram
Enhanced
Noise threshold Dual pass signal
Interchannel Bad quality Channels
estimation Viterbi
output weight segments sum
and noise decoding
adaptation elimination TDOA
thresholding of delays
values
TDOA values selection Output signal generation
Chapter 5. Multichannel Processing for Meetings 115
be additive and of a stochastic nature. The Wiener filter parameters w(t) are chosen so that the
mean square error between the clean signal x(t) and the resulting output signal s(t) is minimized.
Considering an additive noise n(t) it can be written as:
where sn [k] and nn [k] are the discrete speech and noise recorded by each of the N channels in
the room, and xn [k] is the cleaned signal which will be further processed by the system.
In this implementation Wiener filtering is applied to each channel independently, not taking
advantage of the multichannel properties of the speech or noise being recorded as in Rombouts
and M.Moonen (2003) and Doclo and Moonen (2002). Being that the microphones are located
in unknown places in the room it is considered that no assumptions can be made on the noise or
speech properties at this level. The Wiener filtering implementation is taken from ICSI-SRI-UW
and used in the ASR system as explained in Mirghafori et al. (2004).
The algorithms in this block extract information from the input signals, which will be used
further on in the process to construct the signal output. It is composed of four algorithms and
which described below.
In a typical implementation of a time-delay based beamforming system one needs to select one
of the channels as the reference channel. This channel is compared to all others and the time
delay of arrival (TDOA) is estimated for each pair. It is important for this channel to be the
best representative of the acoustics in the meeting, as the correct estimation of the delays of
each of the channels depends on the chosen reference.
In the meetings transcribed by NIST to be used for the Rich Transcription evaluations (NIST
Rich Transcription evaluations, website: http://www.nist.gov/speech/tests/rt 2006) there is one
microphone indicated to be the most centrally located in the room. Such microphone is chosen
empirically given the room layout and the prior knowledge of the microphone types. This module
overpasses that decision and selects one microphone automatically given a criterion based on
acoustics. This is intended for system robustness in cases where absolutely no information on
the room layout and the microphone placements is available. Two possible acoustic criterions
were investigated to select such channel:
tection based on energy is applied to each of the channels independently and the SNR
is computed. The channel with better SNR is chosen to be the reference channel. This
poses a problem on how accurate is the speech/non-speech detection and how it correlates
between channels. The algorithm implementation computed speech/non-speech for each
channel independently and then computed the SNR for each one, giving mixed results. An
SNR computation using some combined speech/non-speech technique where all channels
could be taken into account to come up with one single segmentation could have improved
this selection algorithm.
M N
1 X X
= xcorr(i, j) (5.7)
cross correlationi MN
m=1 j=1,j6=i
where N is the number of channels and M indicates the number of blocks used in the
average. In the implementation GCC-PHAT cross-correlation was used as described below.
The channel with the highest average cross-correlation was chosen as reference channel.
By using this metric it takes into account the amount of time each speaker speaks in total
and the quality of each microphone. In the case where all microphones were the same and
all speakers spoke the same amount of time, the chosen microphone should be the most
physically centrally located one, coinciding with what NIST reports in the RT evaluations.
The input signal to the filter-and-sum module is typically a 16bit, 16KHz signal, and the output
being treated by the diarization system is of the same characteristics. By using 16 bits it can
represent values from -32767 to +32768 in a single channel in steps of 1 (resolution of the
input). Such resolution gets modified when performing the weighted sum of N signals as the
resolution becomes smaller than 1 (the range of possible values of the summed signal depends
1
of the weights of the individual signals, it would be N for equal weighing). Although a higher
resolution is available after the sum, the signal needs to be quantized to steps of unit value to
fit it into the 16bit output channel, therefore getting a quantization error at each frame.
As the use of a signal output using more bits (like using floating points) creates an in-
consistency with the standard signals used in the system and therefore was not considered as
feasible, two simple modifications were done in order to minimize the amount of quantization
error whenever possible. These are:
Chapter 5. Multichannel Processing for Meetings 117
• The input signals usually does not cover all the dynamic range used by the 16 bits available
(or only a few instants in the meeting do). A scaling factor was defined for all signals so
that the sum of them will have a dynamic range closer to the available output, minimizing
the quantization errors of the output signal.
There are several alternatives in signal processing to find maximum values of a signal in
order to normalize it. Some alternatives are to compute the absolute maximum amplitude
over all the show, or the Root Mean Square (RMS) value, or other variations of it involving
a histogram of the signal (for example, taking the maximum as the 80% of such histogram).
It was observed that the processed signal contains very low energy areas (silence regions)
with short duration in average, and very high energy areas (impulsive noises, like door
slams, or common laughs or discussions), with even shorter duration. By using the absolute
maximum or RMS it would saturate the normalizing factor to the highest possible value
or bias it according to the amount of silence in the show. A windowed maximum averaging
was implemented instead in blocks of T=10 seconds to ensure that every block is highly
probable to contain some speech. In each block the maximum value is found and averaged
over all the recording. Such average is used to obtain the overall weighting factor for the
signal in terms of the average maximum of each of the channels as
N M
1 X 1 X T (m − 1) Tm
Wall = max{x[n + ], · · · , x[n + ]} (5.8)
N M fs fs
n=1 m=1
• The quantization of the output signal is necessary to convert from a floating point value
(obtained from the sum of all delayed-weighted-summed signals) to a 16bit signal. It is
quantized to the closest integer value within the range ± 32767, allowing a maximum
quantization error of value ±0.5 instead of using the standard functions “int” or “floor”
in C, which considers a maximum error of 1.
This module was created to deal with all meetings that come from the ICSI Meeting Corpus
which have a error in the synchronization of the channels. This was originally detected and
reported in ICSI Meeting Recorder Project: Channel skew in ICSI-recorded meetings (2006),
indicating that the hardware used for the recordings was found not to keep an exact synchronism
between the different acoustic channels, having a skew between channels of multiples of 2.64ms.
It is not possible to know beforehand the amount of skew of each of the channels as they did
not follow a consistent ordering in their connections to the hardware being used, therefore it is
needed to automatically detect such skew for it not to affect in the beamforming processing.
The artificially generated skew does not affect the general processing of the channels by an
118 5.2. Multichannel Acoustic Beamforming System Implementation
ASR system as it does not need exact time alignment between the channels (in terms of ms). It
does though pose a problem when computing the delays between channels as it introduces an
artificial delay between channel pairs which forces to use a bigger analysis window for the ICSI
meetings than with other meetings in order to compute such delays accurately, increasing the
possibility of delay estimation error and reducing the precision of such values. This module is
therefore used to estimate the skew between each channel and the reference channel (in the case
of ICSI meetings) and use it as a constant bias in the rest of the delay computations from then
on.
In order to estimate the bias a similar technique was used as when estimating the reference
channel and the weighting factor. Given signal xi [n] to compute the skew for, the cross-correlation
is computed of it with the reference signal for P = 25 blocks of 20 seconds each, evenly spaced
along the recording. Such segment’s length has been determined in order to ensure that there is
some speech within the windows being compared. The average skew is obtained for that channel
by averaging the time delays of arrival (TDOA) obtained for each of the segments when their
cross-correlation function is maximized. The process can be summarized in:
P
1 X
Skewi = T DOAm (xim , xref
m ) (5.9)
P
m=1
GCC-PHAT Cross-Correlation
The computation of the time delay of arrival (TDOA) between each of the considered channels
and the reference channel is repeated along the recording in order for the beamforming to respond
to changes in the speaker. In this implementation it is computed every 250ms (called segment
size or analysis scroll) over a window of 500ms (called the analysis window) which covers the
current analysis segment and the next. The size of the analysis window and of the segment
size constitute a tradeoff. A big analysis window or segment window lead to a reduction in the
resolution of changes in the TDOA. On the other hand, using a very small analysis window
reduces the robustness of the cross-correlation estimation, as less acoustic frames are used to
compute it. The reduction of the segment size also increases the computational cost of the
system, while not increasing the quality of the output signal.
In order to compute the TDOA between the reference channel and any other channel for any
given segment it is usual to estimate it as the delay that causes the cross-correlation between the
two signals segments to be maximum. In order to improve robustness against reverberation it is
normal practice to use the Generalized Cross Correlation with Phase Transform (GCC-PHAT)
as presented by Knapp and Carter (1976) and Brandstein and Silverman (1997).
Given two signals xi (n) and xj (n) the GCC-PHAT is defined as:
Chapter 5. Multichannel Processing for Meetings 119
Xi (f )[Xj (f )]∗
ĜP HAT (f ) = (5.10)
|Xi (f )[Xj (f )]∗ |
Where Xi (f ) and Xj (f ) are the Fourier transforms of the two signals and [ ]∗ denotes the
complex conjugate. The TDOA for these two microphones is estimated as:
argmax
dˆP HAT (i, j) = R̂P HAT (d) (5.11)
d
Where R̂P HAT (d) is the inverse Fourier transform of Eq. 5.10.
Although the maximum value of R̂P HAT (d) corresponds to the estimated TDOA for that
particular segment, there are three particular cases for which it was considered not appropriate
to use the absolute maximum from the cross-correlation function. On one hand, the maximum
can be due to a spurious noise or event not related to the speaker active at that time in the
surrounding acoustic region, being the speaker of interest represented by another local maximum
of the cross-correlation.
On the other hand, when two or more speakers are overlapping each other, each speaker
will be represented by a maximum of the cross-correlation function, but the absolute maximum
might not be constantly assigned to the same speaker, resulting on artificial speaker switching.
In order to effectively enhance the signal it would be optimum to first detect when more than
one speaker is speaking at the same time and then obtain a filter-and-sum signal for each one,
stabilizing the selected delays and avoiding them from constant speaker switching. Due to a lack
of an efficient overlap detector, this was not implemented in this thesis and remains as future
work.
Also, when the segment that has been processed is entirely filled with non-speech acoustic
data (either noise or random acoustic events) the GCC-PHAT function obtained will not be at
all informative. In such case no source delay information can be extracted from the signal and
the delays ought to be discarded and substituted by something more informative.
In the system implementation, to deal with such issues, the top M relative maximums in
eq. 5.11 are computed and several delay post-processing techniques are implemented to stabilize
and choose the appropriate delay before aligning the signals for the sum. These are described
below:
120 5.2. Multichannel Acoustic Beamforming System Implementation
Once the TDOA values of all channels have been computed, we have seen above that it is
desirable to apply a TDOA post-processing to obtain the set of delay values to be applied to
each of the signals when performing the filter-and-sum as proposed in eq. 2.33. We propose
and implement two different filtering steps: Noisy TDOA detection and elimination (TDOA
continuity enhancement), and 1-best TDOA selection from the M-best computed vector.
TDOA Post-Processing
The first filtering proposed intends to detect those TDOA values that are not reliable. A TDOA
value does not show any useful information when it is computed over a silence (or mainly silence)
region or when the SNR of the signals being compared (either one) are very low, making them
very dissimilar. The first problem could be addressed by using a speech/non-speech detector prior
to any further processing. Initial experiments indicated that further errors were introduced due
to the detector used. An improvement was obtained by applying a simple continuity filter on
the TDOA values based on their GCC-PHAT values by using a “noise threshold”:
T DOAi [n − 1] if GCC-PHATi [n] < T hrnoise
T DOAi [n] = (5.12)
T DOAi [n] if GCC-PHATi [n] ≥ T hrnoise
where T hrnoise is the “noise threshold”, defined as the minimum correlation value at which
it can be considered that the correlation is returning feasible results. It should be considered
independently in every meeting as the correlation values are dependent not only on the signal
itself but also on the microphone distribution in the different meeting rooms. In order to find an
appropriate value for it, the distribution of computed correlation values needs to be evaluated
for each meeting. In the diarization system presented for RT05s (Anguera, Wooters, Peskin and
Aguilo 2005) a constant threshold fax fixed for all meetings. This caused the system to filter
out a high amount of delays in some meetings while keeping non-speech segments unmodified
from others. For RT06s a threshold was computed for each meeting as the value that filters out
the 10% of lower cross-correlation values. This considers that in each meeting there are 10% of
frames that either are non-speech or unreliable in terms of TDOA estimation.
Figure 5.4 shows the histogram of the two AMI meetings present in RT05s to illustrate this
change. Such histograms are generated taking the output values from the GCC-PHAT for the
used TDOA values, placing them in bins with minimum value 0 and maximum 1, and normalizing
it. Most of the meetings present a bimodal histogram like for AMI 20041210-1052, where the
relative minimum between the modes falls around the 10% of the values. In such case selecting
a noise threshold at 10% absolute (0.1 value applied to the GCC-PHAT output) or finding the
threshold at 10% of all computed values (0.0842 over 1 for AMI 20041210-1052) gives almost
Chapter 5. Multichannel Processing for Meetings 121
the same result. On the other hand, some meetings, like AMI 20050204-1206, obtain a poor
distribution of GCC-PHAT values, concentrating them in the lower part of the histogram. In
this case there is a big difference between the two kinds of thresholding (0.1 versus 0.0532 for
AMI 20050204-1206).
0.07
AMI_20050204−1206
AMI_20041210−1052
0.06
10% relative
threshold
0.05
10% absolute
threshold
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60 70
Within each meeting there is also a slight difference in the distributions of each of the
channel’s correlations. It was found that there was no difference wether to compute an individual
threshold for each channel or one global threshold for all channels, therefore a global threshold
was used for all channels in the system.
The second post-processing technique applied to the computed delays is used to select the
appropriate delay to be used among the M-best GCC-PHAT values are computed at each step.
As pointed out previously, the aim here is to maximize speaker continuity avoiding constant
delay switching in the case of multiple speakers, and to filter out undesired steering towards
spurious noises present in the room.
As seen in figure 5.5 a 2-level Viterbi decoding of the M-best TDOA computed was applied.
The first level consists of a local individual-channel decoding where the 2-best delays are chosen
from the M-best delays computed for that channel at every segment. The second level of de-
coding considers all combinations of such 2-best across all channels and selects the final single
TDOA that ar more consistent across all. For each step one needs to define the topology of the
Viterbi algorithm and the emission and transition probabilities to be used. The selection of a
2-step algorithm is due in part to computational constraints as an exhaustive search over all
122 5.2. Multichannel Acoustic Beamforming System Implementation
possible combinations of all M-best values for all channels would easily become computationally
prohibitive.
Both steps choose the most probable (and second most probable) sequence of hidden states
where each item is related to the TDOA values computed for one segment. In the first step the
set of possible states at each instant is given by the computed M-best values. Each possible state
has an emission probability for each processed segment, equal to the GCC-PHAT value for each
delay (P1m [c], where m is the m-best value being considered and c is the current segment).
The transition probability between two states is taken as the inverse proportional to the
distance between its delays. Given two nodes, i and j at segments c and c − 1, respectively, the
transition probability between them is
where max diff(i, j) = max(|TDOAi [c] − TDOAj [c − 1]|, ∀i, j). This way all transition probabil-
ities as locally bounded between 0 and 1, assigning a 0 probability to the furthest away delays
pair.
This first Viterbi level aims at finding the best two TDOA values that represent the meeting’s
speakers at any given time. By doing so it is considered that the system will be able to choose the
most appropriate/stable TDOA for that segment and a secondary delay, which can come from
interfering events, other speakers or the same speaker’s echoes. Such TDOA values are any two
(not allowing the paths to collapse) of the M-best computed previously by the system, and are
chosen exclusively based on their distance to surrounding TDOA values and their GCC-PHAT
values.
The second level Viterbi decoding finds the best possible path given the set of hidden states
generated by all possible combinations of delays from the 2-best delays obtained earlier for
each channel. Given the vector of dimension N − 1 (same as the number of channels for which
Chapter 5. Multichannel Processing for Meetings 123
TDOA values are computed) describing for each channel which TDOA value is being used
gi [c] = [g1i [c] . . . gN
i i
−1 [c]] with each gn [c] indicating the i position among the 2-best list of TDOA
g [ i]n [c]
values considered for channel n at segment c. Given also that any given xcorr phatn [c] value
(the GCC-PHAT value associated to the g [ i] n [c]-best TDOA value for channel n at segment c)
will take values [0, 1], the emission probabilities are considered as the product of the individual
GCC-PHAT values of each considered TDOA combination gi [c] at segment c as
N
X i
P2 (i)[c] = log(xcorr phatngn [c] [c]) (5.14)
n=1
which can be considered as the extension of P1 (i)[c] to the case of multiple TDOA values where
we consider that the different dimensions are independent from each other (interpreted as in-
dependence of the TDOA values obtained for each channel at segment c, not their relationship
with each other in space along time).
On the other hand, the transition probabilities are also computed in a similar way as in the
first step, but in this case they introduce a new dimension to the computation, as now a vector
of TDOA values needs to be taken into accound. As it was done with the emission probabilities,
the total distance is considered as the sum of the individual distances from each element. Given
TDOA(gni , n)[c] as the TDOA value for the gni [c]-best element in channel n for segment c, the
transition probability between TDOA position vectors i and j is determined by
N
X max diff(i, j, n) − |TDOA(gni , n)[c] − TDOA(gnj , n)[c]|
T r2 (i, j)[c] = (5.15)
max diff(i, j, n)
n=1
where now max diff(i, j, n) = max(|TDOA(gni , n)[c] − TDOA(gnj , n)[c]|, ∀i, j, n).
This second level of processing considers the relationship in space present between all chan-
nels, as they are presumably steering to the same point in space. By performing a decoding
over time it selects the TDOA vector elements according to their distance to the vectors in its
surroundings.
In both cases the transition probabilities are weighted to emphasize its effect in the decision
of the best path in the same way as in the ASR systems (by product in the log domain). It will
be shown in the experiments section that a value of weight 25 for both cases is what optimized
the diarization error given the development set.
To illustrate how the two-step Viterbi decoding works on the TDOA values let us consider
figure 5.6. It shows a situation where four microphones are used in a room where two speakers
are talking to each other, with some overlap speech. There is also one or more noisy events of
short duration and noise room in general, both represented by a “noise” source. Given one of
124 5.2. Multichannel Acoustic Beamforming System Implementation
the microphones as a reference, the delay to each of the other microphone is computed, resulting
in delays from speech coming from either speaker (D(s[1, 2], m)) or from any of the noisy events
(D(nx, m)) with m = 1 . . . 3.
spkr1
noise
Ds1,3 Dnx,1
Dnx,3
Dnx,2
Ds1,2
ch.2 ch.3
Ds2,2
Ds2,3
Ds2,1
spkr2
For a particular segment the M-best TDOA values from the GCC-PHAT cross correlation
function are computed. The first Viterbi step determines for each individual channel the 2-best
paths across time for all the meeting. Figure 5.7 shows a possible Viterbi trellis for the first
step for channel 1, where each column represents the M-best TDOA values computed for one
segment. In this example four segments were considered where two speakers are overlapping each
other, and there is also some eventual noisy events. For any given segment the Viterbi algorithm
finds the two-best paths (forced not to overlap with each other) according to their distance of
the delays to the chosen delays in the neighboring segments (transition probability) and to their
cross-correlation values (emission probability). The resulting paths could be:
The third computed segment contains a noisy event that is well detected by channel 1 and
the reference channel, and therefore it appears as the first in the M-best computed TDOA list.
Chapter 5. Multichannel Processing for Meetings 125
Figure 5.7: Two-step TDOA Viterbi decoding example, step 1 for an individual channel
The effect of the Viterbi decoding can avoid selecting this event as its delay differs too much from
the best delays in its surroundings and both speakers also appear with high correlation. On the
other hand, the first and second segments contain the delays referring to the true speakers in the
first and second-best positions, although alternated in both segments. This example illustrates
a possible case where they cannot be correctly ordered and therefore there is a quick speaker
change in the first and second-best delay paths in that segment.
The second step Viterbi decoding is intended to add an extra layer of robustness for the
selection of the appropriate delays by considering all the possible delay combinations from all
channels. Figure 5.8 shows the trellis formed by considering for each segment (in columns) all
possible combinations of m-best delays (gin [c]) for the 3 channels case.
In this step only the best path is selected according to the overall combined distances and
correlation values among all possible combinations. In this example the algorithm is able to solve
the order mismatch from the previous step, selecting the delays relative to speaker 1 for all the
segments. The current computes the 2-best path also in this step and output a signal steering
at the two sets of TDOA values, although the diarization algorithm only use the first of them.
In order to take advantage of the second (or more) delays steering at the overlap speakers in
the meeting it is necessary to achieve some more progress in reliable speaker overlap detection
algorithms, which remains as future work at the end of this thesis.
In the implementation of the second level Viterbi decoding a big burden in computation time
could be faced depending on the amount of microphones to be processed. In the second level
Viterbi the amount of possible states for each instant k is defined by
126 5.2. Multichannel Acoustic Beamforming System Implementation
where M2 is the number of best TDOA values extracted from the M-best values in the first
Viterbi level (in this implementation M2 = 2). As the amount of states grows exponentially when
increasing D, it becomes computationally prohibitive for meetings with 16 or more microphones
available (for N = 17, M2 = 2, S[k] = 65536). For a feasible implementation, when N > 5
the pool of microphones are split is blocks of 5 and the Viterbi is computed in each block
independently. This is a suboptimal solution as not all microphones are used to optimize the
delays and therefore it is not certain that all blocks will converge to the same solution. It is
though much faster in processing time and it was not observed to degrade the overall performance
compared to using all microphones together.
Once all information is computed from the input signals and the optimum TDOA values are
selected, it is time to output the enhanced signal and any accompanying information to be
used by the subsequent systems. In this module several algorithms were used to account for the
differences between the standard linear microphone array theory and the implementation in this
Chapter 5. Multichannel Processing for Meetings 127
module.
In the typical formulation of the filter&sum processing, the additive noise components on each of
the channels are expected to be random processes with very similar power density distributions.
This allows the noise on each channel to be statistically cancelled and the relevant signal en-
hanced when the delay-adjusted channels are summed. In standard beamforming systems, this
noise cancellation is achieved through the use of identical microphones placed only a few inches
apart one from each other.
In the meetings room it is considered that all of the distant microphones form a microphone
array. However, by having different types of microphones there is a change in the characteristics
of the signal being recorded and therefore a change in the power density distributions of the
resulting additive noises. Also when two microphones are far from each other, the speech they
record will be affected by noise of a different nature, due to the room’s impulse response, and
will have different amplitude depending on the position of the speaker talking.
This issue is addressed by weighting each channel in the filter&sum processing. The weights
are adapted continuously during the meeting. This is inspired by the fact that the different
channels will have different signal quality depending on their relative distance to the person
speaking, which probably changes constantly during a recording.
The weight for channel n at segment c (Wn [c]) is computed in the following way:
1
N c=0
Wn [c] = (5.17)
(1 − α) · Wn [c − 1] + α · avexcorrn [c] otherwise
where α is the adaptation ratio, which was empirically set to α = 0.05, c is the segment being
processed, and avexcorrn [c] is the average of the cross-correlation between channel n and all
other channels being all delayed using the selected T DOAn [c] value for that channel:
N
X −1 W
X
1
avexcorrn [c] = xn [cS − T DOAn [c] − k]xj [cS − T DOAj [c] − k] (5.18)
N −1
j=1,j6=n k=0
where S and W are respectively the scroll/segment size and window size of the filter&sum
processing
128 5.2. Multichannel Acoustic Beamforming System Implementation
Although efforts are made to ensure that the TDOA values assigned to each of the channels
are correct, in some cases the signal of one of the channels at a particular segment is itself of
such low quality that its use in the sum would only degrade the overall quality. This usually
happens when the quality of one or more microphones is very different from the others (for
example the PDA microphones in the ICSI meeting room recordings as explained in Janin, Ang,
Bhagat, Dhillon, Edwards, Macias-Guarasa, Morgan, Peskin, Shriberg, Stolcke, Wooters and
Wrede (2004)).
In the filter&sum processing all available microphones in the room are used and a dynamic
selection and elimination of the microphones that could harm the overall signal quality at every
particular segment is performed. The previously defined avexcorrn [c] is used to determine the
1
channel quality. If avexcorrn [c] < 4N then Wn [c] = 0. After checking all the channels for any
elimination the weights are readapted to sum up to 1.
Once the output weight has been determined for each channel at a particular segment, all the
signals are summed up to form the output “enhanced” signal. Such output signal needs to be
ensured acoustic continuity at all times. In the theoretical filter&sum equation as shown in
eq. 2.33 it will causes discontinuities in the signal on the segments edges due to the mismatch
between the summed-up signals on the edges between segments.
A triangular window is therefore used to smooth and reduce the discontinuity between any
two segments, as seen in figure 5.9. At every segment the triangular filter smooths the signal
delayed using that segment’s selected TDOA value with the signals delayed using the TDOA
values from the previous segment. By using the triangular window the system obtains a constant
total value without discontinuities. The actual implementation follows equation 5.19.
N
X
y[cS + k] = Woverall (α[k] wi [c]xi [cS + k − T DOAi [c]]+
n=1
N
X
(1 − α[k]) wi [c]xi [cS + k − T DOAi [c − 1]]) (5.19)
n=1
where S is the segment sample length, c is the segment being processed and k is the sample
within that segment being processed.
In the standard implementation the analysis window overlaps 50% with the segment window,
which agrees with the triangular overlap of 50% overlap done here. After all samples from both
Chapter 5. Multichannel Processing for Meetings 129
Ch.1
1 Woverall
Enhanced
signal
Ch.N
overlapping windows are summed the overall weighting factor computed earlier is applied to
ensure that the dynamic range of the filter&summed signal is optimally matched with the
available dynamic range of the output file. The resulting signal is further processed by the
speaker diarization system which is described in the next chapter. Together with the acoustic
signal, also the TDOA values used in the channel delays are written to an ASCII file for use by
the Diarization system as features and used as explained in the next section.
Performing an acoustic beamforming of the multiple input signals has multiple advantages,
including the simplicity of the following speaker diarization system, which can be reused from
the broadcast news system as it only needs to compute the output for a single acoustic channel.
Another advantage is the independence of the proposed system to the room layout and number
of microphones.
By doing an acoustic beamforming there comes a drawback in that all spatial information
about the speaker location, which is carried by the multiple microphones in the room, is lost in
the process. For this reason when multiple microphones are available (and therefore a beamform-
ing is performed) the speaker location information is reused for the speaker diarization module.
Such information comes from the Time Delay of Arrival (TDOA) values between each micro-
phone and the reference channel. Although extensive research has gone into speaker localization
using multiple microphones (including the identification of each speaker from the others), this
is only possible when the topology and exact location of all microphones is known in advance.
This is not considered in the current implementation of the system as multiple room topologies
are to be processed and, for some of them, the microphones locations are not known.
130 5.3. Use of the Estimated Delays for Speaker Diarization
Apart from the TDOA values, there are other features that could be useful to determine the
difference between speakers. One such possibility is the relative amplitude between the different
channels, which should be able to identify whoever is closer to what microphone, being therefore
an indicator of location of the speakers. This metric is though very correlated to the TDOA
values, suffering from the same problems, and therefore has not been considered in this thesis.
Further study should be done to indicate wether using both information streams could lead to
further improved speaker diarization.
For any given set of N channels at frame n (x1 [n] . . . xN [n]) the beamforming system determines
a single vector T DOA[n] with dimension N − 1 obtained from the best TDOA values between
each microphone and the reference. The dimension of the TDOA vector will change depending
on the number of microphones available. This does not indicate a priory that TDOA vectors
with lower dimension will be able to discriminate worse between speakers, as it depends not only
on the number of microphones but also on the acoustic properties of the room (influencing how
accurate the TDOA values are), the microphones topology and the location of the speakers.
For any particular frame, the T DOA[n] vector will contain a set of TDOA values that identify
the location of the main acoustic source towards where the acoustic beamforming is steering.
To exemplify this, figure 5.10 shows the histograms and X-Y plot of the first two dimensions of
the TDOA vectors extracted for all speech frames in the show ICSI 20000807-1000 containing
six speakers (which is the actual number of participants in that meeting). The histograms show
the existence of around 6 speakers, being some of them closer together than others. In the X-Y
plot the higher density places indicate higher probability of speakers. The points that fall far
from any of the speakers are due to silence regions not eliminated by the post-processing step
in the beamforming, or by acoustic events other than speakers (pen drops, door slams, etc).
There are also some vectors falling along one of the speaker axes indicating that a speaker is
most probably active during that frame instant but the different dimensions do not agree. This
can be due to overlap regions where each microphone points at a different speaker or errors in
the TDOA approximation. This problem is common when computing TDOA values and is one
of the issues addressed by the double-Viterbi post-processing algorithm. The remaining TDOA
vectors not detected by the post-process tend to cause errors in the diarization algorithm by
causing models to fit such data as an independent speaker.
The use of delays for speaker diarization using the presented diarization system was initiated
by J.M. Pardo and presented in Pardo et al. (2006a). Later, the same proposed in Pardo et al.
(2006b) the combination of TDOA values and acoustics in order to improve results even more.
Also at ICSI some work by Gallardo-Antolin, Anguera and Wooters (2006) shows other features
Chapter 5. Multichannel Processing for Meetings 131
1800 1800
1600 1600
1400 1400
1200 1200
1000 1000
800 800
600 600
400 400
200 200
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Histogram for TDOA values in channel 1 Histogram for TDOA values in channel 2
50
40
30
20
TDOA channel 2
10
−10
−20
−30
−40
−100 −80 −60 −40 −20 0 20 40 60 80
TDOA channel 1
In order to use these delays vectors to add extra information in the speaker diarization
module they are treated as a feature vector and modeled by a GMM. As used in Lathoud,
McCowan and Odobez (2004), a single Gaussian is used to model the clusters initially created
in the diarization system. In figure 5.11 it indicates the way that features computed from the
acoustic signal and the TDOA values are fused.
Upon starting the diarization two feature streams are available for processing, the acoustic
stream (which is composed of 19 MFCC features, computed every 10ms) and the TDOA stream,
computed in the beamforming module. In theory the same TDOA values that are used for the
beamforming process can be reused in this module, but in practice, in order to obtain synchrony
between acoustics and TDOA values, they are recomputed every 10ms. Use of the same TDOA
values was also tested by repeating the same values several times (25 times for 250ms scroll)
Figure 5.11: Fusion of TDOA values and acoustic features within the speaker diarization module
132
5.3. Use of the Estimated Delays for Speaker Diarization
Segmentation fusion Clustering fusion
Acoustic features Closest
Acoustic (several iterations) Acoustic models
models acoustic models
training merge
initialization BIC-based Closest pair
Clusters Viterbi Final segmentation
cluster pairs selection
initialization segmentation &
distance &
(fusion) system output
(fusion) stopping criterion
TDOA Closest
TDOA models
models TDOA models
TDOA features training
initialization merge
System initialization
Chapter 5. Multichannel Processing for Meetings 133
with slightly worse (but acceptable) results, showing its feasibility in case of computational
constraints.
In order to process the signal using both feature streams the system maintains two differ-
ent/independent HMM speaker model sets and keeps the same speaker clustering, which gets
defined using both streams. The speaker models use the same structure as in the standard sys-
tem (an ergodic HMM) and share the number of speakers, defining a model pair for each speaker
cluster, but can be represented using a different complexity, depending on the optimum way that
the data in each stream should be modeled.
The first step in the system is to initialize the K initial speaker clusters. This entails splitting
the input data among these K clusters. This is currently done in the same way as in the standard
system, using solely the acoustic data stream. Once an initial clustering is defined, the initial
models are created both for the acoustics and the TDOA values and the system enters the
segmentation/training step. Both speaker models are used in the Viterbi decoding to determine
the optimum path among the different speaker clusters by considering the joint log-likelihood
for any given frame as
L(xaco [n], xdel [n]|Θaco , Θdel ) = W1 · L(xaco [n]|Θaco ) + W2 · L(xdel [n]|Θdel ) (5.20)
where Θaco , xaco [n] is the acoustic model and data, Θdel , xdel [n] is the delay model and data,
and W1 , W2 weight the effect of each stream in the decoding, given that W1 + W2 = 1. In this
formulation it is considered each stream to be statistically independent from each other, which
is a plausible consideration given that acoustic and TDOA information convey very different
information. If more feature streams are available, this formulation can be expanded with each
feature likelihood being weighted by a different Wi . When running the Viterbi decoding a mini-
mum duration for a speaker segment is set to M D = 3 seconds (optimized for the development
data) for both models in order to avoid constant changes in the clustering. Once a new speaker
clustering is defined, the models are retrained independently.
The second step where the feature streams fusion takes place is in the clustering step where
the closest cluster pair is selected and the clusters and models are merged (or the processing fin-
ishes if the stopping criterion decides so). As explained in 3.1 the cluster pair comparison metric
of choice is a variant of the BIC metric where the penalty term is eliminated by constraining the
complexity of the different models being compared. In this particular case the formulation for
the BIC contemplating the fusion between both streams can be defined directly from equation
5.20 as
where A, B are two clusters we want to compute the distance for, and W1 , W2 are the same
weights as in eq. 5.20. This can also be directly expanded to use more than 2 streams.
If frame or segment purification are to be applied, these are done so only using the acoustic
frames. This is so in the case of frame purification because the TDOA models react in a different
way to the non-speech data than the acoustic models.
The same stopping criterion as in the regular system is used. While the system does not
determine to stop the clustering process, the closest cluster pair is selected and merged, together
with the models belonging to such cluster. In the case of the TDOA models the merging is done
by overlapping both existing models and retraining the overall model using all the data from both
clusters. In the case of the acoustic models it is either done in the same way as just explained
or it is modified according to the determined new complexity for the resulting model.
Whenever the system determines to stop clustering, a final Viterbi decoding is performed
using again both frame streams, with a smaller minimum duration, as explained in the meetings
system in section 3.3.
As seen in equations 5.21 and 5.20, in order to combine the acoustic and TDOA features one
needs to determine an optimum set of weights Wi that define how relevant each one is. Without
an automatic way to determine such value it needs to be found using development data and per-
forming a sweep of the Wi parameters optimizing the Diarization Error Rate (DER) score. This
constitutes a problem of robustness due to the possible big differences between the development
and test sets in terms of the relative importance between features. It also becomes a tedious job
if the number of parallel feature streams is big. Some of the factors that reduce the ability of a
feature set to optimally represent the speakers in a recording (and therefore its relevance should
be reduced) are:
• TDOA values in noisy environments (where approximation of the correct TDOA value is
difficult) or with multiple impulsive noises.
When setting the values by hand they are normally defined for all meetings equally and
therefore they do not account for peculiarities due to the meeting room (noisier rooms) or to
Chapter 5. Multichannel Processing for Meetings 135
the nature of the meetings (kind of usual attendees or wether they move from their seats). The
automatic weight setting algorithm presented here is able to compute the optimum values for
each meeting independently.
Prior art in weights selection for features fusion needs to be searched for in areas other
than speaker diarization, like in speaker verification and biometric fusion techniques (Fiérrez-
Aguilar, Ortega-Garcı́a and González-Rodrı́guez (2003), Ross, Jain and Qian (2001), Verlinde,
Chollet and Acheroy (2000)) and in speech recognition (Misra, Bourlard, and Tyagi (2003),
Ikbal, Misra, Sivadas, Hermansky, and Bourlard (2004), Li (2005)). Throughout the literature a
well used technique for automatic weighting of different feature streams is based on the feature
vectors entropy.
Initial tests were performed using the inverse entropy as relative weight to see how discrim-
inant each feature stream was. This was done by obtaining the weights in a frame-basis via the
inverse entropy of the posterior probabilities of the cluster models given the data. For MFCC,
PLP and other acoustic features these entropies were comparable to each other and could there-
fore determine a correct relative weight between features, as shown in Misra et al. (2003). When
using it with TDOA values their GMM models are such that low entropy values are obtained for
almost every frame, regardless of how accurate the TDOA values can represent a real speaker
position.
The proposed technique in this thesis uses the Bayesian Information Criterion (BIC) to
compare how well each feature stream differentiates between clusters in order to determine an
appropriate stream weighting. The ∆BIC values are independent of the complexity and topology
of the models being used and are a good indication of how close two clusters are.
Acoustic features
TDOA features
35
15000
30
25
10000
20
5000 15
10
0
5
Acoustics
TDOA
0
0 50 100 150 200 −2000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Figure 5.12: First merge cluster-pair BIC values and histogram for acoustic and TDOA features
136 5.3. Use of the Estimated Delays for Speaker Diarization
Given the ∆BIC values between all cluster pairs for the acoustic and TDOA models, figure
5.12 shows the values and their histograms for the meeting EDI 20050216-1051 from the RT06s
evaluation data set, computed for all pairs (given 22 initial clusters) for the first iteration of
the clustering. The TDOA values are much bigger in average and contain more positive values
than the acoustic values. If a weight W1 = 0.5 (equal relevance) is considered, the TDOA BIC
values would mask the acoustics and decide which pair to merge, possibly leading to errors as
not all the information is considered. In order to allow for different feature streams to contribute
in equal conditions in the merging decision it is needed to transform both ∆BIC value sets to
have the same scale using the W1 weight. This way the TDOA values with overall high ∆BIC
are penalized versus the acoustic values in order to be comparable to each other. For a general
case of M feature streams, the weight Wi assigned to each stream i is defined as
√1
Pi
Wi = PM (5.22)
j=1
√1
Pj
where Pi is computed from the N ∆BIC values computed for all cluster pairs xj , xk from each
feature stream as
j=N −1 k=N
1 X X
Pi = ∆BICi2 (xj , xk ) (5.23)
N
j=1 k=j+1
The automatic computation of the Wi weight is performed at the first clustering step, when
the ∆BIC values are computed. At the initial segmentation step, no weight has been automat-
ically defined and therefore some initial weight still needs to be determined by hand, or it can
be set to an uninformative W1 = W2 = 0.5.
On subsequent clustering iterations the models usually represent the clusters better and
obtain ∆BIC values which are more accurate. In order to allow the system to refine the weight
as the merging iterations progress, the ∆BIC values are kept for all cluster pairs that disappeared
during previous iterations and existing pairs are recomputed. Then a new weight is computed
taking into account both old and updated values in order to allow for a weight adaptation,
containing enough samples for a robust computation.
To illustrate the effect of the weight adaptation as the system iterates, figure 5.13 shows
the evolution of the Wi weight over the initial 10 iterations of the algorithm for meeting
CMU 20050912-0900 (in the RT06s data set). It is common on all meetings to start with big-
Chapter 5. Multichannel Processing for Meetings 137
0.88
0.86
0.84
Acoustic weight
0.82
0.8
0.78
0.76
0.74
0.72
0.7
0.68
1 2 3 4 5 6 7 8 9 10
# iterations weight setting
Figure 5.13: Acoustic weight evolution with the number of iterations for meeting CMU 20050912-
0900
ger values for the acoustic part and to see it reduced overtime and converging to a final value
(W1 = 0.71 in this case, converging after 5 iterations). The optimum weight always enhances
the acoustic values versus the TDOA values for all shows, both when computed automatically
or manually. By doing it automatically each show obtains its own optimum value, which would
had been set to W1 = 0.9 manually for RT06s set (including this meeting).
In the experiments section the effect of the number of iterations in which the weight is
computed versus the final DER score is computed. It is found that weights always converge to
constant values with optimum DER values, therefore leading to a robust solution with one less
tuning parameter.
Chapter 6
Experiments
This chapter verses about the experimentation of the different proposed techniques in order
to evaluate its suitability in the task of speaker diarization for meetings. It is done by first
defining a baseline system to compare all algorithms to. Such baseline system is derived from the
broadcast news mono-channel system with several improvements that were considered standard
and necessary to the system as adapted to meetings.
Then a set of metrics used in the evaluation of the different techniques are described in
detail. Next, the databases that are used to compare the algorithms performance with that of
the baseline and the reference segmentations which are used in the experiments are explained
and reasoned. Finally, the different experiments with the proposed algorithms are performed
and results are explained.
When comparing the results of new speech-related algorithms it is usual to always face some sort
of “flakiness”. This term started being used for speaker diarization during the RT04f workshop
(NIST Fall Rich Transcription Evaluation website 2006) in order to account for two phenomena
that were common to all diarization systems presented in that evaluation. These were the big
variance of the scores among all evaluated shows and the extreme susceptibility of the scores to
experience big changes upon small modifications of their tuning parameters.
Alike some other disciplines within the speech technologies, it makes a difference, when
comparing the performances of algorithms compared to a baseline, to select the optimum baseline
databases and test conditions to be able to show when the proposed algorithms preform the best.
In many cases, due to flakiness, testing the same algorithms with two different databases or
baseline systems derives into two very different results, one proving the validity of the proposed
algorithm and one otherwise.
139
140 6.1. Meetings Domain Experiments Setup
In order to run meaningful and fair experiments using the algorithms proposed in this thesis
one needs to define:
• A baseline system, which acts as the comparison ground to all systems proposed and
tested.
• A common development and test datasets, based on the NIST RT evaluations datasets,
in order for results to be comparable between experiments and to systems outside of the
thesis.
• A set of metrics in order to evaluate such systems with commonly used and available
techniques.
In the following subsections each of these items is described as it has been used in this thesis
for most of the experiments with the system’s main blocks.
Taking as a reference the blocks diagram in figure 3.5, experiments were conducted on three
of the main blocks, namely the filter&sum module, the speech/non-speech module and the
mono-channel speaker diarization module. For each block a baseline was defined to suit its
characteristics and to allow for the development of its optimum parameters selection. The initial
Wiener filtering of the signal was not analyzed as it was used without modification from its
original implementation outside of the scope of this thesis.
The baseline system used for the experiments in the diarization module and in the speech/non
speech module corresponds to a modified version of the broadcast news system presented for
the NIST RT04f evaluation as described in section 3.1. This corresponds to a mono-channel
system (or Single Distant Microphone, SDM, in the meetings domain) with the following main
differences from RT04f:
• The speech/non-speech (spnsp) detector used in the experiments for the hybrid spnsp algo-
rithm is composed of a two-states HMM model trained with meetings data, as it was used
in the RT05s evaluation and explained in section 3.1. For the speaker diarization module
the proposed hybrid spnsp detector was used instead, with parameters equal to the values
used for the RT06s evaluation (see Anguera, Wooters and Pardo (2006b), (Anguera, Woot-
ers and Pardo 2006a)). These use the parameter values optimized in the spnsp experiments
section.
• During the agglomerative clustering processing the same speaker turn minimum duration
is applied as in the broadcast news system (3 seconds). Before the output of the resulting
Chapter 6. Experiments 141
segmentation, a final segmentation step is performed using the same speaker models but
reducing the minimum duration to 1.5 seconds to allow for smaller speaker turns to be
properly detected.
• The HMM acoustic models used in the segmentation of the data do not have any maximum
time constraint, as explained in section 4.2.3, to allow the speaker segments to be as long
as the acoustics dictate. As shown in Anguera, Wooters and Hernando (2006a) it does not
change much the DER of the systems but allows for longer speaker segments to be created.
• A few bug fixes regarding floating point values inexactitudes were resolved which slightly
changed the system outputs.
• The BIC-based stopping criterion is used in all experiments in order to stop clustering
when the optimum number of clusters is reached.
The baseline system used for experiments on the beamforming module is composed of the
submission to RT06s NIST evaluation campaign. This contains all the modules as explained in
section 5.2 and their parameters optimized using a subset of 10 meetings from the development
data available for RT06s.
6.1.2 Databases
In the experiments in this thesis the datasets used were obtained from the data available for
the Rich Transcription (RT) evaluations for the meetings domain. So far the evaluations on
meetings have been RT02, RT04s, RT05s and RT06s. On the later two years only the conference
room type data has been used as it contains a richer variety of speakers and with characteristics
matching more closely the aim of the algorithms presented in the thesis.
From all available datasets, two groups have been defined as development and test. The
RT02, RT04s and RT05s sets form the development set, with a total of 24 meeting excerpts,
ranging from 10 to 12 minutes in duration each. The RT06s set has been used as a test set
(with 8 meeting excerpts), to compare the system improvements on data not used to tune its
parameters. Figure 6.1 summarizes the data available in each one of the RT sets used. For a
complete list of the individual files refer to appendix B.
These sets contain a few special characteristics that need to be taken into account. On one
hand, the development set contains four meetings that only contain one available microphone.
These are two pairs of two CMU meetings recorded for the RT02 and RT04s evaluations. These
are not suitable to evaluate the beamforming performance but are left in the development data
to obtain fair and comparable results.
142 6.1. Meetings Domain Experiments Setup
On the other hand, the meeting NIST 20050412-1303 from RT05s dataset contains one
speaker which was participating in the meeting through a telephone device. As will be de-
scribed later on, using forced alignments to robustly evaluate the data leads to results where
this speaker was not included in the reference files and therefore causes a big bias in the scores.
Depending on the test performed this meeting will be eliminated to allow for a fair comparison
(when doing so it will be clearly stated).
The two main metrics used to evaluate the algorithms presented in this thesis are the Diarization
Error Rate (DER) and the Signal-to-Noise Ratio (SNR).
The main metric that is used for speaker diarization experiments is the Diarization Error Rate
(DER) as described and used by NIST in the RT evaluations (NIST Fall Rich Transcription on
meetings 2006 Evaluation Plan 2006). It is measured as the fraction of time that is not attributed
correctly to a speaker or to non-speech. To measure it, a script names MD-eval-v12.pl (NIST
MD-eval-v21 DER evaluation script 2006), developed by NIST, was used.
As per the definition of the task, the system hypothesis diarization output does not need to
identify the speakers by name or definite ID, therefore the ID tags assigned to the speakers in
both the hypothesis and the reference segmentation do not need to be the same. This is unlike
the non-speech tags, which are marked as non labelled gaps between two speaker segments, and
therefore do implicitly need to be identified.
The evaluation script first does an optimum one-to-one mapping of all speaker label ID
between hypothesis and reference files. This allows the scoring of different ID tags between the
two files. The Diarization Error Rate score is computed as
PS
s=1 dur(s) · (max(Nref (s), Nhyp (s)) − Ncorrect (s))
DER = PS (6.1)
s=1 dur(s) · Nref
Chapter 6. Experiments 143
where S is the total number of speaker segments where both reference and hypothesis files
contain the same speaker/s pair/s. It is obtained by collapsing together the hypothesis and
reference speaker turns. The terms Nref (s) and Nsys (s) indicate the number of speaker speaking
in segment s, and Ncorrect (s) indicates the number of speakers that speak in segment s and have
been correctly matched between reference and hypothesis. Segments labelled as non-speech are
considered to contain 0 speakers. When all speakers/non-speech in a segment are correctly
matched the error for that segment is 0.
The DER error can be decomposed into the errors coming from the different sources, which
are:
• Speaker error: percentage of scored time that a speaker ID is assigned to the wrong speaker.
This type of error does not account for speakers in overlap not detected or any error coming
from non-speech frames. It can be written as
PS
s=1 dur(s) · (min(Nref (s), Nhyp (s)) − Ncorrect (s))
ESpkr = (6.2)
Tscore
PS
where Tscore = s=1 dur(s) · Nref is the total scoring time, in the denominator in eq. 6.1.
• False alarm speech: percentage of scored time that a hypothesized speaker is labelled as a
non-speech in the reference. It can be formulated as
PS
s=1 dur(s) · (Nhyp (s) − Nref (s))
EF A = ∀ (Nhyp (s) − Nref (s)) > 0 (6.3)
Tscore
computed only over segments where the reference segment is labelled as non-speech.
• Missed speech: percentage of scored time that a hypothesized non-speech segment corre-
sponds to a reference speaker segment. It can be expressed as
PS
s=1 dur(s) · (Nref (s) − Nhyp (s))
EM ISS = ∀ (Nref (s) − Nhyp (s)) > 0 (6.4)
Tscore
computed only over segments where the hypothesis segment is labelled as non-speech.
• Overlap speaker: percentage of scored time that some of the multiple speakers in a segment
do not get assigned to any speaker. This errors usually fuses either into the EM ISS or EF A ,
depending on wether it is the reference or the hypothesis containing non assigned speakers.
If multiple speakers appear in both the reference and the hypothesis the error produced
belongs to Espkr .
When evaluating performance, a collar around every reference speaker turn can be defined
which accounts for inexactitudes in the labelling of the data. It was estimated by NIST that
a ±250ms collar could account for all these differences. When there is people overlapping each
other in the recording it is stated so in the reference file, with as many as 5 speaker turns
being assigned to the same time instant. As pointed out in the denominator of eq. 6.1, the total
evaluated time includes the overlaps. Errors produced when the system does not detect any or
some of the multiple speakers in overlap count as missed speaker errors.
Once the performance is obtained for each individual meeting excerpt, the time weighted
average is done among all meetings in a given set to obtain an overall average score. The scored
time is the one used for such weighting, as it indicates the total (overlapped speaker included)
time that has been evaluated in each excerpt.
Signal-to-Noise Ratio
The SNR is a metric based on the power ratio between the signal and the noise. It is mainly
used in signal processing applications to evaluate how the desired signal stands out from the
background noise. It this thesis it was thought useful to measure how much quality does the
resulting speech signal has after the beamforming module. In speech signals it is not clear what
part belongs to silence and what to speech, therefore several methods can be applied to compute
an SNR approximation. Each computation method can be very independent from every other
and therefore comparisons should only be made using the same estimation algorithm. In this
thesis the method used in the NIST Speech Quality Assurance Package (SPQA) (NIST Speech
tools and APIs 2006) and described in detail in section 3.2.1.
The use of predefined reference segmentations is necessary to compute the DER given the system
hypotheses. The data used in this chapter all comes from the NIST evaluations, which defined a
set of rules on how the transcription should be made. In the latest evaluation (NIST Fall Rich
Transcription on meetings 2006 Evaluation Plan 2006) they were:
• Within a speaker’s turn, pauses lasting less than 0.3 seconds are to be considered to belong
to that speaker turn. Pauses with more than 0.3 seconds or in between different speaker
turns are to be considered non-speech. This value was determined in 2003 by NIST as the
minimum duration for a pause that would indicate an utterance boundary.
Chapter 6. Experiments 145
• Vocal noises, such as laugh, cough, sneeze, breath and lipsmack are to be considered non-
speech, and take this into account when considering segment boundaries.
• Although not a rule in creating the transcriptions, it is worth mentioning again the collar of
±0.25 seconds to be considered around each reference segment boundary when comparing
it to the hypothesis in order to account for inexactitudes in computing the real segment
boundary.
Within the NIST evaluation campaigns all data sent out for development and test was
carefully transcribed by hand by the Linguistic Data Consortium (LDC). Such transcription
was usually done listenning to the channel with the best quality possible (which usually is
the Individual Headphone Channel, IHM, when available) for each participant, and then the
transcriptions are collapsed into a main reference file for all participants.
Prior to the RT06s evaluation it was under consideration by NIST and by some of the
participants (including ICSI) the use of forced alignments of the acoustic data. Although in
RT06s still hand alignments were used, it is the intention of NIST to change the reference
transcriptions to be forced alignments in the near future. The need for such change became strong
when areas in overlap started being scored as part of the main metric for system performance.
In chapter 3.2 a quantitative comparison is done between forced and hand alignments. In brief,
the main drawbacks found in the hand-aligned references are:
• Transcriptions time inconsistency due to the gap of 1 year between each of the tran-
scriptions for each evaluation, which leads to a change in transcription criterions, human
transcriber, transcription tools, etc. Leading to consistent differences between the reference
files to which the systems try to learn from.
• Inability, at times, to detect short speaker pauses when these are around 0.3 seconds. This
leads to problems for systems which are trained to this data and which are impeded to
determine when a speaker pause has to be a silence and when it does not.
• Existence of extended durations when labelling the overlap speech. As seen in chapter 3.2
the average length of speech in overlap is bigger in the hand-alignments, usually so as the
human transcribers added some arbitrary padding to either side of some overlap regions,
leading to greater overlap errors. Such difference varies from evaluation to evaluation and
was detected only in RT06s data when overlap became part of the main metric.
• The inability, at times, to identify in the distant microphones (the ones actually used in the
evaluations) some sounds or artifacts that are heart and transcribed in the IHM channels
(much closer to each speaker’s mouth).
146 6.1. Meetings Domain Experiments Setup
It was decided at ICSI that development for the RT06s evaluation had to be done using
forced alignments in order to avoid these problems. In order to obtain the forced alignment of a
meeting recording a two steps process was followed:
1. The human words transcription for each one of the IHM channels was used to do a forced
alignment of the audio in each of the IHM channels to such transcription, obtaining a
time-aligned word transcription for each speaker with a headset on. To do so, the ICSI-
SRI ASR system (Janin, Stolcke, Anguera, Boakye, Cetin, Frankel and Zheng 2006) was
used. Experiments pursued by NIST after the RT06s evaluation Fiscus, Garofolo, Ajot and
Michet (2006) indicated that very similar behaviors for all participants could be obtained
using either ICSI-SRI transcriptions or LIMSI’s ASR system transcriptions.
2. The transcriptions from each individual speaker were collapsed into a single file and the
transcription rules were applied to determine when two words were to be joined into a
single speaker segment or two speaker segments needed to be created.
By using forced alignments there are also several drawbacks to point out:
• In the way that these were done, an IHM channel needed to be provided for each participant
in the meetings in order to obtain that channel’s alignment. One meeting in RT05s (named
NIST 20050412-1303) contained a speaker through a telephone speaker which was not
considered, therefore creating a transcript lacking of some of the data. This could be
avoided by using other channels instead, trying to always select the optimum quality
source.
• Errors in the transcription of the words (which is done so by human transcribers) prop-
agates into the forced-alignments. These errors were measured to be much smaller than
transcribing the speaker turns directly.
• Each ASR system does their own systematic errors/decisions which translate into sys-
tematic segmentation issues. These are thought to be the difference between every ASR
forced-alignment output that can be used. Although such difference is very small, in order
to create good quality transcripts, reducing this variability, they could be derived from the
output of multiple systems.
All results reported in this thesis were computed using the forced alignments obtained using
the ICSI-SRI ASR system, unless otherwise stated.
Chapter 6. Experiments 147
This section covers the experiments done to assess the performance of the system given the
different improvements proposed in the previous chapters. To do so, the following structure will
be followed:
• This section sets the baseline using the system described in section 6.1.1.
• Section 6.3 verses about the speecn/non-speech detector applied to the baseline system in
the SDM case.
• Section 6.4 explores the different algorithms introduced in the beamforming block. It quan-
tifies their performance both in SNR and DER.
• Section 6.5 takes both the speech/non-speech detector and the beamforming system and
analyzes the performance of the diarization block with the introduced algorithms. It first
analyzes each individual algorithm contribution separate from the others and then itera-
tively agglomerates them into the final optimized system using all blocks.
As explained above, there are several baseline systems that are considered to test the different
modules proposed by this thesis. By doing so every module’s performance can be evaluated
independently.
The system used to evaluate the speech/non-speech detector can be considered as the baseline
of this thesis as it is directly derived from the broadcast news (BN) system found at ICSI at the
time of this thesis work start. Such system already contains a few improvements from the BN
initial system but these are considered core and will not be evaluated.
The other baseline used (although it is in reality an intermediate system) uses the beam-
forming system submitted to the RT06s evaluation and the hybrid speech/non-speech detector
together with the initial baseline. This is used to evaluate the algorithms in the acoustic beam-
forming module and in the diarization module.
Table 6.2: Results for the CV-EM training algorithm in the agglomerate system
148 6.3. Speech/Non-Speech Detection Block
Table 6.2 shows the baseline scores to compare to through the following sections. The dif-
ference between the baseline and the RT06s system is the inclusion or not of 4 CMU meetings
from the development set with a single channel. The version with 20 meeting excerpts is used to
develop the beamforming system, while the complete baseline is used to evaluate it (all meetings
contain more than one channel).
Experiments for the speech/non-speech module were obtained for the SDM case to make it
directly comparable with the baseline system results shown in the previous section. Although
in this case two slightly different development and test sets were used. The development set
consisted on the RT02 + RT04s datasets (16 meeting excerpts) and the test set was the RT05s
set (with exception of the NIST meeting with faulty transcriptions). Forced alignments were
used to evaluate the DER, MISS and FA errors.
In the development of the proposed hybrid speech/non-speech detector there are three main
parameters that need to be set. These are the minimum duration for the speech/non-speech
segments in both the energy block and the models block, and the complexity of the models in
the models block.
14
MISS
12 FA
total
10
% Error
0
0 0.5 1 1.5
Min. Duration (sec.)
Figure 6.1: Energy-based system errors depending on its segment minimum duration
The development set was used to first estimate the minimum duration of the speech and
non-speech segments in the energy-based detector. In figure 6.1 one can see the MISS and FA
scores for various durations (in # frames). While for a final speech/non-speech system one would
choose the value that gives the minimum total error, in this case the goal is to obtain enough
Chapter 6. Experiments 149
non-speech data to train the non-speech models in the second step. It is very important to
choose the value with smaller MISS so that the non-speech model is as pure as possible. This is
so because the speech model is usually assigned more Gaussian mixtures in the modeling step,
therefore a bigger FA rate does not influence it as much. It can be observed how in the range
between duration 1000 and 8000 the MISS rate remains quite flat, which indicates how robust
the system is to variations in the data. In any new dataset, if it does not contain a minimum
value for the MISS rate at the same value are in the development set, it will most probably still
be a very plausible solution. A duration = 2400 (150ms duration) is chosen with MISS = 0.3%
and FA=9.5% (total 9.7%).
12
10
8
% Error
2 MISS
FA
TOTAL
0
1000 6000 11000 16000
Min duration
Figure 6.2: Model-based system errors depending on its segment minimum duration
The same procedure is followed to select the minimum duration for the speech and non-speech
segments decoded using the model-based decoder, using the minimum duration determined by
the previous analysis of the energy-based detector. In figure 6.2 one can see the FA and MISS
error rates for different minimum segment sizes (the same for speech and non speech); such
curve is almost identical when using different # mixtures for the speech model, a complexity of
2 Gaussian mixtures for the speech model and 1 for silence is chosen. In contrast to the energy-
based system, this second step does output a final result to be used in the diarization system,
therefore it is a need to find the minimum segment duration that minimizes the total percent
error. An minimum error of 5.6% was achieved using a minimum duration of 0.7 seconds. If the
parameters in the energy-based detector that minimize the overall speech/non-speech error had
been chosen (which is at 8000 frames, 0.5 seconds) instead of the current ones, the obtained
scores would have had a minimum error of 6.0% after the cluster-based decoder step.
In table 6.3 results are presented for the development and evaluation sets using the selected
150 6.3. Speech/Non-Speech Detection Block
parameters, taking into account only the MISS and FA errors from the proposed module. Used
as comparison, the “all-speech” system shows the total percentage of data labelled as non-speech
in the reference (ground truth) files. After obtaining the forced alignment from the STT system,
there existed many non-speech segments with a very small duration due to the strict application
of the 0.3s minimum pause duration rule to the forced alignment segmentations. The second row
shows the speech/non-speech results using SRI speech/non-speech system (Stolcke, Anguera,
Boakye, Cetin, Grezl, Janin, Mandal, Peskin, Wooters and Zheng 2005) which is was developed
using training data coming from various meeting sources and its parameters optimized using
the development data presented here and the forced alignment reference files. If tuned using the
hand annotated reference files provided by NIST for each data set, it obtains a much bigger FA
rate, possibly due to the fact that it is more complicated in hand annotated data to follow the
0.3s silence rule. The third and forth rows belong to the results for the presented algorithm. The
third row shows the errors in the intermediate stage of the algorithm, after the energy-based
decoding. These are not comparable with the other systems as the optimization in here is done
regarding the MISS error, and not the TOTAL error. The forth row shows the result of the final
output from both systems together.
Although the speech/non-speech error rate obtained for the development set is worse than
what is obtained using the pre-trained system, it is almost a 25% relative better in the evalu-
ation set. This changes when considering the final DER. In order to test the usability of such
speech/non-speech output for the speaker diarization of meetings data the baseline system was
used interposing either of the three speech/non-speech modules shown in table 6.3.
It is seen in 6.4 that the use of any speech/non-speech detection algorithm improves the
performance of the speaker diarization system. Both systems perform much better than just
using the diarization system alone. This is due to the agglomerative clustering technique, which
starts with a large amount of speaker clusters and tries to converge to an optimum number of
Chapter 6. Experiments 151
clusters via cluster-pair comparisons. As non-speech data is distributed among all clusters, the
more non-speech they contain, the less discriminative the comparison is, leading to more errors.
In both the development and evaluation sets the final DER of the proposed speech/non-
speech system outperforms by a 14% relative (development) and a 10% relative (evaluation) the
system using pre-trained models. It can be seen how the DER on the development set is much bet-
ter that the pretrained system, even though the proposed system has a worse speech/non-speech
error. This indicates that the proposed system obtains a set of speech/non-speech segments that
are more tightly coupled with the diarization system.
In this section an analysis is made on the appropriateness of the different techniques implemented
for the acoustic beamforming of the multiple available signals into an “enhanced” signal. The
experiments were conducted using both development and evaluation sets as described in 6.1.2
where 4 meetings from CMU were taken out of the development set as they only contained a
single microphone.
The experiments use as a comparison system the filter&sum (F&S) beamforming used in the
RT06s NIST evaluation, which contains all the modules and algorithms described in section 5.2.
This implementation is the one used in the following section to test the appropriateness of all
the algorithms in the single-channel diarization module. Each module is evaluated by comparing
the performance of the system with and without it, maintaining all other modules in place.
The metrics used in the experiments process in this section are the Signal-to-Noise ratio
(SNR) and the Diarization Error Rate (DER) as described in 6.1.3. In order to conduct a fair
comparison, the F&S was obtained for each considered beamformed signal and the SNR was
computed . After this, the DER was obtained by running the diarization module on that signal
(previously parameterized) using the optimum diarization parameters according to the results in
section 6.5. The TDOA values were not used in this analysis and the speech/non-speech labels
were kept constant to those of the RT06s system (used as the baseline system) as computed and
explained in section 4.1.3. This was done in order to focus on the changes in DER only from the
change in the beamforming module.
Table 6.5 shows the SNR and DER results for the development and test sets. The SNR
values are obtained in the same way as in section 3.2, doing a lineal average of the values from
each meeting source. The first thing to observe is that although the SNR for the test set is
much higher than the development set, the DER values are otherwise, which raises a warning
on how uncorrelated these two metrics are. This phenomenon will be repeated throughout the
experiments in this section.
70
60 DER
SNR
50
40
30
20
10
0
IC _20 03 -14 (6)
IC _20 05 -14 (6)
)
)
1 )
IS 00 1 1 )
05 16 06 )
18 51 )
C _2 509 -14 (2)
VT 200 91 90 )
VT 00 30 90 )
LD _20 06 -14 (2)
IC 20 10 -14 (3)
IC _20 12 -14 (4)
IC _20 08 -18 (4)
IC _20 02 -10 (4)
)
ED 20 121 170 (8)
90 6)
)
N _2 306 -10 (7)
N _2 309 -14 (7)
N 2 04 15 7)
N _2 504 -13 (7)
LD 20 10 09 7)
LD _20 11 -09 (7)
C 00 116 323 )
AM _20 111 -14 (7)
C I_2 110 -10 (6
N _2 111 - 10 0( 6
N _2 502 - 1 0( 6
N 2 02 16 ( 6
20 2 2 2
02 -10 (16
0 -0 (2
_2 50 4-0 0(2
AM 20 112 -15 (8
16
LD _2 11 2-1 0(7
I _ 050 4- 1 2( 1
-0 (1
T_ 05 25- 09(
T_ 05 27- 03(
5 1 0
C 05 23 15
S I 01 08 00
S I 01 22 30
S 0 31 50
T_ 02 28- 00
8 0
U 05 12 30
SI 01 27 00
S I 00 07 00
S I 01 07 00
5 48
3 7
2 7
C 05 24 39
I_ 01 16 00
I_ 04 1- 00
I _ 5 0- 0(
0(
U 0 30 3
3 3
M 00 1 0
IS 00 2 0
IS 00 1 1
C 0 0 3
C _2 503 -13
U 0 04
4
1
0
C _2 503
VT 200
T 0
0
M 0
M 0
M 0
IS 0
0
IS 0
IS 0
IS 0
IS 0
_
_
U
VT
T
T
T
Figure 6.3: Individual meetings DER vs. SNR vs. number of microphones in the RT06s system
To further show the lack of correlation of the SNR vs. DER values, figure 6.3 shows the
individual values for all shows (22 dev + 8 eval) used in the experiment in table 6.5. The
meetings in the X axis are sorted according to the number of available microphones (shown in
parenthesis). Both the DER and SNR values share the same Y axis, although SNR is better the
higher it goes and DER otherwise. No correlation can be observed neither between SNR and
DER values nor with SNR and the number of channels in the meetings.
Chapter 6. Experiments 153
As for the SNR values it totally depends on the particular rooms, time of day of the recordings
and the type of microphones being used. Two cases where values are stable are the AMI project
meetings (including AMI and EDI recordings) which keep a very constant SNR value around
37db in average. The DER results depend on these and many other factors. In Mirghafori and
Wooters (2006) some of these factors are studied, referring to the high variability of the DER
values as show flakiness.
Therefore, it becomes clear that SNR and DER do measure and are affected by different phe-
nomena. A signal output with higher SNR (therefore higher signal quality) does not necessarily
lead to a better DER. Given that the aim of this thesis is to improve the diarization output, the
DER is the metric that will be most observed (and minimized) but the average SNR will still be
shown for all cases as comparison. For other applications, like using the output signal for ASR,
the SNR is still the metric to be maximized. Results for ASR using the presented filter&sum
system are shown in section 6.4.5.
As explained in section 5.2.2, a cross-correlation based algorithm was proposed and implemented
in the RT06s system in order to select the optimum channel to act as reference channel in the
beamforming system. By automatically selecting this channel the system becomes independent
of the human assignment of the SDM channel, which NIST selects in the RT datasets and which
was used as reference in previous versions of the system.
Table 6.6 shows the results of the RT06s system (with automatic selection) or the same
system but using the SDM channel as reference. By automatic selection of the optimum channel
the results become slightly worse both in SNR and DER. Although the DER of the development
set is almost equal to the hand-picked reference channel case, in the test set there is a decrease
in DER performance of 1.87% relative. By considering the development results, it is preferable
and more robust to use the automatic selection of the reference channel, as it then becomes
possible to use this system in areas other than the RT evaluation data, where there might not
be any prior information on which microphone to select as the reference channel.
154 6.4. Acoustic Beamforming Experiments
The post-processing module (explained in 5.2.3) includes the noise thresholding algorithm and
the TDOA values stability algorithm using Viterbi decoding. These do a post-processing of the
computed TDOA values to select the final delays to be applied to each signal prior to doing the
sum of the different channels. The noise thresholding algorithm detects those TDOA values that
most probably come from a silence region and substitutes its value by the previous, more stable,
delay value. It does it by finding the threshold that cuts 10% of the TDOA values as noise. The
TDOA stability algorithm uses a double-pass Viterbi algorithm to select the optimum among
all possible combinations of N-best computed TDOA values.
These modules are the second version of the initial algorithms implemented and presented in
the RT05s system explained in Anguera, Wooters, Peskin and Aguilo (2005). On one hand, the
initial algorithm for the noise thresholding used a fixed threshold set by hand (using development
data) for all meetings. This caused problems in very noisy shows, as it will be shown in the
results, where it is used as a comparison with a threshold = 0.1 over the GCC-PHAT value for
each frame (whose values can range from 0 to 1). On the other hand, the first version of the
stability 1-best selection algorithm used a simple distance based rules algorithm to either use
the 1st-best value or some other value in the N-best list. If the difference between the TDOA
value of the first element in the N-best TDOA list for a particular frame and the selected value
for the previous frame was greater than a threshold it was searched in the N-best list if there
was any other element which was closer, in which case that was the one selected.
Table 6.7 shows results for the RT06s system with the latest version of both algorithms
presented, and compares them to using the RT05s versions or not using any algorithm (for
either algorithm and in overall).
The development set SNR using the RT06s version of both algorithms is better than the
RT05s algorithms combination, but is worse than not using any continuity algorithm or not using
post-processing at all. The evaluation set SNR for RT06s, though, outperforms all other cases. As
for the DER, on the development set, the RT06s system outperforms all other combinations. On
the test set results are slightly worse on the proposed RT06s algorithms than not using anything
Chapter 6. Experiments 155
or doing a fixed noise threshold. In overall, the RT06s post-processing algorithm outperforms
the lack of postprocessing in am 8.3% relative on the development set, while it gets a 2.9% worse
in the evaluation set.
The change of the noise thresholding to a percentage-based threshold in RT06s versus using
a fixed threshold slightly benefits the results of DER in the development set while it gets worse
in the test set. Such new algorithm was implemented given the problems obtained in processing
very noisy meetings (as it was the case for the LDC meetings) which the RT05s algorithm
labelled many frames as non-speech.
30.4
30.2
30
29.8
29.6
SNR
29.4
29.2
29
28.8
28.6
28.4
0 0.05 0.1 0.15 0.2 0.25
% noise threshold
Figure 6.4: Development set SNR modifying the percentage of noise threshold adjustment
Figure 6.4 shows the SNR obtained for the development set by sweeping the percentage of
frames with lowest GCC-PHAT values considered as noise. After 10% of the frames are selected
the SNR results are very stable. This was the selected value for the RT06s system as it modifies
as few frames as possible while achieving good performance. If instead of a 10% threshold, the
maximum SNR point has been selected (20%) the results for the SNR and DER are shown in
table 6.8:
By using the optimum value for the SNR in the development set it is observed that the
evaluation set obtains a slightly worse SNR but both the dev and eval sets obtain an important
improvement in DER. This optimum value was not found during the development of the RT06s
system as the development set was slightly different than the one used in these experiments.
These scores demonstrate the better performance of the noise thresholding using a percentage
instead of using a fixed threshold or no thresholding, shown in table 6.7 above.
The use of the TDOA-selection continuity algorithm is justified by the results in table 6.7.
Results on DER comparing it to the RT05s algorithm results show a 6.9% relative improvement
on the development set and a more modest 1% relative on the test set. Comparing it to not
doing anything obtains similar results in the development set but a 3% relative improvement in
the test set. This algorithm though requires the computation of a double Viterbi decoding of
the multiple TDOA values, which can take a long time to compute, depending on the number
of microphones to process. Although results are beneficiary to the system, it is doubtful if it is
feasible to be used in a realtime application.
30.6
30.4 First Viterbi
Second Viterbi
30.2
30
29.8
29.6
29.4
29.2
29
28.8
28.6
15 20 25 30 35
Figure 6.5: Development set SNR values modifying the Viterbi transition prob. weights in the
F&S algorithm
To further study the effect of the double-Viterbi decoding, the behavior of three parameters
in the algorithm are studied with respect to the SNR in the development set. These are the
weights used to enhance the transition probabilities in each of the 2 Viterbi decodings levels
(see section 5.2.3 for more details) and the number of initial N-best TDOA values given to the
algorithm to select from. Table 6.5 shows the SNR for the development set by changing the
relative weights of either the first or second Viterbi steps. The default value for both variables
taken at 25 is a relative maximum of both curves (very similar to each other). Other values
around the default ones obtain better SNR. For both cases in table 6.9 the SNR and DER for
the default RT06s case and for cases when either weight is set to 15 are shown.
By selecting an alternative value for the weights in the first Viterbi decoding it obtains an
improvement in the SNR of both devel and eval sets and DER in the development set, but the
DER gets worse for the evaluation set. On the contrary, by using the alternative weight in the
Chapter 6. Experiments 157
second Viterbi only the SNR in the development set improves. It is desirable to maintain both
weights to the same value to avoid over-tuning to the data, which in average is better to be at
25, as set for the RT06s evaluation.
The third parameter to analyze in the algorithm is the number of N-best values to be
considered by the algorithm when selecting the optimum TDOA value. The first Viterbi step
does a local selection within each channel from the N-best possible TDOA values to the 2-best,
which then are considered by the second Viterbi in a global decoding among all TDOA values
from all channels. The number of possible initial TDOA values is a parameter that describes
how many possible peaks in the GCC-PHAT function have to be considered by the first Viterbi.
These maxima might represent cases with multiple speakers in overlap or single speakers with
impulsive noises. The selection of the right number of initial N-best values needs to account for
concurrent acoustic events while avoiding false peaks in the GCC-PHAT function to be selected.
31
30.5
30
SNR
29.5
29
28.5
0 2 4 6 8 10 12
N-best
Figure 6.6: Development set SNR values modifying the number of N-best values used for TDOA
selection
Figure 6.6 shows the SNR for the development set by selecting from 2 to 10 initial N-best
values. The RT06s system default value of 4 obtains a stable SNR behavior (not much change
is seen in SNR around it). Choosing only the 2-best peaks of the TDOA values gives a slightly
better SNR in the development set, and therefore it was compared to 4 by computing the SNR
and DER on the test set, and shown in table 6.10:
Although the SNR for the development set is slightly better using 2-best than using 4-best,
the DER for such set behaves very poorly compared to the RT06s system. The SNR for the eval
158 6.4. Acoustic Beamforming Experiments
set is worse than in the default system, although the DER slightly outperforms the default. The
optimum value for diarization is therefore left at 4-best TDOA values.
The signal output module includes the relative channel weight estimation algorithm (explained
in 5.2.4) and the elimination of frames from low quality channels (seen in 5.2.4).
The relative channel weight algorithm is necessary when the different microphones are of very
diverse type and therefore the levels and kinds of noises being recorded are different. In such
cases the standard delay-and-sum theory does not apply as the noise from the different channels
cannot cancel itself out. One needs to find the appropriate weights of each of the channels that
is able to reduce the effect of a channel when the quality of the signal is poor and magnify it
when it is very good. In the RT06s system the relative weights are computed once the channel
delays are known, and is a function of the average correlation between the signal of all channels.
The elimination of certain channels when their quality is too low uses also the correlation
information to determine when a channel in a particular frame is of too low quality that it
is better not to use it in the output as it would degrade the output quality. This is done
automatically and in a dynamic way (only certain frames are eliminated, not all the data in that
recording). Table 6.11 shows the results comparing the RT06s system with the same without
the use of the relative channels weighting and the automatic channels elimination algorithms.
1
When no channel weights are used, a N constant weight is applied to all channels.
Table 6.11: Results for relative channel weights and elimination of bad channels
The SNR of the development set improves when not using either one of the techniques,
but gets worse in the test set. The DER improves a 1.8% relative by using the relative channel
weights on the development set and a 4.7% relative on the evaluation set. By eliminating the bad
channels from the processing the DER does not change in the development set but is improves
Chapter 6. Experiments 159
The beamforming system presented for this thesis was also used to obtain an enhanced signal
for the ASR systems at ICSI presented to the RT NIST evaluations. For RT05s the same beam-
forming system was used for ASR than for diarization. As explained in Stolcke et al. (2005),
evaluated in the RT04s eval set, and not considering the CMU mono-channel meetings, the new
beamforming outperformed in 2.8% absolute (from 42.9% word error rate to 40.1%) the previous
beamforming system in use at ICSI, which was based on delay&sum of full speech segments.
For the RT06s system the beamforming module was tuned separately from the diarization
module to optimize for Word Error Rate (WER) with is a word-based metric (not as the DER,
which is time-based). This lead to a system which was more robust than the RT05s beamformer.
Table 6.12: WER using RT06s ASR system including the presented beamformer
As seen in Janin et al. (2006) and reproduced in table 6.12 the RT05s and RT06s datasets
were used to evaluate the RT06s ASR submission in terms of WER. In both datasets there is an
improvement of almost 2% absolute by using a single channel or the MDM beamformed signal,
where the ASR system only differs in the F&S algorithm use and minor tuning parameters,
optimized for each case.
This improvement becomes much larger between the MDM and ADM cases, where the
improvement is exclusively due to the increase of microphones available in the ADM case and
therefore to the improvement in signal quality due to the beamforming processing.
The mark III microphone arrays (MM3a) were available for the RT06s evaluation. Tests
performed comparing results with other state of the art beamforming systems showed that the
proposed beamformer achieves an excellent performance.
This section analyzes each of the proposed improvements to the mono-channel diarization module
proposed in this thesis. The algorithms that are analyzed in this section are:
All experiments use the same baseline system as described in 6.1.1. This uses the new hybrid
speech/non-speech detector and the RT06s beamforming system when necessary. In order to
evaluate all possible diarization working conditions, three different tasks are run:
• The SDM system uses only the single SDM channel as defined by NIST for the RT datasets
used in the experiments.
• The MDM mono-stream system uses the beamforming of the multiple available channels
to obtain a single acoustic channel for use by the system. The same configuration is used
for the diarization of this channel as for the SDM channel.
• The MDM multi-stream system uses the beamformed signal as well as the extracted TDOA
values to add information to the diarization on the location of the speakers. Although this
system uses the same acoustic channels as the pure MDM task, this evaluation condition
is added as it behaves differently to the data than any of the other tasks.
Experiments were performed in two steps to find the optimum parameters for all algorithms
by using the development set. First, each algorithm was tested alone against the proposed
baseline. One or two parameters for each algorithm were searched for the optimum configuration.
Then, the different algorithms in the order of individual improvement (from most to least) were
used together and the optimum parameters were searched for again, obtaining the optimum
system according to the development. Finally, the evaluation set was used to test how the
system performs with unseen data.
The optimization of the parameters is done using the arithmetic average of the three consid-
ered systems, with the exception of the multi-stream weight computation algorithm which only
affects the system using the TDOA feature stream.
In this chapter each algorithm is tested independently, compared to the baseline system.
Chapter 6. Experiments 161
As pointed out during the system description chapters, the speaker diarization system via ag-
glomerative clustering in the way that it is implemented in this thesis uses very small GMM
models trained with small amounts of speaker data. These allow the system to be computation-
ally faster than systems based on UBM adaptation techniques (like Zhu et al. (2006)) but it
requires of a more accurate selection of the number of initial clusters in the system, the complex-
ity of the cluster models, and how each model is trained so that data is equally well represented
in each model and comparisons between them yield better decisions in the agglomerative process.
For both the number of initial clusters to be used in the system and for the complexity
of the cluster models the Cluster Complexity Ration (CCR) is defined as the parameter to
be optimized, as described in section 4.2.2. In the experiments performed for both algorithms
an initial analysis studied the effect of such parameter in the final DER when using either
algorithm alone compared to the baseline system. As both algorithms use the same parameter,
a join experiment using both algorithms was used to tune the optimum CCR value according to
the average DER between the three systems considered in the experiments.
25 SDM
MDM
23 TDOA-MDM
DER
21
19
17
15
3 5 7 CCR 9 11 13
Figure 6.7: DER for the model complexity selection algorithm using different CCR values
Figure 6.7 shows the DER for the development set data using the three system implementa-
tions for model complexity selection. While SDM and MDM systems obtain a similar behavior on
the different CCR values, the TDOA-MDM system tends to obtain the best scores for the lower
set of evaluated CCR. Given that only the acoustic feature models are affected by the algorithm
(not the TDOA values) this indicates that when using the TDOA values the system becomes
more robust to complexity selection errors, given that bigger models (obtained by smaller CCR
values) do not overfit as much to the data than when using acoustics alone.
The same behavior is observed when evaluating the number of initial clusters algorithm in
figure 6.8. When using the TDOA-MDM system the DER is very stable from CCR=5 to 10.
Both algorithms show also an increase in DER when the CCR values are higher that 10, which
162 6.5. Speaker Diarization Module Experiments
25 SDM
MDM
23 TDOA-MDM
DER
21
19
17
15
3 5 7 CCR 9 11 13
Figure 6.8: DER for the initial number of clusters algorithm using different CCR values
indicates that models are too small to well represent the speaker data and too few models are
initially created to allow the system to distribute the data appropriately among the speakers.
25
complex + init
24
complex
23 init number clusters
22
DER
21
20
19
18
17
3 5 7 CCR 9 11 13
Figure 6.9: DER for the combination of complexity selection + initial number of clusters using
different CCR values
In both algorithms the CCR value that obtains the optimum average DER is within the
range of CCR=6 to 8. In order to use a single CCR value both algorithms are combined and
the DER is shown in figure 6.9. In it the average DER between the three systems when using
complexity selection or initial number of clusters is compared to the average DER when using
both combined. It becomes very clear the existence of an average minimum value at CCR = 7.
Table 6.13 summarizes the results shown in the figures for the development set and shows
the evaluation set scores obtained using the optimum parameters for each algorithm. While
combining the two algorithms using a new CCR optimum value gives an improvement versus
either development results, in the evaluation set the combined score is very similar to the worse
of the individual algorithms. In any case, an improvement of 7.05% relative is shown in the
development set and only a 0.2% relative in the evaluation set.
Chapter 6. Experiments 163
Table 6.13: DER for the development and evaluation sets for models complexity selection and
initial number of clusters algorithms
Cross-Validation EM Training
As seen in the previous section, it makes a big difference in the speaker diarization system the
correct training of the speaker models. the EM-ML algorithm is appropriate for such endeavor,
but in defining the number of iterations it normally undertrains or overfits to the training data.
To solve this problem, the Cross-Validation EM (CV-EM) algorithm does EM training iterations
over a set of parallel models that allow for a robust validation set to determine when to stop
training (defined when the total likelihood of the validation set between two iterations increases
less than 0.1%).
When using the CV-EM algorithm initial models use in average more EM iterations than the
5 iterations that were set for the standard EM-ML system. Once the models are retrained using
almost the same data as in previous iterations, the CV-EM algorithm stops at 1 or 2 iterations,
while the standard EM keeps doing 5 iterations. This reduced in average the computation of
the system. the use of multiple parallel models in the CV-EM algorithm does not pose a com-
putational burden as the increase in computation is minimal, which comes from the multiple
accumulation of statistics for each model.
SDM
MDM
20.5 TDOA-MDM
19.5
DER
18.5
17.5
16.5
15.5
0 10 20 30 40 50
# CV-EM parallel models
Figure 6.10: DER variation with the number of parallel models used in CV-EM training
In the proposed CV-EM algorithm the number of parallel models used needs to be defined a
priory. Figure 6.10 shows the evolution of the DER for the three considered systems by modifying
the number of parallel models used in the algorithm. In the three cases the minimum DER value
164 6.5. Speaker Diarization Module Experiments
is found at 35 parallel models. At this optimum value, the CV-EM training obtains a 17.50%
DER, a 6.46% relative improvement over the baseline.
Once the initial number of desired clusters is defined, the algorithm called friends-and-enemies
was proposed to cluster the available acoustic data among these clusters so that the cluster purity
was maximized and there were no speakers (regardless of the length of their intervention in the
recording) that were left without any exclusive cluster. The proposed algorithm is an iterative
process where initial single speaker segments are clustered together with the closest segments
(friends) and new clusters are derived with the most dissimilar resulting segments(enemies).
In the definition of the algorithm there are three levels of freedom that need to be evaluated.
On one hand, a way of selecting the initial speaker segment to start the iterative processing
and a metric of “closeness” between segments need to be defined. On the other hand, one needs
to determine the optimum number of friend-segments to group into a cluster so that it is well
represented but only with data from one speaker.
In the description of the algorithm, in section 4.2.1, three alternatives are proposed for both
the selection of the initial segment and the distance metric between segments. Table 6.14 shows
the average DER of the considered systems using the different combinations of initial segment
selection and distance metric as numbered in section 4.2.1 for the development set. These values
were computed using 2 friends for each cluster (i.e. total of 3 segments per initial cluster).
Table 6.14: Average DER for the development set using various possible distance metrics and
initial segment selection criterions
From the results in table 6.14, the combinations 2-2 and 3-3 are the ones with better DER
values, compared to the baseline with 18.71% DER they represent a 1.9% and a 3.2% relative
improvement respectively. Taking the combination 2-2, figure 6.11 shows the evolution of the
DER for each system with the number of friends chosen.
For values greater than 1 the system is very robust as the three systems obtain very stable
results. The minimum average DER pertains to using 3 friends, with a 17.47% DER, a 6.62%
relative improvement over the baseline.
Chapter 6. Experiments 165
50
SDM
45
MDM
40 TDOA-MDM
35
DER
30
25
20
15
0 1 2 3 4 5
# friends
Figure 6.11: DER variation with the number of friends used in the friends-and-enemies initial-
ization
The frame and segment purification algorithms deal with the problem of cluster impurity at two
different levels. The segment level purification locates the segments within a cluster that are
most probable to belong to a different speaker and places them in a different cluster. The frame
level purification locates those frames that impede the proper discrimination between clusters
and eliminates them when conducting the cluster pairs comparison.
First, experiments were performed on the frame purification algorithm. For all experiments
either metric 1 or 2 were used (see section 4.3.1). Two main parameters were selected that
determine the behavior of the algorithm given the data. On one hand the percentage P % of
data with the highest metric values to be eliminated from the models comparison. On the other
hand, the percentage of Gaussian mixtures with smallest average variance that are used to
compute metric 2. In fact, metric 1 can be considered equivalent to metric 2 when 100% of
Gaussians are used.
Given the development data, figure 6.12 shows the average DER for several P % of used
frames and using 50% and 75% of the available Gaussians. The optimum value is found for
50% Gaussians and using P = 70% of the frames. While using a 75% of Gaussians shows a
clear minimum point with much higher values around it, with the optimum 50% values oscillate
around the minimum, showing a more robust selection of the optimum point.
In the segment purification algorithm no parameters were tuned. Table 6.15 compares the
selected systems to the baseline both in the development set and the evaluation set. While
segment purification obtains an improvement of 2.5% relative on the development set, it obtains
worse results than the baseline in the evaluation set. The frame purification algorithm obtains
an improvement of 3.6% and 2.8% relative improvement in development and evaluation sets
respectively.
166 6.5. Speaker Diarization Module Experiments
19
18.9 75% gauss
18.8 50% gauss
18.7
18.6
DER
18.5
18.4
18.3
18.2
18.1
18
60 65 70 75 80 85 90
% accepted frames
Figure 6.12: DER variation with the percentage of accepted frames and used Gaussians in frame
purification
Table 6.15: DER results for the segment and frame purification algorithms
In order to test the effectiveness of the automatic weighting scheme for multi-stream feature
sets only the MDM system with TDOA values was used. By setting the weights automatically
each meeting can compute the optimum relationship between the acoustic and TDOA features.
In fact, given that TDOA features determine the identity of the speaker by his/her physical
location in the room, it can suffer from modeling errors whenever the speakers move around the
room, which in the RT meetings only happens in a few cases.
30
28
26
24
DER
22
20
18
16
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Acoustic weight
Figure 6.13: DER scores for the baseline system setting the relative weights by hand on develop-
ment data
Chapter 6. Experiments 167
The alternative to automatically setting the weights is to define a relative weight by using
a development set and applying it to all meetings in the test set. This alternative lacks from
the flexibility to consider all meetings different, and can diverge from the development to the
test set, lacking on robustness to changes. Figure 6.13 shows the DER on the development set
for the TDOA-MDM system, where the relative weight between TDOA and acoustic features
has been set a priory. It is easily observable the big dynamic range of DER scores obtained
along the various possible weights. Given the computed values, the optimum working weight is
at W1 = 0.9 with a DER of 16.49%.
By using the proposed automatic weighting it computes the weights after the ∆BIC values
are obtained for all cluster pairs. This could be done after each iteration of the agglomerative
clustering and therefore a decision must be made wether the initial iteration weights are kept
throughout the process or they are allowed to readapt by using the new ∆BIC pairs.
15
14.9
14.8
14.7
DER
14.6
14.5
14.4
14.3
1 2 3 4 5 6 7 8 9 10
# iterations weight setting
It was observed in figure 5.13 in section 5.3.2 that the weights usually converge to a stable
value after several agglomerative clustering iterations. It must be seen wether these weights
obtain an appropriate DER result. Figure 6.14 shows the evolution of the DER computed using
the development set by changing the number of iterations in which the weights are reestimated.
The DER decreases as the number of iterations increase, with the exception of iteration 3, and
stabilizing around iteration 9. This indicates that the system tends to obtain better values for
the weight as it progresses, and therefore there is no need in the final system to tune for the
number of iterations. Instead, it was allowed to adapt a new weight as long as the stopping
criterion did not stop the system.
On final parameter of the weighting algorithm is the initial weights to be used in the system
initialization and the initial segmentation, before the first clustering occurs. In order to study
the effect of the chosen initial weight in figure 6.15 the DER variation is computed for the
development set using the automatic weighting algorithm not limiting the number of adaptation
iterations and setting the initial weights. The final system DER for the development data changes
168 6.5. Speaker Diarization Module Experiments
16
15.8
15.6
15.4
DER 15.2
15
14.8
14.6
14.4
14.2
14
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Initial acoustic weights
Figure 6.15: DER evolution changing the initial feature stream weights
depending on the initial weight setting, being the optimum setting at W1,init = 0.65. This
variation is thought minimal compared to the DER variation observed when using a manual
setting and shown in figure 6.13. In fact, by selecting a non informative initial weight Wi = 0.5
there is only a 0.57% absolute DER loss, which might be acceptable for many applications where
the nature of the data changes rapidly and it becomes a burden to tune the system every time.
In table 6.16 the DER is computed for several implementations. The mono-stream system
uses only acoustic features, the other systems use both acoustics and TDOA values, differing
in the way that the weights are found. The system “inv-entropy” performs a frame-wise inverse
entropy weight estimation as described in Misra et al. (2003). The “manual weights” system
finds the optimum weights using a development set and is set to W1 = 0.9. The other two lines
show results using the automatic weighting with different initial weights, W1 = 0.9, optimum
in the development set for the manual case and W1 = 0.65 optimum in the development set for
the automatic weight setting.
Given these results it is seen that using inverse entropy does not achieve good results. In
average the entropy method assigns higher weight to the TDOA values while all optimum per-
formance points do otherwise. Also, observe that all the multi-stream methods (except inverse
entropy) greatly outperform the mono-stream baseline system.
Automatic weighting obtains, in its optimum point, a relative 14.1% improvement versus
manual weighting in the development set. Manually setting the weight achieves the best per-
Chapter 6. Experiments 169
formance in the test set, although values for the DER around that point obtain much higher
errors (DER = 22.85 for W1 = 0.85 and DER = 22.29 for W1 = 0.95) which makes it doubtful
of its robustness in other data sets. On the other hand, the values for the automatic weighting
algorithm in the test set remain stable (DER = 20.5% in average) for most observed weights.
Such system could be expanded to compute the weights for more than 2 streams, becoming
it much easier to automatically do it than having to perform a sweep of possible values using
a development set. Even in performance is not improved in all cases, by using the automatic
algorithm it becomes much easier to adapt the system to new domains quickly, which follows
one of the thesis objectives.
Given these results, the automatic setting of the relative weights between features is set to
use an unlimited number of adaptation iterations, with an initial weight Wi = 0.65. This is the
first algorithm to be added to the baseline system in the following chapter as it only affects one
of the systems evaluated.
In the previous section each algorithm proposed has been tested on its own, against the baseline
system described in 6.1.1. Table 6.17 summarizes the results of each algorithm as applied in-
dependently to the baseline system, either being only the TDOA-MDM system for the weights
computation, or the average of all three systems. The last column shows the rank in improve-
ment over the baseline obtained by each system. This rank is used to determine the order of
application of the algorithms in agglomerate to conform the final system.
Table 6.17: Summary of DER improvements for the individual algorithms in the development
set
Given the previous section procedure, emphasis must be given to the big difference in some
cases between the three systems that are being tested (SDM, MDM and TDOA-MDM), therefore
it occurs that some of the algorithms obtain better results with one system than with the other.
Results obtained as an average are lower in improvement percentage than what could be obtained
in a targeted application. By using an average of three systems and an extended development
170 6.5. Speaker Diarization Module Experiments
set (20-24 meeting excerpts) the level of uncertainty of the results (very typical in speaker
diarization, being called “flakiness”) is reduced.
In this section the baseline system is iteratively augmented with the different individual
algorithms and further analysis on the parameters studied in the previous section is performed
to assess that the same settings (or others) are the optimum in each step. Optimally a full system
should be built and a full search done over all parameters space to find the optimum set, but
the big number of dimensions of this space and number of algorithms disallow this procedure.
Instead, a greedy algorithm iteratively adapts each algorithm to perform optimally within the
system.
24
22
20
18
DER
16
14
12
10
1 2 3 4 5
Gaussians used for silence
Figure 6.16: DER variation with the number of Gaussian mixtures initially assigned to the TDOA
models
The first algorithm to be included into the system is the automatic weighting of the TDOA
and acoustic streams using an initial weight W1 = 0.65. At this point, and given that the second
algorithm to include is the model complexity selection for the acoustic features, it is interesting
to see how the complexity of the TDOA models in the TDOA-MDM system affect the behavior
of the system. Figure 6.16 shows how 1 Gaussian mixture is the optimum complexity to be used
for the TDOA models initially. Such complexity increases for the TDOA models as the different
clusters merge by creating models that contain the sum of both parents complexities.
As hinted above, the algorithms to include in the next step are the model complexity selection
and number of initial clusters. Figure 6.17 shows the evolution of the DER when changing the
values of the CCR parameter for the three considered systems. As it happened when studying
the individual system, the TDOA-MDM system performs better with lower CCR values than
the other two. The optimum working point remains at CCR=7.
Table 6.18 shows the development and evaluation sets DER scores for all systems considered
up to this point, and the average. It compares the current results with those of the baseline and
the system at the prior agglomerative step, so that improvements in overall can be observed
Chapter 6. Experiments 171
23.5 SDM
MDM
21.5 TDOA-MDM
19.5
DER
17.5
15.5
13.5
5 6 7 8 9
CCR
Figure 6.17: DER variation with the CCR parameter in the agglomerate system
as well as relative improvements of using this technique in the agglomerate. While the average
DER are all better than the baseline for all cases, in the evaluation set the SDM system performs
much worse than the baseline. Compared to the prior system (the stream-weight selection) an
improvement is seen mostly in the test set where the prior system got worse results than the
baseline, but combined with this algorithm obtains a 17.03% DER versus a 18.65% DER, a net
gain of 1.62%.
Table 6.18: Results for the model complexity and number of initial clusters in the agglomerate
system
Next, the friends-and-enemies algorithm is evaluated when used in conjunction with the
previous systems. In the previous section it was determined that the combinations of metric and
init segment selection corresponding to 2-2 and 3-3 gave the best results. Figure 6.18 show the
average DER when using either set of parameters. Although the optimum number of friends
was 3 when evaluating the algorithm by itself, it is clear in this case that the minimum for both
combinations goes to 4 friends.
In Table 6.19 results are shown for the development and evaluation sets for both metric sets
with 4 friends per cluster, comparing them to the prior best system and to the baseline.
Although in the development set the MDM system is improved in both metric alternatives
172 6.5. Speaker Diarization Module Experiments
21.5
combination 2-2
21
combination 3-3
20.5
20
19.5
DER
19
18.5
18
17.5
17
0 1 2 3 4 5 6
# friends per cluster
Figure 6.18: DER variation with the number of friends in the agglomerate system
with respect to the prior and baseline systems, this is not enough to obtain an average better
result. In the evaluation set the same thing happens, being this time the TDOA-MDM system
which obtains much better results than previously, but they are masked by the bad performance
in SDM and MDM conditions. Even though the system shows that can be useful for certain tasks
and conditions, in average in the agglomerate system it shows unable to improve the average
performance at this point, therefore it is not included for the next step.
Following this algorithm, the next one in succession is the CV-EM training algorithm. The
function of the CV-EM algorithm as used in this thesis is to execute at each training step
the optimum number of iterations that allow the cluster models to optimally model the data
without overfitting to it or undertraining. In order to compare this algorithm to the standard
EM-ML training algorithm in figure 6.19 the average DER is evaluated in terms of the number
of iterations of EM training for the system at this point using standard EM-ML. The optimum
amount of iterations is 5, as has been used in the baseline system.
When using the CV-EM algorithm at this point in the system, figure 6.20 shows the DER
Chapter 6. Experiments 173
18.8
18.6
18.4
18.2
18
17.8
17.6
17.4
17.2
17
3 3.5 4 4.5 5 5.5 6 6.5 7
Figure 6.19: DER variation with the number of EM iterations of a standard EM-ML training
algorithm
20.5 SDM
MDM
19.5 TDOA-MDM
DER
18.5
17.5
16.5
15.5
15 20 25 30 35 40 45
Parallel models
Figure 6.20: DER variation with the number CV-EM parallel models
for all three considered systems when selecting different number of CV-EM parallel models.
contrary to the same test performed with the algorithm in isolation, in this case all three systems
observe a very similar behavior to the number of used models. This can be explained with the
increased robustness of the system by using the previously applied algorithms, which makes its
performance much more stable and less flaky to small changes in the parameters. In the current
CV-EM application the optimum number of models is 25, although 20 would be a good choice
too, as results are very stable in that region. Using 15 or less models increases the DER, probably
because the different models contain data that starts to be too different to each model’s and
therefore leads the EM steps to obtain divergent parameters for the models.
Table 6.20 shows the results comparing the inclusion to the CV-EM training algorithm to the
prior system (which does not include the friends-and-enemies initialization) and to the baseline
system. In the development set the main improvement comes from the MDM system, leading to
a final slight gain over the prior system. In the evaluation set results are much improved and a
5.9% relative improvement is observed.
174 6.5. Speaker Diarization Module Experiments
Table 6.20: Results for the CV-EM training algorithm in the agglomerate system
The next algorithm to introduce, according to the order of individual improvement, is the
frame purification algorithm. When used in isolation, the algorithm achieved optimum perfor-
mance when using 50% of the possible Gaussians and keeping 70% of the frames (eliminating
the 30% with highest evaluated metric). In order to test both parameters in this setting the first
sweep is pursued on the % of frames while keeping the 50% Gaussians fixed.
21
SDM
20 MDM
TDOA-MDM
19
DER
18
17
16
15
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
% frames accepted
Figure 6.21: DER variation with the frame % acceptance for frame purification algorithm
Figure 6.21 shows the DER for the three compared systems for different frame acceptance
percentages. The case of 100% corresponds to the prior algorithm, without any frame purifica-
tion. In all cases the curves show a high DER at 60% which have a stable improvement as the
percentage of acceptd frames increases (with exception of 80% in TDOA-MDM). Both SDM
and TDOA-MDM obtain several values with better DER than the 100% case, but in MDM this
always behaves better. The optimum working point according to the average DER is at 90% of
frames accepted.
Fixing now the % of frames at 0.9, the different values for the % of Gaussians used is
studied. Figure 6.22 shows the DER when the percentage goes from 20% to 100%. SDM and
MDM systems have a very flat behavior, which is disrupted in the TDOA-MDM system from
Chapter 6. Experiments 175
18.5
18
17.5
17
DER
16.5
16 SDM
MDM
15.5 TDOA-MDM
15
14.5
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
% Gaussians accepted
Figure 6.22: DER variation with the Gaussian % used in the frame purification algorithm
40$ to 60%. The optimum points are at 25% and 100%, which are equivalent to the 2nd and 1st
metrics shown in the algorithm description.
These results have a double interpretation. On one hand, by the success of the algorithm
in improving the DER it is proven that acoustic frames with high likelihood are more prone at
conveying information that is not useful at discriminating between speakers. This could be used
in other fields, like speaker identification, where techniques based on frame bagging are already
in use to omit those frames with the lowest likelihoods.
Table 6.21: Results for the frame purification algorithm in the agglomerate system
Table 6.21 shows the DER for the frame purification algorithm (both using 25% and 100%
of Gaussians). All algorithms show an excellent performance both in the development set and
in the evaluation set, outperforming the baseline and the prior algorithm (agglomerate system
with CV-EM training). This is a 7.4% and 3.5% relative on the evaluation set using either the
25% or 100% of Gaussians (respectively).
Finally, the segment purification algorithm is evaluated using the system composed of all
176 6.6. Overall Experiments and Analysis of Results
successful prior systems. Table 6.22 shows the results in comparison with the prior system
(selected with 25% Gaussians used) and the baseline.
Table 6.22: Results for the segment purification algorithm in the agglomerate system
Results are mixed, improving in certain systems and being worse in others. In average both
the development and evaluation sets obtain worse results than the prior system. An interesting
effect is also noticed in that for both the development and evaluation the MDM system performs
worse than the SDM, which indicates that somehow the segment purification system can identify
better the segments from alien speakers when only one channel is used. Given the results and
the high computational load that the segment purification poses on the system, this is taken out
of the experiments system.
In the previous section a majority of the algorithms proposed in this thesis for use in speaker
diarization have been analyzed, first by themselves comparing them to the baseline and then as
part of an agglomerate system in order to obtain an optimum final system.
System DER Improv. Improv. vs. DER Improv. Improv. vs.
devel vs. prior baseline eval vs. prior baseline
Baseline 18.71% – – 23.23 – –
multi-stream weights 17.93% 4.16% 4.16% 23.97% -3.18% -3.18%
# init clusters + complexity 17.19% 4.12% 8.12% 23.18% 3.29% 0.21%
Friends-and-enemies init 17.77% -3.37% – 23.79% -2.63% –
CV-EM Training 17.17% 0.11%(*) 8.23% 21.79% 5.99%(*) 6.19%
Frame purification 16.77% 2.32% 10.36% 20.16% 7.48% 13.21%
Segment purificaton 16.82% -0.29% – 20.55% -1.93% –
Table 6.23: Summary of average DER for the agglomerate system on development and evaluation
data
Table 6.23 shows a summary of the results analyzed in the previous section and computes
the relative improvement of each algorithm with respect to the previous one and to the baseline
(accumulating all improvements). Both the friends-and-enemies and segment purification algo-
rithms obtain bad results in this experiment and therefore are not included in the agglomerate
Chapter 6. Experiments 177
system result (results with * indicate that the relative improvement to the prior system com-
putes it not taking them into account). The algorithms not included in this final experiment are
still valid and obtain good results in certain situations, but not in the average of all cases.
Taking into account the average DER between the SDM, MDM and TDOA-MDM system
outputs, and tuning all the algorithms to them, the proposed algorithms in the diarization
module improve up to a 10.36% relative in the development set and up to a 13.21% relative in
the evaluation set.
While such approach of optimizing the average DER obtains systems that will perform
well for all the different tasks considered. In some applications where multiple microphones
are available it is interesting to find the best result possible. To obtain it the TDOA-MDM
system has been selected and the parameters optimized to it according to the parameter sweeps
performed in the previous section. The parameters selection was then:
Table 6.24: Results for the TDOA-MDM task using different algorithm settings
The obtained results are shown in table 6.24 where DER results are shown only for the
TDOA-MDM task, which is usually the one with best performance. By using the optimum
parameters according to the development sweeps in the previous section, and including all algo-
rithms into the system, the resulting optimum TDOA-MDM system obtains an improvement of
5.08% on the evaluation set versus the optimum average system optimized for the average DER
for all three systems.
The final system has a robust performance over changes in the data, sometimes at the cost
of not obtaining the absolute minimum DER in all cases. To illustrate this, let us take the
system labelled as best TDOA-MDM on the development, which is a system built only using
178 6.6. Overall Experiments and Analysis of Results
the automatic weighting algorithm and the definition of number of clusters and complexity of
the models. This system outperforms vastly the optimum systems in the development set but
when changing to a different set it returns a poor performance. By using the optimum systems
once all algorithms are in place, the results are even in both sets at the expense of some increase
int the DER in the development.
Table 6.25 shows the DER scores to illustrate the overall improvement achieved by the system
while transforming from broadcast news speaker diarization to diarization for meetings. The BN
baseline system shows the DER of the described baseline which uses model-based speech/non-
speech detection. As pointed out, even though this system is already a step forward from the
system at the start of this thesis work, it acts as a good baseline for all the work done in
beamforming and speaker diarization.
The meetings baseline is the same baseline as in the BN system, but using the hybrid
speech/non-speech detection and the baseline RT06s beamforming. Finally, the optimum TDOA-
MDM system has been presented earlier in this section and shows the optimum/robust results
obtained by using all proposed algorithms.
One Interesting result in the BN baseline system is the outstanding difference between devel-
opment and evaluation results. While the system performs rather poorly on the development set,
for that particular combination of parameters it obtains a very good result on the evaluation set.
Running any acoustic beamforming using more microphones than just the SDM results always
in an increase on the DER. In fact, an experiment using the RT06s beamforming on top of the
BN baseline achieves a 23.63% DER on dev set (a slight improvement) and 26.3% DER on the
evaluation set (much worse, similar to the result for the meetings baseline).
This is another example of show flakiness and lack of robustness on the baseline system.
On one hand, while one set of data performs well, another set can perform very poorly for the
same parameters setting. On the other hand, when doing changes to the system, not all datasets
perform the same way, achieving an improvement on one set does not mean that it translates
Chapter 6. Experiments 179
to others. This problem is the keystone of research in speaker diarization and has been a main
concern and issue during all the thesis development, as random results jeopardize the assessment
of new techniques that, although beneficiary to the system, might have been discarded due to
poor performance.
Given each independent recording (meetings, broadcast news or other sources), the speaker
diarization algorithm processes it and obtains an output segmentation. Such segmentation might
show a slight improvement due to the applied systems but could also obtain a very high DER
due to factors like a badly chosen stopping point. By considering the final DER as the time-
weighted sum of all excerpts, if a few of them experienced such bad behavior then the final score
is worse than previous runs, misleading to thinking that the tested algorithm is not correct.
When computing the DER for a small set of excerpts (8 to 10), these errors cause a big impact
in the final score.
While the DER score is a standard way of measuring diarization performance as used in
the NIST RT evaluations, which is the ultimate metric to reduce to show improvements in the
systems, in order to avoid the problems posed earlier there are several alternatives. On one hand
the use of bigger development and evaluation sets so as to reduce the effect of these outliers. To
work with such datasets there must be accurate and consistent transcriptions to test against,
which should be obtained using automated mechanisms like forced-alignments. On the other
hand, the DER metric could be altered to eliminate the outlier scores from the average during
development. Although this would solve the occasional excerpts with big errors, it does not help
improve those that are considered “hard nuts” (shows that always perform very badly) and it
is therefore difficult to define the outlier boundaries that describe the system correctly.
Finally, for comparison purposes, table 6.26 shows the initial and final systems evaluated
this time using the hand-alignments proposed by NIST during the evaluation campaigns and
splitting results into evaluation sources. These are used for comparison only as development
using these references was stopped and switched to the forced-alignment references, much more
robust across years.
Chapter 7
The National Institute of Standards and Technology (National Institute for Standards and
Technology 2006) (NIST) is an agency of the U.S. Commerce Department’s Technology Admin-
istration that was created to provide standards and measurements for the U.S. industry. Within
NIST, the speech group’s mission is to contribute to the advancement of spoken language pro-
cessing technologies (both speech recognition and understanding) so that spoken language can
reliably serve as an alternative modality for the human-computer interface of the future. This
is done by defining performance measurement methods, providing reference materials, perform-
ing benchmark tests within the research and development community and building prototype
systems to show the latest speech technology advances and future applications.
In the last decade NIST has been performing a series of evaluations in the topic of speaker
diarization in order to empower research institutions to evaluate their research systems using
a common framework, including data and specifications to follow. Each evaluation in speaker
diarization has been included within a more general framework including other research areas.
From 2000 to 2002 speaker diarization was run within the speaker recognition evaluation (SRE,
see NIST Speech Recognition Evaluation (2006)) and from then on it has been included in
the Rich Transcription Evaluation (RT eval, see NIST Rich Transcription evaluations, website:
http://www.nist.gov/speech/tests/rt (2006)).
ICSI has been participated regularly to the speaker diarization evaluations and also to the
speaker recognition and speech to text evaluations. During my stay at ICSI I have been a part
of the team entering the latest broadcast news speaker diarization evaluation and on the latest
two evaluations of speaker diarization for meetings. In this chapter an overview is first given of
the evaluation campaigns in speaker diarization for meetings for the last two years and then the
systems that ICSI participated with are explained, as well as which performance they achieved.
Finally some personal insight is offered about my opinion of pros and cons of these evaluations.
181
182 7.1. NIST Rich Transcription Evaluations in Speaker Diarization for Meetings
The Rich transcription evaluations conducted by NIST started with the RT02s in 2002 until
the latest one (RT06s). According to NIST (Spring 2005 (RT-05S) Rich Transcription Meeting
Recognition Evaluation Plan (n.d.), Spring 2006 (RT-06S) Rich Transcription Meeting Recogni-
tion Evaluation Plan (n.d.)) the Rich Transcription (RT) of a spoken document addresses the
need for information other than the set of words that have seen said (extracted with a Speech-
to-Text, STT, system). When obtaining a transcription of the words that have been spoken in
a recording it is difficult to receive all the information that the speakers tried to convey. This
is because spoken language is much more than just the spoken words; it contains information
about the speakers, prosodic cues and intend, and much more.
The goal of future RT systems is for transcripts to be created with all sorts of metadata to
allow the user to fully understand the content of an audio recording without listening to it. In the
recent RT evaluations NIST has focused on three core technologies that are important elements
of the metadata content. These are Speech-to-Text (STT), Speaker Diarization (SPKR) and
Speech Activity Detection (SAD). In the last two years (RT05s and RT06s) evaluations have
been focusing on the meetings domain.
This section focuses on the two latest evaluations performed on the meetings domain, namely
RT05s and RT06s. These two are similar in that two different subdomains were proposed, with
different microphone configurations within each subdomain. All systems were allowed to run
with unlimited runtime speed so that they could be comparable within the same metrics. The
speed of each system was reported as part of the system description.
• Conference room meetings: These are conducted around a meetings table with several
participants involved in an active conversation among them. It contains various amounts
of speaker overlap (depending on the nature of the meeting). These have been the focus
of research of several projects including the European AMI project.
• Lecture room meetings: These are conducted in a lecture setting where a lecturer gives
a presentation in front of an audience, which normally interrupts with questions during
the talk. In these meetings the lecturer normally speaks for most of the time during the
talk, and it becomes more balanced during question and answer sections. It has been the
focus of research of the European CHIL project.
Chapter 7. NIST Evaluations in Speaker Diarization 183
In each one of the meeting rooms there are multiple microphones available which record
the signal synchronously.In some settings there are also cameras, but these fall outside of the
scope of the speaker diarization evaluation. The microphones are clustered in different groups to
determine different conditions/evaluation subtasks. The following list points out the terminology
used for each of the possible groups and wether it is used in the speaker diarization evaluation
and in which domain:
SDM (Single Distant Microphone): This is defined as one of the centrally located microphones
in the room, located on the meetings table. This microphone is always part of the bigger
MDM group. Both lecture and conference room subdomains run this task.
MDM (Multiple Distant Microphones): These are a set of microphones situated on the meeting
table. All participants in the conference room subdomain sit around the table as well as
participants on the lecture room subdomain except for the lecturer. This task also exists
in both subdomains.
MM3A (Multiple Mark III Microphone Arrays): The lecture meetings contain one or two
of these arrays, which were built by NIST and contain 64 microphones setup linearly.
Diarization could be run on either 64 channels or a beamformed version of it distributed
by Karlsruhe University for RT06s.
MSLA (Multiple Source Localization Microphone Arrays): These are four groups of four mi-
crophones positioned into a “T” shape array which were originally defined for speaker
localization. They are only found in the lecture subdomain.
ADM (All Distant Microphones): In lecture room recordings this task allows the system to use
all possible microphones previously explained (all except for the IHM microphones). The
conference room subdomain does not usually define this task as all distant microphones
are of MDM type.
IHM (Individual Headphone Microphone Arrays): Although not evaluated in the diarization
evaluations, these microphones are worn by some of the participants in the meetings. They
are a task in the STT evaluation and are also used when creating the forced-alignment
reference segmentations for speaker diarization.
Each one of the NIST evaluations start one year in advance during the workshop organized to
share results from the previous RT evaluation. In there all the participants are able to make
comments on the different tasks and propose changes in the evaluation or possible new eval-
184 7.1. NIST Rich Transcription Evaluations in Speaker Diarization for Meetings
uations. A schedule is then set for each of the necessary deadlines to follow towards the next
evaluation.
During the months following the workshop normally a set of conference calls occur where
further details are polished in terms of available databases, metrics used or changes in the tasks.
A deadline is normally set for research groups to commit to run the evaluations. This is normally
about one month before the evaluation starts.
On the months prior to the evaluation period some development and training data is dis-
tributed and for STT there are limits set on the sources of the data that can be used for training
so that what differentiates the systems is their algorithms and not the amount of data they are
trained on.
The evaluation data is handed to all sites at the same time and they have normally about
three weeks to process it. In RT05s and RT06s the conference room results were due a week earlier
than the lecture room results, to allow labs with fewer resources to process all. By participating
in the evaluation all sites make a pledge not to do any development using the evaluation data,
so that results from their systems are a realistic indication of performance on unseen data.
Once the results have been turned in to NIST, scores are normally delivered to the partici-
pating sites within a week. The scores computed for each entry are the Diarization Error Rate
(DER), as explained in the experiments chapter. For the task of speech activity detection the
same score is used but considering any speaker segment as speech, wether if it is one or multiple
speakers talking.
After results are made public each participant then prepares a paper describing the systems
they used in order to share the knowledge acquired during the evaluation. These papers are
presented in a Workshop where all participants can meet each other and start planning for the
following year’s evaluation. On both RT05s and RT06s the Workshop has coincided with the
MLMI workshop, and the papers of the evaluation participants have been published in Lecture
Notes in Compute Science from Springer jointly with the workshop’s papers.
The test datasets used in both RT05s and RT06s evaluations were composed of conference
and lecture type data. The conference data is composed of ten and nine meeting excerpts of 12
minutes each. One meeting was eliminated from RT06s after the evaluation finished for technical
issues. These datasets have been used in this thesis to evaluate the different proposed techniques
and are covered in mode detail in the experiments chapter and in appendix B.
The lecture room data for test was composed of excerpts of different sizes contributed by
the different partners in the CHIL project and corresponding to different instants in a lecture
Chapter 7. NIST Evaluations in Speaker Diarization 185
meeting. In particular:
• RT05s test data was composed of 29 excepts, all recorded at Karlsruhe University. Up to
thee excepts were selected from each meeting, but systems were not expected to process
the data from each meeting together. The majority of data corresponded to the lecturer,
resulting in many excerpts where only one person was speaking. The shortest excerpt was
69 seconds and the longest 468 seconds.
• RT06s test data was composed of 38 excerpts of five minutes each, recorded in 5 different
CHIL meeting rooms: 4 at AIT, 4 at IBM, 2 at ITC, 24 at Karlsruhe and 4 at UPC. This
year the excerpts were chosen to contain a bigger variety of speakers and situations. After
the evaluation finished, the set was reduced to 28 excerpts for technical reasons.
The development data used in these evaluations was usually a compilation of the data sets
from previous evaluation campaigns. The used sets for conference room data were from RT02s
and RT04s evaluations for RT05s, and a subset of RT02s through RT05s for the RT06s evaluation.
For the lecture room evaluations, as this subdomain was first included in the evaluation in RT05s,
there was no prior datasets available and therefore NIST distributed a set of transcribed lecture
recordings similar to those in RT05s. For RT06s development was done using a subset of the
original development set plus the RT05s evaluation set.
Although the diarization system does not use any training data, the speech/non-speech
detector used in RT05s needed to be trained. It used around 80 hours of meetings data extracted
from the ICSI meeting corpus.
In this section an overview the ICSI participation in the NIST RT evaluations in meetings for
2005 and 2006 is given, which served as a test for the techniques and algorithms presented in
this thesis, which were created and evolved during this time.
For the RT05s evaluation ICSI presented several systems combining different alternative algo-
rithms. All combinations have in common:
• Frontend composed of a single acoustic stream with 19th order MFCC, no deltas, 30 msec
analysis window, 10 msec step size.
• Each initial cluster is modeled with a GMM with five Gaussian mixtures.
186 7.2. ICSI Participation in the RT Evaluations
• Iterative segmentation/training.
ICSI participated in the speaker diarization task on conference room and lecture room data.
The speech activity detection (SAD) algorithm was also ran on the data but it did not compete
in the official SAD evaluation. Next the systems presented for each domain are presented.
For the conference room environment the submission consisted on one primary system in each
of the MDM and SDM conditions. The MDM system uses filter&sum to acoustically fuse all
the available channels into one enhanced channel. Then it applies the speaker diarization to this
enhanced channel. The SDM condition skips the filter&sum processing, as the system’s input
is already a single channel (from the most centrally located microphone according to NIST).
The filter&sum processing lacks some of the delay post-processing improvements presented in
RT06s.
In the lecture room environment the submission consisted on primary systems for the tasks
MDM, SDM and MSLA, and contrastive systems for MDM (two systems), SDM and MSLA
(two systems).
Following is a brief description for each of these systems and their motivation:
• MDM, SDM and MSLA primary condition (MDM/SDM/MSLA p-omnione): It was ob-
served in the development data that on many occasions it was possible to obtain the best
performance by just guessing one speaker for the whole duration of the lecture. This is
particularly true when the meeting excerpt consists only of the lecturer speaking, but is
often also achieved in the question-and-answer section since many of the excerpts in the de-
velopment data consisted of very short questions followed by long answers by the lecturer.
They were therefore presented as the primary submissions, serving also as a baseline score
for the lecture room environment. Contrary to what was observed in the development data,
the contrastive (“real”) systems outperformed the primary (“guess one speaker”) submis-
sions on the evaluation data. Depending on what data is to be processed (the length of
the lecturer turn and the amount of silence in the recordings) it might not be feasible to
improve upon a “dummy” system with the current state of the art diarization systems.
• MDM using speech/non-speech detection (mdm c-spnspone): This differs from the primary
submission only on the use of the speech/non-speech (spnsp) detector to eliminate the areas
Chapter 7. NIST Evaluations in Speaker Diarization 187
of non-speech. On the development data it was observed that non-speech regions were
only labelled (in the hand-made references) when there was a change of speakers, which
never happened for the “all lecturing” sections. In a real system though it is important
to detect these silences and not attribute them to speakers. This submission is meant to
complement the previous one by trying to improve performance where between-speech
silences are marked.
• MDM using only the TableTop microphone (mdm c-ttoppur): From the available five mi-
crophones in the lecture room, one microphone (labelled as “TableTop” microphone) is
clearly of much better quality than all the others (which can be found via an SNR com-
parison among the channels). It is located in a different part of the room and is of a
different kind, which could be the reason for its better performance. In the evaluation data
it was found by using an SNR estimator and the standard diarization is used on it. No
spnsp detection was used in this system.
• SDM using the SDM channel with a minimum duration of 12 seconds for each cluster
(sdm c-pur12s): This uses the clustering system on the SDM channel. It didn’t use the
spnsp detector either. It was observed that using a minimum duration of 12 seconds, the
issue of silences marked as speech in the reference files could be bypassed, and force the
system to end with fewer clusters.
• MSLA with standard filter&sum (msla c-nwsdpur12s): In order to combine the various
available speaker-localization arrays, we used the filter&sum processing, using a random
channel from one of the arrays as the reference channel. The enhanced channel obtained
was then clustered using the 12 second minimum duration system.
• MSLA with weighted filter&sum (msla c-wsdpur12s): In the time between the conference
room and lecture room submissions, experiments were performed with a first version of
the weighted filter&sum algorithm as presented in this thesis. It was applied to the MSLA
channels in this system.
The main metric used for the RT05s evaluation was the Diarization Error Rate (DER) not
taking into account the speaker overlap regions. The DER scores as they were released by
NIST are shown in the ninth column of table 7.1, together with a summary of each system’s
characteristics. The numbers in the tenth column reflect improvements after small bug fixes
right after the evaluation, mainly coming from problems in two of the meetings.
1
This system uses a weighted version of filter&sum using correlations (slightly different from the one presented
in this thesis).
188 7.2. ICSI Participation in the RT Evaluations
System ID room Task Submit Delay # Initial Acoustic Mics DER post-eval
type type &sum clusters min. dur. used DER
p-dspursys Conf. MDM Primary YES 10 3 sec All 18.56% 16.33%
p-pursys Conf. SDM Primary NO 10 3 sec SDM 15.32% —
p-omnione Lect. MDM Primary NO n/a n/a n/a 12.21% —
c-spnspone Lect. MDM Contrast NO n/a n/a n/a 12.84% —
c-ttoppur Lect. MDM Contrast NO 5 5 sec Tabletop 10.41% 10.21%
p-omnione Lect. SDM Primary NO n/a n/a n/a 12.21% —
c-pur12s Lect. SDM Contrast NO 5 12 sec SDM 10.43% 10.47%
p-omnione Lect. MSLA Primary NO n/a n/a n/a 12.21% —
c-nwsdpur12s Lect. MSLA Contrast YES 5 12 sec All 9.98% 9.66%
c-wsdpur12s Lect. MSLA Contrast YES 1 5 12 sec All 9.99% 9.78%
Table 7.1: Systems summary description and DER on the evaluation set for RT05s
In figures 7.1 and 7.2 the DER scores are shown for each one of the excerpts used in the
evaluations for conference and lecture room data. The different excerpts are shown in the hor-
izontal axis and the DER in the vertical axis, showing one curve for each one of the presented
systems as described before. In the lecture room data the table omits the full meeting names
and just show the terminations, which indicates the content of the meeting. Excerpts terminated
with “E1” or “E3” only contain the lecturer and therefore it is easier for the system to obtain
a perfect diarization.
60
MDM system
50 SDM system
40
30
20
10
0
5
00
30
2
06
00
39
61
41
05
03
30
13
14
12
11
09
-1
-1
-1
-1
-1
4-
3-
7-
4-
8-
10
28
12
30
20
53
11
30
31
2
12
04
04
50
50
0
10
11
50
50
04
05
05
05
00
00
00
00
00
00
20
20
20
20
_2
_2
_2
_2
_2
_2
I_
I_
T_
T_
U
SI
SI
VT
VT
AM
AM
IS
IS
M
IC
IC
C
Figure 7.1: DER Break-down by meeting for the RT05s conference data
The use of filter&sum to enhance the signal before doing the clustering turned out to be
a bad choice for the conference room systems, as the SDM DER is smaller than the MDM.
This was explained due to the big difference between the quality of the signal of the different
microphones. When using the best quality microphone as the SDM channel it is difficult to
Chapter 7. NIST Evaluations in Speaker Diarization 189
90 mdm_c-spnspone
mdm_c-ttoppur
80 mdm_p-omnione
msla_c-nwsdpur12s
msla_c-wsdpur12s
70 msla_p-omnione
sdm_c-pur12s
sdm_p-omnione
60
50
40
30
20
10
0
E1 E2 E1 E2 E1 E2 E3 E1 E2 E1 E2 E1 E2 E1 E2 E1 E2 E1 E1 E1 E2 E1 E2 E1 E1 E2 E1 E1 E2
Figure 7.2: DER break-down by show for the RT05s lecture data
improve such signal using the other channels combined via filter&sum. A weighted version of
the algorithm was proposed to automatically (and adaptively) weight those channels with better
quality signal. The weight computation was improved for RT06s evaluation.
For the RT06s evaluation a total of 23 systems were presented in the multiple tasks and subtasks
proposed. Each system uses one or more of the improvements presented in this thesis. The
common characteristics of all systems are:
• Frontend composed of at least an acoustic stream with 19th order MFCC, no deltas, 30
msec analysis window, 10 msec step size.
• Each initial cluster is modeled with a GMM with five Gaussian mixtures.
• Iterative segmentation/training.
In the following list the main characteristics of the systems presented are explained. Across
tasks, systems with the same ID are equal or very similar, just differing on a few parameters.
Their characteristics are:
p-wdels: This is the primary system presented this year for all multi-microphone conditions. It
uses most of the proposed improvements of this thesis, and all changes in the diarization
code from last year’s evaluation.
c-newspnspdelay: This system is presented for the multi-microphone cases and is composed of
RT05s evaluation code using the new filter&sum algorithm, this year’s hybrid speech/non-
speech detector and taking advantage of the delays for clustering. It uses a minimum du-
ration of 3 seconds, 1/5 initial Gaussian mixtures for delays/acoustics and a split weight
of 0.1/0.9 between the streams fixed for all meetings. It is intended to compare the im-
provements of using delays in the system compared to last year’s performance.
c-wdelsfix: This system is identical to p-wdels in all parts except the decision of the initial
number of clusters, which is fixed to 16 and 10 clusters for conference and lecture rooms,
respectively. It intends to compare the robustness of the initial number of clusters selection.
c/p-nodels: This system contains all of RT06s improvements with respect to filter&sum (when
available, in MDM), speech/non-speech detection and other diarization algorithms except
the inclusion of the delays as an extra feature stream.
c-oldbase: This system uses all improvements in filter&sum (when available, in MDM) and
speech/non-speech detection while using the RT05s core speaker diarization system. It is
meant to serve as a baseline result for RT06s systems.
c-guessone: This system guesses one speaker for all of the show. In RT05s this was presented
as the primary system for lecture room data, showing the need to beat this system in order
to think of speaker diarization in the lecture data as a reasonable task. In RT06s it is also
presented as a baseline lecture-room system to be compared with the other lecture-room
systems.
In this section the NIST official scores are shown for all of the ICSI systems presented in the
RT06s evaluation in the speaker diarization (SPKR) task and the speech activity detection
(SAD) task. In RT06s the main metric used was DER including the speaker overlap regions. In
tables 7.2 and 7.3, the SPKR results are shown both for conference and lecture room data, and
in table 7.4 the SAD results are shown. During the development of the systems for RT06s focus
was switched at using forced-alignments as reference segmentations instead of hand-alignments,
Chapter 7. NIST Evaluations in Speaker Diarization 191
which were believed to be less reliable. In all cases in the results tables they show both the
official hand-made references and the forced-alignment references.
In general, results for RT06s using hand-alignments were much worse than in previous years
for conference room, which was not so pronounced when evaluating results using the forced
alignments. This might be due to the increased complexity of the data and of a decrease in the
quality of the hand-generated transcriptions for RT06s evaluation.
Table 7.2: Results for RT06s Speaker Diarization, conference room environment
Table 7.3: Results for RT06s Speaker Diarization, lecture room environment
In the SPKR task for conference room a substantial improvement can be seen between the
first three systems in MDM and the last two due to using delays as features in diarization. In
lecture room data (Table 7.3, third column) the use of delays affects negatively the performance,
possibly due to the existence of people moving around the room (delays consider a different
speaker for each location).
Figure 7.3 shows the DER per meeting for each of the presented systems. It is interesting to
observe that the primary MDM system (mdm p-wdels) obtains flatter scores for all the shows
than using last year’s system, labelled as mdm c-newspnspdelay. Both are shown in dashed lines
in figure 7.3.
In general the more microphones available for processing, the better the results. As the
diarization system is the same, the improvement is thanks to the filter&sum processing. This is
192 7.2. ICSI Participation in the RT Evaluations
60 mdm_c-newspnspdelay
mdm_c-nodels
mdm_c-oldbase
55
mdm_c-wdels
mdm_p-wdels
50 sdm_c-oldbase
sdm_p-nodels
45
40
35
30
25
00
00
51
00
0
30
23
40
40
9
10
09
3
-0
-0
1
-0
-1
-
3-
7-
2
16
18
2
91
91
2
02
10
2
06
10
0
50
50
1
05
05
5
05
05
0
00
00
0
20
20
20
20
_2
_2
_2
_2
I_
I_
T_
T_
U
VT
VT
ED
ED
IS
IS
M
M
C
N
Figure 7.3: DER break-down by show for the RT06s conference data
clear in the conference room data, while in the lecture room data results are mixed. It is believes
that this is due to the big difference in quality between the microphone used in SDM and all
others.
In the lecture room results shown in Table 7.3 a comparison is made between the manual and
forced-alignment DER for all systems submitted. The third column shows the results using the
latest release of the manual reference segmentations (18 meeting segments). When generating the
forced-alignments using the IHM channels from each individual speaker we could not produce
them for the meeting segments containing speakers not wearing any headset microphone. The
last column shows results using forced-alignment references for a subset of 17 meeting segments
containing all speakers who wore a headset microphone. The second to last column shows results
using this same subset and using hand-alignments for comparison purposes.
Results using FA references are much better than using hand-alignments in the conference
room, while they remain similar in lecture room (with a constant improvement of 0.5% to 1%
for FA). It is believed that the conference room manual references contain many human-created
problems, which were filtered out in the lecture room references after several redistributions of
references.
Figure 7.4 shows the break-down of the DER for all presented systems for the lecture room
data. Some meetings are much harder to process than others, creating spikes in the DER curves,
more or less pronounced depending on the system. In some cases the ADM systems perform as
well in these “hard” meetings as in the easier ones.
On the other hand, in table 7.4 results are shown for systems on conference and lecture
room data for the SAD task, using the new speech/non-speech detector developed for RT06s
evaluation.
Chapter 7. NIST Evaluations in Speaker Diarization 193
90
adm_c-guessone
adm_c-nodels
80 adm_c-wdelsfix
adm_p-wdels
mdm_c-guessone
70 mdm_c-nodels
mdm_c-wdelsfix
mdm_p-wdels
60
msla_p-guessone
sdm_c-guessone
50 sdm_p-nodels
40
30
20
10
0
C
C
M
KA
KA
KA
KA
KA
KA
KA
KA
KA
KA
KA
KA
KA
KA
T
PC
PC
PC
PC
AI
AI
AI
AI
IT
IT
IB
IB
IB
IB
U
Figure 7.4: DER break-down by show for the RT06s lecture data
Table 7.4: Results for RT06s Speech Activity Detection (SAD). Results with * are only for a
subset of segments
The RT06s speech/non-speech detector was developed using forced-alignment (FA) data.
Therefore the results of the SAD are better as shown in the forced-alignment column. The
increase in % MISS in the hand-aligned conference data compared to the FA results is probably
due to silence regions (greater than 0.3s) that are correctly labelled by the FA transcriptions
but are considered speech by the hand-alignments.
As was done for the diarization experiments, a subset of meetings was created to appropri-
ately evaluate the lecture room systems using forced-alignment references, and the counterpart
hand-alignments for completeness. One initial observation is that the error rate decreases dra-
matically when evaluating only a subset of the shows using hand-alignments. Possible expla-
nations for this are transcription errors produced due to the lower quality of the non-headset
microphones used in the eliminated set of meetings, and/or an overall decrease of quality on
these meetings for causes other than the transcription process.
As in the diarization results, these experiments also obtain better results the more micro-
194 7.3. Pros and Cons of the NIST Evaluations
phones used, thanks to the filter&sum module. When comparing the forced-alignment with the
hand-alignment subset the first group keeps a better balance between misses and false alarms,
indicating that parameters defined in development translate robustly to the evaluation data.
Overall, for RT06s there was a big improvement with the use of delays between microphones
as a feature in the diarization process for conference room data, while mixed results were ob-
tained in lecture room. Also, a general improvement was observed using filter&sum on as many
microphone signals as possible.
I strongly believe in the advantages behind any evaluation where different independent research
groups work towards solving a common problem. But as much as I think that they are beneficiary,
there are some issues that could be improved.
It is also a good framework to be able to share resources between research groups that allows
for better systems to be created and for more systems to be at the top performance possible. This
was the case in RT06s when Karlsruhe University shared the output of their beamforming system
in order allow other labs to obtain results and perform research with the MM3A microphone
array.
One drawback of the current rich transcription evaluations is the reduced number of par-
ticipants in some of the tasks. This has been tried to address by setting up smaller tasks like
Speech Activity Detection (SAD) in which many groups participated in RT06s.
By repeating the evaluations in successive years it allows technology and new ideas brought
in by one group to be used by another with the purpose of solving/improving the problem at
hand. Baseline tools and systems should be made available to research groups with a willingness
Chapter 7. NIST Evaluations in Speaker Diarization 195
to participate in order to allow them to obtain competitive results without the need to building
a whole system.
Chapter 8
Conclusions
In this chapter, first a general review is given on the improvements achieved at the end of this
thesis. Then all the objectives proposed in the introduction are reviewed and their success is
analyzed. Finally, future work still to be done is proposed.
This PhD thesis verses about the topic of speaker diarization for meetings. While answering to
the question “Who spoke when?”, the presented speaker diarization system is able to process a
variable number of microphones spread around the meeting room and determine the optimum
output without any prior knowledge of the number of speakers or their identities.
The presented system uses as baseline the technology in speaker diarization for broadcast
news existent at the International Computer Science Institute (ICSI) and adapts it to the meet-
ings domain by developing new algorithms and improving existent ones to adapt the system
to the desired meetings environment. While prior work in the topic of speaker diarization for
meetings proposed some sorts of parallel diarization processing of the acoustics and a fusion
of the multiple channel outputs, the proposed system uses acoustic beamforming to obtain an
“enhanced” single channel and information about the speaker positions in order to use them
combined in a single-channel speaker diarization process.
Then the system discards non-speech segments using a new hybrid speech/non-speech detec-
tor and processes both acoustics and speaker position information. Algorithms include automatic
algorithms for models complexity selection, initialization and training, number of initial clusters
and their initial segments, frame and segment purification algorithms and others.
The development of the system was closely linked to participation in the speaker diarization
evaluations in Rich Transcription (RT) for meetings proposed by NIST in 2005 and 2006. In
both submissions the systems proposed by ICSI both for lecture anc conference room data, with
197
198 8.2. Review of Objectives Completion
Experiments were done using the NIST Rich Transcription evaluations datasets to analyze
the suitability of each individual module, obtaining results that can be easily compared with
other systems and implementations. A 41.15% relative improvement is reported for the devel-
opment set comparing the system at the start of the thesis to the optimum system proposed. A
25.45% relative improvement is reported for the evaluation set.
Upon the thesis start (ad stated in the introduction) a set of objectives was set. At this point
these were all successfully completed and will be reviewed in the following paragraphs.
In general, a successful system was implemented basing it on the broadcast news technology
available at ICSI at the start of the thesis. During this process most of the differences between
broadcast news and meetings were analyzed and algorithms were proposed to bridge the gap
between both. These were, for example, the multi-channel setup in the meetings versus a single
broadcast news channel, the different nature of the non-speech data to be detected and the
existence of shorter (in average) speaker. When developing the system it was made modular
so that the acoustic beamformer, the speech/non-speech detector and the speaker diarization
modules were independent from each other and passing the information between them in files.
This allowed the use of the beamforming module for the automatic speech recognition system
for meetings, with good results.
Two main ideals that were already in place at ICSI at the start of the thesis were followed
at heart. These were making the system be easy to adapt to new domains and that parameters
should be robust and not flaky. In terms of easy adaptation, as has been already mentioned,
the system was developed using separate blocks to allow for an easy recombination depending
on the necessities. In fact, already within the meetings domain, the same speaker diarization
module and the speech/non-speech module were reused for SDM and MDM conditions, either
using the beamforming as an initial step or not. The core speaker diarization module was kept
very similar to the broadcast news system, therefore it could be readapted to that domain with
little effort.
In terms of robustness and lack of flakiness, they are problems present in many current
speaker diarization systems under research nowadays. Mostly with the inclusion of the proposed
new diarization module algorithms it has been shown in the experiments that final results on
development and test follow each other closely, showing an increase in robustness from the start
of the thesis work. Regarding flakiness, some parameters were defined to substitute others which
experienced important differences in Diarization Error Rate (DER) when slightly modifying its
Chapter 8. Conclusions 199
value. The DER value accounts for the percentage of incorrectly assigned time. With the new
parameters, in many cases, it was shown that the DER curves were flatter, reducing therefore
the flakiness. In some other cases there is still work to do.
Although the system needs the use of development data in order to tune some of the pa-
rameters in it, with the development of a hybrid speech/non-speech detector it does not require
anymore the use of any external training data. This speaks also in favor of self-sufficiency of
the system, which is another ideal followed during its implementation, very much in tune with
the capability for fast adaptation to new domains and robustness to changes in the data being
tested.
Both in 2005 and 2006 the speaker diarization system entered the NIST Rich Transcription
(RT) evaluations where a common task and common datasets were processed by multiple re-
search laboratories. In both entries the ICSI system performed very well. This was established
as a goal or milestone in order to push the research and development of the system to be avail-
able for the evaluations. In 2005 the main improvement consisted on the development of the
initial version of the beamforming system. In 2006 it was a set of improvements to the beam-
forming and many changes made to the speaker diarization module, as well as a totally new
speech/non-speech detector.
As important as the technical improvements and innovations are the tasks to increase public
awareness on the system and algorithms being proposed. To this respect, the RT evaluations
are a wonderful way to meet people from the same research area and to expose one’s research
to the community. Another very important way is the publication of articles in conferences and
technical magazines. From the start of this thesis work more than 15 papers have been accepted
for publication which explain the different improvements and capabilities of the system.
Yet another way is the transfer of technology or knowledge between research labs, which
allows other researcher to build on top of pre-established research from other researchers. This
was the case of TNO-Twente research group (within the AMI project group) which implemented
part of their RT06s contribution based on ICSI’s system, or LIA (Avignon) which experimented
with the segment purification algorithm originally proposed in RT05s ICSI’s submission. Also in
this group is the direct transfer of resources by means of system source code, as it was originally
done by IDIAP to bring to ICSI the initial speaker diarization system (thanks to Jitendra
Ajmera), and was followed recently by the author to take it to the University of Washington
(UW). Finally, recently the diarization system has been adapted for speaker tracking used in a
Spanish evaluation task within UPC ??.
200 8.3. Possible Future Work Topics
As with everything else in life, there is more that could be done, and it could not be otherwise
in this thesis.
One topic within the meeting domain processing that has received quite some attention
recently is speaker overlap detection. As such, it refers to the detection of the segments where
more than one speaker is talking at the same time, and the output of an appropriate ID for each
participant. In NIST RT06s evaluation the main metric included overlap for the first time and
multiple research labs (including the author) researched techniques for its detection without any
success in reducing the overall error. There is still work to be done in detecting when more than
one person is speaking, which should come from the beamforming module (where speakers are
well determined by their location) and from the diarization module (where data with multiple
overlapping speakers has special acoustic properties compared to single speakers). Also in ASR
systems for meetings it would be very beneficiary to create multiple signals in overlap regions,
each one derived from the steering to every speaker.
Another area where there research should be directed is the creation of strong links between
ASR transcription output and diarization. Although it is well established the use of diarization
algorithms to help the ASR systems in model adaptation, it has only been briefly studied the
use of ASR for diarization. It could be useful in a number of areas, like the definition of possible
speaker change-points (ranging from boundaries at word level to discourse level), or in the
assignment of speaker ID’s (or the correct name) based on the transcription content (in which
LIMSI has done some research). Also, both areas could benefit by the combination of plausible
speaker labels with ASR N-best words for each instant, it being useful at decoding level to reduce
the errors in both the ASR and diarization tasks.
In the topic of discourse modeling, speaker diarization could benefit from research on ways
to model the turn-taking between the speakers. By using information at a higher level than pure
acoustics, the transition probabilities between speakers could be appropriately set to help the
decoding. One of such possible high level information is easily noticeable in broadcasted news
where an anchor speaker is very probable to speak after every other speaker. Similar analysis
could be made in the meetings domain and possibly classify the meetings into several types
(more fine grained than the current lecture/conference classification) to apply different models
to. Possible types could me: moderated meetings (with one person acting as anchor), struc-
tured meetings (people following an order, without anchor) and unstructured (where everyone
intervenes at random, supposedly with higher amount of overlap regions). Also, it could be con-
sidered to split the meeting into several parts with each person’s participation dependent on
which part/topic the meeting is in.
Chapter 8. Conclusions 201
One of the objectives of this thesis was to increase the robustness of the system to mismatches
between the development data and the test data, and to make the system parameters less
sensitive to the processed data by obtaining parameters more linked to the acoustics and by
eliminating any model training step from the system. It has been shown in the experiments
section that an important step forward has been taken in that direction. There is still more
that can be done in this topic towards eliminating as many tuning parameters as possible,
letting the algorithms select such parameters solely from the data. Also important is to better
understand the underlying processes that lead the system to score very differently depending on
every particular meeting (leading to “easy” and “difficult” meetings).
Finally, although current systems in the RT evaluations are defined with no application in
mind, trying to be adaptable to any possible application, this poses a burden in the capacity
of such systems to obtain the optimum score and makes them more computationally intensive
as most of the algorithms used for diarization are iterative. It would be interesting to explore
particular areas of application where, for example, the number of speakers in a meeting are
known. This particular information would probably change the way that speaker diarization
algorithms are designed and would allow for lower DER scores, most probably in the region
where speaker identification techniques are nowadays.
Appendix A
The purpose of this appendix is to show the equivalence between two different representations
of the Bayesian Information Criterion (BIC), one based on the likelihood of the data given the
models, which allows the models to be arbitrary and as complex as necessary given the task at
hand, and another representation only dependent on the sufficient statistics of the data, which
considers the case of a single Gaussian modeling the data. These two representations are used
alternatively in the bibliography with various modifications, which sometimes cause the results
not to be comparable between each other.
1
BIC(Mi ) = log L(Xi , Mi ) − λ #(Mi )log(Ni ) (A.1)
2
Being logL(Xi , Mi ) the log-likelihood of the data given the considered model. The parameter
λ is a design parameter which is not part of the original BIC formulation but which is used to
change the effect of the penalty term in the formula. Such formula allows the model Mi to be
of any kind.
If instead it is considered that the model is created by a single Gaussian, eq. A.1 can be
rewritten as
1 Ni 1
BIC(Mi ) = − Ni log(|Si |) − d(1 + log(2π)) − λ #(Mi )log(Ni ) (A.2)
2 2 2
where S is the covariance matrix representing the data and d is its dimension. Such formulation
203
204
only depends on the sufficient statistics of the data, and therefore its computation is very fast.
Let us progress from equation 2.1 into obtaining equation A.2. Considering that the used
model is a single Gaussian with full covariance, one can rewrite eq. 2.1 as
Ni
X Ni
Y 1 1 ′ −1 (x
BIC(Mi ) = log p(xi [n]|Mi ) = log [ exp− 2 (xi [n]−x̄) S i [n]−x̄)
] (A.3)
n=1 n=1
(2π)d/2 |S|1/2
by doing the Ni products one obtains a sum of terms in the exponential, where each terms is a
scalar value. One can use mathematical properties of the trace in order to obtain a closer form
for it. As the trace(scalar) = scalar, it does not change the result.
Let us then consider only the trace of the numerator in the exponent
T r[(xi [1] − x̄)′ S −1 (xi [1] − x̄) + · · · + (xi [Ni ] − x̄)′ S −1 (xi [Ni ] − x̄)]
P P
1. Applying the property that T r[ ] = T r[]
· · · = T r[(xi [1] − x̄)′ S −1 (xi [1] − x̄)] + · · · + tr[(xi [Ni ] − x̄)′ S −1 (xi [Ni ] − x̄)]
2. For each trace element applying the circularity of the trace property
· · · = T r[(xi [1] − x̄)′ (xi [1] − x̄)S −1 ] + · · · + tr[(xi [Ni ] − x̄)′ (xi [Ni ] − x̄)S −1 ]
3. Applying the property of matrix algebra AB+CB = (A+C)B one can isolate the inverse
covariance matrix. At this point, given the definition of covariance matrix one can identify
it in the equation and therefore we obtain
· · · = T r[Ni SS −1 ] = Ni T r[I] = N d
Given this result and going back to the BIC formulation in eq. A.3 and using the log prop-
erties
1
BIC(Mi ) = log[ e−Ni d/2 ]
(2π)dNi /2 |S|N/2
Ni d 1
= − log(2π)dNi /2 − log(|S|)Ni /2 − − λ #(Mi )log(Ni )
2 2
Obtaining finally
Appendix A. BIC Formulation for Gaussian Mixture Models 205
1 Ni 1
BIC(Mi ) = − Ni log(|Si |) − d(1 + log(2π)) − λ #(Mi )log(Ni ) (A.4)
2 2 2
1
Which is in fact equation A.2. Note that a factor 2 applies to each term in the expres-
sion. Such factor is sometimes omitted, causing the optimum λ factor to differ in the different
implementations.
Appendix B
In this appendix a complete list of the data used for the development and test sets in this thesis
is listed. This data forms the datasets used by NIST in the RT evaluations, in the conference
room recordings subdomain.
Table B.1 shows the complete meeting names and some relevant information about each
meeting. The total time column indicates the length of the excerpt extracted from each meeting
to be used for the evaluation, in seconds. The column titled effective duration indicates the
length of the speech regions in each one of the meetings as indicated by the forced-alignment
reference segmentation files.
207
208
Adami, A., Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain, P., Kajarekar,
S., Morgan, N. and Sivadas, S.: 2002, Qualcomm-icsi-ogi features for asr, Proc. International
Conference on Speech and Language Processing.
Adami, A. G., Kajarekar, S. S. and Hermansky, H.: 2002, A new speaker change detection
method for two-speaker segmentation, Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing, Orlando, Florida.
Aguilo, M.: 2005, Deteccion de actividad oral en un sistema de diarizacion, Master’s thesis,
UPC.
Ajmera, J.: 2004, Robust Audio Segmentation, PhD thesis, Ecole Polytechnique Federale de
Lausanne.
Ajmera, J., Bourlard, H. and Lapidot, I.: 2002, Improved unknown-multiple speaker clustering
using HMM, Technical report, IDIAP.
Ajmera, J., Lathoud, G. and McCowan, I.: 2004, Clustering and segmenting speakers and their
locations in meetings, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Vol. 1, pp. 605–608.
Ajmera, J., McCowan, I. and Bourlard, H.: 2003, Robust speaker change detection, Technical
report, IDIAP.
Ajmera, J., McCowan, I. and Bourlard, H.: 2004, Robust speaker change detection, IEEE Signal
Processing Letters 11(8), 649–651.
Ajmera, J. and Wooters, C.: 2003, A robust speaker clustering algorithm, IEEE Automatic
Speech Recognition and Understanding Workshop, US Virgin Islands, USA.
Anguera, X.: 2005, Xbic: Real-time cross probabilities measure for speaker segmentation, Tech-
nical Report TR-99-2004, ICSI.
209
210 Bibliography
Anguera, X., Aguilo, M., Wooters, C., Nadeu, C. and Hernando, J.: 2006, Hybrid speech/non-
speech detector applied to speaker diarization of meetings, Speaker Odyssey 06, Puerto
Rico, USA.
Anguera, X. and Hernando, J.: 2004a, Evolutive speaker segmentation using a repository system,
Proc. International Conference on Speech and Language Processing, Jeju Island, Korea.
Anguera, X. and Hernando, J.: 2004b, XBIC: nueva medida para segmentacion de locutor hacia
el indexado automatico de la senal de voz, III Jornadas en Tecnologia del Habla, Valencia,
Spain.
Anguera, X., Wooters, C. and Hernando, J.: 2005, Speaker diarization for multi-party meetings
using acoustic fusion, IEEE Automatic Speech Recognition and Understanding Workshop,
Puerto Rico, USA.
Anguera, X., Wooters, C. and Hernando, J.: 2006a, Automatic cluster complexity and quantity
selection: Towards robust speaker diarization, MLMI’06, Washington DC, USA.
Anguera, X., Wooters, C. and Hernando, J.: 2006b, Frame purification for cluster comparison
in speaker diarization, MMUA’06, Toulouse, France.
Anguera, X., Wooters, C. and Hernando, J.: 2006c, Friends and enemies: A novel initialization
for speaker diarization, Proc. International Conference on Speech and Language Processing,
Pittsburgh, USA.
Anguera, X., Wooters, C. and Hernando, J.: 2006d, Purity algorithms for speaker diarization
of meetings data, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Toulouse, France.
Anguera, X., Wooters, C. and Pardo, J. M.: 2006a, Robust speaker diarization for meetings:
ICSI RT06s evaluation system, Proc. International Conference on Speech and Language
Processing, Pittsburgh, USA.
Anguera, X., Wooters, C. and Pardo, J. M.: 2006b, Robust speaker diarization for meetings: ICSI
RT06s meetings evaluation system, RT06s Meetings Recognition Evaluation, Washington
DC, USA.
Anguera, X., Wooters, C., Peskin, B. and Aguilo, M.: 2005, Robust speaker segmentation for
meetings: The ICSI-SRI spring 2005 diarization system, RT05s Meetings Recognition Eval-
uation, Edinburgh, Great Brittain.
Appel, U. and Brandt, A.: 1982, Adaptive sequential segmentation of piecewise stationary time
series, Inf. Sci. 29(1), 27–56.
Bibliography 211
Attias, H.: 2000, A variational bayesian framework for graphical models, Advances in Neural
information processing systems . MIT Press, Cambridge.
Bakis, R., Chen, S., Gopalakrishnan, P. and Gopinath, R.: 1997, Transcription of broadcast
news shows with the IBM large vocabulary speech recognition system, Speech Recognition
Workshop, pp. 67–72.
Barras, C., Zhu, X., Meignier, S. and Gauvain, J.-L.: 2004, Improving speaker diarization, Fall
2004 Rich Transcription Workshop (RT04), Palisades, NY.
Basseville, M. and Nikiforov, I.: 1993, Detection of abrupt changes-theory abd application,
Prentice-Hall.
Beigi, H. S. and Maes, S. H.: 1998, Speaker, channel and environment change detection, World
Congress on Automation.
Beigi, H. S., Maes, S. H. and Sorensen, J. S.: 1998, A distance measure between collections of dis-
tributions and its application to speaker recognition, Proc. IEEE International Conference
on Acoustics, Speech and Signal Processing, Detroit, USA.
Ben, M., Betser, M., Bimbot, F. and Gravier, G.: 2004, Speaker diarization using bottom-up
clustering based on a parameter-derived distance between adapted GMMs, Proc. Interna-
tional Conference on Speech and Language Processing, Jeju Island, Korea.
Bilmes, J. and Zweig, G.: 2002, The graphical models toolkit: an open source software system
for speech and time-series processing, Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing, Orlando, Fl, USA.
Bimbot, F. and Mathan, L.: 1993, Text-free speaker recognition using an arithmetic-harmonic
sphericity measure, Eurospeech’93, Berlin, Germany, pp. 169–172.
Bonastre, J.-F., Delacourt, P., Fredouille, C., Merlin, T. and Wellekens, C.: 2000, A speaker
tracking system based on speaker turn detection for NIST evaluation, Proc. IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, pp. 1177–
1180.
Brandstein, M., Adcock, J. and Silverman, H.: 1995, A practical time-delay estimator for local-
izing speech sources with a microphone array, Comput. Speech Lang. 9, 153–159.
Brandstein, M. and Griebel, S.: 2001, Explicit Speech Modeling for Microphone Array Applica-
tions, Springer, chapter 7.
212 Bibliography
Brandstein, M. S. and Silverman, H. F.: 1997, A robust method for speech signal time-delay
estimation in reverberant rooms, Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing, Munich, Germany.
Burger, S., Maclaren, V. and Yu, H.: 2002, The ISL meeting corpus: The impact of meeting
type on speech style, Proc. International Conference on Speech and Language Processing,
Denver, USA.
Campbell, J. P.: 1997, Speaker recognition: a tutorial, Proceedings of the IEEE 1.85(9), 1437–
1462.
Canseco, L., Lamel, L. and Gauvain, J.-L.: 2005, A comparative study using manual and auto-
matic transcriptions for diarization, IEEE Automatic Speech Recognition and Understand-
ing Workshop, San Juan, Puerto Rico.
Canseco-Rodriguez, L., Lamel, L. and Gauvain, J.-L.: 2004a, Speaker Diarization from Speech
Transcripts, Proc. International Conference on Speech and Language Processing, Jeju Is-
land, S. Korea, pp. 1272–1275.
Canseco-Rodriguez, L., Lamel, L. and Gauvain, J.-L.: 2004b, Towards using STT for Broadcast
News Speaker Diarization, Proc. DARPA RT04, Palisades NY.
Carter, G., Nuttall, A. H. and Cable, P. G.: 1973, The smoothed coherence transform, Proc.
IEEE (Lett.) 61, 1497–1498.
Cassidy, S.: 2004, The macquarie speaker diarization system for rt04s, NIST 2004 Spring Rich
Transcrition Evaluation Workshop, Montreal, Canada.
Cettolo, M. and Vescovi, M.: 2003, Efficient audio segmentation algorithms based on the BIC,
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing.
Champagne, B., Bedard, S. and Stephenne, A.: 1996, Performance of time-delay estimation in
the presence of room reverberation, IEEE Transactions on Speech and Audio Processing .
Chan, W., Lee, T., Zheng, N. and hua Ouyang: 2006, Use of vocal source features in speaker
segmentation, Proc. IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, Toulouse, France.
Chen, L., Rose, R. T., Parrill, F., Han, X., Tu, J., Huang, Z., Harper, M., Quek, F., McNeill,
D., Tuttle, R. and Huang, T.: 2005, Vace multimodal meeting corpus, MLMI, Edimburgh,
UK.
Bibliography 213
Chen, S. S., Gales, M. J. F., Gopinath, R. A., Kanvesky, D. and Olsen, P.: 2002, Automatic
transcription of broadcast news, Speech Communication 37, 69–87.
Chen, S. S. and Gopalakrishnan, P.: 1998, Clustering via the bayesian information criterion
with applications in speech recognition, Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing, Vol. 2, Seattle, USA, pp. 645–648.
Chickering, D. M. and Heckerman, D.: 1997, Efficient approximations for the marginal likelihood
of bayesian networks with hidden variables, Machine Learning 29, 181–212.
Cohen, I. and Berdugo, B.: 2002, Speech enhancement based on a microphone array and log-
spectral amplitude estimation, 22nd Convention of Electrical and Electronics Engineers in
Israel.
Collet, M., Charlet, D. and Bimbot, F.: 2005, A correlation metric for speaker tracking us-
ing anchor models, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Philadelphia, USA.
Cox, H., Zeskind, R. and Kooij, I.: 1986, Practical supergain, IEEE Transactions on Acoustics,
Speech and Signal Processing 34(3), 393–397.
Cox, H., Zeskind, R. and Owen, M.: 1987, Robust adaptive beamforming, IEEE Transactions
on Acoustics, Speech and Signal Processing 35(10), 1365–1376.
Delacourt, P., Kryze, D. and Wellekens, C. J.: 1999a, Detection of speaker changes in an audio
document, Eurospeech-1999, Budapest, Hungary.
Delacourt, P., Kryze, D. and Wellekens, C. J.: 1999b, Speaker-based segmentation for audio data
indexing, ESCA Workshop on accessing Information in Audio Data.
Delacourt, P. and Wellekens, C. J.: 1999, Audio data indexing: Use of second-order statistics
for speaker-based segmentation, IEEE International Conference on Multimedia, Computing
and Systems, Florence, Italy.
214 Bibliography
Delacourt, P. and Wellekens, C. J.: 2000, DISTBIC: A speaker-based segmentation for audio
data indexing, Speech Communication: Special Issue in Accessing Information in Spoken
Audio 32, 111–126.
Deshayes, J. and Picard, D.: 1986, Off-line statistical analysis of change-point models using
non-parametric and likelihood methods, Springer-Verlag.
Digalakis, V., Monaco, P. and Murveit, H.: 1996, Genones: generalized mixture tying in con-
tinuous hidden markov model-based speech recognizers, IEEE transactions on speech and
audio processing 4(4), 281–289.
Doclo, S. and Moonen, M.: 2002, Gsvd-based optimal filtering for single and multimicrophone
speech enhancement, IEEE Trans. Signal Processing 50, 2230–2244.
Duda, R. and Hart, P.: 1973, Pattern classification and Scene analysis, John Wiley & Sons.
Dunn, R. B., Reynolds, D. and Quatieri, T. F.: 2000, Approaches to speaker detection and
tracking in conversational speech, Digital signal processing 10, 93–112.
Eckart, C.: 1952, Optimal rectifier systems for the detection of steady signals, Technical Report
Rep SI0 12692, SI0 Ref 52-11,1952, Univ. California, Scripps Inst. Oceanography, Marine
Physical Lab.
Ellis, D. and Liu, J. C.: 2004, Speaker turn detection based on between-channels differences,
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing.
F. Reed, P. F. and Bershad, N.: 1981, Time delay estimation using the lms adaptive filter - static
behavior, IEEE Transactions on Acoustics, Speech and Signal Processing .
Fischer, S. and Kammeyer, K.-D.: 1997, Broadband beamforming with adaptive postfiltering for
speech acquisition in noisy environments, Proc. IEEE International Conference on Acous-
tics, Speech and Signal Processing.
Fiscus, J. G., Ajot, J., Michet, M. and Garofolo, J. S.: 2006, The rich transcription 2006 spring
meeting recognition evaluation, NIST 2006 Spring Rich Transcrition Evaluation Workshop,
Washington DC, USA.
Fiscus, J. G., Garofolo, J., Ajot, J. and Michet, M.: 2006, Rt-06s speaker diarization results and
speech activity detection results, NIST 2006 Spring Rich Transcrition Evaluation Work-
shop, Washington DC, USA.
Bibliography 215
Fiscus, J. G., Radde, N., Garofolo, J. S., Le, A., Ajot, J. and Laprun, C. D.: 2005, The rich tran-
scription 2005 spring meeting recognition evaluation, NIST 2005 Spring Rich Transcrition
Evaluation Workshop, Edimburgh, UK.
Flanagan, J., Johnson, J., Kahn, R. and Elko, G.: 1994, Computer-steered microphone arrays for
sound transduction in large rooms, Journal of the Acoustic Society of America 78, 1508–
1518.
Fredouille, C., Moraru, D., Meignier, S., Besacier, L. and Bonastre, J.-F.: 2004, The NIST 2004
spring rich transcription evaluation: Two-axis merging strategy in the context of multiple
distant microphone based meeting speaker segmentation, NIST 2004 Spring Rich Trans-
crition Evaluation Workshop, Montreal, Canada.
Gallardo-Antolin, A., Anguera, X. and Wooters, C.: 2006, Multi-stream speaker diarization
systems for the meetings domain, Proc. International Conference on Speech and Language
Processing, Pittsburgh, USA.
Gangadharaiah, R., Narayanaswamy, B. and Balakrishnan, N.: 2004, A novel method for two-
speaker segmentation, Proc. International Conference on Speech and Language Processing,
Jeju, S. Korea.
Garofolo, J. S., Laprun, C. D. and Fiscus, J. G.: 2004, The rich transcription 2004 spring
meeting recognition evaluation, NIST 2004 Spring Rich Transcrition Evaluation Workshop,
Montreal, Canada.
Gauvain, J.-L., Lamel, L. and Adda, G.: 1998, Partitioning and transcription of broadcast news
data, Proc. International Conference on Speech and Language Processing, Vol. 4, Sidney,
Australia, pp. 1335–1338.
Gish, H. and Schmidt, M.: 1994, Text-independent speaker identification, Signal Processing
Magazine, IEEE pp. 18–32.
Gish, H., Siu, M.-H. and Rohlicek, R.: 1991, Segregation of speakers for speech recognition
and speaker identification, Proc. IEEE International Conference on Acoustics, Speech and
Signal Processing, Vol. 2, Toronto, Canada, pp. 873–876.
Griffiths, L. and Jim, C.: 1982, An alternative approach to linearly constrained adaptive beam-
forming, IEEE Trans. on Antenas and Propagation .
Hain, T., Johnson, S., Turek, A., Woodland, P. and Young, S. J.: 1998, Segment generation
and clustering in the HTK broadcast news transcription system, DARPA Broadcast News
Transcription and Understanding Workshop, pp. 133–137.
216 Bibliography
Heck, L. and Sankar, A.: 1997, Acoustic clustering and adaptation for robust speech recognition,
Eurospeech-97, Rhodes, Greece.
Hoshuyama, O., Sugiyama, A. and Hirano, A.: 1999, A robust adaptive beamformer for micro-
phone arrays with a blocking matrix using coefficient-constrained adaptive filters, IEEE
Trans. on Signal Processing .
Hung, J., Wang, H. and Lee, L.: 2000, Automatic metric based speech segmentation for broad-
cast news via principal component analysis, Proc. International Conference on Speech and
Language Processing, Beijing, China.
Ifeachor, E. and Jervis, B.: 1996, Digital signal processing: a practical approach, Addison-Wesley.
Ikbal, S., Misra, H., Sivadas, S., Hermansky, H., and Bourlard, H.: 2004, Entropy based com-
bination of tandem representations for noise robust asr, Proc. International Conference on
Speech and Language Processing, South Korea.
improvements of the E-HMM based speaker diarization system for meetings records, T.: 2006,
The rich transcription 2006 spring meeting recognition evaluation, NIST 2006 Spring Rich
Transcrition Evaluation Workshop, Washington DC, USA.
Istrate, D., Fredouille, C., Meignier, S., Besacier, L. and Bonastre, J.-F.: 2005, NIST RT05S eval-
uation: Pre-processing techniques and speaker diarization on multiple microphone meetings,
NIST 2005 Spring Rich Transcrition Evaluation Workshop, Edinburgh, UK.
Janin, A., Ang, J., Bhagat, S., Dhillon, R., Edwards, J., Macias-Guarasa, J., Morgan, N., Peskin,
B., Shriberg, E., Stolcke, A., Wooters, C. and Wrede, B.: 2004, The icsi meeting project:
Resources and research, ICCASP, Montreal.
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T.,
Shriberg, E., Stolcke, A. and Wooters, C.: 2003, The ICSI meeting corpus, ICCASP, Hong
Kong.
Bibliography 217
Janin, A., Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Frankel, J. and Zheng, J.: 2006, The
ICSI-SRI spring 2006 meeting recognition system, Proceedings of the Rich Transcription
2006 Spring Meeting Recognition Evaluation, Washington, USA.
Jin, H., Kubala, F. and Schwartz, R.: 1997, Automatic speaker clustering, DARPA Speech Recog-
nition workshop, Chantilly, USA.
Jin, Q., Laskowski, K., Schultz, T. and Waibel, A.: 2004, Speaker segmentation and clustering in
meetings, NIST 2004 Spring Rich Transcrition Evaluation Workshop, Montreal, Canada.
Johnson, D. and Dudgeon, D.: 1993, Array signal processing, Prentice Hall.
Johnson, S.: 1999, Who spoke when? - automatic segmentation and clustering for determining
speaker turns, Eurospeech-99, Budapest, Hungary.
Johnson, S. and Woodland, P.: 1998, Speaker clustering using direct maximization of the MLLR-
adapted likelihood, Proc. International Conference on Speech and Language Processing,
Vol. 5, pp. 1775–1779.
Juang, B. and Rabiner, L.: 1985, A probabilistic distance measure for hidden markov models,
AT&T Technical Journal 64, AT&T.
Kaneda, Y.: 1991, Directivity characteristics of adaptive microphone-array for noise reduction
(amnor), Journal of the Acoustical Society of Japan 12(4), 179–187.
Kaneda, Y. and Ohga, J.: 1986, Adaptive microphone-array system for noise reduction, IEEE
Trans. on Acoustics, Speech, and Signal Processing .
Kass, R. E. and Raftery, A. E.: 1995, Bayes factors, Journal of the American Statistics association
90, 773–795.
Kataoka, A. and Ichirose, Y.: 1990, A microphone array configuration for anmor (adaptive
microphone-array system for noise reduction), Journal of the Acoustical Society of Japan
11(6), 317–325.
Kemp, T., Schmidt, M., Westphal, M. and Waibel, A.: 2000, Strategies for automatic segmenta-
tion of audio data, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Istanbul, Turkey, pp. 1423–1426.
Kim, H.-G., Ertelt, D. and Sikora, T.: 2005, Hybrid speaker-based segmentation system us-
ing model-level clustering, Proc. IEEE International Conference on Acoustics, Speech and
Signal Processing, Philadelphia, USA.
Knapp, C. H. and Carter, G. C.: 1976, The generalized correlation method for estimation of time
delay, IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-24(4), 320–
327.
218 Bibliography
Kohonen, T.: 1990, The self-organizing map, Proceedings of the IEEE 78(9), 1464–1480.
Krim, H. and Viberg, M.: 1996, Two decades of array signal processing research, IEEE Signal
Processing Magazine pp. 67–94.
Kristjansson, T., Deligne, S. and Olsen, P.: 2005, Voicing features for robust speech detection,
Proc. International Conference on Speech and Language Processing, Lisbon, Portugal.
Kubala, F., Jin, H., Matsoukas, S., Gnuyen, L., Schwartz, R. and Machoul, J.: 1997, The 1996
BBN byblos HUB-4 transcription system, Speech Recognition Workshop, pp. 90–93.
Lapidot, I.: 2003, SOM as likelihood estimator for speaker clustering, Eurospeech, Geneva,
Switzerland.
Lapidot, I., Gunterman, H. and Cohen, A.: 2002, Unsupervised speaker recognition based
on competition between self-organizing-maps, IEEE Transactions on Neural Networks
13(4), 877–887.
Lathoud, G. and McCowan, I. A.: 2003, Location based speaker segmentation, Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing.
Lathoud, G., McCowan, I. and Odobez, J.: 2004, Unsupervised location-based segmentation of
multi-party speech, ICASSP-NIST Meeting Recognition Workshop.
Lathoud, G., Odobez, J.-M. and McCowan, I.: 2004, Short-term spatio-temporal clustering of
sporadic and concurrent events, Technical Report IDIAP-RR 04-14, IDIAP.
Lee, K.-F.: 1998, Large vocabulary speaker-independent continuous speech recognition: the
SPHINX system, PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA.
Leeuwen, D. A. V. and Huijbregts, M.: 2006, The AMI speaker diarization system for NIST
RT06s meeting data, NIST 2006 Spring Rich Transcrition Evaluation Workshop, Washing-
ton DC, USA.
Li, Q., Zheng, J., Tsai, A., and Zhou, Q.: 2002, Robust endpoint detection and energy nor-
malization for real-time speech and speaker recognition, IEEE Transactions on Speech and
Audio Processing 10(3).
Li, X.: 2005, Combination and Generation of Parallel Feature Streams for Improved Speech
Recognition, PhD thesis, ECE Department, CMU.
Liu, D. and Kubala, F.: 1999, Fast speaker change detection for broadcast news transcription
and indexing, Eurospeech-99, Vol. 3, Budapest, Hungary, pp. 1031–1034.
Bibliography 219
Lopez, J. F. and Ellis, D. P. W.: 2000a, Using acoustic condition clustering to improve acoustic
change detection on broadcast news, Proc. International Conference on Speech and Lan-
guage Processing, Beijing, China.
Lopez, J. F. and Ellis, D. P. W.: 2000b, Using acoustic condition clustering to improve acoustic
change detection on broadcast news, Proc. International Conference on Speech and Lan-
guage Processing, Beijing, China.
Lu, L., Li, S. Z. and Zhang, H.-J.: 2001, Content-based audio segmentation using support vector
machines, ACM Multimedia Conference, pp. 203–211.
Lu, L. and Zhang, H.-J.: 2002a, Real-time unsupervised speaker change detection, ICPR’02,
Vol. 2, Quebec City, Canada.
Lu, L. and Zhang, H.-J.: 2002b, Speaker change detection and tracking in real-time news broad-
casting analysis, ACM International Conference on Multimedia, pp. 602–610.
Lu, L., Zhang, H.-J. and Jiang, H.: 2002, Content analysis for audio classification and segmen-
tation, IEEE Transactions on Speech and Audio Processing 10(7), 504–516.
Malegaonkar, A., Ariyaeeinia, A., Sivakumaran, P. and Fortuna, J.: 2006, Unsupervised speaker
change detection using probabilistic pattern matching, IEEE Signal Processing Letters
13(8), 509–512.
Marro, C., Mahieux, Y. and Simmer, K.: 1998, Analysis of noise reduction and dereverberation
techniques based on microphone arrays with postfiltering, IEEE Trans. on Speech and Audio
Processing .
McCowan, I.: 2001, Robust Speech Recognition using microphone arrays, PhD thesis, Queensland
University of Technology, Australia.
McCowan, I. A., Pelecanos, J. and Sridharan, S.: 2001, Robust speaker recognition using micro-
phone arrays, IEEE Speaker Odyssey recognition workshop.
McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M. and Zhang, D.: 2005,
Automatic analysis of multimodal group actions in meetings, IEEE Trans. on Pattern
Analysis and Machine Intelligence 27, 305–317.
McCowan, I., Marro, C. and Mauuary, L.: 2000, Robust speech recognition using near-field
superdirective beamforming with post-filtering, Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing, Vol. 3, pp. 1723–1726.
220 Bibliography
McCowan, I., Moore, D. and Sridharan, S.: 2000, Speech enhancement using near-field superdi-
rectivity with an adaptive sidelobe canceler and post-filter, Australian International Con-
ference on Speech Science and Technology, pp. 268–273.
Meignier, S., Bonastre, J.-F. and Igournet, S.: 2001, E-HMM approach for learning and adapting
sound models for speaker indexing, A speaker Oddissey, Chania, Crete, pp. 175–180.
Meignier, S., Moraru, D., Fredouille, C., Besacier, L. and Bonastre, J.-F.: 2004, Benefits of
prior acoustic segmentation for automatic speaker segmentation, Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, Montreal, Canada.
Meinedo, H. and Neto, J.: 2003, Audio segmentation, classification and clustering in a broadcast
news task, Proc. IEEE International Conference on Acoustics, Speech and Signal Process-
ing, Hong-Kong, China.
Metze, F., Fugen, C., Pan, Y., Schultz, T. and Yu, H.: 2004, The ISL RT-04S meetings tran-
scription system, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Montreal, Canada.
Mirghafori, N., Stolcke, A., Wooters, C., Pirinen, T., Bulyko, I., Gelbart, D., Graciarena, M.,
Otterson, S., Peskin, B. and Ostendorf, M.: 2004, From switchboard to meetings: Develop-
ment of the 2004 ICSI-SRI-UW meeting recognition system, Proc. International Conference
on Speech and Language Processing, Jeju Island, Korea.
Mirghafori, N. and Wooters, C.: 2006, Nuts and flakes: A study of data characteristics in speaker
diarization, Proc. IEEE International Conference on Acoustics, Speech and Signal Process-
ing, Toulouse, France.
Misra, H., Bourlard, H., and Tyagi, V.: 2003, New entropy based combination rules in hmm/ann
multi-stream asr, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Hong Kong.
Moh, Y., Nguyen, P. and Junqua, J.-C.: 2003, Towards domain independent speaker clustering,
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Hong
Kong.
Moraru, D., Ben, M. and Gravier, G.: 2005, Experiments on speaker tracking and segmentation in
radio broadcast news, Proc. International Conference on Speech and Language Processing,
Lisbon, Portugal.
Moraru, D., Besacier, L., Meignier, S., Fredouille, C. and francois Bonastre, J.: 2004, Speaker di-
arization in the elisa consodrium over the last 4 years, NIST 2004 Spring Rich Transcrition
Evaluation Workshop, Montreal, Canada.
Bibliography 221
Moraru, D., Meignier, S., Besacier, L., Bonastre, J.-F. and Magrin-Chagnolleau, I.: 2002, The
ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recog-
nition evaluation, NIST 2002 Spring Rich Transcrition Evaluation Workshop.
Moraru, D., Meignier, S., Besacier, L., Bonastre, J.-F. and Magrin-Chagnolleau, I.: 2004, The
ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recog-
nition evaluation, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Montreal, Canada.
Moraru, D., Meignier, S., Fredouille, C., Besacier, L. and Bonastre, J.-F.: 2004, The ELISA
consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich
transcription evaluation, Proc. IEEE International Conference on Acoustics, Speech and
Signal Processing, Montreal, Canada.
Mori, K. and Nakagawa, S.: 2001, Speaker change detection and speaker clustering using VQ
distortion for broadcast news speech recognition, Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing, Vol. 1, Salt Lake City, USA, pp. 413–416.
Nakagawa, S. and Suzuki, H.: 1993, A new speech recognition method based on VQ-distortion
and hmm, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing,
Vol. 2, Minneapolis, USA, pp. 676–679.
Nguyen, P.: 2003, SWAMP: An isometric frontend for speaker clustering, NIST 2003 Rich Tran-
scription Workshop, Boston, USA.
Nishida, M. and Kawahara, T.: 2003, Unsupervised speaker indexing using speaker model se-
lection based on bayesian information criterion, Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing, Hong Kong.
Omar, M. K., Chaudhari, U. and Ramaswamy, G.: 2005, Blind change detection for audio
segmentation, Proc. IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, Philadelphia, USA.
Ouellet, P., Boulianne, G. and Kenny, P.: 2005, Fravors of gaussian warping, Proc. International
Conference on Speech and Language Processing, Lisbon, Portugal.
Pardo, J. M., Anguera, X. and Wooters, C.: 2006a, Speaker diarization for multi-microphone
meetings using only between-channel differences, MLMI 2006.
Pardo, J. M., Anguera, X. and Wooters, C.: 2006b, Speaker diarization for multiple distant
microphone meetings: Mixing acoustic features and inter-channel time differences, Proc.
International Conference on Speech and Language Processing.
Pattern analysis, Statistical modeling and Computational learning (Pascal) website: 2006.
*http://www.pascal-network.org/
Pelecanos, J. and Sridharan, S.: 2001, Feature warping for robust speaker verification, ISCA
Speaker Recognition Workshop odyssey, Crete, Grece.
Perez-Freire, L. and Garcia-Mateo, C.: 2004, A multimedia approach for audio segmentation in
TV broadcast news, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Montreal, Canada, pp. 369–372.
Pwint, M. and Sattar, F.: 2005, A segmentation method for noisy speech using genetic algo-
rithm, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing,
Philadelphia, USA.
Bibliography 223
Rentzeperis, E., Stergiou, A., Boukis, C., Pnevmatikakis, A. and Polymenakos, L. C.: 2006,
The 2006 athens information technology speech activity detection and speaker diarization
systems, NIST 2006 Spring Rich Transcrition Evaluation Workshop, Washington DC, USA.
Reynolds, D. A., Singer, E., Carlson, B. A., O’Leary, G. C., McLaughlin, J. J. and Zixxman,
M. A.: 1998, Blind clustering of speech utterances based on speaker and language char-
acteristics, Proc. International Conference on Speech and Language Processing, Sidney,
Australia.
Reynolds, D. and Torres-Carrasquillo, P.: 2004, The MIT Lincoln Laboratories RT-04F diariza-
tion systems: Applications to broadcast audio and telephone conversations, Fall 2004 Rich
Transcription Workshop (RT04), Palisades, NY.
Roch, M. and Cheng, Y.: 2004, Speaker segmentation using the MAP-adapted bayesian infor-
mation criterion, Odyssey-04, Toledo, Spain, pp. 349–354.
Rombouts, G. and M.Moonen: 2003, Qrd-based unconstrained optimal filtering for acoustic noise
reduction, IEEE Trans. Signal Processing 83(9), 1889–1904.
Rosca, J., Balan, R. and Beaugeant, C.: 2003, Multi-channel psychoacoustically motivated speech
enhancement, Proc. IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing.
Ross, A., Jain, A. K. and Qian, J. Z.: 2001, Information fusion in biometrics, 3rd International
Conference on Audio and Video-Based Person Authentication.
Roth, P.: 1971, Effective measurements using digital signal analysis, IEEE Spectrum 8, 62–70.
Rougui, J., Rziza, M., Aboutajdine, D., Gelgon, M. and Martinez, J.: 2006, Fast incremental
clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast,
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse,
France.
Sankar, A., Beaufays, F. and Digalakis, V.: 1995, Training data clustering for improved speech
recognition, Eurospeech-95, Madrid, Spain.
Sankar, A., Weng, F., Stolcke, Z. R. A. and Grande, R. R.: 1998, Development of SRI’s 1997
broadcast news transcription system, DARPA Broadcast News Transcription and Under-
standing Workshop, Landsdowne, USA.
224 Bibliography
Schmidt, R.: 1986, Multiple emitter location and signal parameter estimation, IEEE Transac-
tions on Antennas and Propagation .
Schwarz, G.: 1971, A sequential student test, The Annals of Statistics 42(3), 1003–1009.
Schwarz, G.: 1978, Estimating the dimension of a model, The Annals of Statistics 6, 461–464.
Shaobing Chen, S. and Gopalakrishnan, P.: 1998, Speaker, environment and channel change de-
tection and clustering via the bayesian information criterion, Proceedings DARPA Broadcast
News Transcription and Understanding Workshop, Virginia, USA.
Shinozaki, T. and Ostendorf, M.: 2007, Cross-validation EM training for robust parameter esti-
mation, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing .
submitted.
sian Cheng, S. and min Wang, H.: 2003, A sequential metric-based audio segmentation method
via the bayesian information criterion, Eurospeech’03, Geneva, Switzerland.
sian Cheng, S. and min Wang, H.: 2004, METRIC-SEQDAC: A hybrid approach for audio
segmentation, Proc. International Conference on Speech and Language Processing, Jeju,
South Korea.
Siegler, M. A., Jain, U., Raj, B. and Stern, R. M.: 1997, Automatic segmentation, classification
and clustering of broadcast news audio, DARPA Speech Recognition Workshop, Chantilly,
pp. 97–99.
Sinha, R., Tranter, S. E., Gales, J. J. F. and Woodland, P. C.: 2005, The cambridge university
march 2005 speaker diarisation system, European Conference on Speech Communication
and Technology (Interspeech), Lisbon, Portugal, pp. 2437–2440.
Siu, M.-H., Yu, G. and Gish, H.: 1992, An unsupervised, sequential learning algorithm for
the segmentation of speech waveforms with multiple speakers, Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, Vol. 2, San Francisco, USA, pp. 189–
192.
Sivakumaran, P., Fortuna, J. and Ariyaeeinia, A.: 2001, On the use of the bayesian information
criterion in multiple speaker detection, Eurospeech’01, Scandinavia.
Solomonov, A., Mielke, A., Schmidt, M. and Gish, H.: 1998, Clustering speakers by their voices,
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 2,
Seattle, USA, pp. 757–760.
Bibliography 225
Spring 2005 (RT-05S) Rich Transcription Meeting Recognition Evaluation Plan: n.d.
*http://www.nist.gov/speech/tests/rt/rt2005/spring/rt05s-meeting-eval-plan-V1.pdf
Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation Plan: n.d.
*http://www.nist.gov/speech/tests/rt/rt2006/spring/docs/rt06s-meeting-eval-plan-
V2.pdf
Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Grezl, F., Janin, A., Mandal, A., Peskin, B.,
Wooters, C. and Zheng, J.: 2005, Further progress in meeting recognition: The icsi-sri spring
2005 speech-to-text evaluation system, RT05s Meetings Recognition Evaluation, Edinburgh,
Great Brittain.
Strassel, S. and Glenn, M.: 2004, Shared linguistic resources for human language technology in
the meeting domain, ICASSP-DARPA Meetings Diarization Workshop, Montreal, Canada.
Sturim, D., Reynolds, D., Singer, E. and J.P.Campbell: 2001, Speaker indexing in large audio
databases using anchor models, Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing, Salt Lake City, USA.
Tager, W.: 1998a, Etudes en traitement d’antenne pour la prise de son, PhD thesis, Universite
de Rennes.
Tager, W.: 1998b, Near field superdirectivity (nfsd), Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing, pp. 2045–2048.
Tranter, S.: 2005, Two-way cluster voting to improve speaker diarization performance, Proc.
IEEE International Conference on Acoustics, Speech and Signal Processing, Montreal,
Canada.
Tranter, S. and Reynolds, D.: 2004, Speaker diarization for broadcast news, ODYSSEY’04,
Toledo, Spain.
Trees, H. V.: 1968, Detection Estimation and Modulation Theory, Vol. 1, Wiley.
Tritschler, A. and Gopinath, R.: 1999, Improved speaker segmentation and segments clustering
using the bayesian information criterion, Eurospeech’99, pp. 679–682.
Tsai, W.-H., Cheng, S.-S., Chao, Y.-H. and Wang, H.-M.: 2005, Clustering speech utterances by
speaker using eigenvoice-motivated vector space models, Proc. IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, Philadelphia, USA.
226 Bibliography
Tsai, W.-H., Cheng, S.-S. and Wang, H.-M.: 2004, Speaker clustering of speech utterances using a
voice characteristic reference space, Proc. International Conference on Speech and Language
Processing, Jeju Island, Korea.
Tsai, W.-H. and Wang, H.-M.: 2006, On maximizing the within-cluster homogeneity of speaker
voice characteristics for speech utterance clustering, Proc. IEEE International Conference
on Acoustics, Speech and Signal Processing, Toulouse, France.
Valente, F.: 2006, Infinite models for speaker clustering, Proc. International Conference on
Speech and Language Processing, Pittsburgh, USA.
Valente, F. and Wellekens, C.: 2004, Variational bayesian speaker clustering, Speaker Odyssey,
Toledo, Spain.
Valente, F. and Wellekens, C.: 2005, Variational bayesian adaptation for speaker clustering,
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Lisbon,
Portugal.
Valin, J., Rouat, J. and Michaud, F.: 2004, Microphone array post-filter for separation of simul-
taneous non-stationary sources, Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing.
van Leeuwen, D.: 2005, The TNO speaker diarization system system for NIST RT05s for meeting
data, NIST 2005 Spring Rich Transcrition Evaluation Workshop, Edinburgh, UK.
Vandecatseye, A. and Martens, J.-P.: 2003, A fast, accurate and stream-based speaker segmen-
tation and clustering algorithm, Eurospeech’03, Geneva, Switzerland, pp. 941–944.
Vandecatseye, A., Martens, J.-P. et al.: 2004, The cost278 pan-european broadcast news
database, LREC’04, Lisbon, Portugal.
Veen, B. V. and Buckley, K.: 1988, Beamforming: A versatile approach to spacial filtering, IEEE
Transactions on Acoustics, Speech and Signal Processing .
Verlinde, P., Chollet, G. and Acheroy, M.: 2000, Multi-modal identity verification using expert
fusion, Information Fusion 1(1), 17–33.
Vescovi, M., Cettolo, M. and Rizzi, R.: 2003, A DP algoritm for speaker change detection,
Eurospeech’03.
Video analysis and content extraction for defense intelligence (ARDA-VACE II): 2006.
*http://www.informedia.cs.cmu.edu/arda/vaceII.html
Wactlar, H., Hauptmann, A. and Witbrock, M.: 1996, News on-demand experiments in speech
recognition, ARPA STL Workshop.
Bibliography 227
Wegmann, S., Scattone, F., Carp, I., Gillick, L., Roth, R. and Yamron, J.: 1998, Dragon sys-
tem’s 1997 broadcast news transcription system, DARPA Broadcast News Transcription
and Understanding Workshop, Landsdowne, USA.
Wiener and Norbert: 1949, Extrapolation, Interpolation, and Smoothing of Stationary Time
Series, Wiley.
Wilcox, L., Chen, F., Kimber, D. and Balasubramanian, V.: 1994, Segmentation of speech using
speaker identification, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Vol. 1, Adelaide, Australia, pp. 161–164.
Willsky, A. S. and Jones, H. L.: 1976, A generalized likelihood ratio approach to the detection
and estimation of jumps in linear systems, IEEE Transactions on Automatic Control AC-
21(1), 108–112.
Woodland, P., Gales, M., Pye, D. and Young, S.: 1997, The development of the 1996 HTK
broadcast news transcription system, Speech Recorgnition Workshop, pp. 73–78.
Wooters, C., Fung, J., Peskin, B. and Anguera, X.: 2004, Towards robust speaker segmentation:
The ICSI-SRI fall 2004 diarization system, Fall 2004 Rich Transcription Workshop (RT04),
Palisades, NY.
Wu, T., Lu, L., Chen, K. and Zhang, H.-J.: 2003a, UBM-based incremental speaker adaptation,
ICME’03, Vol. 2, pp. 721–724.
Wu, T., Lu, L., Chen, K. and Zhang, H.-J.: 2003b, UBM-based real-time speaker segmentation for
broadcasting news, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing.
Wu, T., Lu, L., Chen, K. and Zhang, H.-J.: 2003c, Universal background models for real-time
speaker change detection, International Conference on Multimedia Modeling.
Yamaguchi, M., Yamashita, M. and Matsunaga, S.: 2005, Spectral cross-correlation features for
audio indexing of broadcast news and meetings, Proc. International Conference on Speech
and Language Processing.
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V. and Woodland, P.: 2005, The HTK
Book, Cambridge University Engineering Department.
Zdansky, J. and Nouza, J.: 2005, Detection of acoustic change-points in audio records via grobal
BIC maximization and dynamic programming, Proc. International Conference on Speech
and Language Processing, Lisbon, Portugal.
228 Bibliography
Zelinski, R.: 1988, A microphone array with adaptive post-filtering for noise reduction in re-
verberant rooms, Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing, Vol. 5, pp. 2578–2581.
Zhang, X., Hansen, J. and Rehar, K.: 2004, Speech enhancement based on a combined multi-
channel array with constrained iterative and auditory masked processing, Proc. IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing.
Zhou, B. and Hansen, J. H.: 2000, Unsupervised audio stream segmentation and clustering via
the bayesian information criterion, Proc. International Conference on Speech and Language
Processing, Vol. 3, Beijing, China, pp. 714–717.
Zhu, X., Barras, C., Lamel, L. and Gauvain, J.-L.: 2006, Speaker diarization: from broadcast
news to lectures, NIST 2006 Spring Rich Transcrition Evaluation Workshop, Washington
DC, USA.
Zhu, X., Barras, C., Meignier, S. and Gauvain, J.-L.: 2005, Combining speaker identification
and bic for speaker diarization, Proc. International Conference on Speech and Language
Processing, Lisbon, Portugal.
Zochova, P. and Radova, V.: 2005, Modified DISTBIC algorithm for speaker change detection,
Proc. International Conference on Speech and Language Processing, Lisbon, Portugal.