0% found this document useful (0 votes)
9 views13 pages

Unsupervised Speech Pattern Discovery

The document presents a novel approach to unsupervised pattern discovery in speech, focusing on extracting words and linguistic entities directly from audio without prior knowledge of lexical units. The authors utilize a segmental dynamic time warping algorithm to align and cluster similar acoustic sequences, demonstrating the effectiveness of their method on a corpus of academic lectures. This work aims to enhance understanding of word acquisition processes by leveraging the statistical properties of speech patterns, inspired by developmental psychology and comparative genomics.

Uploaded by

sadwumble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

Unsupervised Speech Pattern Discovery

The document presents a novel approach to unsupervised pattern discovery in speech, focusing on extracting words and linguistic entities directly from audio without prior knowledge of lexical units. The authors utilize a segmental dynamic time warping algorithm to align and cluster similar acoustic sequences, demonstrating the effectiveness of their method on a corpus of academic lectures. This work aims to enhance understanding of word acquisition processes by leveraging the statistical properties of speech patterns, inspired by developmental psychology and comparative genomics.

Uploaded by

sadwumble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3457991

Unsupervised Pattern Discovery in Speech

Article in IEEE Transactions on Audio Speech and Language Processing · February 2008
DOI: 10.1109/TASL.2007.909282 · Source: IEEE Xplore

CITATIONS READS
366 447

2 authors, including:

James R Glass
Massachusetts Institute of Technology
649 PUBLICATIONS 24,265 CITATIONS

SEE PROFILE

All content following this page was uploaded by James R Glass on 12 August 2013.

The user has requested enhancement of the downloaded file.


186 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

Unsupervised Pattern Discovery in Speech


Alex S. Park, Member, IEEE, and James R. Glass, Senior Member, IEEE

Abstract—We present a novel approach to speech processing from a set of experiments conducted by developmental psychol-
based on the principle of pattern discovery. Our work represents ogists studying infant language learning. Saffran et al. found
a departure from traditional models of speech recognition, where that 8-month-old infants are able to detect the statistical proper-
the end goal is to classify speech into categories defined by a
prespecified inventory of lexical units (i.e., phones or words). ties of commonly co-occurring syllable patterns, indicating that
Instead, we attempt to discover such an inventory in an unsuper- the identification of recurring patterns may be important in the
vised manner by exploiting the structure of repeating patterns word acquisition process [3]. Our second source of inspiration
within the speech signal. We show how pattern discovery can be is implementational in nature and relates to current research in
used to automatically acquire lexical entities directly from an comparative genomics [4], [5]. In that area of research, pattern
untranscribed audio stream. Our approach to unsupervised word
acquisition utilizes a segmental variant of a widely used dynamic discovery algorithms are needed in order to find genes and struc-
programming technique, which allows us to find matching acoustic turally important sequences from massive amounts of genomic
patterns between spoken utterances. By aggregating information DNA or protein sequence data. Unlike speech, the lexicon of
about these matching patterns across audio streams, we demon- interesting subsequences is not known ahead of time, so these
strate how to group similar acoustic sequences together to form items must be discovered from the data directly. By aligning se-
clusters corresponding to lexical entities such as words and short
multiword phrases. On a corpus of academic lecture material, we quences to each other and identifying patterns that repeat with
demonstrate that clusters found using this technique exhibit high high recurrence, these biologically important sequences, which
purity and that many of the corresponding lexical identities are are more likely to be preserved, can be readily discovered. Our
relevant to the underlying audio stream. hope is to find analogous techniques for speech based on the
Index Terms—Speech processing, unsupervised pattern dis- observation that patterns of speech sounds are more likely to
covery, word acquisition. be consistent within word or phrase boundaries than across. By
aligning continuous utterances to each other and finding sim-
ilar sequences, we can potentially discover frequently occurring
I. INTRODUCTION words with minimal knowledge of the underlying speech signal.
The fundamental assumption of this approach is that acoustic
O VER the last several decades, significant progress has
been made in developing automatic speech recognition
(ASR) systems which are now capable of performing large-vo-
speech data displays enough regularity to make finding such
matches possible.
cabulary continuous speech recognition [1], [2]. In spite of This paper primarily focuses on the unsupervised processing
this progress, the underlying paradigm of most approaches of speech data to automatically extract words and linguistic
to speech recognition has remained the same. The problem phrases. Our work differs substantially from other approaches
is cast as one of classification, where input data (speech) is to unsupervised word acquisition (see Section II) in that it
segmented and classified into a preexisting set of known cat- operates directly on the acoustic signal, using no intermediate
egories (words). Discovering where these word entities come recognition stage to transform the audio into a symbolic repre-
from is typically not addressed. This problem is of interest to sentation. Although the inspiration for our methods is partially
us because it represents a key difference in the language pro- derived from experiments in developmental psychology, we
cessing strategies employed by humans and machines. Equally make no claims on the cognitive plausibility of these word
important, it raises the question of how much can be learned acquisition mechanisms in actual human language learning.
from speech data alone, in the absence of supervised input. The results obtained in this paper are summarized as follows.
In this paper, we propose a computational technique for ex- 1) We demonstrate how to find subsequence alignments
between the spectral representations of pairs of contin-
tracting words and linguistic entities from speech without super-
vision. The inspiration for our unsupervised approach to speech uous utterances. In so doing, we propose a variation of
processing comes from two sources. The first source comes a well-known dynamic programming technique for time
series alignment, which we call segmental dynamic time
warping (DTW). This task is motivated by the assumption
Manuscript received January 10, 2007; revised July 26, 2007. This work was
supported by the National Science Foundation under Grant #IIS-0415865. The
that common words and phrases between utterance pairs
associate editor coordinating the review of this manuscript and approving it for are likely to be acoustically similar to each other. This
publication was Dr. Helen Meng. algorithm allows us to find low distortion alignments
A. S. Park was with the Computer Science and Artificial Intelligence Lab-
oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.
between different regions of time in a given audio stream,
He is now with Tower Research Capital, New York, NY 10013 USA (e-mail: which correspond to similar sounding speech patterns.
[email protected]). 2) We show how low distortion alignments generated by the
J. R. Glass is with the Computer Science and Artificial Intelligence Lab- segmental DTW algorithm can be used to find recurring
oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
(e-mail: [email protected]). speech patterns in an audio stream. These patterns can be
Digital Object Identifier 10.1109/TASL.2007.909282 clustered together by representing the audio stream as an
1558-7916/$25.00 © 2007 IEEE
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 187

abstract adjacency graph. The speech pattern clusters that Distance matrices are also used extensively by researchers
are discovered using this methodology are shown to corre- in the music analysis community. In this area of research, the
spond to words and phrases that are relevant to the audio music audio is parameterized as a sequence of feature vectors,
streams from which they are extracted. and the resulting sequence is used to create a self-distance ma-
The remainder of this paper is organized as follows: We trix. The structure of the distance matrix can then be processed
briefly survey related work in the areas of pattern discovery to induce music structure (i.e., distinguish between chorus and
and unsupervised word acquisition in Section II. Section III verse), characterize musical themes, summarize music files, and
describes the segmental DTW algorithm, an adaptation of detect duplicate music files [10]–[13]. We carry over the use of
a widely known dynamic programming technique, which is distance matrices for pattern discovery in music audio to our
designed to find matching acoustic patterns between spoken own work in speech processing.
utterances. In Section IV, we demonstrate how to induce a
graph representation from the audio stream. We also employ B. Unsupervised Language Acquisition
clustering techniques to discover patterns that correspond to The area of research most closely related to our work con-
words and phrases in speech by aggregating the alignment cerns the problem of unsupervised knowledge acquisition at
paths that are produced by the segmental DTW algorithm. The the lexical level. Most recently, Roy et al. have proposed a
experimental background for the experiments conducted in this model for lexical acquisition by machine using multimodal
paper are presented in Section V, including a description of the inputs, including speech [14]. Roy used a recurrent neural
speech data used and specifics about our choice of signal repre- network trained on transcribed speech data to output a stream
sentation. We give examples of the types of word entities found of phoneme probabilities for phonemically segmented audio.
and analyze the results of our algorithm in Section VI, then Words were learned by pairing audio and visual events and
conclude and discuss directions for future work in Section VII. storing them as lexical items in a long-term memory structure.
In [15], de Marcken demonstrated how to learn words from
II. RELATED WORK phonetic transcriptions of continuous speech by using a model-
There have been a variety of research efforts that are related based approach to lexicon induction. The algorithm iteratively
to the work presented in this paper. We can roughly categorize updates parameters of the model (lexicon) to minimize the de-
these works into two major groups: applications of pattern dis- scription length of the model given the available evidence (the
covery principles to domains outside of natural language pro- input corpus).
cessing, and unsupervised learning techniques within the field Brent proposed a model-based dynamic programming ap-
of natural language processing. proach to word acquisition by considering the problem as one
of segmentation (i.e., inferring word boundaries in speech)
A. Pattern Discovery [16]. In his approach, the input corpus is presented as a single
The works summarized in this section represent a variety of unsegmented stream. The optimal segmentation of the corpus
different fields, ranging from computational biology to music is found through a dynamic programming search, where an
analysis to multimedia summarization. There is a common un- explicit probability model is used to evaluate each candidate
derlying theme in all of this research: the application of pattern segmentation. A similar strategy is used by Venkataraman in
discovery principles to sequence data. We briefly describe work [17], although the utterance level representation of the corpus
in each of these fields below. is used as a starting point rather than viewing the entire corpus
In computational biology, research in pattern discovery algo- as a single entity. The estimation of probabilities used in the
rithms is motivated by the problem of finding motifs (biologically segmentation algorithms of Brent and Venkataraman differ, but
significant recurring patterns) in biological sequences. Although the overall strategies of the two techniques are conceptually
the large body of proposed approaches is too large to list here, a similar. More recently, Goldwater has improved upon these
survey of the more important techniques is described in [6] and model-based approaches by allowing for sparse solutions and
[7]. The class of algorithms most relevant to our work are based more thoroughly investigating the role of search in determining
upon sequence comparison, where multiple sequences are com- the optimal segmentation of the corpus [18].
pared to one another to determine which regions of the sequence We note here that each of the above examples used a phono-
are recurring. Since biological sequences can be abstractly rep- logical lexicon as a foundation for the word acquisition process,
resented as strings of discrete symbols, many of the comparison and none of the techniques described were designed to be ap-
techniques have roots in string alignment algorithms. In partic- plied to the speech signal directly. The algorithms proposed by
ular, a popular approach to alignment is the use of dynamic pro- de Marcken and Roy, both depend on a phonetic recognition
gramming to search an edit distance matrix (also known as a dis- system to convert the continuous speech signal into a set of dis-
tance matrix, position weight matrix, or position-specific scoring crete units. The systems of Brent and Venkataraman were eval-
matrix) for optimal global alignments [8] or optimal local align- uated using speech data phonemically transcribed by humans in
ments [9]. The distance matrix is a structure which generates a a way that applied a consistent phoneme sequence to a partic-
distance or similarity score for each pair of symbols in the se- ular word entity, regardless of pronunciation.
quences being compared. We make use of distance matrices for Pattern discovery in audio has been previously proposed by
alignment in this paper as well, although the sequences we work several researchers. In [19], Johnson used a specialized dis-
with are derived from the audio signal, and are therefore com- tance metric for comparing covariance matrices of audio seg-
posed of real-valued vectors, not discrete symbols. ments to find non-news events such as commercials and jingles
188 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

in broadcast news. Typically, the repeated events were identical speakers or when comparing speech from different environ-
to one another and were on the order of several seconds long. mental conditions. Finding robust feature representations is a
Unsupervised processing of speech has also been considered difficult problem in its own right, and we defer treatment of this
as a first step in acoustic model development [20]. Bacchiani issue to more extensive research done in the area.
proposed a method for breaking words into smaller acoustic When the utterances that we are trying to compare happen
segments and clustering those segments to jointly determine to be isolated words, the globally optimal alignment is a suit-
acoustic subword units and word pronunciations [21]. Simi- able way to directly measure the similarity of two utterances
larily, Deligne demonstrated how to automatically derive an in- at the acoustic level. However, if the utterances consist of mul-
ventory of variable-length acoustic units directly from speech tiple words sequences, the distances and paths produced by op-
by quantizing the spectral observation vectors, counting symbol timal global alignment may not be meaningful. Although DTW
sequences that occur more than a specified number of times, was applied to the problem of connected word recognition via a
and then iteratively refining the models that define each of these framework called level building, this technique still required the
symbol sequences [22]. existence of a set of isolated word reference templates [25]. In
that respect, the problem has significant differences to the one
III. SEGMENTAL DTW in which we are interested. Consider the pair of utterances:
This section motivates and describes a dynamic programming 1) “He too was diagnosed with paranoid schizophrenia”;
algorithm which we call segmental DTW [23], [24]. Segmental 2) “ were willing to put Nash’s schizophrenia on record.”
DTW takes as input two continuous speech utterances and finds Even in an optimal scenario, a global alignment between these
matching pairs of subsequences. This algorithm serves as the two utterances would be forced to map speech frames from
foundation for the pattern discovery methodology described in dissimilar words to one another, making the overall distortion
this paper. difficult to interpret. This difficulty arises primarily because
Dynamic time warping was originally proposed as a way of each utterance is composed of a different sequence of words,
comparing two whole word exemplars to each other by way of meaning that the utterances cannot be considered from a
some optimal alignment. Given two utterances, and , we can global perspective. However, (1) and (2) do share similarities
represent each as a time series of spectral vectors, at the local level. Namely, both utterances contain the word
and , respectively. The optimal alignment path “schizophrenia.” Identifying and aligning such similar local
between and , , can be computed, and the accumulated segments is the problem we seek to address in this section. Our
distortion between the two utterances along that path, , proposed solution is a segmental variant of DTW that attempts
can be used as a basis for comparison. Formally, we define a to find subsequences of two utterances that align well to each
warping relation, or warp path, , to be an alignment which other. Segmental DTW is comprised of two main components:
maps to while obeying several constraints. The warping a local alignment procedure which produces multiple warp
relation can be written as a sequence of ordered pairs paths that have limited temporal variation, and a path trimming
procedure which retains only the lower distortion regions of an
(1) alignment path.

that represents the mapping A. Local Alignment


In this section, we modify the basic DTW algorithm in sev-
eral important ways. First, we incorporate global constraints to
In the case of global alignment, maps all of sequence to all restrict the allowable shapes that a warp path can take. Second,
of sequence . The globally optimal alignment is the one which we attempt to generate multiple alignment paths for the same
minimizes two input sequences by employing different starting and ending
points in the DTW search.
(2) The need for global constraints in the DTW process can be
seen by considering the example in Fig. 1. The shape of the path
in the figure corresponds to an alignment that indicates that is
In (2), represents the unweighted Euclidean distance be- not a temporally dilated form of , or vice versa. A more rigid
tween feature vectors and . alignment would prevent an overly large temporal skew between
Although there are a number of spectral representations that the two sequences, by keeping frames from one utterance from
are widely used in the speech research community, in this paper getting too far ahead of frames from the other. The following
we use whitened Mel-scale cepstral coefficients (MFCCs). criterion, proposed by Sakoe, accomplishes this goal [26]. For a
The process of whitening decorrelates the dimensions of the warp path, originating at , the th coordinate of the path,
feature vector and normalizes the variance in each dimension. , must satisfy
These characteristics of this spectral representation make the
Euclidean distance metric a reasonable choice for comparing (3)
two feature vectors, as the distance in each dimension will
also be uncorrelated and have equal variance. We note that our The constraint in (3) essentially limits the path to a diagonal
choice of feature representation and distance measure are not region of width . This region is shown in Fig. 1, for a
specifically designed to be stable when comparing different value of . Depending on the size of , the ending point
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 189

B. Path Refinement

At this stage, we are left with a family of local warp paths


for , where is the number of diagonal re-
gions. Because we are only interested in finding portions of the
alignment which are similar to each other, the next step is to re-
fine the warp path by discarding parts of the alignment with high
distortion. Although there are a number of possible methods
that could be used to accomplish this objective, we proceed by
identifying and isolating the length-constrained minimum av-
erage (LCMA) distortion fragment of the local alignment path.
Fig. 1. Nonideal warp path that can result from unconstrained alignment. For
this path, all frames from X are mapped to the first frame of Y , and all frames
We then extend the path fragment to include neighboring points
from Y are mapped to the last frame of X . The alignment corresponding to falling below a particular threshold.
the warp path is displayed in the lower part of the figure. The shaded region of The problem of finding the LCMA distortion fragment can
the graph represents the allowable set of path coordinates following the band
constraint in (3) with R = 2.
be described more generally as follows. Consider a sequence of
positive real numbers

(4)

and a length constraint parameter . Then, the length con-


strained minimum average subsequence LCMA is a
consecutive subsequence of with length at least that min-
imizes the average of the numbers in the subsequence. More
formally, we wish to find and that satisfy

(5)

with . In our work, we make use of an algorithm


proposed by Lin et al. for finding LCMA in
time [27].
In order to apply this algorithm to our task, recall that every
Fig. 2. Multiple alignment paths resulting from applying the band constraint
with R = 1. The alignments corresponding to each diagonal region are shown
warp path is a sequence of ordered pairs
below the grid.
(6)

of the constrained path may not reach . An alignment Associated with each warp path is a distortion sequence whose
path resulting in unassigned frames in either of the input utter- values are real and positive
ances may be desirable in cases where only part of the utterances
match. (7)
In addition to limiting temporal skew, the constraint in (3)
also introduces a natural division of the search grid into regions The minimum distortion warp path fragment is a subsequence
suitable for generating multiple alignment paths with offset start of that satisfies
coordinates as shown in Fig. 2.
For utterances of length and , with a constraint param- LCMA (8)
eter of , the start coordinates will be
The minimum length criterion plays a practical role in com-
puting the minimum average subsequence. Without the length
constraint, the minimum average subsequence would typically
be just the smallest single element in the original sequence.
Likewise, for our application, it has the effect of preventing
spurious matches between short segments within each utter-
Based on these coordinates, we will have a number of diagonal ance. The length criterion also has important conceptual im-
regions, each defining a range of alignments between the two plications. The value of serves to control the granularity of
utterances with different offsets but the same temporal rigidity. repeating patterns that are returned by the segmental DTW pro-
Within each region, we can use dynamic time warping to find the cedure. Small values of will lead to many short, subword pat-
optimal local alignment , where is the index of the diagonal terns being found, while large values of will return fewer, but
region. more linguistically significant patterns such as words or phrases.
190 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

Fig. 4. Utterance level view of a warp path from Fig. 3. The segment bounded
by the vertical black lines corresponds to the LCMA fragment for this particular
warp path, while the remainder of the white line corresponds to the fragment
Fig. 3. Family of constrained warp paths ^ with R = 10 for the pair of resulting from extending the LCMA fragment to neighboring regions with low
utterances in our example. The frame rate for this distance matrix is 200 frames distortion.
per second. The associated LCMA path fragments, with L = 100, are shown in
bold as part of each warp path. Each path fragment is associated with an average
distortion that indicates how well the aligned segments match one another.
An alternate view of the distortion path, including a frame-
level view of the individual utterances, is shown in Fig. 4. This
view of the distortion path highlights the need for extending
In separate experiments, we found that the reliability of align- the path fragments discovered using the LCMA algorithm. Al-
ment paths found by the algorithm, in terms of matching accu- though the distortion remains low from the onset of the word
racy, was positively correlated with path length [28]. This result, “schizophrenia” in each utterance, the LCMA path fragment
along with our need to limit the found paths to a manageable (shown in red) starts almost 500 ms after this initial drop in dis-
number for a given audio stream, led us to select a relatively tortion. In order to compensate for this phenomenon, we allow
long minimum length constraint of 500 ms. We discuss some for path extension using a distortion threshold based on the
less arbitrary methods for determining an optimal setting for values in the path fragment, for example within 10% of the dis-
in Section VII. In the remainder of this section, we show ex- tortion of the original fragment. The extension of the fragment
ample outputs that are produced when segmental DTW is ap- is shown in Fig. 4 as a white line.
plied to pairs of utterances. Although the endpoints of the extended path fragment in
Fig. 4 happen to coincide with the common word boundaries
C. Example Outputs for that particular example, in many cases, the segmental DTW
In this section, we step through the segmental DTW proce- algorithm will align subword sequences or even multiword
dure for the pair of utterances presented at the beginning of sequences. This is because, aside from fragment length, the
Section III. The distance matrix for these two utterances is dis- segmental DTW algorithm makes no use of lexical identity
played in Fig. 3. In this distance matrix, each cell corresponds to when searching for an alignment path.
the Euclidean distance between frames from each of the utter-
ances being compared. The cell at row i, column j, corresponds IV. FROM PATHS TO CLUSTERS
to the distance between frame i of the first utterance and frame j In order to apply the segmental DTW algorithm to an audio
of the second utterance. The local similarity between the utter- stream longer than a short sentence, we first perform silence
ance portions containing the word “schizophrenia” are evident detection on the audio stream to break it into shorter utter-
in the diagonal band of low distortion cells stretching from the ances. This segmentation step is described in more detail in
time coordinates (1.6, 0.9) to (2.1, 1.4). From the distance ma- Section V-B. Segmental DTW is then performed on each pair
trix, a family of constrained warp paths is found using dynamic of utterances. With the appropriate choice of length constraint,
time warping as shown in Fig. 3. The width parameter which this procedure generates a large number of alignment path
constrains the extent of time warping is set to frames, fragments that are distributed throughout the original audio
at a 5-ms analysis rate, which corresponds to a total allowable stream. Each alignment path consists of two intervals (the
offset of 105 ms. The warp paths are overlaid with their associ- regions in time purported to be matching), and the associated
ated length constrained minimum average path fragments. The distortion along that interval. Fig. 5 illustrates the distribution
length parameter used in this example is , which corre- of path fragments throughout the audio stream. This visualiza-
sponds to approximately 500 ms. The coloring of the warp path tion demonstrates how some time intervals in the audio match
fragments correspond to the average distortion of the path frag- well to many other intervals, with up to 17 associated path
ment, with bright red fragments indicating low distortion paths fragments, while some time intervals have few, if any, matches.
and darker shades indicating high distortion paths. Typically, Since these fragments serve to link regions in time that are
there is a wide range of distortion values for the path fragments acoustically similar to one another, a natural question to ask is
found, but only the lowest distortion fragments are of interest to whether they can be used to build clusters of similar sounding
us, as they indicate potential local matches between utterances. speech segments with a common underlying lexical identity.
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 191

Fig. 5. Histogram indicating the number of path fragments present for each
instant of time for the Friedman lecture. The distribution of path fragments is
irregular, indicating that certain time intervals have more acoustic matches than
others.

Fig. 7. Top—a partial utterance with the time regions from its associated path
fragments shown in white. Paths are ordered from bottom to top in increasing
order of distortion. Bottom—the corresponding similarity profile for the same
time interval is shown as a solid line, with the smoothed version shown as a
dashed line (raised for clarity). The extracted time indices are denoted as dots
at the profile peaks.
Fig. 6. Production of an adjacency graph from alignment paths and extracted
nodes. The audio stream is shown as a timeline, while the alignment paths are
shown as pairs of colored lines at the same height above the timeline. Node
relations are captured by the graph on the right, with edge weights given by the
A. Node Extraction
path similarities. While it is relatively straightforward to see how alignment
path fragments can be converted into graph edges given a set
of time index nodes in the audio stream, it is less clear how
Our approach to this problem is cast in a graph theoretical these nodes can be extracted in the first place. In this section,
framework, which represents the audio stream as an abstract ad- we describe the node extraction procedure.
jacency graph consisting of a set of nodes and a set of edges Recall that the input to the segmental DTW algorithm is not
. In this graph, the nodes correspond to locations in time, and a single contiguous audio stream, but rather a set of utterances
the edges correspond to measures of similarity between those produced by segmenting the audio using silence detection. Our
time indices. Given an appropriate choice of nodes and edges, goal in node extraction is to determine a set of discrete time
graph clustering techniques can be applied to this abstract rep- indices within these utterances that are representative of their
resentation to group together the nodes in the graph that are surrounding time interval. This is accomplished by using in-
closest to one another. Since graph clustering and partitioning formation about the alignment paths that populate a particular
algorithms are an active area of research [29]–[31], a wide range utterance.
of techniques can be applied to this stage of the problem. Consider the example shown in Fig. 7. In this example, there
An overview of the graph conversion process is shown in are a number of alignment paths distributed throughout the ut-
Fig. 6. The time indices indicated in the audio stream are real- terance with different average path distortions. The distribution
ized as nodes in the adjacency graph, while the alignment paths of alignment paths is such that some time indices are covered by
overlapping the time indices are realized as edges between the many more paths than others—and are therefore similar to more
nodes. We use these alignment paths to derive edge weights by time indices in other utterances. These heavily covered time in-
applying a simple linear transformation of the average path dis- dices are typically located within the words and phrases that are
tortions, with the weight between two nodes being given by the matched via multiple alignment paths.
following similarity score We can use the alignment paths to form a similarity profile
by summing the similarity scores of (9) over time. That is, the
(9) similarity score at time , is given by
In this equation, is the weight on the edge between nodes (10)
and , is the alignment path common to both nodes,
is the average distortion for that path, and is a
threshold used to normalize the path distortions. The average In this equation, are the paths that overlap , and is the
distortion is used as opposed to the total distortion in order to similarity value for given by (9).
normalize for path length when comparing paths with different After smoothing the similarity profile with a 0.5-s trian-
durations. Paths with average distortion greater than are not gular averaging window, we take the peaks from the resulting
included in the similarity computation. The distortion threshold smoothed profile and use those time indices as the nodes in
chosen for all experiments in this chapter was 2.5, which re- our adjacency graph. Because our extraction procedure finds
sulted in approximately 10% of the generated alignment paths locations with locally maximized similarity within the utter-
being retained. The resulting edge weights are closer to 1 be- ance, the resulting time indices demarcate locations that are
tween nodes with high similarity, and closer to zero (or nonex- more likely to bear resemblance to other locations in the audio
istent) for nodes with low similarity. stream.
192 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

node in its own group, then merges groups together in a greedy


fashion by adding edges back to the graph in the order that max-
imizes a modularity measure which is given by

(11)

where is the fraction of edges in the original network that


Fig. 8. Example of graph clustering output. Nodes are colored according to
cluster membership. Dashed lines indicate intercluster edges. connect vertices in group to those in group , and .
More informally, is the fraction of edges that fall within
groups, minus the expected value of the same quantity if edges
The reasoning behind this procedure can be understood by fall at random without regard for the community structure of
noting that only some portions of the audio stream will have high the graph. The value of ranges between 0 and 1, with 0 being
similarity (i.e., low distortion) to other portions. By focusing on the expected modularity of a clustering where intercluster edges
the peaks of the aggregated similarity profile, we restrict our- occured about as frequently as intracluster edges, and higher
selves to finding those locations that are most similar to other scores indicating more favorable clusterings of the graph. The
locations. Since every alignment path covers only a portion of advantages of this particular algorithm are threefold. First, it
an utterance, the similarity profile will fluctuate over time. This easily allows us to incorporate edge weight information in the
causes each utterance to separate naturally into multiple nodes clustering process by considering weights as fractional edges in
corresponding to distinct patterns that can be joined together via computing edge counts. Second, it is extremely fast, operating
their common alignment paths. Each path that overlaps a node in time in the worst case. Finally, the modularity
maps to an edge in the adjacency graph representation of the criterion offers a data-driven measure for determining the
audio stream. The method we describe for inducing a graph from number of clusters to be detected from a particular graph.
the alignment paths is one of many possible techniques. We dis- Because our goal is to separate the graph into groups joining
cuss other possibilities for graph conversion in Section VII. nodes sharing the same word(s), multiple groups containing the
same word are more desirable than fewer groups containing
B. Graph Clustering many different words. We therefore associate a higher cost with
Once an adjacency graph has been generated for the audio the action of mistakenly joining two unlike groups than that of
stream using the extracted nodes and path fragment edges, the mistakenly leaving two like groups unmerged. This observation
challenge of finding clusters in the graph remains. In an adja- leads us to choose a conservative stopping point for the clus-
cency graph, a good clustering is one where nodes in one cluster tering algorithm at 80% of peak modularity.
are more densely connected to each other than they are to nodes
in another cluster. The clustered adjacency graph in Fig. 8 il- C. Nodes to Intervals
lustrates this concept. A naive approach to this problem is to Recall from Section IV-A that the nodes in the adjacency
simply threshold the edge weights and use the groups of con- graph represent not time intervals in the original audio stream,
nected components that remain as clusters. Though conceptu- but time indices. For the purposes of clustering, this time index
ally simple, this approach is prone to accidental merging if even abstraction may be adequate for representing nodes, but we will,
a single edge with high weight exists between two clusters that at times, require associating a time interval corresponding to that
should be separated. In contrast to simple edge thresholding, a node. One situation where we need a time interval rather than
number of more sophisticated algorithms for automatic graph the time index corresponding to the node is for determining how
clustering have been proposed by researchers in other fields to transcribe a node. As can be seen from the example in Fig. 7,
[32], [33]. For some applications, such as task scheduling for the alignment paths overlapping a particular node rarely agree
parallel computing, the clustering problem is cast as a parti- on starting and ending times for their respective time intervals.
tioning task, where the number and size of desired clusters is We assign a time interval to a node by computing the average
known and the objective is to find the optimal set of clusters start and end times for all the alignment paths for edges occur-
with those criteria in mind. For other applications, such as de- ring within the cluster to which that node belongs.
tecting community structure in social and biological networks,
the number and size of clusters is typically unknown, and the V. EXPERIMENTAL BACKGROUND
goal is to discover communities and groups from the relation-
ships between individuals. A. Speech Data
In our work, the clustering paradigm aligns more closely with Word-level experiments in this paper are performed on
the latter example, as we are attempting to discover groups of speech data taken from an extensive corpus of academic lec-
segments corresponding to the same underlying lexical entity, tures recorded at MIT [35]. At present, the lecture corpus
and not partition the audio stream into a set of clusters with includes more than 300 h of audio data recorded from eight dif-
uniform size. Since a detailed treatment of the graph clustering ferent courses and over 80 seminars given on a variety of topics
problem is outside the scope and intent of this work, we focus such as poetry, psychology, and science. Many of these lectures
on an efficient, bottom-up clustering algorithm for finding com- are publicly available on the MITWorld website [36] and as a
munity structure in networks proposed by Newman [34]. The part of the MIT OpenCourseware initiative [37]. In most cases,
Newman algorithm begins with all edges removed and each each lecture takes place in a classroom environment, and the
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 193

TABLE I TABLE II
SEGMENT OF SPEECH TAKEN FROM A LECTURE, “THE WORLD CLUSTER STATISTICS FOR ALL LECTURES PROCESSED IN THIS PAPER.
IS FLAT,” DELIVERED BY THOMAS FRIEDMAN ONLY CLUSTERS WITH AT LEAST THREE MEMBERS ARE INCLUDED
IN THIS TABLE. THE LAST TWO COLUMNS INDICATE HOW MANY
OF THE GENERATED CLUSTERS ARE ASSOCIATED WITH A SINGLE
WORD IDENTITY OR A MULTIWORD PHRASE

audio is recorded with an omnidirectional microphone (as part


of a video recording).
We used six lectures for the experiments and examples
in this work, each one delivered by a different speaker. The
lectures ranged in duration from 47 to 85 min, with each C. Computational Considerations
focusing on a well-defined topic. Five of the lectures were
As described, the pattern discovery process requires that each
academic in nature, covering topics like linear algebra, physics,
utterance is compared with each other utterance. The number of
and automatic speech recognition. The remaining lecture was
segmental DTW comparisons required for each audio stream is
delivered by Thomas Friedman, a New York Times columnist,
therefore quadratic in the number of utterances. This step is the
who spoke for 75 min on the material in his recent book, “The
most computationally intensive part of the process; node gener-
World is Flat.” An example of the type of speech from this
ation and clustering do not incur significant computation costs.
lecture is shown in Table I. We note that transcript deviates
Since each pair of utterances can be compared independently,
significantly from patterns typically observed in formal written
we perform these comparisons in parallel to speed up computa-
text, exhibiting artifacts such as filled pauses (1), false starts
tion. The number of comparisons can potentially be reduced by
(1), sentence fragments (2), and sentence planning errors (3).
merging matching segments as they are found.
One of the unique characteristics of the lectures described
above is the quantity of speech data that is available for any par- VI. CLUSTER ANALYSIS
ticular speaker. Unlike other sources of speech data, these lec-
We processed the six lectures described in Section V-A using
tures are primarily comprised of a single speaker addressing an
the segmental DTW algorithm and generated clusters for each.
audience for up to an hour or more at a time, making it particu-
Overall cluster statistics for these lectures are shown in Table II.
larly well suited to our word-discovery technique. Moreover, the
We will return to this table momentarily, but for illustrative pur-
focused and topical nature of the lectures we investigate tend to
poses, we focus on clusters from the Thomas Friedman lecture.
result in relatively small vocabularies which make frequent use
A more detailed view of the clusters with at least three members
of subject-specific keywords that may not be commonly used in
is shown in Table III. In this table, the clusters are listed first
everyday speech.
in decreasing order of size, denoted by , then by decreasing
B. Segmentation order of density . The density, a measure of the “intercon-
nectedness” of each cluster, is given by
The lectures in the dataset are typically recorded as a single
stream of audio often over 1 h in length, with no supplementary
(12)
indicators of where one utterance stops and another begins. For
many of the processing steps undertaken in subsequent stages,
we require a set of discrete utterances in order to compare utter- The quantity in the above equation is the fraction of edges ob-
ances to one another. In order to subdivide the audio stream into served in the cluster out of all possible edges that could exist be-
discrete segments of continuous speech, we use a basic phone tween cluster nodes. Higher densities indicate greater agreement
recognizer to identify regions of silence in the signal [38]. Silent between nodes. Table III also includes a purity score for each
regions with duration longer than 2 s are removed, and the por- cluster. The purity score is a measure of how accurately the clus-
tions of speech in between those silences are used as the isolated tering algorithm is able to group together like speech nodes, and
utterances. The use of a phone recognizer is not a critical pre- is determined by calculating the percentage of nodes that agree
requisite for this segmentation procedure, since we only use the with the lexical identity of the cluster. The cluster identity, in
output to make a speech activity decision at each particular point turn, is derived by looking at the underlying reference transcrip-
in time. In the absence of a phone recognizer, a less sophisticated tion for each node and choosing the word or phrase that appears
technique for speech activity detection can be substituted in its most frequently in the nodes of that particular cluster. Clusters
place. Most of the utterances produced during the segmentation with no majority word or phrase (such as those matching sub-
procedure are short, averaging durations of less than 3 s. The word speech segments), are labeled as “ .”
segmentation procedure is also conservative enough that seg- 1) Example Clusters: Some examples of specific clusters
mentation end points are rarely placed in the middle of a word. with high purity are shown in Fig. 9. Cluster 27 in Fig. 9 is an
194 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

TABLE III
INFORMATION FOR THE 63 CLUSTERS WITH AT LEAST THREE MEMBERS GENERATED FOR THE FRIEDMAN LECTURE.
CLUSTERS ARE ORDERED FIRST BY SIZE, THEN IN DECREASING ORDER OF DENSITY

example of a high-density cluster, with each node connecting to


each other node, and the underlying transcriptions confirm that
each node corresponds to the same recurring phrase. The other
two clusters in Fig. 9, while not displaying the same degree of in-
terconnectedness, nevertheless all consist of nodes with similar
transcriptions. One interesting property of these clusters is the
high degree of temporal locality displayed by their constituent
nodes. With the exception of node 587, most of the other nodes
occur within 5 min of the other nodes in their respective clusters.
This locality may be indicative of transient topics in the lecture
which require the usage of terms that are only sporadically used.
In the case of cluster 27, these four instances of “search engine
optimize-” were the only instances where they were spoken in
the lecture.
2) Cluster Statistics: Several interesting points can be noted
regarding the clusters generated from the Friedman lecture.
First, most clusters (56 of 63) have a word or phrase that can be
considered to be the lexical identity of the cluster. Out of these
clusters, over 73% of the clusters have a purity of 100%, which
offers encouraging evidence that the segmental DTW measures
and subsequent clustering procedure are able to correctly group
recurring words together. As might be expected, the cluster
density appears to be positively correlated to cluster purity,
with an average purity of 87% among clusters with density
greater than 0.05, and an average purity of 53% among clusters
with density less than or equal to 0.05. We also observe that the Fig. 9. Detailed view of clusters 17, 24, and 10, including the node indices,
clustering algorithm does not appear to discriminate between transcriptions, and locations in the audio stream.
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 195

TABLE IV
TWENTY MOST RELEVANT WORDS FOR EACH LECTURE, LISTED IN DECREASING ORDER OF TFIDF SCORE.
WORDS OCCURING AS PART OF A CLUSTER FOR THAT LECTURE ARE SHOWN IN BOLD

single words and multiword phrases that are frequently spoken Since there is no easy way of measuring word relevancy
as a single entity, with more than half of the clusters (31 of 56) directly, for the purposes of our work, we use each word’s
mapping to multiword phrases. term-frequency, inverse document-frequency (TFIDF) score as
Overall cluster purity statistics for the five other academic a proxy for its degree of relevance [39]. The TFIDF score is the
lecture processed in this paper are shown in Table II. We found frequency of the word within a document normalized by the
that across all six lectures, approximately 83% of the generated frequency of the same word across multiple documents. Our
clusters had density greater than 0.05, and among these higher rationale for using this score is that words with high frequency
density clusters, the average purity was 92.2%. In contrast, the within the lecture, but low frequency in general usage are more
average purity across all of the lower density clusters was only likely to be specific to the subject content for that lecture. The
72.6%. These statistics indicate that the observations noted in word lists in Table IV are the 20 most relevant words for each
the previous paragraph appear to transfer to the other lectures. lecture ranked in decreasing order of their TFIDF score. Each
Some notable differences between the Friedman lecture and the list was generated as follows.
academic lectures are the larger average cluster size, and higher 1) First, words in the reference transcription were stemmed to
overall purity across the clusters in general. The larger size of merge pluralized nouns with their root nouns, and various
some clusters can be attributed to the more focused nature of the verb tenses with their associated root verbs.
academic lecture vocabulary, while the higher purity may be a 2) Partial words, filled pauses, single letters and numbers, and
result of differences in speaking style. contractions such as “you’ve” or “i’m” were removed from
A cursory view of the cluster identities for each lecture in- the reference transcription.
dicates that many clusters correspond to words or phrases that 3) Finally, the remaining words in the lecture were ranked
are highly specific to the subject material of that particular lec- by TFIDF, where the document frequency was deter-
ture. For example, in the physics lecture, the words “charge,” mined using the 2K most common words in the Brown
“electric,” “surface,” and “epsilon,” all correspond to some of corpus [40].
the larger clusters for the lecture. This phenomenon is expected, When considered in the context of each lecture’s title, the lists
since relevant content words are likely to recur more often, and of words generated in Table IV appear to be very relevant to
function words such as “the,” “is,” and “of,” are of short du- the subject matter of each lecture, which qualitatively validates
ration and typically exhibit significant pronunciation variation our use of the TFIDF measure. The words for each lecture in
as a result of coarticulation with adjacent words. One way of Table IV are highlighted according to their cluster coverage,
evaluating how well the clusters capture the subject content of with words represented by a cluster shown in bold. On average,
a lecture is to consider the coverage of relevant words by the 14.8 of the top 20 most relevant words are covered by a cluster
generated clusters. generated by our procedure. This statistic offers encouraging
196 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008

evidence that the recurring acoustic patterns discovered by our formed and the time intervals for each node have been esti-
approach are not only similar to each other (as shown by the high mated, edge weights between cluster nodes can be recomputed
average purity), but also informative about the lexical content of using the start and end times of the node intervals as constraints.
the audio stream. Based on these new edge weights, nodes can be rejected from
the cluster and the time intervals can be reestimated, with the
VII. DISCUSSION AND FUTURE WORK process continuing until convergence to a final set of nodes.
This paper has focused on the unsupervised acquisition of The idea behind this approach is to eliminate chaining and
lexical entities from the information produced by the segmental partial match errors by forcing clusters to be generated based
DTW algorithm. We demonstrated how to use alignment paths, on distortions that are computed over a consistent set of speech
which indicate pairwise similarity, to transform the audio stream intervals.
into an abstract adjacency graph which can then be clustered Similarly, one could imagine using an interval-based clus-
using standard graph clustering techniques. As part of our eval- tering strategy to help avoid accidental merging of lexically dif-
uation, we showed that the clusters generated by our proposed ferent clusters, which can occur as a result of “chained” multi-
procedure have both high purity and good coverage of terms that word phrases, or matched subword units such as “tion.” Interval-
are relevant to the subject of the underlying lecture. based clustering would resolve this problem by using whole
As we noted in Section V-A, there are several reasons why time intervals as nodes, rather than time indices. This approach
the lecture audio data was particularly well suited for pattern would allow a hierarchical representation of a particular speech
discovery using segmental DTW. First, the types of material segment and distinguish between overlapping intervals of dif-
was single-speaker data in a consistent environment, which al- ferent lengths.
lowed us to ignore issues of robustness with our feature rep- At a more abstract level, we believe that an interesting direc-
resentation. Second, the amount of topic-specific data ensured tion for future work would be to incorporate some way to build
that there were enough instances of repeated words for the al- and update a model of the clustered intervals using some type
gorithm to find. For both of these reasons, our algorithm would of hidden Markov model or generalized word template. This
likely not perform as well if applied directly to other domains, would introduce significant computational savings by reducing
such as Switchboard or Broadcast News. In particular, we would the number of required comparisons.
not expect to find clusters of the same size or density without re- Another area for future exploration is the automatic identifi-
ducing the length parameter and/or including more edges in the cation and transcription of cluster identities. We have previously
adjacency graph. The reason for this is mainly due to speaker proposed algorithms for doing so using isolated word recog-
changes and paucity of repeated content words. Speaking style nition and phonetic recognition combined with a large base-
is not as significant an issue, as the lecture data exhibits speech form dictionary [24]. This task illustrates how unsupervised pat-
that is much more conversational than read speech or broadcast tern discovery can provide complementary information to more
news data. traditional automatic speech recognition systems. Since most
The work presented in this paper represents only an initial speech recognizers process each utterance independently of one
investigation into the more general problem of knowledge ac- another, they typically do not take advantage of the consistency
quisition from speech. Many directions for future work remain, with which the same word is uttered when repeated in the test
and we expand upon some of them here. data. Alignment paths generated by segmental DTW can find lo-
In our experiments, we chose a large value for the param- cations where an automatic transcription is not consistent by in-
eter to limit the over-generation of alignment path fragments cor- dicating where acoustically similar segments produced different
responding to short, possibly spurious, acoustic matches. Typi- transcriptions.
cally, low-distortion path fragments corresponding to words or This paper documents our initial research on unsupervised
phrases are recoverable from shorter path fragments during the strategies for speech processing. While conventional large
extension step of path refinement. Discovery of longer fragments vocabulary speech recognition would likely perform well
is therefore not particularly sensitive to our choice of . Larger in matched training and testing scenarios, there are many
values of primarily serve to prevent short path fragments (usu- real-world scenarios where a paucity of content information
ally corresponding to subword matches) from being passed on to can expose the brittleness of purely supervised approaches. We
the node generation and clustering stage. Within the context of believe that techniques such as the one in this paper, which
word acquisition, these shorter path fragments are problematic rely less on training data, can be combined with conventional
because they cause dissimilar words to cluster with one another speech recognizers to create more flexible, hybrid systems that
via common subword units. Possibilities for future work include can learn from and adapt to unexpected input. Examples of
using smaller values of for discovery of subword units or deter- such unexpected input include: accented speech, out-of-vocab-
mining the appropriate setting of in a more principled manner. ulary words, new languages, and novel word usage patterns. In
For example, the optimal setting for could be determined by each of these scenarios, exploiting the consistency of repeated
performing pattern discovery over the audio stream using mul- patterns in the test data has not been fully explored, and we
tiple ’s and choosing from the best one according to some se- believe it is a promising direction for future research.
lection criterion.
REFERENCES
An incremental strategy for improving cluster purity and
[1] J. L. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news
finding more precise word boundaries may be to adopt an transcription system,” Speech Commun., vol. 37, no. 1–2, pp. 89–108,
iterative approach to cluster formation. After clusters have been May 2002.
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 197

[2] A. Ljolje, D. M. Hindle, M. D. Riley, and R. W. Sproat, “The AT&T [28] A. Park, “Unsupervised pattern discovery in speech: Applications to
LVCSR-2000 system,” in Proc. DARPA Speech Transcription Work- word acquisition and speaker segmentation,” Pd.D. dissertation, Mass.
shop, College Park, MD, May 2000 [Online]. Available: http://www. Inst. Technol., Cambridge, MA, 1988.
nist.gov/speech/publications/tw00/pdf/cts30.pdf [29] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE
[3] J. R. Saffran, R. N. Aslin, and E. L. Newport, “Statistical learning by Trans. Pattern Anal. Mach. Intell. vol. 22, no. 8, pp. 888–905, Aug.
8-month old infants,” Science, vol. 274, pp. 1926–1928, Dec. 1996. 2000 [Online]. Available: citeseer.ist.psu.edu/article/shi97normalized.
[4] I. Rigoutsos and A. Floratos, “Combinatorial pattern discovery in bi- html.
ological sequences: The TEIRESIAS algorithm,” Bioinformatics, vol. [30] M. Meila and J. Shi, “Learning segmentation by random walks,” in
14, no. 1, pp. 55–67, Feb. 1998. Advances in Neural Information Processing Systems 13, T. K. Leen, T.
[5] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to G. Dietterich, and V. Tresp, Eds. Cambridge, MA: MIT Press, 2001,
the automatic discovery of patterns in biosequences,” J. Comput. Biol., vol. 13, pp. 873–879.
vol. 5, no. 2, pp. 279–305, 1998. [31] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
[6] G. K. Sandve and F. Drabløs, “A survey of motif discovery methods in and an algorithm,” in Advances in Neural Information Processing Sys-
an integrated framework,” Biol. Direct, vol. 1, pp. 1–11, Apr. 2006. tems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cam-
[7] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological bridge, MA: MIT Press, 2002, pp. 849–856.
Sequence Analysis: Probabilistic Models of Proteins and Nucleic [32] S. White and P. Smyth, “A spectral clustering approach to finding com-
Acids. Cambridge, U.K.: Cambridge Univ. Press, 1998. munities in graphs,” in SIAM Int. Conf. Data Mining, Newport Beach,
[8] S. B. Needleman and C. D. Wunsch, “A general method applicable to CA, 2005, pp. 274–285.
the search for similarities in the amino acid sequence of two proteins,” [33] M. E. J. Newman and M. Girvan, “Finding and evaluating community
J. Mol. Biol., vol. 48, pp. 443–453, 1970. structure in networks,” Phys. Rev. E, vol. 69, 2004, 026113.
[9] M. S. Waterman and M. Eggert, “A new algorithm for best subse- [34] M. E. J. Newman, “Fast algorithm for detecting community structure
quence alignments with application to tRNA-rRNA comparisons,” J. in networks,” Phys. Rev. E, vol. 69, 2004, 066133.
Mol. Biol., vol. 197, pp. 723–725, 1987. [35] J. Glass, T. J. Hazen, L. Hetherington, and C. Wang, “Analysis and
[10] B. Logan and S. Chu, “Music summarization using key phrases,” in processing of lecture audio data: Preliminary Investigations,” in Proc.
Proc. Int. Conf. Acoust., Speech, Signal Process., Istanbul, Turkey, Jun. HLT-NAACL 2004 Workshop Interdisciplinary Approaches to Speech
2000, pp. 749–752. Indexing and Retrieval, Boston, MA, May 2004, pp. 9–12.
[11] C. J. Burges, D. Plastina, J. C. Platt, E. Renshaw, and H. Malvar, “Using [36] MIT, MIT World [Online]. Available: http://mitworld.mit.edu.
audio fingerprinting for duplicate detection and thumbnail generation,” [37] MIT, “MIT Open Courseware,” [Online]. Available: http://ocw.mit.
in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, edu.
Mar. 2005, vol. 3, pp. 9–12. [38] J. Glass, “A probabilistic framework for segment-based speech recog-
[12] R. Dannenberg and N. Hu, “Pattern discovery techniques for music nition,” Comput. Speech Lang., vol. 17, no. 2–3, pp. 137–152, 2003.
audio,” in Proc. Int. Conf. Music Inf. Retrieval, Paris, France, Oct. [39] G. Salton and C. Buckley, “Term weighting approaches in automatic
2002, pp. 63–70. text retrieval,” Cornell Univ., Ithaca, NY, Tech. Rep. TR87-881, 1987.
[13] M. Goto, “A chorus-section detecting method for musical audio sig- [40] W. N. Francis and H. Kucera, Frequency Analysis of English Usage:
nals,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Apr. 2003, Lexicon and Grammar. Boston, MA: Houghton-Mifflin, 1982.
vol. 5, pp. 437–440.
[14] D. Roy and A. Pentland, “Learning words from sights and sounds: A
computational model,” Cognitive Sci., vol. 26, no. 1, pp. 113–146, Jan.
2002.
[15] C. G. de Marcken, “Unsupervised language acquisition,” Ph.D. disser- Alex S. Park (M’06) received the B.S., M.S., and
tation, Mass. Inst. Technol., Cambridge, MA, 1996. Ph.D. degrees in electrical engineering and computer
[16] M. R. Brent, “An efficient probabilistically sound algorithm for seg-
science from the Massachusetts Institute of Tech-
mentation and word discovery,” Mach. Learn., vol. 34, no. 1–3, pp.
nology (MIT), Cambridge, in 2001, 2002, and 2006,
71–105, Feb. 1999.
[17] A. Venkataraman, “A statistical model for word discovery in tran- respectively.
scribed speech,” Comput. Ling., vol. 27, no. 3, pp. 352–372, Sep. 2001. While at MIT, he performed his doctoral research
[18] S. Goldwater, T. Griffiths, and M. Johnson, “Contextual dependencies as a member of the Spoken Language Systems Group
in unsupervised word segmentation,” in Proc. Coling/ACL, Sydney, in the Computer Science and Artificial Intelligence
Australia, 2006, pp. 673–670. Laboratory. His research interests include unsuper-
[19] S. Johnson and P. Woodland, “A method for direct audio search with vised learning in speech, auditory signal processing,
application to indexing and retrieval,” in Proc. Int. Conf. Acoust., speaker recognition, and noise robust speech recogni-
Speech, Signal Process., Istanbul, Turkey, 2000, pp. 1427–1430. tion. While a student, he took part in research internships with Nuance Commu-
[20] J. Glass, “Finding acoustic regularities in speech: Application to pho- nications and ATR Laboratories. He is currently with Tower Research Capital
netic recognition,” Ph.D. dissertation, Mass. Inst. Technol., Cambridge, in New York.
1988.
[21] M. Bacchiani and M. Ostendorf, “Joint lexicon, acoustic unit inventory
and model design,” Speech Commun., vol. 29, no. 2–4, pp. 99–114,
Nov. 1999.
[22] S. Deligne and F. Bimbot, “Inference of variable length acoustic units James R. Glass (SM’06) received the S.M. and Ph.D.
for continuous speech recognition,” in Proc. Int. Conf. Acoust., Speech, degrees in electrical engineering and computer sci-
Signal Process., Munich, Germany, 1997, vol. 3, pp. 1731–1734. ence from the Massachusetts Institute of Technology
[23] A. Park and J. Glass, “Towards unsupervised pattern discovery in
(MIT), Cambridge, in 1985, and 1988, respectively.
speech,” in Proc. IEEE Workshop Autom. Speech Recognition Under-
After starting in the Speech Communication
standing, San Juan, Puerto Rico, 2005, pp. 53–58.
[24] A. Park and J. R. Glass, “Unsupervised word acquisition from speech Group at the MIT Research Laboratory of Elec-
using pattern discovery,” in Proc. Int. Conf. Acoust., Speech, Signal tronics, he has worked at the Laboratory for
Process., Toulouse, France, Apr. 2006, pp. I-409–I-412. Computer Science, now the Computer Science and
[25] C. S. Myers and L. R. Rabiner, “A level building dynamic time Artificial Intelligence Laboratory (CSAIL), since
warping algorithm for connected word recognition,” IEEE Trans. 1989. Currently, he is a Principal Research Scientist
Acoust, Speech, Signal Process., vol. ASSP-29, no. 2, pp. 284–297, at CSAIL, where he heads the Spoken Language
Apr. 1981. Systems Group. He is also a Lecturer in the Harvard-MIT Division of Health
[26] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimiza- Sciences and Technology. His primary research interests are in the area of
tion for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal speech communication and human–computer interaction, centered on auto-
Process., vol. ASSP-26, no. 1, pp. 43–49, Feb. 1978. matic speech recognition and spoken language understanding. He has lectured,
[27] Y.-L. Lin, T. Jiang, and K.-M. Chao, “Efficient algorithms for lo- taught courses, supervised students, and published extensively in these areas.
cating the length-constrained heaviest segments with applications to Dr. Glass has been a member of the IEEE Signal Processing Society Speech
biomolecular sequence analysis,” J. Comput. Syst. Sci., vol. 65, no. 3, Technical Committee, and an Associate Editor for the IEEE TRANSACTIONS ON
pp. 570–586, Jan. 2002. AUDIO, SPEECH, AND LANGUAGE PROCESSING.

View publication stats

You might also like