Unsupervised Speech Pattern Discovery
Unsupervised Speech Pattern Discovery
net/publication/3457991
Article in IEEE Transactions on Audio Speech and Language Processing · February 2008
DOI: 10.1109/TASL.2007.909282 · Source: IEEE Xplore
CITATIONS READS
366 447
2 authors, including:
James R Glass
Massachusetts Institute of Technology
649 PUBLICATIONS 24,265 CITATIONS
SEE PROFILE
All content following this page was uploaded by James R Glass on 12 August 2013.
Abstract—We present a novel approach to speech processing from a set of experiments conducted by developmental psychol-
based on the principle of pattern discovery. Our work represents ogists studying infant language learning. Saffran et al. found
a departure from traditional models of speech recognition, where that 8-month-old infants are able to detect the statistical proper-
the end goal is to classify speech into categories defined by a
prespecified inventory of lexical units (i.e., phones or words). ties of commonly co-occurring syllable patterns, indicating that
Instead, we attempt to discover such an inventory in an unsuper- the identification of recurring patterns may be important in the
vised manner by exploiting the structure of repeating patterns word acquisition process [3]. Our second source of inspiration
within the speech signal. We show how pattern discovery can be is implementational in nature and relates to current research in
used to automatically acquire lexical entities directly from an comparative genomics [4], [5]. In that area of research, pattern
untranscribed audio stream. Our approach to unsupervised word
acquisition utilizes a segmental variant of a widely used dynamic discovery algorithms are needed in order to find genes and struc-
programming technique, which allows us to find matching acoustic turally important sequences from massive amounts of genomic
patterns between spoken utterances. By aggregating information DNA or protein sequence data. Unlike speech, the lexicon of
about these matching patterns across audio streams, we demon- interesting subsequences is not known ahead of time, so these
strate how to group similar acoustic sequences together to form items must be discovered from the data directly. By aligning se-
clusters corresponding to lexical entities such as words and short
multiword phrases. On a corpus of academic lecture material, we quences to each other and identifying patterns that repeat with
demonstrate that clusters found using this technique exhibit high high recurrence, these biologically important sequences, which
purity and that many of the corresponding lexical identities are are more likely to be preserved, can be readily discovered. Our
relevant to the underlying audio stream. hope is to find analogous techniques for speech based on the
Index Terms—Speech processing, unsupervised pattern dis- observation that patterns of speech sounds are more likely to
covery, word acquisition. be consistent within word or phrase boundaries than across. By
aligning continuous utterances to each other and finding sim-
ilar sequences, we can potentially discover frequently occurring
I. INTRODUCTION words with minimal knowledge of the underlying speech signal.
The fundamental assumption of this approach is that acoustic
O VER the last several decades, significant progress has
been made in developing automatic speech recognition
(ASR) systems which are now capable of performing large-vo-
speech data displays enough regularity to make finding such
matches possible.
cabulary continuous speech recognition [1], [2]. In spite of This paper primarily focuses on the unsupervised processing
this progress, the underlying paradigm of most approaches of speech data to automatically extract words and linguistic
to speech recognition has remained the same. The problem phrases. Our work differs substantially from other approaches
is cast as one of classification, where input data (speech) is to unsupervised word acquisition (see Section II) in that it
segmented and classified into a preexisting set of known cat- operates directly on the acoustic signal, using no intermediate
egories (words). Discovering where these word entities come recognition stage to transform the audio into a symbolic repre-
from is typically not addressed. This problem is of interest to sentation. Although the inspiration for our methods is partially
us because it represents a key difference in the language pro- derived from experiments in developmental psychology, we
cessing strategies employed by humans and machines. Equally make no claims on the cognitive plausibility of these word
important, it raises the question of how much can be learned acquisition mechanisms in actual human language learning.
from speech data alone, in the absence of supervised input. The results obtained in this paper are summarized as follows.
In this paper, we propose a computational technique for ex- 1) We demonstrate how to find subsequence alignments
between the spectral representations of pairs of contin-
tracting words and linguistic entities from speech without super-
vision. The inspiration for our unsupervised approach to speech uous utterances. In so doing, we propose a variation of
processing comes from two sources. The first source comes a well-known dynamic programming technique for time
series alignment, which we call segmental dynamic time
warping (DTW). This task is motivated by the assumption
Manuscript received January 10, 2007; revised July 26, 2007. This work was
supported by the National Science Foundation under Grant #IIS-0415865. The
that common words and phrases between utterance pairs
associate editor coordinating the review of this manuscript and approving it for are likely to be acoustically similar to each other. This
publication was Dr. Helen Meng. algorithm allows us to find low distortion alignments
A. S. Park was with the Computer Science and Artificial Intelligence Lab-
oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.
between different regions of time in a given audio stream,
He is now with Tower Research Capital, New York, NY 10013 USA (e-mail: which correspond to similar sounding speech patterns.
[email protected]). 2) We show how low distortion alignments generated by the
J. R. Glass is with the Computer Science and Artificial Intelligence Lab- segmental DTW algorithm can be used to find recurring
oratory, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
(e-mail: [email protected]). speech patterns in an audio stream. These patterns can be
Digital Object Identifier 10.1109/TASL.2007.909282 clustered together by representing the audio stream as an
1558-7916/$25.00 © 2007 IEEE
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 187
abstract adjacency graph. The speech pattern clusters that Distance matrices are also used extensively by researchers
are discovered using this methodology are shown to corre- in the music analysis community. In this area of research, the
spond to words and phrases that are relevant to the audio music audio is parameterized as a sequence of feature vectors,
streams from which they are extracted. and the resulting sequence is used to create a self-distance ma-
The remainder of this paper is organized as follows: We trix. The structure of the distance matrix can then be processed
briefly survey related work in the areas of pattern discovery to induce music structure (i.e., distinguish between chorus and
and unsupervised word acquisition in Section II. Section III verse), characterize musical themes, summarize music files, and
describes the segmental DTW algorithm, an adaptation of detect duplicate music files [10]–[13]. We carry over the use of
a widely known dynamic programming technique, which is distance matrices for pattern discovery in music audio to our
designed to find matching acoustic patterns between spoken own work in speech processing.
utterances. In Section IV, we demonstrate how to induce a
graph representation from the audio stream. We also employ B. Unsupervised Language Acquisition
clustering techniques to discover patterns that correspond to The area of research most closely related to our work con-
words and phrases in speech by aggregating the alignment cerns the problem of unsupervised knowledge acquisition at
paths that are produced by the segmental DTW algorithm. The the lexical level. Most recently, Roy et al. have proposed a
experimental background for the experiments conducted in this model for lexical acquisition by machine using multimodal
paper are presented in Section V, including a description of the inputs, including speech [14]. Roy used a recurrent neural
speech data used and specifics about our choice of signal repre- network trained on transcribed speech data to output a stream
sentation. We give examples of the types of word entities found of phoneme probabilities for phonemically segmented audio.
and analyze the results of our algorithm in Section VI, then Words were learned by pairing audio and visual events and
conclude and discuss directions for future work in Section VII. storing them as lexical items in a long-term memory structure.
In [15], de Marcken demonstrated how to learn words from
II. RELATED WORK phonetic transcriptions of continuous speech by using a model-
There have been a variety of research efforts that are related based approach to lexicon induction. The algorithm iteratively
to the work presented in this paper. We can roughly categorize updates parameters of the model (lexicon) to minimize the de-
these works into two major groups: applications of pattern dis- scription length of the model given the available evidence (the
covery principles to domains outside of natural language pro- input corpus).
cessing, and unsupervised learning techniques within the field Brent proposed a model-based dynamic programming ap-
of natural language processing. proach to word acquisition by considering the problem as one
of segmentation (i.e., inferring word boundaries in speech)
A. Pattern Discovery [16]. In his approach, the input corpus is presented as a single
The works summarized in this section represent a variety of unsegmented stream. The optimal segmentation of the corpus
different fields, ranging from computational biology to music is found through a dynamic programming search, where an
analysis to multimedia summarization. There is a common un- explicit probability model is used to evaluate each candidate
derlying theme in all of this research: the application of pattern segmentation. A similar strategy is used by Venkataraman in
discovery principles to sequence data. We briefly describe work [17], although the utterance level representation of the corpus
in each of these fields below. is used as a starting point rather than viewing the entire corpus
In computational biology, research in pattern discovery algo- as a single entity. The estimation of probabilities used in the
rithms is motivated by the problem of finding motifs (biologically segmentation algorithms of Brent and Venkataraman differ, but
significant recurring patterns) in biological sequences. Although the overall strategies of the two techniques are conceptually
the large body of proposed approaches is too large to list here, a similar. More recently, Goldwater has improved upon these
survey of the more important techniques is described in [6] and model-based approaches by allowing for sparse solutions and
[7]. The class of algorithms most relevant to our work are based more thoroughly investigating the role of search in determining
upon sequence comparison, where multiple sequences are com- the optimal segmentation of the corpus [18].
pared to one another to determine which regions of the sequence We note here that each of the above examples used a phono-
are recurring. Since biological sequences can be abstractly rep- logical lexicon as a foundation for the word acquisition process,
resented as strings of discrete symbols, many of the comparison and none of the techniques described were designed to be ap-
techniques have roots in string alignment algorithms. In partic- plied to the speech signal directly. The algorithms proposed by
ular, a popular approach to alignment is the use of dynamic pro- de Marcken and Roy, both depend on a phonetic recognition
gramming to search an edit distance matrix (also known as a dis- system to convert the continuous speech signal into a set of dis-
tance matrix, position weight matrix, or position-specific scoring crete units. The systems of Brent and Venkataraman were eval-
matrix) for optimal global alignments [8] or optimal local align- uated using speech data phonemically transcribed by humans in
ments [9]. The distance matrix is a structure which generates a a way that applied a consistent phoneme sequence to a partic-
distance or similarity score for each pair of symbols in the se- ular word entity, regardless of pronunciation.
quences being compared. We make use of distance matrices for Pattern discovery in audio has been previously proposed by
alignment in this paper as well, although the sequences we work several researchers. In [19], Johnson used a specialized dis-
with are derived from the audio signal, and are therefore com- tance metric for comparing covariance matrices of audio seg-
posed of real-valued vectors, not discrete symbols. ments to find non-news events such as commercials and jingles
188 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008
in broadcast news. Typically, the repeated events were identical speakers or when comparing speech from different environ-
to one another and were on the order of several seconds long. mental conditions. Finding robust feature representations is a
Unsupervised processing of speech has also been considered difficult problem in its own right, and we defer treatment of this
as a first step in acoustic model development [20]. Bacchiani issue to more extensive research done in the area.
proposed a method for breaking words into smaller acoustic When the utterances that we are trying to compare happen
segments and clustering those segments to jointly determine to be isolated words, the globally optimal alignment is a suit-
acoustic subword units and word pronunciations [21]. Simi- able way to directly measure the similarity of two utterances
larily, Deligne demonstrated how to automatically derive an in- at the acoustic level. However, if the utterances consist of mul-
ventory of variable-length acoustic units directly from speech tiple words sequences, the distances and paths produced by op-
by quantizing the spectral observation vectors, counting symbol timal global alignment may not be meaningful. Although DTW
sequences that occur more than a specified number of times, was applied to the problem of connected word recognition via a
and then iteratively refining the models that define each of these framework called level building, this technique still required the
symbol sequences [22]. existence of a set of isolated word reference templates [25]. In
that respect, the problem has significant differences to the one
III. SEGMENTAL DTW in which we are interested. Consider the pair of utterances:
This section motivates and describes a dynamic programming 1) “He too was diagnosed with paranoid schizophrenia”;
algorithm which we call segmental DTW [23], [24]. Segmental 2) “ were willing to put Nash’s schizophrenia on record.”
DTW takes as input two continuous speech utterances and finds Even in an optimal scenario, a global alignment between these
matching pairs of subsequences. This algorithm serves as the two utterances would be forced to map speech frames from
foundation for the pattern discovery methodology described in dissimilar words to one another, making the overall distortion
this paper. difficult to interpret. This difficulty arises primarily because
Dynamic time warping was originally proposed as a way of each utterance is composed of a different sequence of words,
comparing two whole word exemplars to each other by way of meaning that the utterances cannot be considered from a
some optimal alignment. Given two utterances, and , we can global perspective. However, (1) and (2) do share similarities
represent each as a time series of spectral vectors, at the local level. Namely, both utterances contain the word
and , respectively. The optimal alignment path “schizophrenia.” Identifying and aligning such similar local
between and , , can be computed, and the accumulated segments is the problem we seek to address in this section. Our
distortion between the two utterances along that path, , proposed solution is a segmental variant of DTW that attempts
can be used as a basis for comparison. Formally, we define a to find subsequences of two utterances that align well to each
warping relation, or warp path, , to be an alignment which other. Segmental DTW is comprised of two main components:
maps to while obeying several constraints. The warping a local alignment procedure which produces multiple warp
relation can be written as a sequence of ordered pairs paths that have limited temporal variation, and a path trimming
procedure which retains only the lower distortion regions of an
(1) alignment path.
B. Path Refinement
(4)
(5)
of the constrained path may not reach . An alignment Associated with each warp path is a distortion sequence whose
path resulting in unassigned frames in either of the input utter- values are real and positive
ances may be desirable in cases where only part of the utterances
match. (7)
In addition to limiting temporal skew, the constraint in (3)
also introduces a natural division of the search grid into regions The minimum distortion warp path fragment is a subsequence
suitable for generating multiple alignment paths with offset start of that satisfies
coordinates as shown in Fig. 2.
For utterances of length and , with a constraint param- LCMA (8)
eter of , the start coordinates will be
The minimum length criterion plays a practical role in com-
puting the minimum average subsequence. Without the length
constraint, the minimum average subsequence would typically
be just the smallest single element in the original sequence.
Likewise, for our application, it has the effect of preventing
spurious matches between short segments within each utter-
Based on these coordinates, we will have a number of diagonal ance. The length criterion also has important conceptual im-
regions, each defining a range of alignments between the two plications. The value of serves to control the granularity of
utterances with different offsets but the same temporal rigidity. repeating patterns that are returned by the segmental DTW pro-
Within each region, we can use dynamic time warping to find the cedure. Small values of will lead to many short, subword pat-
optimal local alignment , where is the index of the diagonal terns being found, while large values of will return fewer, but
region. more linguistically significant patterns such as words or phrases.
190 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008
Fig. 4. Utterance level view of a warp path from Fig. 3. The segment bounded
by the vertical black lines corresponds to the LCMA fragment for this particular
warp path, while the remainder of the white line corresponds to the fragment
Fig. 3. Family of constrained warp paths ^ with R = 10 for the pair of resulting from extending the LCMA fragment to neighboring regions with low
utterances in our example. The frame rate for this distance matrix is 200 frames distortion.
per second. The associated LCMA path fragments, with L = 100, are shown in
bold as part of each warp path. Each path fragment is associated with an average
distortion that indicates how well the aligned segments match one another.
An alternate view of the distortion path, including a frame-
level view of the individual utterances, is shown in Fig. 4. This
view of the distortion path highlights the need for extending
In separate experiments, we found that the reliability of align- the path fragments discovered using the LCMA algorithm. Al-
ment paths found by the algorithm, in terms of matching accu- though the distortion remains low from the onset of the word
racy, was positively correlated with path length [28]. This result, “schizophrenia” in each utterance, the LCMA path fragment
along with our need to limit the found paths to a manageable (shown in red) starts almost 500 ms after this initial drop in dis-
number for a given audio stream, led us to select a relatively tortion. In order to compensate for this phenomenon, we allow
long minimum length constraint of 500 ms. We discuss some for path extension using a distortion threshold based on the
less arbitrary methods for determining an optimal setting for values in the path fragment, for example within 10% of the dis-
in Section VII. In the remainder of this section, we show ex- tortion of the original fragment. The extension of the fragment
ample outputs that are produced when segmental DTW is ap- is shown in Fig. 4 as a white line.
plied to pairs of utterances. Although the endpoints of the extended path fragment in
Fig. 4 happen to coincide with the common word boundaries
C. Example Outputs for that particular example, in many cases, the segmental DTW
In this section, we step through the segmental DTW proce- algorithm will align subword sequences or even multiword
dure for the pair of utterances presented at the beginning of sequences. This is because, aside from fragment length, the
Section III. The distance matrix for these two utterances is dis- segmental DTW algorithm makes no use of lexical identity
played in Fig. 3. In this distance matrix, each cell corresponds to when searching for an alignment path.
the Euclidean distance between frames from each of the utter-
ances being compared. The cell at row i, column j, corresponds IV. FROM PATHS TO CLUSTERS
to the distance between frame i of the first utterance and frame j In order to apply the segmental DTW algorithm to an audio
of the second utterance. The local similarity between the utter- stream longer than a short sentence, we first perform silence
ance portions containing the word “schizophrenia” are evident detection on the audio stream to break it into shorter utter-
in the diagonal band of low distortion cells stretching from the ances. This segmentation step is described in more detail in
time coordinates (1.6, 0.9) to (2.1, 1.4). From the distance ma- Section V-B. Segmental DTW is then performed on each pair
trix, a family of constrained warp paths is found using dynamic of utterances. With the appropriate choice of length constraint,
time warping as shown in Fig. 3. The width parameter which this procedure generates a large number of alignment path
constrains the extent of time warping is set to frames, fragments that are distributed throughout the original audio
at a 5-ms analysis rate, which corresponds to a total allowable stream. Each alignment path consists of two intervals (the
offset of 105 ms. The warp paths are overlaid with their associ- regions in time purported to be matching), and the associated
ated length constrained minimum average path fragments. The distortion along that interval. Fig. 5 illustrates the distribution
length parameter used in this example is , which corre- of path fragments throughout the audio stream. This visualiza-
sponds to approximately 500 ms. The coloring of the warp path tion demonstrates how some time intervals in the audio match
fragments correspond to the average distortion of the path frag- well to many other intervals, with up to 17 associated path
ment, with bright red fragments indicating low distortion paths fragments, while some time intervals have few, if any, matches.
and darker shades indicating high distortion paths. Typically, Since these fragments serve to link regions in time that are
there is a wide range of distortion values for the path fragments acoustically similar to one another, a natural question to ask is
found, but only the lowest distortion fragments are of interest to whether they can be used to build clusters of similar sounding
us, as they indicate potential local matches between utterances. speech segments with a common underlying lexical identity.
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 191
Fig. 5. Histogram indicating the number of path fragments present for each
instant of time for the Friedman lecture. The distribution of path fragments is
irregular, indicating that certain time intervals have more acoustic matches than
others.
Fig. 7. Top—a partial utterance with the time regions from its associated path
fragments shown in white. Paths are ordered from bottom to top in increasing
order of distortion. Bottom—the corresponding similarity profile for the same
time interval is shown as a solid line, with the smoothed version shown as a
dashed line (raised for clarity). The extracted time indices are denoted as dots
at the profile peaks.
Fig. 6. Production of an adjacency graph from alignment paths and extracted
nodes. The audio stream is shown as a timeline, while the alignment paths are
shown as pairs of colored lines at the same height above the timeline. Node
relations are captured by the graph on the right, with edge weights given by the
A. Node Extraction
path similarities. While it is relatively straightforward to see how alignment
path fragments can be converted into graph edges given a set
of time index nodes in the audio stream, it is less clear how
Our approach to this problem is cast in a graph theoretical these nodes can be extracted in the first place. In this section,
framework, which represents the audio stream as an abstract ad- we describe the node extraction procedure.
jacency graph consisting of a set of nodes and a set of edges Recall that the input to the segmental DTW algorithm is not
. In this graph, the nodes correspond to locations in time, and a single contiguous audio stream, but rather a set of utterances
the edges correspond to measures of similarity between those produced by segmenting the audio using silence detection. Our
time indices. Given an appropriate choice of nodes and edges, goal in node extraction is to determine a set of discrete time
graph clustering techniques can be applied to this abstract rep- indices within these utterances that are representative of their
resentation to group together the nodes in the graph that are surrounding time interval. This is accomplished by using in-
closest to one another. Since graph clustering and partitioning formation about the alignment paths that populate a particular
algorithms are an active area of research [29]–[31], a wide range utterance.
of techniques can be applied to this stage of the problem. Consider the example shown in Fig. 7. In this example, there
An overview of the graph conversion process is shown in are a number of alignment paths distributed throughout the ut-
Fig. 6. The time indices indicated in the audio stream are real- terance with different average path distortions. The distribution
ized as nodes in the adjacency graph, while the alignment paths of alignment paths is such that some time indices are covered by
overlapping the time indices are realized as edges between the many more paths than others—and are therefore similar to more
nodes. We use these alignment paths to derive edge weights by time indices in other utterances. These heavily covered time in-
applying a simple linear transformation of the average path dis- dices are typically located within the words and phrases that are
tortions, with the weight between two nodes being given by the matched via multiple alignment paths.
following similarity score We can use the alignment paths to form a similarity profile
by summing the similarity scores of (9) over time. That is, the
(9) similarity score at time , is given by
In this equation, is the weight on the edge between nodes (10)
and , is the alignment path common to both nodes,
is the average distortion for that path, and is a
threshold used to normalize the path distortions. The average In this equation, are the paths that overlap , and is the
distortion is used as opposed to the total distortion in order to similarity value for given by (9).
normalize for path length when comparing paths with different After smoothing the similarity profile with a 0.5-s trian-
durations. Paths with average distortion greater than are not gular averaging window, we take the peaks from the resulting
included in the similarity computation. The distortion threshold smoothed profile and use those time indices as the nodes in
chosen for all experiments in this chapter was 2.5, which re- our adjacency graph. Because our extraction procedure finds
sulted in approximately 10% of the generated alignment paths locations with locally maximized similarity within the utter-
being retained. The resulting edge weights are closer to 1 be- ance, the resulting time indices demarcate locations that are
tween nodes with high similarity, and closer to zero (or nonex- more likely to bear resemblance to other locations in the audio
istent) for nodes with low similarity. stream.
192 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008
(11)
TABLE I TABLE II
SEGMENT OF SPEECH TAKEN FROM A LECTURE, “THE WORLD CLUSTER STATISTICS FOR ALL LECTURES PROCESSED IN THIS PAPER.
IS FLAT,” DELIVERED BY THOMAS FRIEDMAN ONLY CLUSTERS WITH AT LEAST THREE MEMBERS ARE INCLUDED
IN THIS TABLE. THE LAST TWO COLUMNS INDICATE HOW MANY
OF THE GENERATED CLUSTERS ARE ASSOCIATED WITH A SINGLE
WORD IDENTITY OR A MULTIWORD PHRASE
TABLE III
INFORMATION FOR THE 63 CLUSTERS WITH AT LEAST THREE MEMBERS GENERATED FOR THE FRIEDMAN LECTURE.
CLUSTERS ARE ORDERED FIRST BY SIZE, THEN IN DECREASING ORDER OF DENSITY
TABLE IV
TWENTY MOST RELEVANT WORDS FOR EACH LECTURE, LISTED IN DECREASING ORDER OF TFIDF SCORE.
WORDS OCCURING AS PART OF A CLUSTER FOR THAT LECTURE ARE SHOWN IN BOLD
single words and multiword phrases that are frequently spoken Since there is no easy way of measuring word relevancy
as a single entity, with more than half of the clusters (31 of 56) directly, for the purposes of our work, we use each word’s
mapping to multiword phrases. term-frequency, inverse document-frequency (TFIDF) score as
Overall cluster purity statistics for the five other academic a proxy for its degree of relevance [39]. The TFIDF score is the
lecture processed in this paper are shown in Table II. We found frequency of the word within a document normalized by the
that across all six lectures, approximately 83% of the generated frequency of the same word across multiple documents. Our
clusters had density greater than 0.05, and among these higher rationale for using this score is that words with high frequency
density clusters, the average purity was 92.2%. In contrast, the within the lecture, but low frequency in general usage are more
average purity across all of the lower density clusters was only likely to be specific to the subject content for that lecture. The
72.6%. These statistics indicate that the observations noted in word lists in Table IV are the 20 most relevant words for each
the previous paragraph appear to transfer to the other lectures. lecture ranked in decreasing order of their TFIDF score. Each
Some notable differences between the Friedman lecture and the list was generated as follows.
academic lectures are the larger average cluster size, and higher 1) First, words in the reference transcription were stemmed to
overall purity across the clusters in general. The larger size of merge pluralized nouns with their root nouns, and various
some clusters can be attributed to the more focused nature of the verb tenses with their associated root verbs.
academic lecture vocabulary, while the higher purity may be a 2) Partial words, filled pauses, single letters and numbers, and
result of differences in speaking style. contractions such as “you’ve” or “i’m” were removed from
A cursory view of the cluster identities for each lecture in- the reference transcription.
dicates that many clusters correspond to words or phrases that 3) Finally, the remaining words in the lecture were ranked
are highly specific to the subject material of that particular lec- by TFIDF, where the document frequency was deter-
ture. For example, in the physics lecture, the words “charge,” mined using the 2K most common words in the Brown
“electric,” “surface,” and “epsilon,” all correspond to some of corpus [40].
the larger clusters for the lecture. This phenomenon is expected, When considered in the context of each lecture’s title, the lists
since relevant content words are likely to recur more often, and of words generated in Table IV appear to be very relevant to
function words such as “the,” “is,” and “of,” are of short du- the subject matter of each lecture, which qualitatively validates
ration and typically exhibit significant pronunciation variation our use of the TFIDF measure. The words for each lecture in
as a result of coarticulation with adjacent words. One way of Table IV are highlighted according to their cluster coverage,
evaluating how well the clusters capture the subject content of with words represented by a cluster shown in bold. On average,
a lecture is to consider the coverage of relevant words by the 14.8 of the top 20 most relevant words are covered by a cluster
generated clusters. generated by our procedure. This statistic offers encouraging
196 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 1, JANUARY 2008
evidence that the recurring acoustic patterns discovered by our formed and the time intervals for each node have been esti-
approach are not only similar to each other (as shown by the high mated, edge weights between cluster nodes can be recomputed
average purity), but also informative about the lexical content of using the start and end times of the node intervals as constraints.
the audio stream. Based on these new edge weights, nodes can be rejected from
the cluster and the time intervals can be reestimated, with the
VII. DISCUSSION AND FUTURE WORK process continuing until convergence to a final set of nodes.
This paper has focused on the unsupervised acquisition of The idea behind this approach is to eliminate chaining and
lexical entities from the information produced by the segmental partial match errors by forcing clusters to be generated based
DTW algorithm. We demonstrated how to use alignment paths, on distortions that are computed over a consistent set of speech
which indicate pairwise similarity, to transform the audio stream intervals.
into an abstract adjacency graph which can then be clustered Similarly, one could imagine using an interval-based clus-
using standard graph clustering techniques. As part of our eval- tering strategy to help avoid accidental merging of lexically dif-
uation, we showed that the clusters generated by our proposed ferent clusters, which can occur as a result of “chained” multi-
procedure have both high purity and good coverage of terms that word phrases, or matched subword units such as “tion.” Interval-
are relevant to the subject of the underlying lecture. based clustering would resolve this problem by using whole
As we noted in Section V-A, there are several reasons why time intervals as nodes, rather than time indices. This approach
the lecture audio data was particularly well suited for pattern would allow a hierarchical representation of a particular speech
discovery using segmental DTW. First, the types of material segment and distinguish between overlapping intervals of dif-
was single-speaker data in a consistent environment, which al- ferent lengths.
lowed us to ignore issues of robustness with our feature rep- At a more abstract level, we believe that an interesting direc-
resentation. Second, the amount of topic-specific data ensured tion for future work would be to incorporate some way to build
that there were enough instances of repeated words for the al- and update a model of the clustered intervals using some type
gorithm to find. For both of these reasons, our algorithm would of hidden Markov model or generalized word template. This
likely not perform as well if applied directly to other domains, would introduce significant computational savings by reducing
such as Switchboard or Broadcast News. In particular, we would the number of required comparisons.
not expect to find clusters of the same size or density without re- Another area for future exploration is the automatic identifi-
ducing the length parameter and/or including more edges in the cation and transcription of cluster identities. We have previously
adjacency graph. The reason for this is mainly due to speaker proposed algorithms for doing so using isolated word recog-
changes and paucity of repeated content words. Speaking style nition and phonetic recognition combined with a large base-
is not as significant an issue, as the lecture data exhibits speech form dictionary [24]. This task illustrates how unsupervised pat-
that is much more conversational than read speech or broadcast tern discovery can provide complementary information to more
news data. traditional automatic speech recognition systems. Since most
The work presented in this paper represents only an initial speech recognizers process each utterance independently of one
investigation into the more general problem of knowledge ac- another, they typically do not take advantage of the consistency
quisition from speech. Many directions for future work remain, with which the same word is uttered when repeated in the test
and we expand upon some of them here. data. Alignment paths generated by segmental DTW can find lo-
In our experiments, we chose a large value for the param- cations where an automatic transcription is not consistent by in-
eter to limit the over-generation of alignment path fragments cor- dicating where acoustically similar segments produced different
responding to short, possibly spurious, acoustic matches. Typi- transcriptions.
cally, low-distortion path fragments corresponding to words or This paper documents our initial research on unsupervised
phrases are recoverable from shorter path fragments during the strategies for speech processing. While conventional large
extension step of path refinement. Discovery of longer fragments vocabulary speech recognition would likely perform well
is therefore not particularly sensitive to our choice of . Larger in matched training and testing scenarios, there are many
values of primarily serve to prevent short path fragments (usu- real-world scenarios where a paucity of content information
ally corresponding to subword matches) from being passed on to can expose the brittleness of purely supervised approaches. We
the node generation and clustering stage. Within the context of believe that techniques such as the one in this paper, which
word acquisition, these shorter path fragments are problematic rely less on training data, can be combined with conventional
because they cause dissimilar words to cluster with one another speech recognizers to create more flexible, hybrid systems that
via common subword units. Possibilities for future work include can learn from and adapt to unexpected input. Examples of
using smaller values of for discovery of subword units or deter- such unexpected input include: accented speech, out-of-vocab-
mining the appropriate setting of in a more principled manner. ulary words, new languages, and novel word usage patterns. In
For example, the optimal setting for could be determined by each of these scenarios, exploiting the consistency of repeated
performing pattern discovery over the audio stream using mul- patterns in the test data has not been fully explored, and we
tiple ’s and choosing from the best one according to some se- believe it is a promising direction for future research.
lection criterion.
REFERENCES
An incremental strategy for improving cluster purity and
[1] J. L. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news
finding more precise word boundaries may be to adopt an transcription system,” Speech Commun., vol. 37, no. 1–2, pp. 89–108,
iterative approach to cluster formation. After clusters have been May 2002.
PARK AND GLASS: UNSUPERVISED PATTERN DISCOVERY IN SPEECH 197
[2] A. Ljolje, D. M. Hindle, M. D. Riley, and R. W. Sproat, “The AT&T [28] A. Park, “Unsupervised pattern discovery in speech: Applications to
LVCSR-2000 system,” in Proc. DARPA Speech Transcription Work- word acquisition and speaker segmentation,” Pd.D. dissertation, Mass.
shop, College Park, MD, May 2000 [Online]. Available: http://www. Inst. Technol., Cambridge, MA, 1988.
nist.gov/speech/publications/tw00/pdf/cts30.pdf [29] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE
[3] J. R. Saffran, R. N. Aslin, and E. L. Newport, “Statistical learning by Trans. Pattern Anal. Mach. Intell. vol. 22, no. 8, pp. 888–905, Aug.
8-month old infants,” Science, vol. 274, pp. 1926–1928, Dec. 1996. 2000 [Online]. Available: citeseer.ist.psu.edu/article/shi97normalized.
[4] I. Rigoutsos and A. Floratos, “Combinatorial pattern discovery in bi- html.
ological sequences: The TEIRESIAS algorithm,” Bioinformatics, vol. [30] M. Meila and J. Shi, “Learning segmentation by random walks,” in
14, no. 1, pp. 55–67, Feb. 1998. Advances in Neural Information Processing Systems 13, T. K. Leen, T.
[5] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to G. Dietterich, and V. Tresp, Eds. Cambridge, MA: MIT Press, 2001,
the automatic discovery of patterns in biosequences,” J. Comput. Biol., vol. 13, pp. 873–879.
vol. 5, no. 2, pp. 279–305, 1998. [31] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
[6] G. K. Sandve and F. Drabløs, “A survey of motif discovery methods in and an algorithm,” in Advances in Neural Information Processing Sys-
an integrated framework,” Biol. Direct, vol. 1, pp. 1–11, Apr. 2006. tems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cam-
[7] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological bridge, MA: MIT Press, 2002, pp. 849–856.
Sequence Analysis: Probabilistic Models of Proteins and Nucleic [32] S. White and P. Smyth, “A spectral clustering approach to finding com-
Acids. Cambridge, U.K.: Cambridge Univ. Press, 1998. munities in graphs,” in SIAM Int. Conf. Data Mining, Newport Beach,
[8] S. B. Needleman and C. D. Wunsch, “A general method applicable to CA, 2005, pp. 274–285.
the search for similarities in the amino acid sequence of two proteins,” [33] M. E. J. Newman and M. Girvan, “Finding and evaluating community
J. Mol. Biol., vol. 48, pp. 443–453, 1970. structure in networks,” Phys. Rev. E, vol. 69, 2004, 026113.
[9] M. S. Waterman and M. Eggert, “A new algorithm for best subse- [34] M. E. J. Newman, “Fast algorithm for detecting community structure
quence alignments with application to tRNA-rRNA comparisons,” J. in networks,” Phys. Rev. E, vol. 69, 2004, 066133.
Mol. Biol., vol. 197, pp. 723–725, 1987. [35] J. Glass, T. J. Hazen, L. Hetherington, and C. Wang, “Analysis and
[10] B. Logan and S. Chu, “Music summarization using key phrases,” in processing of lecture audio data: Preliminary Investigations,” in Proc.
Proc. Int. Conf. Acoust., Speech, Signal Process., Istanbul, Turkey, Jun. HLT-NAACL 2004 Workshop Interdisciplinary Approaches to Speech
2000, pp. 749–752. Indexing and Retrieval, Boston, MA, May 2004, pp. 9–12.
[11] C. J. Burges, D. Plastina, J. C. Platt, E. Renshaw, and H. Malvar, “Using [36] MIT, MIT World [Online]. Available: http://mitworld.mit.edu.
audio fingerprinting for duplicate detection and thumbnail generation,” [37] MIT, “MIT Open Courseware,” [Online]. Available: http://ocw.mit.
in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, edu.
Mar. 2005, vol. 3, pp. 9–12. [38] J. Glass, “A probabilistic framework for segment-based speech recog-
[12] R. Dannenberg and N. Hu, “Pattern discovery techniques for music nition,” Comput. Speech Lang., vol. 17, no. 2–3, pp. 137–152, 2003.
audio,” in Proc. Int. Conf. Music Inf. Retrieval, Paris, France, Oct. [39] G. Salton and C. Buckley, “Term weighting approaches in automatic
2002, pp. 63–70. text retrieval,” Cornell Univ., Ithaca, NY, Tech. Rep. TR87-881, 1987.
[13] M. Goto, “A chorus-section detecting method for musical audio sig- [40] W. N. Francis and H. Kucera, Frequency Analysis of English Usage:
nals,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Apr. 2003, Lexicon and Grammar. Boston, MA: Houghton-Mifflin, 1982.
vol. 5, pp. 437–440.
[14] D. Roy and A. Pentland, “Learning words from sights and sounds: A
computational model,” Cognitive Sci., vol. 26, no. 1, pp. 113–146, Jan.
2002.
[15] C. G. de Marcken, “Unsupervised language acquisition,” Ph.D. disser- Alex S. Park (M’06) received the B.S., M.S., and
tation, Mass. Inst. Technol., Cambridge, MA, 1996. Ph.D. degrees in electrical engineering and computer
[16] M. R. Brent, “An efficient probabilistically sound algorithm for seg-
science from the Massachusetts Institute of Tech-
mentation and word discovery,” Mach. Learn., vol. 34, no. 1–3, pp.
nology (MIT), Cambridge, in 2001, 2002, and 2006,
71–105, Feb. 1999.
[17] A. Venkataraman, “A statistical model for word discovery in tran- respectively.
scribed speech,” Comput. Ling., vol. 27, no. 3, pp. 352–372, Sep. 2001. While at MIT, he performed his doctoral research
[18] S. Goldwater, T. Griffiths, and M. Johnson, “Contextual dependencies as a member of the Spoken Language Systems Group
in unsupervised word segmentation,” in Proc. Coling/ACL, Sydney, in the Computer Science and Artificial Intelligence
Australia, 2006, pp. 673–670. Laboratory. His research interests include unsuper-
[19] S. Johnson and P. Woodland, “A method for direct audio search with vised learning in speech, auditory signal processing,
application to indexing and retrieval,” in Proc. Int. Conf. Acoust., speaker recognition, and noise robust speech recogni-
Speech, Signal Process., Istanbul, Turkey, 2000, pp. 1427–1430. tion. While a student, he took part in research internships with Nuance Commu-
[20] J. Glass, “Finding acoustic regularities in speech: Application to pho- nications and ATR Laboratories. He is currently with Tower Research Capital
netic recognition,” Ph.D. dissertation, Mass. Inst. Technol., Cambridge, in New York.
1988.
[21] M. Bacchiani and M. Ostendorf, “Joint lexicon, acoustic unit inventory
and model design,” Speech Commun., vol. 29, no. 2–4, pp. 99–114,
Nov. 1999.
[22] S. Deligne and F. Bimbot, “Inference of variable length acoustic units James R. Glass (SM’06) received the S.M. and Ph.D.
for continuous speech recognition,” in Proc. Int. Conf. Acoust., Speech, degrees in electrical engineering and computer sci-
Signal Process., Munich, Germany, 1997, vol. 3, pp. 1731–1734. ence from the Massachusetts Institute of Technology
[23] A. Park and J. Glass, “Towards unsupervised pattern discovery in
(MIT), Cambridge, in 1985, and 1988, respectively.
speech,” in Proc. IEEE Workshop Autom. Speech Recognition Under-
After starting in the Speech Communication
standing, San Juan, Puerto Rico, 2005, pp. 53–58.
[24] A. Park and J. R. Glass, “Unsupervised word acquisition from speech Group at the MIT Research Laboratory of Elec-
using pattern discovery,” in Proc. Int. Conf. Acoust., Speech, Signal tronics, he has worked at the Laboratory for
Process., Toulouse, France, Apr. 2006, pp. I-409–I-412. Computer Science, now the Computer Science and
[25] C. S. Myers and L. R. Rabiner, “A level building dynamic time Artificial Intelligence Laboratory (CSAIL), since
warping algorithm for connected word recognition,” IEEE Trans. 1989. Currently, he is a Principal Research Scientist
Acoust, Speech, Signal Process., vol. ASSP-29, no. 2, pp. 284–297, at CSAIL, where he heads the Spoken Language
Apr. 1981. Systems Group. He is also a Lecturer in the Harvard-MIT Division of Health
[26] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimiza- Sciences and Technology. His primary research interests are in the area of
tion for spoken word recognition,” IEEE Trans. Acoust., Speech, Signal speech communication and human–computer interaction, centered on auto-
Process., vol. ASSP-26, no. 1, pp. 43–49, Feb. 1978. matic speech recognition and spoken language understanding. He has lectured,
[27] Y.-L. Lin, T. Jiang, and K.-M. Chao, “Efficient algorithms for lo- taught courses, supervised students, and published extensively in these areas.
cating the length-constrained heaviest segments with applications to Dr. Glass has been a member of the IEEE Signal Processing Society Speech
biomolecular sequence analysis,” J. Comput. Syst. Sci., vol. 65, no. 3, Technical Committee, and an Associate Editor for the IEEE TRANSACTIONS ON
pp. 570–586, Jan. 2002. AUDIO, SPEECH, AND LANGUAGE PROCESSING.