0% found this document useful (0 votes)
20 views63 pages

Lec12 4

The document discusses unsupervised learning techniques for analyzing unlabeled data. It introduces clustering as a way to group unlabeled data points into clusters based on similarity. It describes k-means clustering, which aims to minimize the sum of squared distances between data points and their assigned cluster centers. The k-means algorithm works by alternating between assigning points to the nearest cluster center, and updating the cluster centers to be the average of points within each cluster.

Uploaded by

lucaszhuuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views63 pages

Lec12 4

The document discusses unsupervised learning techniques for analyzing unlabeled data. It introduces clustering as a way to group unlabeled data points into clusters based on similarity. It describes k-means clustering, which aims to minimize the sum of squared distances between data points and their assigned cluster centers. The k-means algorithm works by alternating between assigning points to the nearest cluster center, and updating the cluster centers to be the average of points within each cluster.

Uploaded by

lucaszhuuu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Unsupervised Learning

吉建民

USTC
[email protected]

2023 年 6 月 4 日

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Used Materials

Disclaimer: 本课件采用了 S. Russell and P. Norvig’s Artificial


Intelligence –A modern approach slides, 徐林莉老师课件和其他网
络课程课件,也采用了 GitHub 中开源代码,以及部分网络博客
内容

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Unsupervised Learning
Clustering
Principle Component Analysis

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Supervised learning has many successes

▶ Document classification
▶ Protein prediction
▶ Face recognition
▶ Speech recognition
▶ Vehicle steering etc.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
However...

▶ Labeled data can be rare or expensive in many real


applications

- Speech
- Medical data
- Protein
- ···

▶ Unlabeled data is much cheaper and abundant


Question: Can we use unlabeled data to help?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Unsupervised learning

Learning from unlabeled data (without supervision)

▶ What can we predict from unlabeled


data?
▶ Groups or clusters in the data

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Unsupervised learning

Learning from unlabeled data (without supervision)

▶ What can we predict from unlabeled


data?
▶ Groups or clusters in the data
▶ Density estimation (密度估计)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Unsupervised learning

Learning from unlabeled data (without supervision)

▶ What can we predict from unlabeled


data?
▶ Groups or clusters in the data
▶ Density estimation (密度估计)
▶ Low-dimensional structure
▶ Principal Component Analysis 主元分
析 (PCA) (linear)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Unsupervised learning

Learning from unlabeled data (without supervision)

▶ What can we predict from unlabeled


data?
▶ Groups or clusters in the data
▶ Density estimation (密度估计)
▶ Low-dimensional structure
▶ Principal Component Analysis 主元分
析 (PCA) (linear)
▶ Manifold learning 流行学习
(non-linear)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Unsupervised Learning
Clustering
Principle Component Analysis

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Clustering

▶ Are there any “groups” in the data ?


▶ What is each group ?
▶ How many ?
▶ How to identify them?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Clustering
▶ Group the data objects into subsets
or “clusters”:
▶ High similarity within clusters
▶ Low similarity between clusters

▶ A common and important task


that finds many applications in
Science, Engineering, information
Science, and other places
▶ Group genes that perform the
same function
▶ Group individuals that has similar
political view
▶ Categorize documents of similar
topics
▶ Identify similar objects from
pictures

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Clustering

▶ Input: training set of input point

Dtrain = {x1 , . . . , xn }
▶ Output: assignment of each point to a cluster

( C(1), . . . , C(n) ) where C(i) ∈ { 1, . . . , k }

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means clustering

Create centers and assign points to centers to minimize sum of


squared distance

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means objective

▶ Each cluster is represented by a centroid µ


▶ Encode each point by its cluster center, pay a cost for
deviation
▶ Loss function based on reconstruction

n
2
Losskmeans µC(j) − xj
j=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means algorithm

∑n 2
▶ Goal: minµ minC j=1 µC(j) − xj

▶ Strategy: alternating minimization


▶ Step 1: if know cluster centers µ, can find best C
▶ Step 2: if know cluster assignments C, can find best cluster
centers

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means algorithm
Optimize loss function Loss(µ, C)


n
2
min min µC(j) − xj
µ C
j=1

(1) Fix µ, optimize C


n
2
min µC(j) − xj
C(1),C(2),...,C(n)
j=1

Assign each point to the nearest cluster center


(2) Fix C, optimize µ


n
2
min µC(j) − xj
µ(1),µ(2),...,µ(k)
j=1

Solution: average of points in cluster i, exactly second step


(re-center) .
.
.
.
.
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . . .
.
.
.
.
.
.
.
.
.
K-Means

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means clustering: Example

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means clustering: Example

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means clustering: Example

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means clustering: Example
Repeat until convergence

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Properties of K-means algorithm

▶ Guaranteed to converge in a finite number of iterations


▶ To a local minimum
▶ The objective is non-convex, so coordinate descent on is not
guaranteed to converge to the global minimum
▶ Running time per iteration: simple and efficient
▶ Assign data points to closest cluster center

O(KN)

▶ Change the cluster center to the average of its assigned points

O(N)

▶ Different initialization will lead to different results


▶ K-means problem is NP-hard (之前公式的最优解)
▶ Not robust to noise and outliers
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means convergence

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means getting stuck

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
K-means not able to properly cluster

Changing the features (distance function) can help

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Table of Contents

Unsupervised Learning
Clustering
Principle Component Analysis

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Principle component analysis

▶ What is dimensionality reduction?


▶ Why dimensionality reduction?
▶ Principal Component Analysis (PCA)
▶ Nonlinear PCA using Kernels

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
What is dimensionality reduction?

▶ Dimensionality reduction refers to the mapping of the original


high-dimensional data onto a lower-dimensional space.
- Criterion for dimensionality reduction can be different based on
different problem settings.
▶ Unsupervised setting: minimize the information loss
最近重构性:样本点到这个超平面的距离都足够近
▶ Supervised setting: maximize the class discrimination
最大可分性:样本点在这个超平面上的投影能尽可能分开
▶ 对样本进行中心化处理以后,两者等价

▶ Given a set of data points of d dimension variables


{x1 , x2 , . . . , xn }
▶ Compute the linear transformation (projection)

P ∈ Rd×m : x ∈ Rd → y = P⊤ x ∈ Rm (m << d)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
What is dimensionality reduction?

P ∈ Rd×m : x ∈ Rd → y = P⊤ x ∈ Rm

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
High-dimensional data

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Why dimensionality reduction?

▶ Most machine learning and data mining techniques may


not be effective for high-dimensional data
▶ Curse of Dimensionality
▶ Query accuracy and efficiency degrade rapidly as the dimension
increases.
▶ The intrinsic dimension may be small.
▶ For example, the number of genes responsible for a certain
type of disease may be small.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Curse of Dimensionality (维数灾难)
▶ When dimensionality increases, data
becomes increasingly sparse in the space
that it occupies
▶ Definitions of density and distance
between points, which is critical for
clustering and outlier detection, become
less meaningful
▶ If N1 = 100 represents a dense sample for
a single input problem, then N10 = 10010
is the sample size required for the same ▶ Randomly generate
sampling density with dimension 10. 500 points
▶ The proportion of a hypersphere (超球面) ▶ Compute difference
with radius r and dimension d, to that of a between max and
hyercube (超立方体) with sides of length min distance
2r and dimension d converges to 0 as d between any pair of
goes to infinity —nearly all of the points
high-dimensional space is “far away” from
the center . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
High dimensional spaces are empty

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Lost in space
Let’s consider a hypersphere of radius r inscribed in a hypercube
with sides of length 2r. Then take the ratio of the volume (体积)
of the hypersphere to the hypercube. We observe the following
trends.
▶ in 2 dimensions:
V(S2 (r)) πr2
= 2 = 78.5%
V(H2 (2r)) 4r

▶ in 3 dimensions:
4
V(S3 (r)) πr3
= 3 3 = 52.4%
V(H3 (2r)) 8r

▶ when the dimensionality d increases


asymptotically

V(Sd (r)) π d/2


lim = lim d d →0
d→∞ V(Hd (2r)) d→∞ 2 Γ( + 1)
2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Why dimensionality reduction?

▶ Visualization: projection of high-dimensional


data onto 2D or 3D.
▶ Data compression: efficient storage and retrieval
▶ Noise removal: positive effect on query accuracy.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Application of feature reduction

▶ Face recognition
▶ Handwritten digit recognition
▶ Text mining
▶ Image retrieval
▶ Microarray data analysis
▶ Protein classification
▶ ···

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
What is Principal Component Analysis?

▶ Principal component analysis (PCA)


- Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
- Retains most of the sample’s information.
- Useful for the compression and classification of data.
▶ By information we mean the variation present in the sample,
given by the correlations between the original variables.
▶ The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total
information each retains.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Principal components (PCs)

Given n points in a d dimensional space, for large d, how does one


project on to a low dimensional space while preserving broad trends
in the data and allowing it to be visualized?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Geometric picture of principal components

▶ Given n points in a d dimensional space, for large d, how does


one project on to a 1 dimensional space

▶ Choose a line that fits the data so the points are spread out
well along the line

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Let us see it on a figure

PCA 希望降维后信息损失最小,可以理解为投影后的数据尽可能的分
开,这种分散程度可以用方差来表示 (µ 为均值):

1∑
n
Var(a) = (ai − µ)2
n i=1

对数据进行中心化后,即 µ = 0:

1∑ 2
n
Var(a) = a
n i=1 i
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Geometric picture of principal components
对数据进行中心化:

1∑
n
x̄ = xi ,
n
i=1
x′i = xi − x̄, 1 ≤ i ≤ n.

对于中心化以后的数据,即 x̄′ = 0,以下说法等价: Find a line


that
▶ maximize the variance of the projected data
▶ maximize the sum of squares of data samples’ projections on
that line
▶ minimize the sum of squares of distances to the line

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic Interpretation — 1D

▶ Minimizing sum of squares of distances to the line is the same


as maximizing the sum of squares of the projections on that
line, thanks to Pythagoras (毕达哥拉斯).

投影长度为: x⊤ ∥w∥
w

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic Interpretation — 1D

投影长度为: x⊤ u = u⊤ x subject to u⊤ u = 1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Geometric picture of principal components

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Geometric picture of principal components

▶ the 1st PC u1 is a minimum distance fit to a line in X space


▶ the 2nd PC u2 is a minimum distance fit to a line in the
plane perpendicular (垂直于) to the 1st PC
PCs are a series of linear least squares fits to a sample, each
orthogonal (垂直于) to all the previous.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic derivation of PCs

▶ Given a sample of n observations on a vector of d variables

{ x1 , x2 , . . . , xn } ∈ Rd

▶ First project the data onto a one-dimensional space with a d


-dimensional vector u1 (where u⊤ 1 u1 = 1):
{ }
u⊤ ⊤ ⊤
1 x1 , u1 x2 , · · · , u1 xn

▶ Find u1 to maximize the variance the projected data:

1 ∑( ⊤ )2
n
u1 x i − u⊤
1 x̄ = u⊤
1 Su1
n
i=1
∑n ∑n
Where x̄ = 1
n i=1 xi and S = 1
n i=1 (xi − x̄) (xi − x̄)⊤

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic derivation of PCs
▶ To solve maxu1 u⊤ ⊤
1 Su1 subject to u1 u1 = 1
▶ Let λ1 be a Lagrangian multiplier (拉格朗日乘子)
( )
L = u⊤1 Su 1 + λ1 1 − u ⊤
1 u1

∂L
= Su1 − λ1 u1 = 0
∂u1
Su1 = λ1 u1
⇒ u1 is an eigenvector (特征向量)
u⊤
1 Su1 = λ1
⇒ u1 corresponds to the eigenvector with the largest eigenvalue λ1

▶ 即,maxu1 u⊤ ⊤
1 Su1 subject to u1 u1 = 1 的结果就是矩阵 S
的最大特征值
▶ 矩阵 S 特征值计算方法:构造特征多项式 | S − λI | = 0 (I 为
单位矩阵),特征值为线性方程组的解
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic derivation of PCs

▶ To find the second component u2


▶ Solve the following

max u⊤ ⊤ ⊤
2 Su2 subject to u2 u2 = 1 & u1 u2 = 0
u2

- u2 is the eigenvector with the second largest eigenvalue λ2

···

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic derivation of PCs
▶ Main steps for computing PCs
▶ Calculate the covariance matrix S

1∑
n

S= (xi − x̄) (xi − x̄)
n i=1

or first center the data: { x′1 , x′2 , . . . , x′n } and x̄′ = 0


[ ] 1
let X = x′1 , x⊤ ′
2 , . . . , xn ∈ R
d×n
; then S = XX⊤
n
m
▶ Find the first m eigenvectors {ui }i=1
▶ Form the projection matrix

P = [u1 u2 · · · um ] ∈ Rd×m

▶ A new test point can be projected as:

xnew ∈ Rd → P⊤ xnew ∈ Rm
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Algebraic derivation of PCs

y = P⊤ x ∈ Rm

▶ Getting the old data back?


- If P is a square matrix (方阵), we can recover x by
( )−1
x = P⊤ y = Py = PP⊤ x

注:u⊤ ⊤ ⊤
i ui = 1 and ui uj = 0 for i ̸= j, then P P = Im (where
⊤ −1
m = n) and (P ) = P
▶ Here P is not full (m << d), but we can still recover x by
x = Py = PP⊤ x, and lose some information
▶ Objective:
▶ Lose least amount of information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Optimality property of PCA

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Optimality property of PCA
Main theoretical result:
The matrix P consisting of the first m eigenvectors of the
covariance matrix S solves the following min problem:

Notice that, for a matrix A m × n and B n × m,


∑m ∑n
trace(AB) = trace(BA) = i=1 j=1 aij bji
∑d ∑n ∑d ∑n
arg minP i=1 j=1 (xij − x′ij )2 is equivalent to arg maxP i=1 j=1 xij x′ij ,
∑d ∑n
as i=1 j=1 x′ij = trace((PPT X)T PPT X) = trace(XT PPT X)
2

PCA projection minimizes the reconstruction error among all linear


projections of size m. .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
PCA for image compression

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nonlinear PCA using Kernels

Rewrite PCA in terms of dot products



▶ Assume the data has been centered, i.e., i xi =0
▶ 1∑ ⊤
The covariance matrix S can be written as S = n i xi xi
▶ If u is an eigenvector of S corresponding to nonzero eigenvalue
1∑ ⊤ 1 ∑( ⊤ )
Su = xi xi u = λu ⇒ u = xi u xi
n nλ
i i

▶ Eigenvectors of S lie in the space spanned by all data points


Kernel methods:
▶ denote the representation of x as φ(x)
▶ define the kernel function k : X × X → R by
k(xi , xj ) = φ(xi )⊤ φ(xj )
▶ define the kernel matrix K: Kij = k(xi , xj )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nonlinear PCA using Kernels

1∑ ⊤ 1 ∑( ⊤ )
Su = xi xi u = λu ⇒ u = xi u xi
n nλ
i i

The covariance matrix can be written in matrix form

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Nonlinear PCA

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Comments on PCA

▶ Linear dimensionality reduction method


▶ Can be kernelized
▶ Many nonlinear dimensionality reduction methods (Isomap,
graph Laplacian eigenmap, and locally linear embedding/LLE)
can be described as kernel PCA with a special kernel

▶ Non-convex optimization problem


▶ But easy to solve…

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Want to Learn More?
▶ Machine Learning: a Probabilistic Perspective, K. Murphy
▶ Pattern Classification, R. Duda, P. Hart, and D. Stork.
Standard pattern recognition textbook. Limited to
classification problems. Matlab code.
http://rii.ricoh.com/~stork/DHS.html
▶ Pattern recognition and machine learning. C. Bishop
▶ The Elements of statistical Learning: Data Mining, Inference,
and Prediction. T. Hastie, R. Tibshirani, J. Friedman,
Standard statistics textbook. Includes all the standard
machine learning methods for classification, regression,
clustering. R code. http://www-stat-class.stanford.
edu/~tibs/ElemStatLearn/
▶ Introduction to Data Mining, P.-N. Tan, M. Steinbach, V.
Kumar. AddisonWesley, 2006
▶ Principles of Data Mining, D. Hand, H. Mannila, and P.
Smyth. MIT Press, 2001
▶ 统计学习方法,李航 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Machine Learning in AI

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Machine Learning History

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Summary

▶ Supervised learning
▶ Learning Decision Trees
▶ K Nearst Neighbor Classfier
▶ Linear Predictions
▶ Support Vector Machines
▶ Unsupervised learning
▶ Clustering
▶ Principle Component Analysis

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
作业

▶ K-means 算法是否一定会收敛? 如果是,给出证明过程;如


果不是,给出说明。

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

You might also like