Pattern Recognition
Pattern Recognition
Pattern recognition is a data analysis method that uses machine learning algorithms to
automatically recognize patterns and regularities in data. This data can be anything from
text and images to sounds or other definable qualities. Pattern recognition systems can
recognize familiar patterns quickly and accurately.
Feature extraction
Feature extraction is the process of determining the features to be used for learning. The
description and properties of the patterns are known. However, for the classification task at
hand, it is necessary to extract the features to be used.
An example of pattern recognition is classification, which attempts to assign each input value
to one of a given set of classes (for example, determine whether a given email is "spam" or
"non-spam"). However, pattern recognition is a more general problem that encompasses other
types of output as well.
Before talking about discriminant functions for the normal density, we first need to know
what a normal distribution is and how it is represented for just a single variable, and for a vector
variable. Lets begin with the continuous univariate normal or Gaussian density.
fx=12π−−√σexp[−12(x−μσ)2]
μ=E[x]=∫−∞∞xp(x)dx
and where the expected squared deviation or variance is
σ2=E[(x−μ)2]=∫−∞∞(x−μ)2p(x)dx
The univariate normal density is completely specified by two parameters; its mean μ and
variance σ2. The function fx can be written as N(μ,σ) which says that x is distributed normally
with mean μ and variance σ2. Samples from normal distributions tend to cluster about the mean
with a spread related to the standard deviation σ.
fx=1(2π)d2|Σ|12exp[−12(x−μ)tΣ−1(x−μ)]
where x is a d-component column vector, μ is the d-component mean vector, Σ is
the d-by-d covariance matrix, and |Σ| and Σ-1 are its determinant and inverse respectively.
Also,(x - μ)t denotes the transpose of (x - μ).
and
Σ=E[(x−μ)(x−μ)t]=∫(x−μ)(x−μ)tp(x)dx
where the expected value of a vector or a matrix is found by taking the expected value of the
individual components. i.e if xi is the ith component of x, μi the ith component of μ,
and σij the ijth component of Σ, then
μi=E[xi]
and
σij=E[(xi−μi)(xj−μj)]
The covariance matrix Σ is always symmetric and positive definite which means that the
determinant of Σ is strictly positive. The diagonal elements σii are the variances of the
respective xi ( i.e., σ2), and the off-diagonal elements σij are the covariances of xi and xj.
If xi and xj are statistically independent, then σij = 0. If all off-diagonanl elements are zero, p(x)
reduces to the product of the univariate normal densities for the components of x.
Discriminant Functions
Discriminant functions are used to find the minimum probability of error in decision making
problems. In a problem with feature vector y and state of nature variable w, we can represent
the discriminant function as:
gi(Y)=lnp(Y|wi)+lnP(wi)
where from previous essays we defined p(Y|wi) as the conditional probability density function
for Y with wi being the state of nature, and P(wj) is the prior probability that nature is in
state wj. If we take p(Y|wi) as multivariate normal distributions. That is if p(Y|wi) = N(μ,σ).
Then the discriminant function changes to;
gi(Y)=−||x−μi||2σi+lnP(wi),
where ||.|| denotes the Euclidean norm, that is,
Next week, we will look more in depth into discriminant functions for the normal density, looking
at the special cases of the covariance.
Continuing from where we left of in Part 1, in a problem with feature vector y and state of
nature variable w, we can represent the discriminant function as:
gi(x)=−12(x−μi)tΣ−1i(x−μi)−d2ln2π−12ln|Σi|+lnP(wi)
we will now look at the multiple cases for a multivariate normal distribution.
Case 1: Σi = σ2I
This is the simplest case and it occurs when the features are statistically independent and
each feature has the same variance, σ2. Here, the covariance matrix is diagonal since its simply
σ2 times the identity matrix I. This means that each sample falls into equal sized clusters that are
centered about their respective mean vectors. The computation of the determinant and the
inverse |Σi| = σ2d and Σi-1 = (1/σ2)I. Because both |Σi| and the (d/2) ln 2π term in the equation
above are independent of i, we can ignore them and thus we obtain this simplified discriminant
function:
gi(x)=−||x−μi||22σ2+lnP(wi)
where ||.|| denotes the Euclidean norm, that is,
||x−μi||2=(x−μi)t(x−μi)
If the prior probabilities are not equal, then the discriminant function shows that the
squared distance ||x - μ||2 must be normalized by the variance σ2 and offset by adding ln P(wi);
therefore if x is equally near two different mean vectors, the optimal decision will favor the
priori more likely. Expansion of the quadratic form (x - μi)t(x - μi) yields :
gi(x)=−12σ2[xtx−2μtix+μtiμi]+lnP(wi)
which looks like a quadratic function of x. However, the quadratic term xtx is the same for all i,
meaning it can be ignored since it just an additive constant, thereby we obtain the equivalent
discriminant function:
gi(x)=wtix+wi0
where
wi=1σ2μi
and
wi0=−12σ2μtiμi+lnP(wi)
wi0 is the threshold or bias for the ith category.
A classifier that uses linear discriminants is called a linear machine. For a linear machine,
the decision surfaces for a linear machine are just pieces of hyperplanes defined by the linear
equations gi(x) = gj(x) for the two categories with the highest posterior probabilities. In this
situation, the equation can be written as
wt(x−x0)=0
where
w=μi−μj
and
x0=12(μi+μj)−σ2||μi−μj||2lnP(wi)P(wj)(μi−μj)
The equations define a hyperplane through the point x0 and orthogonal to the vector w.
Because w = μi - μj, the hyperplane separating Ri and Rj is orthogonal to the line linking the
means. if P(wi) = P(wj), the point x0 is halfway between the means and the hyperplane is the
perpendicular bisector of the line between the means in fig1 below. If P(wi) ≠ P(wj), the
point x0 shifts away from the more likely mean.
Figure 1
Case 2: Σi = Σ
Another case occurs when the covariance matrices for all the classes are identical. It
corresponds to a situation where the samples fall into hyperellipsoidal clusters of equal size and
shape, with the cluster of the ith class being centered around the mean vector μi. Both |Σi| and
the (d/2) ln 2π terms can also be ignored as done in the first step because they are independent
of i. This leads to the simplified discriminant function:
gi(x)=−12(x−μi)tΣ−1i(x−μi)+lnP(wi)
If the prior probabilities P(wi) are equal for all classes, then the ln P(wi) term can be
ignored, however if they are unequal then the decision will be biased in favor of the more likely
priori. Expansion of the quadratic form ('x - μi)tΣ-1(x - μi) results in a sum involving the
term xtΣ-1'x which is independent of i. After this term is dropped, we get the resulting linear
discriminant function:
gi(x)=wtix+wi0
where
wi=Σ−1μi
and
wi0=−12μtiΣ−1μi+lnP(wi)
Because the discriminants are linear, the resulting decision boundaries are again
hyperplanes. If Ri and Rj are very close, the boundary between them has the equation:
wt(x−x0)=0
where
w=Σ−1(μi−μj)
and
x0=12(μi+μj)−ln[P(wi)/P(wj)](μi−μj)tΣ−1(μi−μj)(μi−μj)
Because w = σ-1(μi - μj) is generally not in the direction of μi - μj, the hyperplane separating
Ri and Rj is generally not orthogonal to the line between the means. If the prior probabilities are
equal, it intersects the line at point x0, and then x0 is halfway between the means. If the prior
probabilities are not equal, the boundary hyperplane is shifted away from the more likely mean.
Figure 2 below shows what the boundary decision looks like for this case
Figure 2
Case 3: Σi = arbitrary
In the general multivariate Gaussian case where the covariance matrices are different for
each class, the only term that can be dropped from the initial discriminant function is the (d/2)
ln 2π term. The resulting discriminant term is;
gi(x)=xtWix+wtix+wi0
where
Wi=−12Σ−1i
wi=Σ−1iμi
and
wi0=−12μtiΣ−1iμi−12ln|Σi|+lnP(wi)
EXAMPLE
Given the set of data below of a distribution with two classes w1 and w2 both with prior
probablility of 0.5, find the discriminant functions and decision boundary.
Sample w1 w2
1 -5.01 -0.91
2 -5.43 1.30
3 1.08 -7.75
4 0.86 -5.47
5 -2.67 6.14
6 4.94 3.60
7 -2.51 5.37
8 -2.25 7.18
9 5.56 -7.39
10 1.03 -7.50
From the data given above we know to use the equations from case 1 since all points in each class
have the same variance, therefore the means are:
μ1=∑k=110x1=−0.44
μ2=∑k=110x2=−0.543
and the variances are
σ21=∑k=110x1−μ1=31.34
σ22=∑k=110x2−μ2=52.62
g1=−0.4431.34−−0.4422∗31.34+ln(0.5)=−0.710
g2=−0.54352.62−−0.54322∗52.62+ln(0.5)=−0.706
and the decision boundary x0 is going to be halfway between the means at 0.492 because they
have the same prior probability.
Principal Component Analysis(PCA) is one of the most popular linear dimension reduction
algorithms. It is a projection based method that transforms the data by projecting it onto a set
of orthogonal(perpendicular) axes. ... In the below figure the data has maximum variance
along the red line in two-dimensional space.
Reshaping the image to 2-dimensional so we are multiplying columns with depth so 225 X 3
= 675. Applying PCA so that it will compress the image, the reduced dimension is shown in
the output. As you can see in the output, we compressed the image using PCA.
It works by choosing random values for the missing data points, and using those guesses
to estimate a second set of data. The new values are used to create a better guess for the first
set, and the process continues until the algorithm converges on a fixed point. See also: EM
Algorithm explained in one picture