0% found this document useful (0 votes)
48 views9 pages

Pattern Recognition

Pattern recognition is a machine learning technique that automatically recognizes patterns in data. Feature extraction is the process of determining which properties or measurements of the patterns should be used for classification. It involves describing the patterns in a way that highlights their similarities and differences. Discriminant functions are used in pattern recognition to classify patterns based on features into categories. They represent the probability that a pattern belongs to a given category based on its features and prior probabilities of categories.

Uploaded by

szar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views9 pages

Pattern Recognition

Pattern recognition is a machine learning technique that automatically recognizes patterns in data. Feature extraction is the process of determining which properties or measurements of the patterns should be used for classification. It involves describing the patterns in a way that highlights their similarities and differences. Discriminant functions are used in pattern recognition to classify patterns based on features into categories. They represent the probability that a pattern belongs to a given category based on its features and prior probabilities of categories.

Uploaded by

szar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Q.1 What is pattern recognition?

Explain the concept of feature extraction in pattern


recognition system with example.

Pattern recognition is a data analysis method that uses machine learning algorithms to
automatically recognize patterns and regularities in data. This data can be anything from
text and images to sounds or other definable qualities. Pattern recognition systems can
recognize familiar patterns quickly and accurately.

Feature extraction

Feature extraction is the process of determining the features to be used for learning. The
description and properties of the patterns are known. However, for the classification task at
hand, it is necessary to extract the features to be used.

Pattern recognition is the automated recognition of patterns and regularities in data. It


has applications in statistical data analysis, signal processing, image analysis, information
retrieval, bioinformatics, data compression, computer graphics and machine learning.

An example of pattern recognition is classification, which attempts to assign each input value
to one of a given set of classes (for example, determine whether a given email is "spam" or
"non-spam"). However, pattern recognition is a more general problem that encompasses other
types of output as well.

Q. 2 Explain with example discriminant function for normal density.

Introduction to Normal or Gaussian Distribution

Before talking about discriminant functions for the normal density, we first need to know
what a normal distribution is and how it is represented for just a single variable, and for a vector
variable. Lets begin with the continuous univariate normal or Gaussian density.

fx=12π−−√σexp[−12(x−μσ)2]

for which the expected value of x is

μ=E[x]=∫−∞∞xp(x)dx
and where the expected squared deviation or variance is

σ2=E[(x−μ)2]=∫−∞∞(x−μ)2p(x)dx
The univariate normal density is completely specified by two parameters; its mean μ and
variance σ2. The function fx can be written as N(μ,σ) which says that x is distributed normally
with mean μ and variance σ2. Samples from normal distributions tend to cluster about the mean
with a spread related to the standard deviation σ.

For the multivariate normal density in d dimensions, fx is written as

fx=1(2π)d2|Σ|12exp[−12(x−μ)tΣ−1(x−μ)]
where x is a d-component column vector, μ is the d-component mean vector, Σ is
the d-by-d covariance matrix, and |Σ| and Σ-1 are its determinant and inverse respectively.
Also,(x - μ)t denotes the transpose of (x - μ).

and

Σ=E[(x−μ)(x−μ)t]=∫(x−μ)(x−μ)tp(x)dx
where the expected value of a vector or a matrix is found by taking the expected value of the
individual components. i.e if xi is the ith component of x, μi the ith component of μ,
and σij the ijth component of Σ, then

μi=E[xi]
and

σij=E[(xi−μi)(xj−μj)]
The covariance matrix Σ is always symmetric and positive definite which means that the
determinant of Σ is strictly positive. The diagonal elements σii are the variances of the
respective xi ( i.e., σ2), and the off-diagonal elements σij are the covariances of xi and xj.
If xi and xj are statistically independent, then σij = 0. If all off-diagonanl elements are zero, p(x)
reduces to the product of the univariate normal densities for the components of x.

Discriminant Functions

Discriminant functions are used to find the minimum probability of error in decision making
problems. In a problem with feature vector y and state of nature variable w, we can represent
the discriminant function as:

gi(Y)=lnp(Y|wi)+lnP(wi)
where from previous essays we defined p(Y|wi) as the conditional probability density function
for Y with wi being the state of nature, and P(wj) is the prior probability that nature is in
state wj. If we take p(Y|wi) as multivariate normal distributions. That is if p(Y|wi) = N(μ,σ).
Then the discriminant function changes to;

gi(Y)=−||x−μi||2σi+lnP(wi),
where ||.|| denotes the Euclidean norm, that is,

Next week, we will look more in depth into discriminant functions for the normal density, looking
at the special cases of the covariance.
Continuing from where we left of in Part 1, in a problem with feature vector y and state of
nature variable w, we can represent the discriminant function as:

gi(x)=−12(x−μi)tΣ−1i(x−μi)−d2ln2π−12ln|Σi|+lnP(wi)
we will now look at the multiple cases for a multivariate normal distribution.

Case 1: Σi = σ2I

This is the simplest case and it occurs when the features are statistically independent and
each feature has the same variance, σ2. Here, the covariance matrix is diagonal since its simply
σ2 times the identity matrix I. This means that each sample falls into equal sized clusters that are
centered about their respective mean vectors. The computation of the determinant and the
inverse |Σi| = σ2d and Σi-1 = (1/σ2)I. Because both |Σi| and the (d/2) ln 2π term in the equation
above are independent of i, we can ignore them and thus we obtain this simplified discriminant
function:

gi(x)=−||x−μi||22σ2+lnP(wi)
where ||.|| denotes the Euclidean norm, that is,

||x−μi||2=(x−μi)t(x−μi)

If the prior probabilities are not equal, then the discriminant function shows that the
squared distance ||x - μ||2 must be normalized by the variance σ2 and offset by adding ln P(wi);
therefore if x is equally near two different mean vectors, the optimal decision will favor the
priori more likely. Expansion of the quadratic form (x - μi)t(x - μi) yields :

gi(x)=−12σ2[xtx−2μtix+μtiμi]+lnP(wi)
which looks like a quadratic function of x. However, the quadratic term xtx is the same for all i,
meaning it can be ignored since it just an additive constant, thereby we obtain the equivalent
discriminant function:

gi(x)=wtix+wi0
where

wi=1σ2μi
and

wi0=−12σ2μtiμi+lnP(wi)
wi0 is the threshold or bias for the ith category.

A classifier that uses linear discriminants is called a linear machine. For a linear machine,
the decision surfaces for a linear machine are just pieces of hyperplanes defined by the linear
equations gi(x) = gj(x) for the two categories with the highest posterior probabilities. In this
situation, the equation can be written as
wt(x−x0)=0
where

w=μi−μj
and

x0=12(μi+μj)−σ2||μi−μj||2lnP(wi)P(wj)(μi−μj)
The equations define a hyperplane through the point x0 and orthogonal to the vector w.
Because w = μi - μj, the hyperplane separating Ri and Rj is orthogonal to the line linking the
means. if P(wi) = P(wj), the point x0 is halfway between the means and the hyperplane is the
perpendicular bisector of the line between the means in fig1 below. If P(wi) ≠ P(wj), the
point x0 shifts away from the more likely mean.

Figure 1

Case 2: Σi = Σ

Another case occurs when the covariance matrices for all the classes are identical. It
corresponds to a situation where the samples fall into hyperellipsoidal clusters of equal size and
shape, with the cluster of the ith class being centered around the mean vector μi. Both |Σi| and
the (d/2) ln 2π terms can also be ignored as done in the first step because they are independent
of i. This leads to the simplified discriminant function:

gi(x)=−12(x−μi)tΣ−1i(x−μi)+lnP(wi)

If the prior probabilities P(wi) are equal for all classes, then the ln P(wi) term can be
ignored, however if they are unequal then the decision will be biased in favor of the more likely
priori. Expansion of the quadratic form ('x - μi)tΣ-1(x - μi) results in a sum involving the
term xtΣ-1'x which is independent of i. After this term is dropped, we get the resulting linear
discriminant function:

gi(x)=wtix+wi0
where

wi=Σ−1μi
and

wi0=−12μtiΣ−1μi+lnP(wi)
Because the discriminants are linear, the resulting decision boundaries are again
hyperplanes. If Ri and Rj are very close, the boundary between them has the equation:

wt(x−x0)=0
where

w=Σ−1(μi−μj)
and

x0=12(μi+μj)−ln[P(wi)/P(wj)](μi−μj)tΣ−1(μi−μj)(μi−μj)

Because w = σ-1(μi - μj) is generally not in the direction of μi - μj, the hyperplane separating
Ri and Rj is generally not orthogonal to the line between the means. If the prior probabilities are
equal, it intersects the line at point x0, and then x0 is halfway between the means. If the prior
probabilities are not equal, the boundary hyperplane is shifted away from the more likely mean.
Figure 2 below shows what the boundary decision looks like for this case
Figure 2

Case 3: Σi = arbitrary

In the general multivariate Gaussian case where the covariance matrices are different for
each class, the only term that can be dropped from the initial discriminant function is the (d/2)
ln 2π term. The resulting discriminant term is;

gi(x)=xtWix+wtix+wi0
where

Wi=−12Σ−1i
wi=Σ−1iμi
and

wi0=−12μtiΣ−1iμi−12ln|Σi|+lnP(wi)

This leads to hyperquadric decision boundaries as seen in the figure below.


. Figure 3

EXAMPLE

Given the set of data below of a distribution with two classes w1 and w2 both with prior
probablility of 0.5, find the discriminant functions and decision boundary.

Sample w1 w2
1 -5.01 -0.91
2 -5.43 1.30
3 1.08 -7.75
4 0.86 -5.47
5 -2.67 6.14
6 4.94 3.60
7 -2.51 5.37
8 -2.25 7.18
9 5.56 -7.39
10 1.03 -7.50

From the data given above we know to use the equations from case 1 since all points in each class
have the same variance, therefore the means are:

μ1=∑k=110x1=−0.44
μ2=∑k=110x2=−0.543
and the variances are

σ21=∑k=110x1−μ1=31.34
σ22=∑k=110x2−μ2=52.62

The discriminant functions are then

g1=−0.4431.34−−0.4422∗31.34+ln(0.5)=−0.710
g2=−0.54352.62−−0.54322∗52.62+ln(0.5)=−0.706
and the decision boundary x0 is going to be halfway between the means at 0.492 because they
have the same prior probability.

Q. 3 How PCA can be used for dimension reduction?

Principal Component Analysis(PCA) is one of the most popular linear dimension reduction
algorithms. It is a projection based method that transforms the data by projecting it onto a set
of orthogonal(perpendicular) axes. ... In the below figure the data has maximum variance
along the red line in two-dimensional space.

Dimensionality reduction involves reducing the number of input variables or columns in


modeling data. PCA is a technique from linear algebra that can be used to automatically
perform dimensionality reduction. How to evaluate predictive models that use a PCA
projection as input and make predictions with new raw data

Hands-on implementation of image compression using PCA

Reshaping the image to 2-dimensional so we are multiplying columns with depth so 225 X 3
= 675. Applying PCA so that it will compress the image, the reduced dimension is shown in
the output. As you can see in the output, we compressed the image using PCA.

Steps involved in PCA:

1. Standardize the d-dimensional dataset.


2. Construct the co-variance matrix for the same.
3. Decompose the co-variance matrix into it's eigen vector and eigen values.
4. Select k eigen vectors that correspond to the k largest eigen values.
5. Construct a projection matrix W using top k eigen vectors.
4. Describe Expectation Maximization algorithm

The expectation-maximization algorithm is an approach for performing maximum


likelihood estimation in the presence of latent variables. It does this by first estimating the
values for the latent variables, then optimizing the model, then repeating these two steps until
convergence.

It works by choosing random values for the missing data points, and using those guesses
to estimate a second set of data. The new values are used to create a better guess for the first
set, and the process continues until the algorithm converges on a fixed point. See also: EM
Algorithm explained in one picture

The expectation maximization algorithm is 혻 a natural generalization of maximum


likelihood estimation to the incomplete data case. In particular, expectation maximization
attempts to find the parameters 罐 that maximize the log probability logP(x;罐) of the
observed data

You might also like