KNN and HMM
K-Nearest Neighbors
K-Nearest Neighbors, is a supervised learning algorithm specialized in
classification.
It is a simple algorithm that stores all available cases and classifies new
cases by a majority vote of its k neighbors.
The case being assigned to the class is the most common among its K
nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski and
Hamming distance.
What is K?
In algorithm, for each test data point, we would be looking at the K nearest
training data points and take the most frequently occurring classes and assign
that class to the test data. Therefore, K represents the number of training data
points lying in proximity to the test data point which we are going to use to
find the class.
Conti..
The algorithm looks at different centroids and compares distance using some
sort of function (usually Euclidean), then analyzes those results and assigns
each point to the group so that it is optimized to be placed with all the
closest points to it.
You can use KNN for both classification and regression problems. However, it
is more widely used in classification problems in the industry.
Conti..
Let’s take a simple case to understand this algorithm. Following is a spread of
red circles (RC) and green squares (GS) :
You intend to find out the class of the blue star (BS) . BS can either be RC or
GS and nothing else. The “K” is KNN algorithm is the nearest neighbors we
wish to take vote from
Conti..
Let’s say K = 3. Hence, we will now make a circle with BS as center just as big
as to enclose only three datapoints on the plane. Refer to following diagram
for more details:
The three closest points to BS is all RC. Hence, with good confidence level we
can say that the BS should belong to the class RC.
Here, the choice became very obvious as all three votes from the closest
neighbor went to RC. The choice of the parameter K is very crucial in this
algorithm.
Algorithm
Load the data
Initialise the value of k
For getting the predicted class, iterate from 1 to total number of training
data points
Calculate the distance between test data and each row of training data. Here we
will use Euclidean distance as our distance metric since it’s the most popular
method. The other metrics that can be used are Chebyshev, cosine, etc.
Sort the calculated distances in ascending order based on distance values
Get top k rows from the sorted array
Get the most frequent class of these rows
Return the predicted class
Choosing the right value for K
To select the K that’s right for your data, we run the KNN
algorithm several times with different values of K and
choose the K that reduces the number of errors we
encounter while maintaining the algorithm’s ability to
accurately make predictions when it’s given data it hasn’t
seen before.
Here are some things to keep in mind:
As we decrease the value of K to 1, our predictions become less
stable.
Inversely, as we increase the value of K, our predictions
become more stable due to majority voting / averaging, and
thus, more likely to make more accurate predictions (up to a
certain point).
Conti..
You will have to note the following points before selecting
KNN:
KNN is computationally expensive.
Variables should be normalized else higher range variables can
bias it.
Works on pre-processing stage more before going for KNN like
outlier, noise removal
Summary
The k-nearest neighbors (KNN) algorithm is a simple, supervised machine
learning algorithm that can be used to solve both classification and
regression problems. It’s easy to implement and understand, but has a
major drawback of becoming significantly slows as the size of that data in
use grows.
KNN works by finding the distances between a query and all the examples
in the data, selecting the specified number examples (K) closest to the
query, then votes for the most frequent label (in the case of
classification) or averages the labels (in the case of regression).
In the case of classification and regression, we saw that choosing the right
K for our data is done by trying several Ks and picking the one that works
best.
Hidden Markov Model
A Hidden Markov Model is a statistical Markov Model
(chain) in which the system being modeled is assumed to
be a Markov Process with hidden states (or unobserved)
states.
In simpler Markov models (like a Markov chain), the state
is directly visible to the observer, and therefore the state
transition probabilities are the only parameters
while in the hidden Markov model, the state is not directly
visible, but the output (in the form of data or "token" in
the following), dependent on the state, is visible
What is a Markov Property?
A stochastic process (or a random process that is a
collection of random variables which changes through
time) if the probability of future states of the process
depends only upon the present state, not on the sequence
of states preceding it.
It is commonly referred as memoryless property.
Any random process that satisfies the Markov Property is
known as Markov Process.
Markov chain
Application
DNS Sequence analysis
Prediction of genes
Horizontal gene transfer
Radiation hyper mapping
Speech recognition
Vehicle trajectory protection
Positron emission tomography (PET)
Digital communication
Music analysis
Optical signal detection
Gesture learning for human robot interface
Example:
Using this Markov chain, what is
the probability that the
Wednesday will be cloudy if today
is sunny?
Answer:
The following are different transitions that can result in a cloudy Wednesday
given today (Monday) is sunny.