0% found this document useful (0 votes)
96 views36 pages

ML Algo

The document provides an overview of logistic regression and how it works. It explains that logistic regression is a classification algorithm that outputs probabilities rather than continuous values. It fits an S-shaped curve called a sigmoid function to the data to predict the probabilities of observations belonging to different categories. The document outlines the key steps in logistic regression: 1) calculating a weighted sum of the inputs, 2) inputting this into the sigmoid function to get a probability, and 3) training the model using methods like maximum likelihood to estimate the optimal parameters.

Uploaded by

Anirban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views36 pages

ML Algo

The document provides an overview of logistic regression and how it works. It explains that logistic regression is a classification algorithm that outputs probabilities rather than continuous values. It fits an S-shaped curve called a sigmoid function to the data to predict the probabilities of observations belonging to different categories. The document outlines the key steps in logistic regression: 1) calculating a weighted sum of the inputs, 2) inputting this into the sigmoid function to get a probability, and 3) training the model using methods like maximum likelihood to estimate the optimal parameters.

Uploaded by

Anirban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Logistic Regression Explained

In this post, I will explain Logistic Regression in simple terms. It could be considered a
Logistic Regression for dummies post, however, I’ve never really liked that expression.

In the Machine Learning world, Logistic Regression is a kind of parametric


classification model, despite having the word ‘regression’ in its name.

This means that logistic regression models are models that have a certain fixed number
of parameters that depend on the number of input features, and they output categorical
prediction, like for example if a plant belongs to a certain species or not.

In reality, the theory behind Logistic Regression is very similar to the one from Linear
Regression, so if you don’t know what Linear Regression is, take 5 minutes to read this
super easy guide:

Linear Regression Explained


[ — Linear Regression explained simply — ]
towardsdatascience.com

In Logistic Regression, we don’t directly fit a straight line to our data like in linear
regression. Instead, we fit a S shaped curve, called Sigmoid, to our observations.

Sigmoid function fitted to some data

Let's examine this figure closely.

First of all, like we said before, Logistic Regression models are classification models;
specifically binary classification models (they can only be used to distinguish between 2
different categories — like if a person is obese or not given its weight, or if a house is big
or small given its size). This means that our data has two kinds of
observations (Category 1 and Category 2 observations) like we can observe in the
figure.

Note: This is a very simple example of Logistic Regression, in practice much harder


problems can be solved using these models, using a wide range of features and not just
a single one.
Secondly, as we can see, the Y-axis goes from 0 to 1. This is because
the sigmoid function always takes as maximum and minimum these two values, and this
fits very well our goal of classifying samples in two different categories. By computing
the sigmoid function of X (that is a weighted sum of the input features, just like in Linear
Regression), we get a probability (between 0 and 1 obviously) of an observation
belonging to one of the two categories.

The formula for the sigmoid function is the following:

If we wanted to predict if a person was obese or not given their weight, we would first
compute a weighted sum of their weight (sorry for the lexical redundancy) and then
input this into the sigmoid function:

1) Calculate weighted sum of inputs

Weighted sum of the input features (feature in this case)

2) Calculate the probability of Obese

Use of the sigmoid equation for this calculation

Alright, this looks cool and all, but isn’t this meant to be a Machine Learning
model? How do we train it? That is a good question. There are multiple ways to train
a Logistic Regression model (fit the S shaped line to our data). We can use an iterative
optimisation algorithm like Gradient Descent to calculate the parameters of the model
(the weights) or we can use probabilistic methods like Maximum likelihood.

If you don’t know what any of these are, Gradient Descent was explained in the Linear
Regression post, and an explanation of Maximum Likelihood for Machine Learning can
be found here:
Probability Learning III: Maximum Likelihood
Another step in our way to become probability masters…
towardsdatascience.com

Once we have used one of these methods to train our model, we are ready to make some
predictions. Let's see an example of how the process of training a Logistic Regression
model and using it to make predictions would go:

1. First, we would collect a Dataset of patients who have and who have not been
diagnosed as obese, along with their corresponding weights.

2. After this, we would train our model, to fit our S shape line to the data and
obtain the parameters of the model. After training using Maximum Likelihood, we
got the following parameters:

Parameters and equation of X

3. Now, we are ready to make some predictions: imagine we got two patients; one is
120 kg and one is 60 kg. Let's see what happens when we plug these numbers into the
model:
Results of using the fitted model to predict obesity given patient weight

As we can see, the first patient (60 kg) has a very low probability of being obese, however,
the second one (120 kg) has a very high one.

Logistic Regression results for the previous examples.

In the previous figure, we can see the results given by the Logistic Regression model for
the discussed examples. Now, given the weight of any patient, we could calculate their
probability of being obese, and give our doctors a quick first round of information!
6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R
Overview

 Understand one of the most popular and simple machine learning classification


algorithms, the Naive Bayes algorithm
 It is based on the Bayes Theorem for calculating probabilities and conditional
probabilities
 Learn how to implement the Naive Bayes Classifier in R and Python

Introduction

Here’s a situation you’ve got into in your data science project:

You are working on a classification problem and have generated your set of hypothesis,
created features and discussed the importance of variables. Within an hour,
stakeholders want to see the first cut of the model.

What will you do? You have hundreds of thousands of data points and quite a few
variables in your training data set. In such a situation, if I were in your place, I would
have used ‘Naive Bayes‘, which can be extremely fast relative to other classification
algorithms. It works on Bayes theorem of probability to predict the class of unknown
data sets.

In this article, I’ll explain the basics of this algorithm, so that next time when you come
across large data sets, you can bring this algorithm to action. In addition, if you are
a newbie in Python or R, you should not be overwhelmed by the presence of available
codes in this article.

Project to apply Naive Bayes

Problem Statement

HR analytics is revolutionizing the way human resources departments operate, leading to


higher efficiency and better results overall. Human resources have been using analytics for
years.

However, the collection, processing, and analysis of data have been largely manual, and given
the nature of human resources dynamics and HR KPIs, the approach has been constraining
HR. Therefore, it is surprising that HR departments woke up to the utility of machine
learning so late in the game. Here is an opportunity to try predictive analytics in identifying
the employees most likely to get promoted.

Table of Contents

1. What is Naive Bayes algorithm?


2. How Naive Bayes Algorithms works?
3. What are the Pros and Cons of using Naive Bayes?
4. 4 Applications of Naive Bayes Algorithm
5. Steps to build a basic Naive Bayes Model in Python
6. Tips to improve the power of Naive Bayes Model

What is Naive Bayes algorithm?

It is a classification technique based on Bayes’ Theorem with an assumption of


independence among predictors. In simple terms, a Naive Bayes classifier assumes that
the presence of a particular feature in a class is unrelated to the presence of any other
feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter. Even if these features depend on each other or upon the existence
of the other features, all of these properties independently contribute to the probability
that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).


 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

 How Naive Bayes algorithm works?

Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need
to classify whether players will play or not based on weather condition. Let’s follow the
below steps to perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

 What are the Pros and Cons of Naive Bayes?

Pros:

 It is easy and fast to predict class of test data set. It also perform well in multi
class prediction
 When assumption of independence holds, a Naive Bayes classifier performs
better compare to other models like logistic regression and you need less
training data.
 It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve,
which is a strong assumption).

Cons:

 If categorical variable has a category (in test data set), which was not observed
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction. This is often known as “Zero Frequency”. To solve
this, we can use the smoothing technique. One of the simplest smoothing
techniques is called Laplace estimation.
 On the other side naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
 Another limitation of Naive Bayes is the assumption of independent predictors. In
real life, it is almost impossible that we get a set of predictors which are
completely independent.

 4 Applications of Naive Bayes Algorithms


 Real time Prediction: Naive Bayes is an eager learning classifier and it is
sure fast. Thus, it could be used for making predictions in real time.
 Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms.
As a result, it is widely used in Spam filtering (identify spam e-mail) and
Sentiment Analysis (in social media analysis, to identify positive and negative
customer sentiments)
 Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine learning
and data mining techniques to filter unseen information and predict whether a
user would like a given resource or not

 How to build a basic model using Naive Bayes in Python and R?

Again, scikit learn (python library) will help here to build a Naive Bayes model in Python.
There are three types of Naive Bayes model under the scikit-learn library:

 Gaussian: It is used in classification and it assumes that features follow a normal


distribution.

 Multinomial: It is used for discrete counts. For example, let’s say,  we have a text
classification problem. Here we can consider Bernoulli trials which is one step
further and instead of “word occurring in the document”, we have “count how
often word occurs in the document”, you can think of it as “number of times
outcome number x_i is observed over the n trials”.

 Bernoulli: The binomial model is useful if your feature vectors are binary (i.e.
zeros and ones). One application would be text classification with ‘bag of words’
model where the 1s & 0s are “word occurs in the document” and “word does not
occur in the document” respectively.

R Code:

require(e1071) #Holds the Naive Bayes Classifier


Train <- read.csv(file.choose())

Test <- read.csv(file.choose())

#Make sure the target variable is of a two-class classification

problem only

levels(Train$Item_Fat_Content)

model <- naiveBayes(Item_Fat_Content~., data = Train)

class(model)

pred <- predict(model,Test)

table(pred)

Above, we looked at the basic Naive Bayes model, you can improve the power of this
basic model by tuning parameters and handle assumption intelligently. Let’s look at the
methods to improve the performance of Naive Bayes Model. I’d recommend you to go
through this document for more details on Text classification using Naive Bayes.

Tips to improve the power of Naive Bayes Model


Here are some tips for improving power of Naive Bayes Model:

 If continuous features do not have normal distribution, we should use


transformation or different methods to convert it in normal distribution.
 If test data set has zero frequency issue, apply smoothing techniques “Laplace
Correction” to predict the class of test data set.
 Remove correlated features, as the highly correlated features are voted twice in
the model and it can lead to over inflating importance.
 Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for
smoothing, fit_prior=[True|False] to learn class prior probabilities or not and
some other options (look at detail here). I would recommend to focus on your
pre-processing of data and the feature selection.
 You might think to apply some classifier combination technique like  ensembling,
bagging and boosting but these methods would not help. Actually, “ensembling,
boosting, bagging” won’t help since their purpose is to reduce variance. Naive
Bayes has no variance to minimize.

 End Notes

In this article, we looked at one of the supervised machine learning algorithm “Naive


Bayes” mainly used for classification. Congrats, if you’ve thoroughly & understood this
article, you’ve already taken you first step to master this algorithm. From here, all you
need is practice.

Further, I would suggest you to focus more on data pre-processing and feature selection
prior to applying Naive Bayes algorithm.0 In future post, I will discuss about text and
document classification using naive bayes in more detail.
Intuitive Guide to Latent Dirichlet Allocation

Topic modelling refers to the task of identifying topics that best describes a set of documents.
These topics will only emerge during the topic modelling process (therefore called latent). And
one popular topic modelling technique is known as Latent Dirichlet Allocation (LDA). Though
the name is a mouthful, the concept behind this is very simple.

To tell briefly, LDA imagines a fixed set of topics. Each topic represents a set of words. And the
goal of LDA is to map all the documents to the topics in a way, such that the words in each
document are mostly captured by those imaginary topics. We will systematically go through this
method by the end which you will be comfortable enough to use this method on your own.

This is the fourth blog post in the series of light on math machine learning A-Z. You can find the
previous blog posts linked to the letter below.

A B C D* E F G H I J K L M N O P Q R S T U V W X Y Z

*denotes articles behind Medium Paywall.

Why topic modeling?

What are some of the real world uses topic modelling has? Historians can use LDA to identify
important events in history by analysing text based on year. Web based libraries can use LDA
to recommend books based on your past readings. News providers can use topic modelling
to understand articles quickly or cluster similar articles. Another interesting application
is unsupervised clustering of images, where each image is treated similar to a document.

What’s unique about this article? Is this another fish in the sea?

The short answer is a big NO! I’ve swept through many different articles out there. And there are
many great articles/videos giving the intuition. However most of them stop at answering questions
like:
 What is the intuition behind LDA?

 What is a Dirichlet distribution?

Which I do talk about but don’t believe we should stop there. The way these models are trained is
a key component I find missing in many of the articles I read. So I try to answer few more
questions like:

 What is the mathematical entity we’re interested in solving for?

 How do we solve for that?

What is the big idea behind LDA?

Once you understand the big idea, I think it helps you to understand why the mechanics in LDA
are the way they are. So here goes;
Each document can be described by a distribution of topics and each topic can be described by a
distribution of words
But why do we use this idea? Let’s imagine it through an example.

LDA in layman’s terms

Say you have a set of 1000 words (i.e. most common 1000 words found in all the documents) and
you have 1000 documents. Assume that each document on average has 500 of these words
appearing in each. How can you understand what category each document belongs to? One way is
to connect each document to each word by a thread based on their appearance in the document.
Something like below.
Modeling documents just with words. You can see that we can’t really infer any useful
information due to the large amount of connections

And then when you see that some documents are connected to same set of words. You know they
discuss the same topic. Then you can read one of those documents and know what all these
documents talk about. But to do this you don’t have enough thread. You’re going to need around
500*1000=500,000 threads for that. But we are living in 2100 and we have exhausted all the
resources for manufacturing threads, so they are so expensive and you can only afford 10,000
threads. How can you solve this problem?

Go deeper to reduce the threads!

We can solve this problem, by introducing a latent (i.e. hidden) layer. Say we know 10
topics/themes that occur throughout the documents. But these topics are not observed, we only
observe words and documents, thus topics are latent. And we want to utilise this information to
cut down on the number of threads. Then what you can do is, connect the words to the topics
depending on how well that word fall in that topic and then connect the topics to the documents
based on what topics each document touch upon.

Now say you got each document having around 5 topics and each topic relating to 500 words.
That is we need 1000*5 threads to connect documents to topics and 10*500 threads to connect
topics to words, adding up to 10000.

Words are modeled by a set of topics and documents are modeled by a set of topics. The
relationships are clearer than the first example because there’s a fewer connections than the first
example.
Note: The topics I use here (“Animals”, “Sports”, “Tech”) are imaginary. In the real solution, you
won’t have such topics but something like (0.3*Cats,0.4*Dogs,0.2*Loyal, 0.1*Evil) representing
the topic “Animals”. That is, as mentioned before, each document is a distribution of words.

A different view: how LDA imagine documents are generated?

To give more context to what’s going on, LDA assumes the following generative process is
behind any document you see. For simplicity let us assume we are generating a single documents
with 5 words. But the same process is generalisable to M documents with N words in each. The
caption pretty well explain what’s going on here. So I won’t reiterate.

How a document is generated. First α (alpha) organise the ground θ (theta) and then you go and
pick a ball from θ. Based on what you pick, you’re sent to ground β (beta). β is organised by η
(Eta). Now you pick a word from β and put it into the document. You iterate this process 5 times
to get 5 words out.

This image is a depiction of what an already-learnt LDA system looks like. But to arrive at this
stage you have to answer several questions, such as:

 How do we know how many topics are there in the documents?


 You see that we already have a nice structure in the grounds that help us to
generate sensible documents, because organisers have ensured the proper design of
the grounds. How do we find such good organisers?

This will be answered in the next few sections. Additionally, we’re going to get a bit technical this
point onwards. So buckle up!

Note: LDA does not care the order of the words in the document. Usually, LDA use the bag-of-
words feature representation to represent a document. It makes sense, because, if I take a
document, jumble the words and give it to you, you still can guess what sort of topics are
discussed in the document.

Getting a bit mathematical …

Before diving into the details. Let’s get a few things across like notations and definitions.

Definitions and notations

 k — Number of topics a document belongs to (a fixed number)

 V — Size of the vocabulary

 M — Number of documents

 N — Number of words in each document

 w — A word in a document. This is represented as a one hot encoded vector of


size V (i.e. V — vocabulary size)

 w (bold w): represents a document (i.e. vector of “w”s) of N words

 D — Corpus, a collection of M documents

 z — A topic from a set of k topics. A topic is a distribution words. For example it


might be, Animal = (0.3 Cats, 0.4 Dogs, 0 AI, 0.2 Loyal, 0.1 Evil)

Defining document generation more mathematically

First let’s put the ground based example about generating documents above, to a proper
mathematical drawing.
Graphical model of the LDA. Here I mark the shapes of the all the possible variables (both
observed and hidden). But remember that θ, z, and β are distributions, not deterministic values

Let’s decipher what this is saying. We have a single α value (i.e. organiser of ground θ) which
defines θ; the topic distribution for documents is going to be like. We have M documents and got
some θ distribution for each such document. Now to understand things more clearly, squint your
eyes and make that M plate disappear (assuming there’s only a single document), woosh!

Now that single document has N words and each word is generated by a topic. You
generate N topics to be filled in with words. These N words are still placeholders.

Now the top plate kicks in. Based on η, β has some distribution (i.e. a Dirichlet distribution to be
precise — discussed soon) and according to that distribution, β generates k individual words for
each topic. Now you fill in a word to each placeholder (in the set of N placeholders), conditioned
on the topic it represents.

Viola, you got a document with N words now!


Why are α and η constant?

α and η are shown as constants in the image above. But it is actually more complex than that. For
example α has a topic distribution for each document (θ ground for each document). Ideally, a (M
x K) shape matrix. And η has a parameter vector for each topic. η will be of shape (k x V). In the
above drawing, the constants actually represent matrices, and are formed by replicating the single
value in the matrix to every single cell.

Let’s understand θ and β in more detail

θ is a random matrix, where θ(i,j) represents the probability of the i th document to containing


words belonging to the j th topic. If you take a look at what ground θ looks like in the example
above, you can see that balls a nicely laid out in the corners not much in the middle. The
advantage of having such a property is that, the words we produce are likely to belong to a single
topic as it is normally with real-world documents. This is a property that arise by modelling θ as
a Dirichlet distribution. Similarly β(i,j) represents the probability of the i th topic containing
the j th word. And β is also a Dirichlet distribution. Below, I’m providing a quick detour to
understand the Dirichlet distribution.

Quick detour: Understanding the Dirichlet distribution

Dirichlet distribution is the multivariate generalisation of the Beta distribution. Here we discuss


an example of a 3-dimensional problem, where we have 3 parameters in α that affects the shape of
θ (i.e. distribution). For an N-dimensional Dirichlet distribution you have a N length vector as α.
You can see how the shape of θ changes with different α values. For example, you can see how
the top middle plot shows a similar shape to the θ ground.

The main take-away is as follows:


Large α values push the distribution to the middle of the triangle, where smaller α values push the
distribution to the corners.
How the distribution of θ changes with different α values

How do we learn the LDA?

We still haven’t answered the real problem is, how do we know the exact α and η values? Before
that, let me list down the latent (hidden) variable we need to find.

 α — Distribution related parameter that governs what the distribution of topics is


for all the documents in the corpus looks like

 θ — Random matrix where θ(i,j) represents the probability of the i th document to


containing the j th topic

 η — Distribution related parameter that governs what the distribution of words in


each topic looks like

 β — A random matrix where β(i,j) represents the probability of i th topic


containing the j th word.

Formulating what do we need to learn


If I’m to mathematically state what I’m interested in finding it is as below:

It looks scary but contains a simple message. This is basically saying,

I have a set of M documents, each document having N words, where each word is generated by a
single topic from a set of K topics. I’m looking for the joint posterior probability of:

 θ — A distribution of topics, one for each document,

 z — N Topics for each document,

 β — A distribution of words, one for each topic,

given,

 D — All the data we have (i.e. the corups),

and using parameters,

 α — A parameter vector for each document (document — Topic distribution)

 η — A parameter vector for each topic (topic — word distribution)

But we cannot calculate this nicely, as this entity is intractable. Sow how do we solve this?

How do I solve this? Variational inference to the rescue

There are many ways to solve this. But for this article, I’m going to focus on variational inference.
The probability we discussed above is a very messy intractable posterior (meaning we cannot
calculate that on paper and have nice equations). So we’re going to approximate that with some
known probability distribution that closely matches the true posterior. That’s the idea behind
variational inference.
The way to do this is to minimise the KL divergence between the approximation and true
posterior as an optimisation problem. Again I’m not going to swim through the details as this is
out of scope.

But we’ll take a quick look at the optimization problem

γ , ϕ and λ represent the free variational parameters we approximate θ,z and β with, respectively.
Here D(q||p) represents the KL divergence between q and p. And by changing γ,ϕ and λ, we get
different q distributions having different distances from the true posterior p. Our goal is to find the
γ* , ϕ* and λ* that minimise the KL divergence between the approximation q and the true
posterior p.

With everything nicely defined, it’s just a matter of iteratively solving the above optimisation
problem until the solution converges. Once you have γ* , ϕ* and λ* you have everything you need
in the final LDA model.

Wrap up

In this article we discussed about Latent Dirichlet Allocation (LDA). LDA is a powerful method
that allows to identify topics within the documents and map documents to those topics. LDA has
many uses to it such as recommending books to customers.

We looked at how LDA works with an example of connecting threads. Then we saw a different
perspective based on how LDA imagine a document is generated. Finally we went into the
training of the model. In this we discussed a significant amount of mathematics behind LDA,
while keeping the math light. We took a look at what a Dirichlet distribution looks like, what is
the probability distribution we’re interested in finding (i.e. posterior) and how do we solve that
using variational inference.

I will post a tutorial on how to use LDA for topic modelling including some cool analysis as
another tutorial. Cheers.
Naive Bayes Explained

Naive Bayes is a probabilistic algorithm that’s typically used for classification problems.

Naive Bayes is simple, intuitive, and yet performs surprisingly well in many cases. For

example, spam filters Email app uses are built on Naive Bayes. In this article, I’ll explain

the rationales behind Naive Bayes and build a spam filter in Python. (For simplicity, I’ll

focus on binary classification problems)

Theory

Before we get started, please memorize the notations used in this article:

Basic Idea

To make classifications, we need to use X to predict Y. In other words, given a data point
X=(x1,x2,…,xn), what the odd of Y being y. This can be rewritten as the following
equation:

This is the basic idea of Naive Bayes, the rest of the algorithm is really more focusing on
how to calculate the conditional probability above.
Bayes Theorem

So far Mr. Bayes has no contribution to the algorithm. Now is his time to shine.
According to the Bayes Theorem:

Bayes Theorem

This is a rather simple transformation, but it bridges the gap between what we want to do
and what we can do. We can’t get P(Y|X) directly, but we can get P(X|Y) and P(Y) from
the training data. Here’s an example:

Weather dataset, from the University of Edinburgh

In this case, X =(Outlook, Temperature, Humidity, Windy), and Y=Play. P(X|Y) and P(Y)
can be calculated:
Example of finding P(Y) and P(X|Y)

Naive Bayes Assumption and Why

Theoretically, it is not hard to find P(X|Y). However, it is much harder in reality as the
number of features grows.

7 parameters are needed for a 2-feature binary dataset

Estimate Join Distribution requires more data

Having this amount of parameters in the model is impractical. To solve this problem, a
naive assumption is made. We pretend all features are independent. What does
this mean?

Conditional independence

Now with the help of this naive assumption (naive because features are rarely
independent), we can make classification with much fewer parameters:

Naive Bayes Classifier


Naive Bayes need fewer parameters (4 in this case)

This is a big deal. We changed the number of parameters from exponential to linear. This
means that Naive Bayes handles high-dimensional data well.

Categorical And Continuous Features

Categorical Data

For categorical features, the estimation of P(Xi|Y) is easy.

Calculate the likelihood of categorical features

However, one issue is that if some feature values never show (maybe lack of data), their
likelihood will be zero, which makes the whole posterior probability zero. One simple
way to fix this problem is called Laplace Estimator: add imaginary samples (usually one)
to each category
Laplace Estimator

Continuous Data

For continuous features, there are essentially two choices: discretization and continuous
Naive Bayes.

Discretization works by breaking the data into categorical values. The simplest
discretization is uniform binning, which creates bins with fixed range. There are, of
course, smarter and more complicated ways such as Recursive minimal entropy
partitioning or SOM based partitioning.

Discretizing Continuous Feature for Naive Bayes

The second option is utilizing known distributions. If the features are continuous, the
Naive Bayes algorithm can be written as:

f is the probability density function

For instance, if we visualize the data and see a bell-curve-like distribution, it is fair to
make an assumption that the feature is normally distributed

The first step is calculating the mean and variance of the feature for a given label y:
variance adjusted by the degree of freedom

Now we can calculate the probability density f(x):

There are, of course, other distributions:

Naive Bayes, from Scikit-Learn

Although these methods vary in form, the core idea behind is the sa
me: assuming the feature satisfies a certain distribution, estimating the parameters of the
distribution, and then get the probability density function.

Strength and Weakness

Strength

1. Even though the naive assumption is rarely true, the algorithm performs
surprisingly good in many cases

2. Handles high dimensional data well. Easy to parallelize and handles big data well

3. Performs better than more complicated models when the data set is small

Weakness

1. The estimated probability is often inaccurate because of the naive assumption.


Not ideal for regression use or probability estimation

2. When data is abundant, other more complicated models tend to outperform Naive
Bayes

Summary

Naive Bayes utilizes the most fundamental probability knowledge and makes a naive
assumption that all features are independent. Despite the simplicity (some may say
oversimplification), Naive Bayes gives a decent performance in many applications.

Now you understand how Naive Bayes works, it is time to try it in real projects!
A beginner’s guide to dimensionality reduction in Machine Learning

This is my first article on medium. Here, I’ll be giving a quick overview of what
dimensionality reduction is, why we need it and how to do it.

What is Dimensionality Reduction?

Dimensionality reduction is simply, the process of reducing the dimension of your


feature set. Your feature set could be a dataset with a hundred columns (i.e features) or it
could be an array of points that make up a large sphere in the three-dimensional space.
Dimensionality reduction is bringing the number of columns down to say, twenty or
converting the sphere to a circle in the two-dimensional space.

That is all well and good but why should we care? Why would we drop 80 columns off
our dataset when we could straight up feed it to our machine learning algorithm and let it
do the rest?

The Curse of Dimensionality

We care because the curse of dimensionality demands that we do. The curse of
dimensionality refers to all the problems that arise when working with data in the higher
dimensions, that did not exist in the lower dimensions.

As the number of features increase, the number of samples also increases proportionally.
The more features we have, the more number of samples we will need to have all
combinations of feature values well represented in our sample.
The Curse of Dimensionality

As the number of features increases, the model becomes more complex. The more the
number of features, the more the chances of overfitting. A machine learning model that is
trained on a large number of features, gets increasingly dependent on the data it was
trained on and in turn overfitted, resulting in poor performance on real data, beating the
purpose.

Avoiding overfitting is a major motivation for performing dimensionality reduction. The


fewer features our training data has, the lesser assumptions our model makes and the
simpler it will be. But that is not all and dimensionality reduction has a lot more
advantages to offer, like

1. Less misleading data means model accuracy improves.

2. Less dimensions mean less computing. Less data means that algorithms
train faster.

3. Less data means less storage space required.

4. Less dimensions allow usage of algorithms unfit for a large number of


dimensions

5. Removes redundant features and noise.

Feature Selection and Feature Engineering for dimensionality reduction


Dimensionality reduction could be done by both feature selection methods as well as
feature engineering methods.

Feature selection is the process of identifying and selecting relevant features for your
sample. Feature engineering is manually generating new features from existing features,
by applying some transformation or performing some operation on them.

Feature selection can be done either manually or programmatically. For example,


consider you are trying to build a model which predicts people’s weights and you have
collected a large corpus of data which describes each person quite thoroughly. If you had
a column that described the color of each person’s clothing, would that be much help in
predicting their weight? I think we can safely agree it won’t be. This is something we can
drop without further ado. What about a column that described their heights? That’s a
definite yes. We can make these simple manual feature selections and reduce the
dimensionality when the relevance or irrelevance of certain features are obvious or
common knowledge. And when its not glaringly obvious, there are a lot of tools we could
employ to aid our feature selection.

1. Heatmaps that show the correlation between features is a good idea.

2. So is just visualising the relationship between the features and the target
variable by plotting each feature against the target variable.

Now let us look at a few programmatic methods for feature selection from the popular
machine learning library sci-kit learn, namely,

1. Variance Threshold and

2. Univariate selection.

Variance Threshold is a baseline approach to feature selection. As the name suggests, it


drops all features where the variance along the column does not exceed a threshold
value. The premise is that a feature which doesn’t vary much within itself, has very little
predictive power.
>>> X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
>>> selector = VarianceThreshold()
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])
Univariate Feature Selection uses statistical tests to select features. Univariate describes
a type of data which consists of observations on only a single characteristic or attribute.
Univariate feature selection examines each feature individually to determine the strength
of the relationship of the feature with the response variable. Some examples of statistical
tests that can be used to evaluate feature relevance are Pearson Correlation, Maximal
information coefficient, Distance correlation, ANOVA and Chi-square. Chi-square is used
to find the relationship between categorical variables and Anova is preferred when the
variables are continuous.

Scikit-learn exposes feature selection routines likes SelectKBest, SelectPercentile or


GenericUnivariateSelect as objects that implement a transform method based on the
score of anova or chi2 or mutual information. Sklearn offers f_regression and
mutual_info_regression as the scoring functions for regression and f_classif and
mutual_info_classif for classification.
F-Test checks for and only captures linear relationships between features and labels. A
highly correlated feature is given higher score and less correlated features are given lower
score. Correlation is highly deceptive as it doesn’t capture strong non-linear
relationships. On the other hand, mutual information methods can capture any kind of
statistical dependency, but being nonparametric, they require more samples for accurate
estimation.

Feature selection is the simplest of dimensionality reduction methods. We will look at a


few feature engineering methods for dimensionality reduction later.

Linear Dimensionality Reduction Methods

The most common and well known dimensionality reduction methods are the ones that
apply linear transformations, like

1. PCA (Principal Component Analysis) : Popularly used for dimensionality


reduction in continuous data, PCA rotates and projects data along the
direction of increasing variance. The features with the maximum variance
are the principal components.

2. Factor Analysis : a technique that is used to reduce a large number of


variables into fewer numbers of factors. The values of observed data are
expressed as functions of a number of possible causes in order to find which
are the most important. The observations are assumed to be caused by a
linear transformation of lower dimensional latent factors and added
Gaussian noise.

3. LDA (Linear Discriminant Analysis): projects data in a way that the class
separability is maximised. Examples from same class are put closely
together by the projection. Examples from different classes are placed far
apart by the projection

PCA orients data along the direction of the component with maximum variance whereas
LDA projects the data to signify the class separability

Non-linear Dimensionality Reduction Methods

Non-linear transformation methods or manifold learning methods are used when the
data doesn’t lie on a linear subspace. It is based on the manifold hypothesis which says
that in a high dimensional structure, most relevant information is concentrated in small
number of low dimensional manifolds. If a linear subspace is a flat sheet of paper, then a
rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is
called a Swiss roll, a canonical problem in the field of non-linear dimensionality
reduction. Some popular manifold learning methods are,

1. Multi-dimensional scaling (MDS) : A technique used for analyzing


similarity or dissimilarity of data as distances in a geometric spaces.
Projects data to a lower dimension such that data points that are close to
each other (in terms if Euclidean distance) in the higher dimension are
close in the lower dimension as well.

2. Isometric Feature Mapping (Isomap) : Projects data to a lower dimension


while preserving the geodesic distance (rather than Euclidean distance as in
MDS). Geodesic distance is the shortest distance between two points on a
curve.

3. Locally Linear Embedding (LLE): Recovers global non-linear structure


from linear fits. Each local patch of the manifold can be written as a linear,
weighted sum of its neighbours given enough data.
4. Hessian Eigenmapping (HLLE): Projects data to a lower dimension while
preserving the local neighbourhood like LLE but uses the Hessian operator
to better achieve this result and hence the name.

5. Spectral Embedding (Laplacian Eigenmaps): Uses spectral techniques to


perform dimensionality reduction by mapping nearby inputs to nearby
outputs. It preserves locality rather than local linearity

6. t-distributed Stochastic Neighbor Embedding (t-SNE): Computes the


probability that pairs of data points in the high-dimensional space are
related and then chooses a low-dimensional embedding which produce a
similar distribution.

Shows the resulting projection from applying different manifold learning methods on a
3D S-Curve

Auto-encoders

Another popular dimensionality reduction method that gives spectacular results are
auto-encoders, a type of artificial neural network that aims to copy their inputs to their
outputs. They compress the input into a latent-space representation, and then
reconstructs the output from this representation. An autoencoder is composed of two
parts :

1. Encoder: compresses the input into a latent-space representation.

2. Decoder: reconstruct the input from the latent space representation.


In subsequent posts, let us look more deeply into linear and non-linear dimensionality
reduction methods.

You might also like