ML Algo
ML Algo
In this post, I will explain Logistic Regression in simple terms. It could be considered a
Logistic Regression for dummies post, however, I’ve never really liked that expression.
This means that logistic regression models are models that have a certain fixed number
of parameters that depend on the number of input features, and they output categorical
prediction, like for example if a plant belongs to a certain species or not.
In reality, the theory behind Logistic Regression is very similar to the one from Linear
Regression, so if you don’t know what Linear Regression is, take 5 minutes to read this
super easy guide:
In Logistic Regression, we don’t directly fit a straight line to our data like in linear
regression. Instead, we fit a S shaped curve, called Sigmoid, to our observations.
First of all, like we said before, Logistic Regression models are classification models;
specifically binary classification models (they can only be used to distinguish between 2
different categories — like if a person is obese or not given its weight, or if a house is big
or small given its size). This means that our data has two kinds of
observations (Category 1 and Category 2 observations) like we can observe in the
figure.
If we wanted to predict if a person was obese or not given their weight, we would first
compute a weighted sum of their weight (sorry for the lexical redundancy) and then
input this into the sigmoid function:
Alright, this looks cool and all, but isn’t this meant to be a Machine Learning
model? How do we train it? That is a good question. There are multiple ways to train
a Logistic Regression model (fit the S shaped line to our data). We can use an iterative
optimisation algorithm like Gradient Descent to calculate the parameters of the model
(the weights) or we can use probabilistic methods like Maximum likelihood.
If you don’t know what any of these are, Gradient Descent was explained in the Linear
Regression post, and an explanation of Maximum Likelihood for Machine Learning can
be found here:
Probability Learning III: Maximum Likelihood
Another step in our way to become probability masters…
towardsdatascience.com
Once we have used one of these methods to train our model, we are ready to make some
predictions. Let's see an example of how the process of training a Logistic Regression
model and using it to make predictions would go:
1. First, we would collect a Dataset of patients who have and who have not been
diagnosed as obese, along with their corresponding weights.
2. After this, we would train our model, to fit our S shape line to the data and
obtain the parameters of the model. After training using Maximum Likelihood, we
got the following parameters:
3. Now, we are ready to make some predictions: imagine we got two patients; one is
120 kg and one is 60 kg. Let's see what happens when we plug these numbers into the
model:
Results of using the fitted model to predict obesity given patient weight
As we can see, the first patient (60 kg) has a very low probability of being obese, however,
the second one (120 kg) has a very high one.
In the previous figure, we can see the results given by the Logistic Regression model for
the discussed examples. Now, given the weight of any patient, we could calculate their
probability of being obese, and give our doctors a quick first round of information!
6 Easy Steps to Learn Naive Bayes Algorithm with codes in Python and R
Overview
Introduction
You are working on a classification problem and have generated your set of hypothesis,
created features and discussed the importance of variables. Within an hour,
stakeholders want to see the first cut of the model.
What will you do? You have hundreds of thousands of data points and quite a few
variables in your training data set. In such a situation, if I were in your place, I would
have used ‘Naive Bayes‘, which can be extremely fast relative to other classification
algorithms. It works on Bayes theorem of probability to predict the class of unknown
data sets.
In this article, I’ll explain the basics of this algorithm, so that next time when you come
across large data sets, you can bring this algorithm to action. In addition, if you are
a newbie in Python or R, you should not be overwhelmed by the presence of available
codes in this article.
Problem Statement
However, the collection, processing, and analysis of data have been largely manual, and given
the nature of human resources dynamics and HR KPIs, the approach has been constraining
HR. Therefore, it is surprising that HR departments woke up to the utility of machine
learning so late in the game. Here is an opportunity to try predictive analytics in identifying
the employees most likely to get promoted.
Table of Contents
For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter. Even if these features depend on each other or upon the existence
of the other features, all of these properties independently contribute to the probability
that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). Look at the equation below:
Above,
Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need
to classify whether players will play or not based on weather condition. Let’s follow the
below steps to perform it.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Problem: Players will play if weather is sunny. Is this statement is correct?
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
Pros:
It is easy and fast to predict class of test data set. It also perform well in multi
class prediction
When assumption of independence holds, a Naive Bayes classifier performs
better compare to other models like logistic regression and you need less
training data.
It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve,
which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction. This is often known as “Zero Frequency”. To solve
this, we can use the smoothing technique. One of the simplest smoothing
techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In
real life, it is almost impossible that we get a set of predictors which are
completely independent.
Again, scikit learn (python library) will help here to build a Naive Bayes model in Python.
There are three types of Naive Bayes model under the scikit-learn library:
Multinomial: It is used for discrete counts. For example, let’s say, we have a text
classification problem. Here we can consider Bernoulli trials which is one step
further and instead of “word occurring in the document”, we have “count how
often word occurs in the document”, you can think of it as “number of times
outcome number x_i is observed over the n trials”.
Bernoulli: The binomial model is useful if your feature vectors are binary (i.e.
zeros and ones). One application would be text classification with ‘bag of words’
model where the 1s & 0s are “word occurs in the document” and “word does not
occur in the document” respectively.
R Code:
problem only
levels(Train$Item_Fat_Content)
class(model)
table(pred)
Above, we looked at the basic Naive Bayes model, you can improve the power of this
basic model by tuning parameters and handle assumption intelligently. Let’s look at the
methods to improve the performance of Naive Bayes Model. I’d recommend you to go
through this document for more details on Text classification using Naive Bayes.
End Notes
Further, I would suggest you to focus more on data pre-processing and feature selection
prior to applying Naive Bayes algorithm.0 In future post, I will discuss about text and
document classification using naive bayes in more detail.
Intuitive Guide to Latent Dirichlet Allocation
Topic modelling refers to the task of identifying topics that best describes a set of documents.
These topics will only emerge during the topic modelling process (therefore called latent). And
one popular topic modelling technique is known as Latent Dirichlet Allocation (LDA). Though
the name is a mouthful, the concept behind this is very simple.
To tell briefly, LDA imagines a fixed set of topics. Each topic represents a set of words. And the
goal of LDA is to map all the documents to the topics in a way, such that the words in each
document are mostly captured by those imaginary topics. We will systematically go through this
method by the end which you will be comfortable enough to use this method on your own.
This is the fourth blog post in the series of light on math machine learning A-Z. You can find the
previous blog posts linked to the letter below.
What are some of the real world uses topic modelling has? Historians can use LDA to identify
important events in history by analysing text based on year. Web based libraries can use LDA
to recommend books based on your past readings. News providers can use topic modelling
to understand articles quickly or cluster similar articles. Another interesting application
is unsupervised clustering of images, where each image is treated similar to a document.
What’s unique about this article? Is this another fish in the sea?
The short answer is a big NO! I’ve swept through many different articles out there. And there are
many great articles/videos giving the intuition. However most of them stop at answering questions
like:
What is the intuition behind LDA?
Which I do talk about but don’t believe we should stop there. The way these models are trained is
a key component I find missing in many of the articles I read. So I try to answer few more
questions like:
Once you understand the big idea, I think it helps you to understand why the mechanics in LDA
are the way they are. So here goes;
Each document can be described by a distribution of topics and each topic can be described by a
distribution of words
But why do we use this idea? Let’s imagine it through an example.
Say you have a set of 1000 words (i.e. most common 1000 words found in all the documents) and
you have 1000 documents. Assume that each document on average has 500 of these words
appearing in each. How can you understand what category each document belongs to? One way is
to connect each document to each word by a thread based on their appearance in the document.
Something like below.
Modeling documents just with words. You can see that we can’t really infer any useful
information due to the large amount of connections
And then when you see that some documents are connected to same set of words. You know they
discuss the same topic. Then you can read one of those documents and know what all these
documents talk about. But to do this you don’t have enough thread. You’re going to need around
500*1000=500,000 threads for that. But we are living in 2100 and we have exhausted all the
resources for manufacturing threads, so they are so expensive and you can only afford 10,000
threads. How can you solve this problem?
We can solve this problem, by introducing a latent (i.e. hidden) layer. Say we know 10
topics/themes that occur throughout the documents. But these topics are not observed, we only
observe words and documents, thus topics are latent. And we want to utilise this information to
cut down on the number of threads. Then what you can do is, connect the words to the topics
depending on how well that word fall in that topic and then connect the topics to the documents
based on what topics each document touch upon.
Now say you got each document having around 5 topics and each topic relating to 500 words.
That is we need 1000*5 threads to connect documents to topics and 10*500 threads to connect
topics to words, adding up to 10000.
Words are modeled by a set of topics and documents are modeled by a set of topics. The
relationships are clearer than the first example because there’s a fewer connections than the first
example.
Note: The topics I use here (“Animals”, “Sports”, “Tech”) are imaginary. In the real solution, you
won’t have such topics but something like (0.3*Cats,0.4*Dogs,0.2*Loyal, 0.1*Evil) representing
the topic “Animals”. That is, as mentioned before, each document is a distribution of words.
To give more context to what’s going on, LDA assumes the following generative process is
behind any document you see. For simplicity let us assume we are generating a single documents
with 5 words. But the same process is generalisable to M documents with N words in each. The
caption pretty well explain what’s going on here. So I won’t reiterate.
How a document is generated. First α (alpha) organise the ground θ (theta) and then you go and
pick a ball from θ. Based on what you pick, you’re sent to ground β (beta). β is organised by η
(Eta). Now you pick a word from β and put it into the document. You iterate this process 5 times
to get 5 words out.
This image is a depiction of what an already-learnt LDA system looks like. But to arrive at this
stage you have to answer several questions, such as:
This will be answered in the next few sections. Additionally, we’re going to get a bit technical this
point onwards. So buckle up!
Note: LDA does not care the order of the words in the document. Usually, LDA use the bag-of-
words feature representation to represent a document. It makes sense, because, if I take a
document, jumble the words and give it to you, you still can guess what sort of topics are
discussed in the document.
Before diving into the details. Let’s get a few things across like notations and definitions.
M — Number of documents
First let’s put the ground based example about generating documents above, to a proper
mathematical drawing.
Graphical model of the LDA. Here I mark the shapes of the all the possible variables (both
observed and hidden). But remember that θ, z, and β are distributions, not deterministic values
Let’s decipher what this is saying. We have a single α value (i.e. organiser of ground θ) which
defines θ; the topic distribution for documents is going to be like. We have M documents and got
some θ distribution for each such document. Now to understand things more clearly, squint your
eyes and make that M plate disappear (assuming there’s only a single document), woosh!
Now that single document has N words and each word is generated by a topic. You
generate N topics to be filled in with words. These N words are still placeholders.
Now the top plate kicks in. Based on η, β has some distribution (i.e. a Dirichlet distribution to be
precise — discussed soon) and according to that distribution, β generates k individual words for
each topic. Now you fill in a word to each placeholder (in the set of N placeholders), conditioned
on the topic it represents.
α and η are shown as constants in the image above. But it is actually more complex than that. For
example α has a topic distribution for each document (θ ground for each document). Ideally, a (M
x K) shape matrix. And η has a parameter vector for each topic. η will be of shape (k x V). In the
above drawing, the constants actually represent matrices, and are formed by replicating the single
value in the matrix to every single cell.
We still haven’t answered the real problem is, how do we know the exact α and η values? Before
that, let me list down the latent (hidden) variable we need to find.
I have a set of M documents, each document having N words, where each word is generated by a
single topic from a set of K topics. I’m looking for the joint posterior probability of:
given,
But we cannot calculate this nicely, as this entity is intractable. Sow how do we solve this?
There are many ways to solve this. But for this article, I’m going to focus on variational inference.
The probability we discussed above is a very messy intractable posterior (meaning we cannot
calculate that on paper and have nice equations). So we’re going to approximate that with some
known probability distribution that closely matches the true posterior. That’s the idea behind
variational inference.
The way to do this is to minimise the KL divergence between the approximation and true
posterior as an optimisation problem. Again I’m not going to swim through the details as this is
out of scope.
γ , ϕ and λ represent the free variational parameters we approximate θ,z and β with, respectively.
Here D(q||p) represents the KL divergence between q and p. And by changing γ,ϕ and λ, we get
different q distributions having different distances from the true posterior p. Our goal is to find the
γ* , ϕ* and λ* that minimise the KL divergence between the approximation q and the true
posterior p.
With everything nicely defined, it’s just a matter of iteratively solving the above optimisation
problem until the solution converges. Once you have γ* , ϕ* and λ* you have everything you need
in the final LDA model.
Wrap up
In this article we discussed about Latent Dirichlet Allocation (LDA). LDA is a powerful method
that allows to identify topics within the documents and map documents to those topics. LDA has
many uses to it such as recommending books to customers.
We looked at how LDA works with an example of connecting threads. Then we saw a different
perspective based on how LDA imagine a document is generated. Finally we went into the
training of the model. In this we discussed a significant amount of mathematics behind LDA,
while keeping the math light. We took a look at what a Dirichlet distribution looks like, what is
the probability distribution we’re interested in finding (i.e. posterior) and how do we solve that
using variational inference.
I will post a tutorial on how to use LDA for topic modelling including some cool analysis as
another tutorial. Cheers.
Naive Bayes Explained
Naive Bayes is a probabilistic algorithm that’s typically used for classification problems.
Naive Bayes is simple, intuitive, and yet performs surprisingly well in many cases. For
example, spam filters Email app uses are built on Naive Bayes. In this article, I’ll explain
the rationales behind Naive Bayes and build a spam filter in Python. (For simplicity, I’ll
Theory
Before we get started, please memorize the notations used in this article:
Basic Idea
To make classifications, we need to use X to predict Y. In other words, given a data point
X=(x1,x2,…,xn), what the odd of Y being y. This can be rewritten as the following
equation:
This is the basic idea of Naive Bayes, the rest of the algorithm is really more focusing on
how to calculate the conditional probability above.
Bayes Theorem
So far Mr. Bayes has no contribution to the algorithm. Now is his time to shine.
According to the Bayes Theorem:
Bayes Theorem
This is a rather simple transformation, but it bridges the gap between what we want to do
and what we can do. We can’t get P(Y|X) directly, but we can get P(X|Y) and P(Y) from
the training data. Here’s an example:
In this case, X =(Outlook, Temperature, Humidity, Windy), and Y=Play. P(X|Y) and P(Y)
can be calculated:
Example of finding P(Y) and P(X|Y)
Theoretically, it is not hard to find P(X|Y). However, it is much harder in reality as the
number of features grows.
Having this amount of parameters in the model is impractical. To solve this problem, a
naive assumption is made. We pretend all features are independent. What does
this mean?
Conditional independence
Now with the help of this naive assumption (naive because features are rarely
independent), we can make classification with much fewer parameters:
This is a big deal. We changed the number of parameters from exponential to linear. This
means that Naive Bayes handles high-dimensional data well.
Categorical Data
However, one issue is that if some feature values never show (maybe lack of data), their
likelihood will be zero, which makes the whole posterior probability zero. One simple
way to fix this problem is called Laplace Estimator: add imaginary samples (usually one)
to each category
Laplace Estimator
Continuous Data
For continuous features, there are essentially two choices: discretization and continuous
Naive Bayes.
Discretization works by breaking the data into categorical values. The simplest
discretization is uniform binning, which creates bins with fixed range. There are, of
course, smarter and more complicated ways such as Recursive minimal entropy
partitioning or SOM based partitioning.
The second option is utilizing known distributions. If the features are continuous, the
Naive Bayes algorithm can be written as:
For instance, if we visualize the data and see a bell-curve-like distribution, it is fair to
make an assumption that the feature is normally distributed
The first step is calculating the mean and variance of the feature for a given label y:
variance adjusted by the degree of freedom
Although these methods vary in form, the core idea behind is the sa
me: assuming the feature satisfies a certain distribution, estimating the parameters of the
distribution, and then get the probability density function.
Strength
1. Even though the naive assumption is rarely true, the algorithm performs
surprisingly good in many cases
2. Handles high dimensional data well. Easy to parallelize and handles big data well
3. Performs better than more complicated models when the data set is small
Weakness
2. When data is abundant, other more complicated models tend to outperform Naive
Bayes
Summary
Naive Bayes utilizes the most fundamental probability knowledge and makes a naive
assumption that all features are independent. Despite the simplicity (some may say
oversimplification), Naive Bayes gives a decent performance in many applications.
Now you understand how Naive Bayes works, it is time to try it in real projects!
A beginner’s guide to dimensionality reduction in Machine Learning
This is my first article on medium. Here, I’ll be giving a quick overview of what
dimensionality reduction is, why we need it and how to do it.
That is all well and good but why should we care? Why would we drop 80 columns off
our dataset when we could straight up feed it to our machine learning algorithm and let it
do the rest?
We care because the curse of dimensionality demands that we do. The curse of
dimensionality refers to all the problems that arise when working with data in the higher
dimensions, that did not exist in the lower dimensions.
As the number of features increase, the number of samples also increases proportionally.
The more features we have, the more number of samples we will need to have all
combinations of feature values well represented in our sample.
The Curse of Dimensionality
As the number of features increases, the model becomes more complex. The more the
number of features, the more the chances of overfitting. A machine learning model that is
trained on a large number of features, gets increasingly dependent on the data it was
trained on and in turn overfitted, resulting in poor performance on real data, beating the
purpose.
2. Less dimensions mean less computing. Less data means that algorithms
train faster.
Feature selection is the process of identifying and selecting relevant features for your
sample. Feature engineering is manually generating new features from existing features,
by applying some transformation or performing some operation on them.
2. So is just visualising the relationship between the features and the target
variable by plotting each feature against the target variable.
Now let us look at a few programmatic methods for feature selection from the popular
machine learning library sci-kit learn, namely,
2. Univariate selection.
The most common and well known dimensionality reduction methods are the ones that
apply linear transformations, like
3. LDA (Linear Discriminant Analysis): projects data in a way that the class
separability is maximised. Examples from same class are put closely
together by the projection. Examples from different classes are placed far
apart by the projection
PCA orients data along the direction of the component with maximum variance whereas
LDA projects the data to signify the class separability
Non-linear transformation methods or manifold learning methods are used when the
data doesn’t lie on a linear subspace. It is based on the manifold hypothesis which says
that in a high dimensional structure, most relevant information is concentrated in small
number of low dimensional manifolds. If a linear subspace is a flat sheet of paper, then a
rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is
called a Swiss roll, a canonical problem in the field of non-linear dimensionality
reduction. Some popular manifold learning methods are,
Shows the resulting projection from applying different manifold learning methods on a
3D S-Curve
Auto-encoders
Another popular dimensionality reduction method that gives spectacular results are
auto-encoders, a type of artificial neural network that aims to copy their inputs to their
outputs. They compress the input into a latent-space representation, and then
reconstructs the output from this representation. An autoencoder is composed of two
parts :