Unit 2
Unit 2
DATA ANALYTICS
UNIT 2
2.1 Regression modeling
Regression analysis is a form of predictive modelling technique which investigates the relationship
between a dependent (target) and independent variable (s) (predictor). This technique is used for
forecasting, time series modelling and finding the causal effect relationship between the variables.
For example, relationship between rash driving and number of road accidents by a driver is best
studied through regression.
Regression models are widely used in data analytics technique that allow the identification and
estimation of possible relationships between a pattern or variable of interest, and factors that
influence that pattern. Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent variables are
held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line
to the data points, in such a manner that the differences between the distances of data points from
the curve or line is minimized.
There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and independent
variable.
2. It indicates the strength of impact of multiple independent variables on a dependent
variable.
Regression analysis also allows us to compare the effects of variables measured on different
scales, such as the effect of price changes and the number of promotional activities. These
benefits help market researcher’s / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:
UNIT -2 1
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig 2.1 Example of Regression
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables. In Regression, we plot a graph
between the variables which best fits the given data points, using this plot, the machine learning
model can make predictions about the data. In simple words, "Regression shows a line or curve
that passes through all the data points on target-predictor graph in such a way that the vertical
distance between the data points and the regression line is minimum." The distance between data
points and line tells whether a model has captured a strong relationship or not.
o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Under fitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is called under
fitting.
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which
UNIT -2 2
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
can make predictions more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
1. Linear Regression
It is one of the most widely known modeling techniques. In this technique, the dependent
variable is continuous, independent variable(s) can be continuous or discrete, and nature of
regression line is linear.
Linear regression is a statistical regression method which is used for predictive analysis. It
shows the relationship between the continuous variables. It is used for solving the regression
problem in machine learning. Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
Linear Regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is
error term. This equation can be used to predict the value of target variable based on given
predictor variable(s).
Important Points:
• There must be linear relationship between independent and dependent variables
• Multiple regression suffers from multicollinearity, autocorrelation, heteroskedasticity.
• Linear Regression is very sensitive to Outliers. It can terribly affect the regression line
and eventually the forecasted values.
• Multicollinearity can increase the variance of the coefficient estimates and make the estimates
very sensitive to minor changes in the model. The result is that the coefficient estimates are
unstable
• In case of multiple independent variables, we can go with forward selection, backward
elimination and step wise approach for selection of most significant independent variables.
2. Logistic Regression
Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable
such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis
algorithm which works on the concept of probability. Logistic regression uses sigmoid
function or logistic function which is a complex cost function. This sigmoid function is used to
model the data in logistic regression. The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to
1, and values below the threshold level are rounded up to 0.
Important Points:
• Logistic regression is widely used for classification problems
• Logistic regression doesn’t require linear relationship between dependent and independent
variables. It can handle various types of relationships because it applies a non-linear log
transformation to the predicted odds ratio
• To avoid over fitting and under fitting, we should include all significant variables. A good
approach to ensure this practice is to use a step wise method to estimate the logistic regression
• It requires large sample sizes because maximum likelihood estimates are less powerful at low
sample sizes than ordinary least square
• The independent variables should not be correlated with each other i.e. no multi collinearity.
However, we have the options to include interaction effects of categorical variables in the
analysis and in the model.
• If the values of dependent variable is ordinal, then it is called as Ordinal logistic regression
• If dependent variable is multi class, then it is known as Multinomial Logistic regression.
UNIT -2 5
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
3. Polynomial Regression:
Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model. It is similar to multiple linear regression, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.
Suppose there is a dataset which consists of data points which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those data points. To cover such
data points, we need Polynomial regression. In Polynomial regression, the original features are
transformed into polynomial features of given degree and then modeled using a linear
model. Which means the data points are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,...bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Support Vector Machine is a supervised learning algorithm which can be used for regression as
well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression. Support Vector Regression is a regression algorithm which works
for continuous variables.
UNIT -2 6
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Below are some keywords which are used in Support Vector Regression:
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of data points are covered in that margin. The main goal of SVR is to consider the
maximum data points within the boundary lines and the hyperplane (best-fit line) must contain a
maximum number of data points. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
UNIT -2 7
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
child nodes are further divided into their children node, and themselves become the parent node
of those nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more
formally as:
UNIT -2 8
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In data analytics, we look at different variables (or factors) and how they might impact certain
situations or outcomes. For example, in marketing, you might look at how the variable “money
spent on advertising” impacts the variable “number of sales.” In the healthcare sector, you might
want to explore whether there’s a correlation between “weekly hours of exercise” and
“cholesterol level.” This helps us to understand why certain outcomes occur, which in turn
allows us to make informed predictions and decisions for the future.
One such example of the real world is the weather. The weather at any particular place does not
solely depend on the ongoing season, instead many other factors play their specific roles, like
humidity, pollution, etc. Just like this, the variables in the analysis are prototypes of real-time
situations, products, services, or decision-making involving more variables.
UNIT -2 9
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In 1928, Wishart presented his paper. The Precise distribution of the sample covariance
matrix of the multivariate normal population, which is the initiation of MVA. In the 1930s,
R.A. Fischer, Hotelling, S.N. Roy, and B.L. Xu et al. made a lot of fundamental theoretical
work on multivariate analysis. At that time, it was widely used in the fields of psychology,
education, and biology. In the middle of the 1950s, with the appearance and expansion of
computers, multivariate analysis began to play a big role in geological, meteorological.
Medical and social and science. From then on, new theories and new methods were proposed
and tested constantly by practice and at the same time, more application fields were
exploited. With the aids of modern computers, we can apply the methodology of multivariate
analysis to do rather complex statistical analyses.
1. Multivariate data analysis helps in the reduction and simplification of data as much as
possible without losing any important details.
2. As MVA has multiple variables, the variables are grouped and sorted on the basis of their
unique features.
3. The variables in multivariate data analysis could be dependent or independent. It is
important to verify the collected data and analyze the state of the variables.
4. In multivariate data analysis, it is very important to understand the relationship between
all the variables and predict the behavior of the variables based on observations.
5. It is tested to create a statistical hypothesis based on the parameters of multivariate data.
This testing is carried out to determine whether or not the assumptions are true.
UNIT -2 10
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Advantages of multivariate data analysis:
1. The main advantage of multivariate analysis is that since it considers more than one
factor of independent variables that influence the variability of dependent variables,
the conclusion drawn is more accurate.
2. Since the analysis is tested, the drawn conclusions are closer to real-life situations.
1. Multivariate data analysis includes many complex computations and hence can be
laborious.
2. The analysis necessitates the collection and tabulation of a large number of observations
for various variables. This process of observation takes a long time.
Dependence methods
Dependence methods are used when one or some of the variables are dependent on others.
Dependence looks at cause and effect; in other words, can the values of two or more independent
variables be used to explain, describe, or predict the value of another, dependent variable? To
give a simple example, the dependent variable of “weight” might be predicted by independent
variables such as “height” and “age.”
In machine learning, dependence techniques are used to build predictive models. The analyst
enters input data into the model, specifying which variables are independent and which ones are
dependent—in other words, which variables they want the model to predict, and which variables
they want the model to use to make those predictions.
Interdependence methods
Interdependence methods are used to understand the structural makeup and underlying patterns
within a dataset. In this case, no variables are dependent on others, so you’re not looking for
causal relationships. Rather, interdependence methods seek to give meaning to a set of variables
or to group them together in meaningful ways.
UNIT -2 11
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• Multiple logistic regression
• Multivariate analysis of variance (MANOVA)
• Factor analysis
• Cluster analysis
• Multiple linear regression
Multiple linear regressions is a dependence method which looks at the relationship between one
dependent variable and two or more independent variables. A multiple regression model will tell
you the extent to which each independent variable has a linear relationship with the dependent
variable. This is useful as it helps you to understand which factors are likely to influence a
certain outcome, allowing you to estimate future outcomes.
UNIT -2 12
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
independent variables into your model, such as age, whether or not they have a serious health
condition, their occupation, and so on. Using these variables, a logistic regression analysis will
calculate the probability of the event (making a claim) occurring. Another oft-cited example is
the filters used to classify email as “spam” or “not spam.”
Example of MANOVA:
Suppose you work for an engineering company that is on a mission to build a super-fast, eco-
friendly rocket. You could use MANOVA to measure the effect that various design
combinations have on both the speed of the rocket and the amount of carbon dioxide it emits.
Your metric dependent variables are speed in kilometers per hour, and carbon dioxide measured
in parts per million. Using MANOVA, you’d test different combinations (e.g. E1, M1, and F1
vs. E1, M2, and F1, vs. E1, M3, and F1, and so on) to calculate the effect of all the independent
variables. This should help you to find the optimal design solution for your rocket.
• Factor analysis
Factor analysis is an interdependence technique which seeks to reduce the number of variables in
a dataset. If you have too many variables, it can be difficult to find patterns in your data. At the
same time, models created using datasets with too many variables are susceptible to overfitting.
Overfitting is a modeling error that occurs when a model fits too closely and specifically to a
certain dataset, making it less generalizable to future datasets, and thus potentially less accurate
in the predictions it makes. Factor analysis works by detecting sets of variables which correlate
highly with each other. These variables may then be condensed into a single variable. Data
analysts will often carry out factor analysis to prepare the data for subsequent analyses.
Factor analysis example:
Let’s imagine you have a dataset containing data pertaining to a person’s income, education
level, and occupation. You might find a high degree of correlation among each of these
variables, and thus reduce them to the single factor “socioeconomic status.” You might also have
data on how happy they were with customer service, how much they like a certain product, and
how likely they are to recommend the product to a friend. Each of these variables could be
grouped into the single factor “customer satisfaction” (as long as they are found to correlate
strongly with one another). Even though you’ve reduced several data points to just one factor,
UNIT -2 13
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
you’re not really losing any information—these factors adequately capture and represent the
individual variables concerned. With your “streamlined” dataset, you’re now ready to carry out
further analyses.
• Cluster analysis
Another interdependence technique, cluster analysis is used to group similar items within a
dataset into clusters. When grouping data into clusters, the aim is for the variables in one cluster
to be more similar to each other than they are to variables in other clusters. This is measured in
terms of intracluster and intercluster distance. Intracluster distance looks at the distance between
data points within one cluster. This should be small. Intercluster distance looks at the distance
between data points in different clusters. This should ideally be large. Cluster analysis helps you
to understand how data in your sample is distributed, and to find patterns.
In data science, data modelling is the process of finding the function by which data was
generated. In this context, data modelling is the goal of any data analysis task. For instance if
UNIT -2 14
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
you have a 2d dataset (see the figures below), and you find the 2 variables are linearly correlated,
you may decide to model it using linear regression.
Bayesian data modelling is to model your data using Bayes Theorem. Let us re-visit Bayes Rule
again:
In the above equation, H is the hypothesis and E is the evidence. In the real world however, we
understand Bayesian components differently! The evidence is usually expressed by data, and the
hypothesis reflects the expert’s prior estimation of the posterior. Therefore, we can re-write the
Bayes Rule to be:
UNIT -2 15
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In the above definition we learned about prior, posterior, and data, bout what
about θ parameter? θ is the set of coefficients that best define the data. You may think of θ as the
set of slope and intercept of your linear regression equation, or the vector of coefficients w in
your polynomial regression function. As you see in the above equation, θ is the single missing
parameter, and the goal of Bayesian modelling is to find it.
In order to select the suitable distribution for your data, you should learn about the data domain
and gather information from previous studies in it. You may also ask an expert to learn how data
is developed over time. If you have big portions of data, you may visualize it and try to detect
certain pattern(s) of its evolving over time, and select your probability distribution upon.
Maximum Likelihood Estimation (MLE) is the simplest way of Bayesian data modelling, in
which we ignore both prior and marginal probabilities (i.e. consider both quantities equal to 1).
The formula of MLE is:
Example
“a company captures the blood glucose (BG) levels of diabetics, analyses these levels, and send
its clients suitable diet preferences. After one week of inserting her BG levels, a client asked the
company smart app whether she can consume desserts after her lunch or not? By checking her
after-meal BG levels, it was {172,171,166,175,170,165,160}. Knowing that the client after-meal
BG should not exceed 200 mg/dl, what should the app recommend to the client? “
UNIT -2 16
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
The goal here is to estimate an average BG level for the client to use it in the company
recommendation system. Having the above BG sample data, we assume these readings are
drawing from a normal distribution, with a mean of 168.43 and standard deviation of 4.69.
UNIT -2 17
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In order to find the maximum likelihood, we need to estimate different values of BG levels and
calculate the corresponding likelihoods. The BG value with maximum likelihood is the most
suitable estimation of the BG.
To automate this process, we generate 1000 random numbers in the same range of captured BG
levels, and measure the corresponded likelihoods. The results are illustrated in the following
plot.
As you can see, the maximum likelihood estimate of randomly generated BGs equals 169.55,
which is very close to the average of captured BG levels (168.43). The difference between both
values is due to the small size of captured data. The larger sample size you have, the smaller
difference between both estimates you get.
Based on this estimate, the app can recommend it’s patient to consume a small piece of dessert
after at least 3 hours of her lunch, with taking the suitable insulin dose.
Maximum A Posteriori (MAP) is the second approach of Bayesian modelling, where the
posterior is calculated using both likelihood and prior probabilities.
UNIT -2 18
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
You can think of MAP modelling as a generalization of MLE, where the likelihood is
approximated with prior information. Technically speaking, we deal with our data samples as if
it were generated by the prior distribution:
Example
“Suppose that the client updated the app with her historical BG levels for the last month, which
turned out to be 171 mg/dl on average, with standard deviation of 3, how the app can use this
new information to update the patient’s BG level?”
In this case, we have two distributions of Blood Glucose, one of recent data (the likelihood), and
the other of historical data (the prior). Each data source can be expressed as normal distribution
(see the following figure).
The posterior here is updated by multiplying the prior marginal probability by each term of data
probabilities:
The prior marginal probability P(θ) is the summation of all data probabilities over the prior
distribution:
As our goal is to maximize posterior estimation, we generate random values of θ and measure
the corresponding estimations. By generating 500 guesses of θ, we obtain the following posterior
UNIT -2 19
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
distribution (in black). For visualization reasons, I raised the value of posterior probabilities to
the power of 0.08.
Now we get more generalized estimation of the client’s BG level. The MAP technique excludes
two measures in the posterior distribution (the red data points outside the black curve), and
generates more convenient estimate. As the below plot shows, the Standard Error (SE) of the
posterior is less than that of the likelihood. Standard error is an indication of the reliability of a
standard expectation (i.e. mean of predicted normal distribution). It is calculated as SE=σ/√N,
where σ is the standard deviation and N is data size.
The new MAP posterior estimation of the patient’s BG level is higher than the one estimated
using MLE. This may lead the app to prohibit the patient of consuming desserts after her lunch,
until her BG levels become more stabilized.
UNIT -2 20
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
2.4 Inference and Bayesian Network
A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph. It is also called a Bayes network,
belief network, decision network, or Bayesian model. Bayesian networks are probabilistic,
because these networks are built from a probability distribution, and also use probability theory
for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram. A Bayesian network graph is made up
of nodes and Arcs (directed links), where:
UNIT -2 21
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
o In the above diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node. Bayesian network is based on Joint
probability distribution and conditional probability.
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work when they
hear the alarm. David always calls Harry when he hears the alarm, but sometimes he got
confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
UNIT -2 22
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or
CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a Boolean variable with k Boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:
UNIT -2 23
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig 2.15 Example of Bayesian Network graph
Let's take the observed probability for the Burglary and earthquake component:
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
UNIT -2 24
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.
The following terms are all forms of inference will slightly difference semantics:
Other algorithms will learn the most evident, most representative characteristics of apples and
lemons, like apples are green and rounded while lemons are yellow and have elliptic form.
UNIT -2 25
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In contrast, SVM will search for apples that are very similar to lemons, for example apples which
are yellow and have elliptic form. This will be a support vector. The other support vector will be a
lemon similar to an apple (green and rounded). So other algorithms learns
the differences while SVM learns similarities.
As we go from left to right, all the examples will be classified as apples until we reach the yellow
apple. From this point, the confidence that a new example is an apple drops while the lemon class
confidence increases. When the lemon class confidence becomes greater than the apple class
confidence, the new examples will be classified as lemons (somewhere between the yellow apple
and the green lemon).
Based on these support vectors, the algorithm tries to find the best hyperplane that separates the
classes. In 2D the hyperplane is a line, so it would look like this:
UNIT -2 26
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
But why did I draw the blue boundary like in the picture above? I could also draw boundaries like
this:
As you can see, we have an infinite number of possibilities to draw the decision boundary. So
how can we find the optimal one?
Intuitively the best line is the line that is far away from both apple and lemon examples (has the
largest margin). To have optimal solution, we have to maximize the margin in both ways (if we
have multiple classes, then we have to maximize it considering each of the classes).
So if we compare the picture above with the picture below, we can easily observe, that the first is
the optimal hyperplane (line) and the second is a sub-optimal solution, because the margin is far
shorter.
UNIT -2 27
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Because we want to maximize the margins taking in consideration all the classes, instead of using
one margin for each class, we use a “global” margin, which takes in consideration all the classes.
This margin would look like the purple line in the following picture:
This margin is orthogonal to the boundary and equidistant to the support vectors.
Basic Steps
UNIT -2 28
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In this case we cannot find a straight line to separate apples from lemons. So how can we solve
this problem. We will use the Kernel Trick.
The basic idea is that when a data set is inseparable in the current dimensions, add another
dimension, maybe that way the data will be separable. Just think about it, the example above is in
2D and it is inseparable, but maybe in 3D there is a gap between the apples and the lemons,
maybe there is a level difference, so lemons are on level one and apples are on level two. In this
case, we can easily draw a separating hyperplane (in 3D a hyperplane is a plane) between level 1
and 2.
To solve this problem we shouldn’t just blindly add another dimension, we should transform the
space so we generate this level difference intentionally.
Mapping from 2D to 3D
Let's assume that we add another dimension called X3. Another important transformation is that
in the new dimension the points are organized using this formula x1² + x2².
If we plot the plane defined by the x² + y² formula, we will get something like this:
UNIT -2 29
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Now we have to map the apples and lemons (which are just simple points) to this new space.
Think about it carefully, what did we do? We just used a transformation in which we added levels
based on distance. If you are in the origin, then the points will be on the lowest level. As we move
away from the origin, it means that we are climbing the hill (moving from the center of the plane
towards the margins) so the level of the points will be higher.
Pros
1. SVN can be very efficient, because it uses only a subset of the training data, only the support
vectors
2. Works very well on smaller data sets, on non-linear data sets and high dimensional spaces
3. Is very effective in cases where number of dimensions is greater than the number of samples
4. It can have high accuracy, sometimes can perform even better than neural networks
5. Not very sensitive to overfitting
Cons
1. Text Classification
2. Detecting spam
3. Sentiment analysis
4. Aspect-based recognition
5. Aspect-based recognition
UNIT -2 30
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
6. Handwritten digit recognition
TSA is the backbone for prediction and forecasting analysis, specific to the time-based problem
statements.
With help of “Time Series” we can prepare numerous time-based analyses and results by
using:
• Forecasting
• Segmentation
• Classification
• Descriptive analysis`
• Intervention analysis
UNIT -2 31
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• ARIMA Model
An auto-regressive integrated moving-average models the value of a variable as a linear function
of previous values and residual errors at previous time steps of a stationary time series.
However, the real world data may be non-stationary and have seasonality, thus Seasonal-
ARIMA and Fractional-ARIMA were developed. ARIMA works on univariate time series, to
handle multiple variables VARIMA was introduced.
• Exponential Smoothing
It models the value of a variable as an exponential weighted linear function of previous values.
This statistical model can handle trend and seasonality as well.
• LSTM
Long Short-Term Memory model (LSTM) is a recurrent neural network which is used for time
series to account for long term dependencies. It can be trained with large amount of data to
capture the trends in multi-variate time series.
i. Trend: It is the increasing or decreasing behavior of a variable with time. It indicates that
there is no fixed interval and any divergence within the given dataset is a continuous
timeline. The trend would be Negative or Positive or Null Trend
ii. Seasonality: In which regular or fixed interval shifts within the dataset in a continuous
timeline.
iii. Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
iv. Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.
Time Series in R is used to see how an object behaves over a period of time. In R, it can be
easily done by ts() function with some parameters. Time series takes the data vector and each
data is connected with timestamp value as given by the user. This function is mostly used to
learn and forecast the behavior of an asset in business for a period of time. For example, sales
analysis of a company, inventory analysis, price analysis of a particular stock or market,
population analysis, etc.
UNIT -2 33
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• frequency represents number of observations per unit time. For example, frequency=1 for
monthly data.
Example: Let’s take the example of COVID-19 pandemic situation. Taking a total number of
positive cases of COVID-19 cases weekly from 22 January 2020 to 15 April 2020 of the
world in data vector.
Code:
Output
UNIT -2 34
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Irregular temporal behavior is ubiquitous in the world surrounding us: wind speeds, water wave
heights, stock market prices, exchange rates, blood pressure, or heart rate all fluctuate more or
less irregularly in time.
Such signals are the output of quite complex systems with nonlinear feedback loops and external
driving. Our goal is to quantify, understand, model, and predict such irregular fluctuations.
Therefore, our research includes the study of deterministic and stochastic model systems which
are selected because of interesting dynamical/statistical behavior and which serve as
paradigmatic data models. We design data analysis methods, we design tests and measures of
performance for such methods, and we apply these to data sets with
various properties. Last but not least, we study data sets because we wish to understand the
underlying phenomena and to improve data based predictions of fluctuations.
Out of the initially listed examples of data sources, the atmosphere sticks out for two reasons:
On the one hand, the atmosphere is an exciting highly complex physical system, where many
different sub-fields of physics meet:
Hydro dynamical transport, thermodynamics, light-matter interaction, droplet formation,
altogether forming a system which is not only far from equilibrium but also far from a linear
regime around some working point. On the other hand, climate change and the impact of
extreme weather on human civilization give atmospheric physics a high relevance. Since the
only perfect model of the atmosphere is the real world itself, and since due to very strong
nonlinearities and hierarchical structures the misleading effects of any approximation might
UNIT -2 35
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
be tremendous, a data based approach to climate issues is urgently needed as a complement to
simulating climate by models.
In a broader context, our work can also seen as part of what nowadays is called data science: We
make huge data sets accessible to our studies, we design visualization concepts and analysis
tools, we test hypothesis, construct models, and apply forecast schemes similar to machine
learning. The time series aspect enters our work through the fact that the temporal order in which
data are recorded carries part of the information, and physics enters as background information,
constraints, and reference models.
Rule induction is an area of machine learning in which formal rules are extracted from a set of
observations.
The rules extracted may represent a full scientific model of the data, or merely represent local
patterns in the data.
Rule induction is an area of machine learning in which formal rules are extracted from a set of
observations. The rules extracted may represent a full scientific model of the data, or merely
represent local patterns in the data.
Rule induction is a data mining process of deducing if-then rules from a data set. These symbolic
decision rules explain an inherent relationship between the attributes and class labels in the data
set. Many real-life experiences are based on intuitive rule induction. For example, we can
proclaim a rule that states “if it is 8 a.m. on a weekday, then highway traffic will be heavy” and
“if it is 8 p.m. on a Sunday, then the traffic will be light.” These rules are not necessarily right all
the time. 8 a.m. weekday traffic may be light during a holiday season. But, in general, these rules
hold true and are deduced from real-life experience based on our every day observations. Rule
induction provides a powerful classification approach.
Key points:
1. It is the extraction of useful if-then rules from data based on statistical significance.
2. Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-
part, stating a classification, prediction, or other expression of a property that holds for cases
defined in the antecedent).
3.Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-
part, stating a classification, prediction, or other expression of a property that holds for cases
defined in the antecedent).
4.Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-
UNIT -2 36
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
part, stating a classification, prediction, or other expression of a property that holds for cases
defined in the antecedent).
5. Rule induction is an area of machine learning in which formal rules are extracted from a set
of observations. The rules extracted may represent a full scientific model of the data, or merely
represent local patterns in the data.
6.It is the extraction of useful if-then rules from data based on statistical significance.
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal Components.
It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It
is a technique to draw strong patterns from the given dataset by reducing the variances.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the power
allocation in various communication channels. It is a feature extraction technique, so it contains
the important variables and drops the least important variable. It is a statistical procedure that
uses an orthogonal transformation which converts a set of correlated variables to a set of
uncorrelated variables. PCA is a most widely used tool in exploratory data analysis and in
machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique
used to examine the interrelations among a set of variables. It is also known as a general factor
analysis where regression determines a line of best fit.
UNIT -2 37
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Some properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has
the most importance, and n PC will have the least importance.
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance. If
the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
Here we will calculate the new features. To do this, we will multiply the P* matrix to the
Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
The new feature set has occurred, so we will decide here what to keep and what to
UNIT -2 38
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
remove. It means, we will only keep the relevant or important features in the new dataset,
and unimportant features will be removed out.
Scientists agree that our brain has around 100 billion neurons. These neurons have hundreds of
billions connections between them.
UNIT -2 39
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Neurons (aka Nerve Cells) are the fundamental units of our brain and nervous system. The neurons
are responsible for receiving input from the external world, for sending output (commands to our
muscles), and for transforming the electrical signals in between.
1943: Warren S. McCulloch and Walter Pitts published “A logical calculus of the ideas
immanent in nervous activity )” This research sought to understand how the human brain could
produce complex patterns through connected brain cells, or neurons. One of the main ideas that
came out of this work was the comparison of neurons with a binary threshold to Boolean logic
(i.e., 0/1 or true/false statements).
1958: Frank Rosenblatt is credited with the development of the perceptron, documented in his
research, “The Perceptron: A Probabilistic Model for Information Storage and Organization in
the Brain” He takes McCulloch and Pitt’s work a step further by introducing weights to the
equation. Leveraging an IBM 704, Rosenblatt was able to get a computer to learn how to
distinguish cards marked on the left vs. cards marked on the right.
1974: While numerous researchers contributed to the idea of back propagation, Paul Werbos was
the first person in the US to note its application within neural networks within his
1989: Yann LesCun published a paper illustrating how the use of constraints in backpropagation
and its integration into the neural network architecture can be used to train algorithms. This
research successfully leveraged a neural network to recognize hand-written zip code digits
provided by the U.S. Postal Service.
Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer,
one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to
another and has an associated weight and threshold. If the output of any individual node is
above the specified threshold value, that node is activated, sending data to the next layer of
the network. Otherwise, no data is passed along to the next layer of the network. Neural
networks can adapt to changing input; so the network generates the best possible result
without needing to redesign the output criteria. The concept of neural networks, which has its
roots in artificial intelligence, is swiftly gaining popularity in the development of trading
systems.
UNIT -2 40
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• Neural networks are a series of algorithms that mimic the operations of an animal brain
to recognize relationships between vast amounts of data.
• As such, they tend to resemble the connections of neurons and synapses found in the
brain.
• They are used in a variety of applications in financial services, from forecasting and
marketing research to fraud detection and risk assessment.
• Neural networks with several process layers are known as "deep" networks and are used
for deep learning algorithms
• The success of neural networks for stock market price prediction varies.
Working of neural networks
The human brain is the inspiration behind neural network architecture. Human brain cells, called
neurons, form a complex, highly interconnected network and send electrical signals to each other
to help humans process information. Similarly, an artificial neural network is made of artificial
neurons that work together to solve a problem. Artificial neurons are software modules, called
nodes, and artificial neural networks are software programs or algorithms that, at their core, use
computing systems to solve mathematical calculations.
UNIT -2 41
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
i. Input Layer
Information from the outside world enters the artificial neural network from the input layer.
Input nodes process the data, analyze or categorize it, and pass it on to the next layer.
A. Perceptron
Perceptron model, proposed by Minsky-Papert is one of the simplest and oldest models of
Neuron. It is the smallest unit of neural network that does certain computations to detect
features or business intelligence in the input data. It accepts weighted inputs, and apply the
activation function to obtain the output as the final result. Perceptron is also known as TLU
(threshold logic unit). Perceptron is a supervised learning algorithm that classifies the data
into two categories, thus it is a binary classifier. A perceptron separates the input space into
two categories by a hyperplane represented by the following equation:
UNIT -2 42
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig 2.24 Perceptron
Advantages
Perceptrons can implement Logic Gates like AND, OR, or NAND.
Disadvantages
Perceptrons can only learn linearly separable problems such as boolean AND problem. For
non-linear problems such as the boolean XOR problem, it does not work.
The simplest form of neural networks where input data travels in one direction only, passing
through artificial neural nodes and exiting through output nodes. Where hidden layers may
or may not be present, input and output layers are present there. Based on this, they can be
further classified as a single-layered or multi-layered feed-forward neural network.
Number of layers depends on the complexity of the function. It has uni-directional forward
propagation but no backward propagation. Weights are static here. An activation function is
fed by inputs which are multiplied by weights. To do so, classifying activation function or
step activation function is used. For example: The neuron is activated if it is above threshold
(usually 0) and the neuron produces 1 as an output. The neuron is not activated if it is below
threshold (usually 0) which is considered as -1. They are fairly simple to maintain and are
equipped with to deal with data which contains a lot of noise.
UNIT -2 43
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
1. Cannot be used for deep learning [due to absence of dense layers and back
propagation]
C. Multilayer Perceptron
An entry point towards complex neural nets where input data travels through various layers
of artificial neurons. Every single node is connected to all neurons in the next layer which
makes it a fully connected neural network. Input and output layers are present having
multiple hidden Layers i.e. at least three or more layers in total. It has a bi -directional
propagation i.e. forward propagation and backward propagation.
UNIT -2 44
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Inputs are multiplied with weights and fed to the activation function and in
backpropagation, they are modified to reduce the loss. In simple words, weights are machine
learnt values from Neural Networks. They self-adjust depending on the difference between
predicted outputs vs training inputs. Nonlinear activation functions are used followed by
softmax as an output layer activation function.
• Speech Recognition
• Machine Translation
• Complex Classification
1. Used for deep learning [due to the presence of dense fully connected layers and
back propagation]
UNIT -2 45
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
neuron in the convolutional layer only processes the information from a small part of the
visual field. Input features are taken in batch-wise like a filter. The network understands the
images in parts and can compute these operations multiple times to complete the full image
processing. Processing involves conversion of the image from RGB or HSI scale to grey -
scale. Furthering the changes in the pixel value will help to detect the edges and images can
be classified into different categories.
• Image processing
• Computer Vision
• Speech Recognition
• Machine translation
Radial Basis Function Network consists of an input vector followed by a layer of RBF
neurons and an output layer with one node per category. Classification is performed by
measuring the input’s similarity to data points from the training set where each neuron stores
a prototype. This will be one of the examples from the training set.
When a new input vector [the n-dimensional vector that you are trying to classify] needs to
be classified, each neuron calculates the Euclidean distance between the input and its
prototype. For example, if we have two classes i.e. class A and Class B, then the new input
to be classified is more close to class A prototypes than the class B prototypes. Hence, it
could be tagged or classified as class A.
Each RBF neuron compares the input vector to its prototype and outputs a value ranging
which is a measure of similarity from 0 to 1. As the input equals to the prototype, the output
of that RBF neuron will be 1 and with the distance grows between the input and prototype
the response falls off exponentially towards 0. The curve generated out of neuron’s response
tends towards a typical bell curve. The output layer consists of a set of neurons [one per
category].
UNIT -2 47
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
UNIT -2 48
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Designed to save the output of a layer, Recurrent Neural Network is fed back to the input to
help in predicting the outcome of the layer. The first layer is typically a feed forward neural
network followed by recurrent neural network layer where some information it had in the
previous time-step is remembered by a memory function. Forward propagation is
implemented in this case. It stores information required for its future use. If the prediction is
wrong, the learning rate is employed to make small changes. Hence, making it gradually
increase towards making the right prediction during the backpropagation.
UNIT -2 49
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
2. Used with convolution layers to extend the pixel effectiveness.
LSTM networks are a type of RNN that uses special units in addition to standard units.
LSTM units include a ‘memory cell’ that can maintain information in memory for long
periods of time. A set of gates is used to control when information enters the memory when
it’s output, and when it’s forgotten. There are three types of gates viz, Input gate, output gate
and forget gate. Input gate decides how many information from the last sample will be kept
in memory; the output gate regulates the amount of data passed to the next layer, and forget
gates control the tearing rate of memory stored. This architecture lets them learn longer -term
dependencies
This is one of the implementations of LSTM cells, many other architectures exist.
UNIT -2 50
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
A sequence to sequence model consists of two Recurrent Neural Networks. Here, there exists
an encoder that processes the input and a decoder that processes the output. The encoder and
decoder work simultaneously – either using the same parameter or different ones. This
model, on contrary to the actual RNN, is particularly applicable in those cases where the
length of the input data is equal to the length of the output data. While they possess similar
benefits and limitations of the RNN, these models are usually applied mainly in chat bots,
machine translations, and question answering systems.
UNIT -2 51
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
A modular neural network has a number of different networks that function independently
and perform sub-tasks. The different networks do not really interact with or signal each other
during the computation process. They work independently towards achieving the output.
As a result, a large and complex computational process are done significantly faster by
breaking it down into independent components. The computation speed increases because
the networks are not interacting with or even connected to each other.
UNIT -2 52
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
1. Efficient
2. Independent training
3. Robustness
Disadvantages of Modular Neural Network
Deep neural networks, or deep learning networks, have several hidden layers with millions of
artificial neurons linked together. A number, called weight, represents the connections between
one node and another. The weight is a positive number if one node excites another, or negative if
one node suppresses the other. Nodes with higher weight values have more influence on the
other nodes.
Theoretically, deep neural networks can map any input type to any output type. However, they
also need much more training as compared to other machine learning methods. They need
millions of examples of training data rather than perhaps the hundreds or thousands that a
simpler network might need.
The term fuzzy refers to things that are not clear or are vague.Fuzzy logic is an approach to
variable processing that allows for multiple possible truth values to be processed through the
same variable. Fuzzy logic attempts to solve problems with an open, imprecise spectrum of data
and heuristics that makes it possible to obtain an array of accurate conclusions. Fuzzy logic is
designed to solve problems by considering all available information and making the best
possible decision given the input.
Fuzzy-Logic theory has introduced a framework whereby human knowledge can be formalized
and used by machines in a wide variety of applications, ranging from cameras to trains. The
basic ideas that we discussed in the earlier posts were concerned with only this aspect with
regards to the use of Fuzzy Logic-based systems; that is the application of human experience
UNIT -2 53
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
into machine-driven applications. While there are numerous instances where such techniques are
relevant; there are also applications where it is challenging for a human user to articulate the
knowledge that they hold. Such applications include driving a car or recognizing images.
Machine learning techniques provide an excellent platform in such circumstances, where sets of
inputs and corresponding outputs are available, building a model that provides the
transformation from the input data to the outputs using the available data.
In the Boolean system truth value, 1.0 represents the absolute truth value and 0.0 represents
the absolute false value. But in the fuzzy system, there is no logic for the absolute truth and
absolute false value. But in fuzzy logic, there is an intermediate value too present which is
partially true and partially false.
Architecture
Its Architecture contains four parts:
• Rule base: It contains the set of rules and the IF-THEN conditions provided by the experts
to govern the decision-making system, on the basis of linguistic information. Recent
developments in fuzzy theory offer several effective methods for the design and tuning of
fuzzy controllers. Most of these developments reduce the number of fuzzy rules.
• Fuzzification: It is used to convert inputs i.e. crisp numbers into fuzzy sets. Crisp inputs
are basically the exact inputs measured by sensors and passed into the control system for
processing, such as temperature, pressure, rpm’s, etc.
• Inference engine: It determines the matching degree of the current fuzzy input with
respect to each rule and decides which rules are to be fired according to the input field.
Next, the fired rules are combined to form the control actions.
• Defuzzification: It is used to convert the fuzzy sets obtained by the inference engine into a
crisp value. There are several defuzzification methods available and the best-suited one is
used with a specific expert system to reduce the error.
UNIT -2 54
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Procedure
Step 1: Divide the input and output spaces into fuzzy regions.
We start by assigning some fuzzy sets to each input and output space. Wang and Mendel
specified an odd number of evenly spaced fuzzy regions, determined by 2N+1 where N is an
integer. As we will see later on, the value of N affects the performance of our models and can
result in under/overfitting at times. N is, therefore, one of the hyper parameters that we will use to
tweak this system’s performance.
UNIT -2 55
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
We can use our input and output spaces, together with the fuzzy regions that we have just defined,
and the dataset for the application to generate fuzzy rules in the form of:
If {antecedent clauses} then {consequent clauses}
We start by determining the degree of membership of each sample in the dataset to the different
fuzzy regions in that space. If, as an example, we consider a sample depicted below:
UNIT -2 56
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig.2.38 Corres
We then assign the region having the maximum degree of membership of to the spaces, which is
indicated by the highlighted elements in the above table so that it is possible to obtain a rule:
The next illustration shows a second example, together with the degree of membership results
that it generates.
UNIT -2 57
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Step 2 is very straightforward to implement, yet it suffers from one problem; it will generate
conflicting rules, that is, rules that have the same antecedent clauses but different consequent
clauses. Wang and Medel solved this issue by assigning a degree to each rule, using a product
strategy such that the degree is the product of all the degree-of-membership values from both
UNIT -2 58
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
antecedent and consequent spaces forming the rule. We retain the rule having the most significant
degree, while we discard the rules having the same antecedent but a having a smaller degree.
If we refer to the previous example, the degree of Rule 1 will equate to:
We notice that this procedure reduces the number of rules radically in practice.
It is also possible to fuse human knowledge to the knowledge obtained from data by introducing a
human element to the rule degree, that has high applicability in practice, as human supervision
can assess the reliability of data, and hence the rules generated from it directly. In the cases where
human intervention is not desirable, this factor is set to 1 for all rules. Rule 1 can be hence
defined as follows;
It is a matrix that holds the fuzzy rule-base information for a system. A Combined Fuzzy Rule
Base can contain the rules that are generated numerically using the procedure described above,
but also rules that are obtained from human experience.
UNIT -2 59
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig.2.41 Combined Fuzzy Rule Base for this system. Note Rule 1 and Rule 2.
The final step in this procedure explains the defuzzification strategy used to determine the value
of y, given (x1, x2). Wang and Mendel suggest a different approach to the max-min computation
used by Mamdani. We have to consider that, in practical applications, the number of input spaces
will be significant when compared to the typical control application where Fuzzy Logic is
typically used. Besides, this procedure will generate a large number of rules, and therefore it
would be impractical to compute an output using the ‘normal’ approach.
UNIT -2 60
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
For a given input combination (x1, x2), we combine the antecedents of a given rule to determine
the degree of output control corresponding to (x1, x2) using the product operator. If
We now define the centre of a fuzzy region as the point that has the smallest absolute value
among all points at which the membership function for this region is equal to 1 as illustrated
below;
UNIT -2 61
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Testing
A (very dirty) implementation of the above algorithm was developed in Python to test it with real
datasets. The code and data used are available in Github. Some considerations on this system
include.
• The sets were created using the recommendation in the original paper, that is evenly spaced.
It is, however, interesting to see the effects of changing this method. One idea is to have sets
created around the dataset mean with a spread relatable to the standard deviation — this
might be investigated in a future post.
• The system created does not cater for categorical data implicitly, and this is a future
improvement that can affect the performance of the system considerably in real-life
scenarios.
Testing metrics
We will use the coefficient of determination (R-Squared) to assess the performance of this system
and to tune the hyper parameter that was identified, the number of fuzzy sets generated.
To explain R-Squared, we must first define the sum of squares total and the sum of squares
residual.
The sum of squares total is the sum of the squared difference between the dependent variable (y)
The sum of squares residual is the sum of the squared difference between the actual and estimated
we notice that R-Squared will have a value between 0 and 1, the larger, the better. If R-Squared
=1, then there is no error and the estimated values will be equal to the actual values.
For a second test, we have used the Weather in Szeged 2006–2016 available at Kaggle. The
UNIT -2 63
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
For this exercise, we will be discarding most of these features and assess if we can predict the
temperature given the month and humidity.Upon examining the data, we notice that the average
temperature varied between 1 and 23 degrees Celcius with a variability of about 20 degrees per
month.
The average humidity varies between 0.63 and 0.85, but we also notice that it is always possible
UNIT -2 64
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
The best Fuzzy system that was tested consisted of 3 fuzzy spaces (N=1) for the input variables
and nine fuzzy spaces (N=4) for the temperature. The system generated the nine rules depicted in
the Fuzzy Distribution Map below and attained an R-Squared value of 0.75, using a test sample of
20%.
UNIT -2 65
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
2.12 Fuzzy Decision Tree
FID Fuzzy Decision Tree has been introduced in 1996. It is a classification system,
implementing the popular and
efficient recursive partitioning technique of decision trees, while combining fuzzy representation
and approximate reasoning for dealing with noise and language uncertainty. A fuzzy decision
tree induction method, which is based on the reduction of classification ambiguity with fuzzy
evidence, is developed. Fuzzy decision trees represent classification knowledge more naturally
to the way of human thinking and are more robust in tolerating imprecise, conflict, and missing
information.
Fuzzy Decision Tree (FID) combines fuzzy representation, and its approximate reasoning, with
symbolic decision trees. As such, they provide for handling of language related uncertainty,
noise, missing or faulty features, robust behavior, while also providing comprehensible
knowledge interpretation.
Unlike classical mathematics, which is deterministic, the introduction of fuzzy theory brings the
discipline of mathematics into a new field, which expands the range of applications of
mathematics, bringing it from the exact space to the fuzzy space. Like all disciplines, fuzzy
theory was born from the needs of production and practice. From time to time, we encounter
fuzzy concepts in many areas of real life. For example, we use fast and slow when describing
speed, expensive, and cheap when describing price, and it is difficult to separate these adjectives
with a precise boundary. The distinctive feature of fuzzy decision tree algorithm is the
integration of fuzzy theory and classical decision tree algorithm, which expands the application
field of decision tree algorithm.
The fuzzy fication improvement of the classical decision tree algorithm mainly includes the
following:
Compared with the decision rules of the clear decision tree algorithm, the fuzzy rules generated
by the fuzzy decision tree are more realistic, and the set composed of these fuzzy rules is called
the fuzzy rule set, as shown in Figure 1. If the basis for splitting attribute selection of decision
tree algorithm is collectively called clear heuristic, and that of fuzzy decision tree is called fuzzy
heuristic, then the differences between these two types of algorithms are mainly in the
UNIT -2 66
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
differences of heuristics, leaf node selection criteria or branch ending criteria, and the final
generated rules. Although the fuzzy decision tree algorithm is an improvement of the decision
tree algorithm, it does not mean that this fuzzy algorithm is better than the clear algorithm in all
aspects, and different algorithms should be chosen according to the actual application area.
UNIT -2 67
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
UNIT -2 68
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
7. The generation of FDT for pattern classification consists of three major steps namely fuzzy
partitioning (clustering), induction of FDT and fuzzy rule inference for classification.
8. The first crucial step in the induction process of FDT is the fuzzy partitioning of input space
using any fuzzy clustering techniques.
9. FDTs are constructed using any standard algorithm like Fuzzy ID3 where we follow a top-
down, recursive divide and conquer approach, which makes locally optimal decisions at each
node.
10. As the tree is being built, the training set is recursively partitioned into smaller subsets and
the generated fuzzy rules are used to predict the class of an unseen pattern by applying suitable
fuzzy inference/reasoning mechanism on the FDT.
11. The general procedure for generating fuzzy decision trees using Fuzzy ID3 is as follows :
• Prerequisites : A Fuzzy partition space, leaf selection threshold th and the best
node selection criterion
• Procedure : While there exist candidate nodes
Stochastic search algorithms are designed for problems with inherent random noise or
deterministic problems solved by injected randomness. The search favors designs with better
performance. An important feature of stochastic search algorithms is that they can carry out
broad search of the design space and thus avoid local optima. Also, stochastic search algorithms
do not require gradients to guide the search, making them a good fit for discrete problems.
However, there is no necessary condition for an optimum solution and the algorithm must run
multiple times to make sure the attained solutions are robust. To handle constraints, penalties can
also be applied on designs that violate constraints. For constraints that are difficult to be
formulated explicitly, a true/false check is straightforward to implement.
UNIT -2 69
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In order to describe stochastic processes in statistical terms, we can give the following
definitions:
• Observation: the result of one trial.
• Population: all the possible observation that can be registered from a trial.
• Sample: a set of results collected from separated independent trials.
For example, the toss of a fair coin is a random process, but thanks to The Law of the Large
Numbers we know that given a large number of trials we will get approximately the same
The Law of the Large Numbers states that: “As the size of the sample increases, the mean value
of the sample will better approximate the mean or expected value in the population. Therefore,
as the sample size goes to infinity, the sample mean will converge to the population mean. It is
Some examples of random processes are stock markets and medical data such as blood pressure
One of the main application of Machine Learning is modelling stochastic processes. Some
Poisson processes
Poisson Processes are used to model a series of discrete events in which we know the average
time between the occurrence of different events but we don’t know exactly when each of these
following criteria’s:
1. The events are independent of each other (if an event happens, this does not alter the
probability that another event can take place).
2. Two events can’t take place simultaneously.
3. The average rate between events occurrence is constant.
Lets take as an example power-cuts. The electricity provider might advertise power cuts are
likely to happen every 10 months on average, but we can’t precisely tell when the next power cut
is going to happen. For example, if a major problem happens, the electricity might go off
repeatedly for 2–3 days (eg. in case the company needs to make some changes to the power
source) and then after that stay on for the next 2 years.Therefore, for this type of processes, we
UNIT -2 70
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
can be quite sure of the average time between the events, but their occurrence is randomly
spaced in time.From a Poisson Process, we can then derive a Poisson Distribution which can be
used to find the probability of the waiting time between the occurrence of different events or the
A Poisson Distribution can be modelled using the following formula (Figure 2), where k
represents the expected number of events which can take place in a period.
Some examples of phenomena which can be modelled using Poisson Processes are radioactive
A Random Walk can be any sequence of discrete steps (of always the same length) moving in
random directions. Random Walks can take place in any type of dimensional space (eg. 1D, 2D,
nD).
UNIT -2 71
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Let’s imagine that we are in a park and we can see a dog looking for food. He is currently in
position zero on the number line and he has an equal probability to move left or right to find any
food.
Now, if we want to find out what is going to be the position of the dog after a certain number of
N steps we can take advantage again of The Law of the Large Numbers. Using this law, we will
found out that as N goes to infinity our dog will probably be back at its starting point. Anyway,
Therefore, we can instead try to use Root-Mean-Square (RMS) as our distance metric (we first
square all the values, then we calculate their average and finally we do the square root of the
result). In this way, all our negative numbers will become positive and the average will not be
In this example, using RMS we will find out that if our dog takes 100 steps it would have moved
Brownian Motion can be used to describe a continuous-time random walk. Some examples of
random walks applications are: tracing the path taken by molecules when moving through a gas
HMMs are probabilistic graphical models used to predict a sequence of hidden (unknown) states
from a set of observable states. This class of models follows the Markov processes assumption:
“The future is independent of the past, given that we know the present”. Therefore, when
working with Hidden Markov Models, we just need to know our present state in order to make a
prediction about the next one (we don’t need any information about the previous states). To
make our predictions using HMMs we just need to calculate the joint probability of our hidden
UNIT -2 72
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
states and then select the sequence which yields the highest probability (the most likely to
happen).
In order to calculate the joint probability, we need three main types of information:
• Initial condition: the initial probability we have to start our sequence in any of the hidden
states.
• Transition probabilities: the probabilities of moving from one hidden state to another.
• Emission probabilities: the probabilities of moving from a hidden state to an observable
state.
As a simple example, let’s imagine we are trying to predict how the wheatear is going to be like
In this case, the different types of weather are going to be our hidden states (eg. sunny, windy
and rainy) and the types of clothing worn are going to be our observable states (eg. t-shirt, long
trousers and jackets). Our Initial condition is going to be our starting point in the series. The
transition probabilities, are instead going to represent the likelihood we are going to move from a
different type of weather a day after the other. Finally, the emission probabilities are going to be
the probabilities someone is going to wear a certain attire depending on the weather of the
previous day.
the number of probabilities and possible scenarios increases exponentially. In order to solve that,
UNIT -2 73
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
is possible to use another algorithm called the Viterbi Algorithm. From a Machine Learning
point of view, the observations form our training data and the number of hidden states forms our
hyper-parameters to tune. One of the most common applications for HMMs in Machine
Gaussian Processes are a class of stationary, zero-mean stochastic processes which are
completely dependent on their auto covariance functions. This class of models can be used for
both regression and classification tasks.one of the greatest advantages of Gaussian Processes is
that they can provide estimates about uncertainty, for example giving us an estimate of how sure
an algorithm is that an item belongs to a class or not.In order to deal with situations which,
Imagine now one of your friends challenges you to play at dice and you make 50 tows. In the
case of a fair dice, we would expect that each of the 6 faces has the same probability to appear
UNIT -2 74
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Anyway, the more you keep playing and the more you notice that the dice tends to land always
on the same faces. At this point, you start thinking the dice might be loaded and therefore you
UNIT -2 75
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
This process is known as Bayesian Inference. Bayesian Inference is a process through which we
update our beliefs about the world based on the gathering of new evidences. We start with
a prior belief and once we update it with brand new information we construct a posterior belief.
which we can later update the distribution using Bayes Rule once we gather new training data.
Auto-Regressive Moving Average (ARMA) processes are a really important class of stochastic
processes used to analyses time-series. What characterizes ARMA models is that their auto
The ARMA model makes use of two main parameters (p, q). These are:
ARMA processes assume that a time series fluctuates uniformly around a time-invariant mean. If
we are trying to analyze a time-series which does not follow this pattern, then this series will
need to be differenced until it would be able to achieve stationarity.
UNIT -2 76