0% found this document useful (0 votes)
100 views

Unit 2

Uploaded by

2004amanpandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views

Unit 2

Uploaded by

2004amanpandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

DATA ANALYTICS (KCS-051/KIT-601)

DATA ANALYTICS
UNIT 2
2.1 Regression modeling
Regression analysis is a form of predictive modelling technique which investigates the relationship
between a dependent (target) and independent variable (s) (predictor). This technique is used for
forecasting, time series modelling and finding the causal effect relationship between the variables.
For example, relationship between rash driving and number of road accidents by a driver is best
studied through regression.

Regression models are widely used in data analytics technique that allow the identification and
estimation of possible relationships between a pattern or variable of interest, and factors that
influence that pattern. Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent variables are
held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line
to the data points, in such a manner that the differences between the distances of data points from
the curve or line is minimized.

There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and independent
variable.
2. It indicates the strength of impact of multiple independent variables on a dependent
variable.
Regression analysis also allows us to compare the effects of variables measured on different
scales, such as the effect of price changes and the number of promotional activities. These
benefits help market researcher’s / data analysts / data scientists to eliminate and evaluate the
best set of variables to be used for building predictive models.

Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:

UNIT -2 1
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig 2.1 Example of Regression

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables. In Regression, we plot a graph
between the variables which best fits the given data points, using this plot, the machine learning
model can make predictions about the data. In simple words, "Regression shows a line or curve
that passes through all the data points on target-predictor graph in such a way that the vertical
distance between the data points and the regression line is minimum." The distance between data
points and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Under fitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is called under
fitting.

Uses Regression Analysis

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which
UNIT -2 2
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
can make predictions more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science.

Below are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.

2.1.1Types of regression techniques


There are various types of regressions which are used in data analytics, data science and machine
learning. These techniques are mostly driven by three metrics (number of independent variables,
type of dependent variables and shape of regression line).

1. Linear Regression
It is one of the most widely known modeling techniques. In this technique, the dependent
variable is continuous, independent variable(s) can be continuous or discrete, and nature of
regression line is linear.
Linear regression is a statistical regression method which is used for predictive analysis. It
shows the relationship between the continuous variables. It is used for solving the regression
problem in machine learning. Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.

Linear Regression establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is
error term. This equation can be used to predict the value of target variable based on given
predictor variable(s).

Fig 2.2 (a)Linear Regression

How to obtain best fit line (Value of a and b)?


This task can be easily accomplished by Least Square Method. It is the most common method
used for fitting a regression line. It calculates the best-fit line for the observed data by
UNIT -2 3
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
minimizing the sum of the squares of the vertical deviations from each data point to the line.
Because the deviations are first squared, when added, there is no cancelling out between positive
and negative values.

Fig 2.2 (b) Linear Regression

Important Points:
• There must be linear relationship between independent and dependent variables
• Multiple regression suffers from multicollinearity, autocorrelation, heteroskedasticity.
• Linear Regression is very sensitive to Outliers. It can terribly affect the regression line
and eventually the forecasted values.
• Multicollinearity can increase the variance of the coefficient estimates and make the estimates
very sensitive to minor changes in the model. The result is that the coefficient estimates are
unstable
• In case of multiple independent variables, we can go with forward selection, backward
elimination and step wise approach for selection of most significant independent variables.

2. Logistic Regression

Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable
such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis
algorithm which works on the concept of probability. Logistic regression uses sigmoid
function or logistic function which is a complex cost function. This sigmoid function is used to
model the data in logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


UNIT -2 4
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

Fig 2.3Logistic Regression

o It uses the concept of threshold levels, values above the threshold level are rounded up to
1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:


o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Important Points:
• Logistic regression is widely used for classification problems
• Logistic regression doesn’t require linear relationship between dependent and independent
variables. It can handle various types of relationships because it applies a non-linear log
transformation to the predicted odds ratio
• To avoid over fitting and under fitting, we should include all significant variables. A good
approach to ensure this practice is to use a step wise method to estimate the logistic regression
• It requires large sample sizes because maximum likelihood estimates are less powerful at low
sample sizes than ordinary least square
• The independent variables should not be correlated with each other i.e. no multi collinearity.
However, we have the options to include interaction effects of categorical variables in the
analysis and in the model.
• If the values of dependent variable is ordinal, then it is called as Ordinal logistic regression
• If dependent variable is multi class, then it is known as Multinomial Logistic regression.

UNIT -2 5
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

3. Polynomial Regression:
Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model. It is similar to multiple linear regression, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.

Suppose there is a dataset which consists of data points which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those data points. To cover such
data points, we need Polynomial regression. In Polynomial regression, the original features are
transformed into polynomial features of given degree and then modeled using a linear
model. Which means the data points are best fitted using a polynomial line.

Fig 2.4Polynomial Regression

o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,...bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

4. Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression as
well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression. Support Vector Regression is a regression algorithm which works
for continuous variables.
UNIT -2 6
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional


data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it
is a line which helps to predict the continuous variables and cover most of the data
points.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for data points.
o Support vectors: Support vectors are the data points which are nearest to the hyperplane
and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of data points are covered in that margin. The main goal of SVR is to consider the
maximum data points within the boundary lines and the hyperplane (best-fit line) must contain a
maximum number of data points. Consider the below image:

Fig 2.5Support Vector Regression

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

5. Decision Tree Regression:


Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems. It can solve problems for both categorical and numerical
data. Decision Tree regression builds a tree-like structure in which each internal node represents
the "test" for an attribute, each branch represents the result of the test, and each leaf node
represents the final decision or result. A decision tree is constructed starting from the root
node/parent node (dataset), which splits into left and right child nodes (subsets of dataset). These

UNIT -2 7
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
child nodes are further divided into their children node, and themselves become the parent node
of those nodes. Consider the below image:

Fig 2.6Decision Tree Regression

Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.

6. Random Forest Regression

o Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more
formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....


o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in
which aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.

UNIT -2 8
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.7Random forestRegression

2.3 Multivariate Analysis

In data analytics, we look at different variables (or factors) and how they might impact certain
situations or outcomes. For example, in marketing, you might look at how the variable “money
spent on advertising” impacts the variable “number of sales.” In the healthcare sector, you might
want to explore whether there’s a correlation between “weekly hours of exercise” and
“cholesterol level.” This helps us to understand why certain outcomes occur, which in turn
allows us to make informed predictions and decisions for the future.

Multivariate Analysis is defined as a process of involving multiple dependent variables


resulting in one outcome. This explains that the majority of the problems in the real world
are Multivariate. For example, we cannot predict the weather of any year based on the
season. There are multiple factors like pollution, humidity, precipitation, etc. Multivariate
data analysis is a type of statistical analysis that involves more than two dependent variables,
resulting in a single outcome. Many problems in the world can be practical examples of
multivariate equations as whatever happens in the world happens due to multiple reasons.

One such example of the real world is the weather. The weather at any particular place does not
solely depend on the ongoing season, instead many other factors play their specific roles, like
humidity, pollution, etc. Just like this, the variables in the analysis are prototypes of real-time
situations, products, services, or decision-making involving more variables.

UNIT -2 9
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

The History of Multivariate analysis

In 1928, Wishart presented his paper. The Precise distribution of the sample covariance
matrix of the multivariate normal population, which is the initiation of MVA. In the 1930s,
R.A. Fischer, Hotelling, S.N. Roy, and B.L. Xu et al. made a lot of fundamental theoretical
work on multivariate analysis. At that time, it was widely used in the fields of psychology,
education, and biology. In the middle of the 1950s, with the appearance and expansion of
computers, multivariate analysis began to play a big role in geological, meteorological.
Medical and social and science. From then on, new theories and new methods were proposed
and tested constantly by practice and at the same time, more application fields were
exploited. With the aids of modern computers, we can apply the methodology of multivariate
analysis to do rather complex statistical analyses.

Fig 2.8History of Multivariate analysis

Objectives of multivariate data analysis:

1. Multivariate data analysis helps in the reduction and simplification of data as much as
possible without losing any important details.
2. As MVA has multiple variables, the variables are grouped and sorted on the basis of their
unique features.
3. The variables in multivariate data analysis could be dependent or independent. It is
important to verify the collected data and analyze the state of the variables.
4. In multivariate data analysis, it is very important to understand the relationship between
all the variables and predict the behavior of the variables based on observations.
5. It is tested to create a statistical hypothesis based on the parameters of multivariate data.
This testing is carried out to determine whether or not the assumptions are true.

UNIT -2 10
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Advantages of multivariate data analysis:

The following are the advantages of multivariate data analysis:

1. The main advantage of multivariate analysis is that since it considers more than one
factor of independent variables that influence the variability of dependent variables,
the conclusion drawn is more accurate.
2. Since the analysis is tested, the drawn conclusions are closer to real-life situations.

Disadvantages of multivariate data analysis:

The following are the disadvantages of multivariate data analysis:

1. Multivariate data analysis includes many complex computations and hence can be
laborious.
2. The analysis necessitates the collection and tabulation of a large number of observations
for various variables. This process of observation takes a long time.

2.2.1 Multivariate data analysis techniques


There are many different techniques for multivariate analysis, and they can be divided into two
categories:
• Dependence techniques
• Interdependence techniques

Multivariate analysis techniques: Dependence vs. interdependence

Dependence methods
Dependence methods are used when one or some of the variables are dependent on others.
Dependence looks at cause and effect; in other words, can the values of two or more independent
variables be used to explain, describe, or predict the value of another, dependent variable? To
give a simple example, the dependent variable of “weight” might be predicted by independent
variables such as “height” and “age.”

In machine learning, dependence techniques are used to build predictive models. The analyst
enters input data into the model, specifying which variables are independent and which ones are
dependent—in other words, which variables they want the model to predict, and which variables
they want the model to use to make those predictions.

Interdependence methods
Interdependence methods are used to understand the structural makeup and underlying patterns
within a dataset. In this case, no variables are dependent on others, so you’re not looking for
causal relationships. Rather, interdependence methods seek to give meaning to a set of variables
or to group them together in meaningful ways.

Multivariate Analysis Techniques


Some common multivariate analysis techniques are as below:
• Multiple linear regression

UNIT -2 11
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• Multiple logistic regression
• Multivariate analysis of variance (MANOVA)
• Factor analysis
• Cluster analysis
• Multiple linear regression
Multiple linear regressions is a dependence method which looks at the relationship between one
dependent variable and two or more independent variables. A multiple regression model will tell
you the extent to which each independent variable has a linear relationship with the dependent
variable. This is useful as it helps you to understand which factors are likely to influence a
certain outcome, allowing you to estimate future outcomes.

Example of multiple regressions:


As a data analyst, you could use multiple regressions to predict crop growth. In this example,
crop growth is your dependent variable and you want to see how different factors affect it. Your
independent variables could be rainfall, temperature, amount of sunlight, and amount of fertilizer
added to the soil. A multiple regression model would show you the proportion of variance in
crop growth that each independent variable accounts for.

Fig 2.9History of Multivariate analysis

• Multiple logistic regression


Logistic regression analysis is used to calculate (and predict) the probability of a binary event
occurring. A binary outcome is one where there are only two possible outcomes; either the event
occurs (1) or it doesn’t (0). So, based on a set of independent variables, logistic regression can
predict how likely it is that a certain scenario will arise. It is also used for classification.

Example of logistic regression:


Let’s imagine you work as an analyst within the insurance sector and you need to predict how
likely it is that each potential customer will make a claim. You might enter a range of

UNIT -2 12
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
independent variables into your model, such as age, whether or not they have a serious health
condition, their occupation, and so on. Using these variables, a logistic regression analysis will
calculate the probability of the event (making a claim) occurring. Another oft-cited example is
the filters used to classify email as “spam” or “not spam.”

• Multivariate analysis of variance (MANOVA)

Multivariate analysis of variance (MANOVA) is used to measure the effect of multiple


independent variables on two or more dependent variables. With MANOVA, it’s important to
note that the independent variables are categorical, while the dependent variables are metric in
nature. A categorical variable is a variable that belongs to a distinct category—for example, the
variable “employment status” could be categorized into certain units, such as “employed full-
time,” “employed part-time,” “unemployed,” and so on. A metric variable is measured
quantitatively and takes on a numerical value.
In MANOVA analysis, you’re looking at various combinations of the independent variables to
compare how they differ in their effects on the dependent variable.

Example of MANOVA:
Suppose you work for an engineering company that is on a mission to build a super-fast, eco-
friendly rocket. You could use MANOVA to measure the effect that various design
combinations have on both the speed of the rocket and the amount of carbon dioxide it emits.

In this scenario, your categorical independent variables could be:


• Engine type, categorized as E1, E2, or E3
• Material used for the rocket exterior, categorized as M1, M2, or M3
• Type of fuel used to power the rocket, categorized as F1, F2, or F3

Your metric dependent variables are speed in kilometers per hour, and carbon dioxide measured
in parts per million. Using MANOVA, you’d test different combinations (e.g. E1, M1, and F1
vs. E1, M2, and F1, vs. E1, M3, and F1, and so on) to calculate the effect of all the independent
variables. This should help you to find the optimal design solution for your rocket.

• Factor analysis

Factor analysis is an interdependence technique which seeks to reduce the number of variables in
a dataset. If you have too many variables, it can be difficult to find patterns in your data. At the
same time, models created using datasets with too many variables are susceptible to overfitting.
Overfitting is a modeling error that occurs when a model fits too closely and specifically to a
certain dataset, making it less generalizable to future datasets, and thus potentially less accurate
in the predictions it makes. Factor analysis works by detecting sets of variables which correlate
highly with each other. These variables may then be condensed into a single variable. Data
analysts will often carry out factor analysis to prepare the data for subsequent analyses.
Factor analysis example:
Let’s imagine you have a dataset containing data pertaining to a person’s income, education
level, and occupation. You might find a high degree of correlation among each of these
variables, and thus reduce them to the single factor “socioeconomic status.” You might also have
data on how happy they were with customer service, how much they like a certain product, and
how likely they are to recommend the product to a friend. Each of these variables could be
grouped into the single factor “customer satisfaction” (as long as they are found to correlate
strongly with one another). Even though you’ve reduced several data points to just one factor,
UNIT -2 13
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
you’re not really losing any information—these factors adequately capture and represent the
individual variables concerned. With your “streamlined” dataset, you’re now ready to carry out
further analyses.

• Cluster analysis

Another interdependence technique, cluster analysis is used to group similar items within a
dataset into clusters. When grouping data into clusters, the aim is for the variables in one cluster
to be more similar to each other than they are to variables in other clusters. This is measured in
terms of intracluster and intercluster distance. Intracluster distance looks at the distance between
data points within one cluster. This should be small. Intercluster distance looks at the distance
between data points in different clusters. This should ideally be large. Cluster analysis helps you
to understand how data in your sample is distributed, and to find patterns.

Cluster analysis example:


A prime example of cluster analysis is audience segmentation. If you were working in
marketing, you might use cluster analysis to define different customer groups which could
benefit from more targeted campaigns. As a healthcare analyst, you might use cluster analysis to
explore whether certain lifestyle factors or geographical locations are associated with higher or
lower cases of certain illnesses. Because it’s an interdependence technique, cluster analysis is
often carried out in the early stages of data analysis.

Fig 2.10cluster analysis example

2.3 Data Modelling

In data science, data modelling is the process of finding the function by which data was
generated. In this context, data modelling is the goal of any data analysis task. For instance if

UNIT -2 14
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
you have a 2d dataset (see the figures below), and you find the 2 variables are linearly correlated,
you may decide to model it using linear regression.

Fig 2.11Data Modelling (linear regression)


Or if you visualize your data and find out it’s non-linear correlated (as the following figures),
you may model it using nonlinear regression.

Fig 2.12Data Modelling (nonlinear regression)

2.3.1 Bayesian Data Modelling

Bayesian data modelling is to model your data using Bayes Theorem. Let us re-visit Bayes Rule
again:

In the above equation, H is the hypothesis and E is the evidence. In the real world however, we
understand Bayesian components differently! The evidence is usually expressed by data, and the
hypothesis reflects the expert’s prior estimation of the posterior. Therefore, we can re-write the
Bayes Rule to be:

UNIT -2 15
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

In the above definition we learned about prior, posterior, and data, bout what
about θ parameter? θ is the set of coefficients that best define the data. You may think of θ as the
set of slope and intercept of your linear regression equation, or the vector of coefficients w in
your polynomial regression function. As you see in the above equation, θ is the single missing
parameter, and the goal of Bayesian modelling is to find it.

Bayesian Modelling & Probability Distributions

Bayes Rule is a probabilistic equation, where each term in it is expressed as a probability.


Therefore, modelling the prior and the likelihood must be achieved using probabilistic functions.
In this context arise probability distributions as a concrete tool in Bayesian modelling, as they
provide a great variety of probabilistic functions that suits numerous styles of discrete and
continuous variables.

In order to select the suitable distribution for your data, you should learn about the data domain
and gather information from previous studies in it. You may also ask an expert to learn how data
is developed over time. If you have big portions of data, you may visualize it and try to detect
certain pattern(s) of its evolving over time, and select your probability distribution upon.

2.3.2 Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is the simplest way of Bayesian data modelling, in
which we ignore both prior and marginal probabilities (i.e. consider both quantities equal to 1).
The formula of MLE is:

The steps of MLE are as follows:

• Select a probability distribution that best describes your data


• Estimate random value(s) of θ
• Tune θ value(s) and measure the corresponding likelihood
• Select θ that correspond to the maximum likelihood

Example

“a company captures the blood glucose (BG) levels of diabetics, analyses these levels, and send
its clients suitable diet preferences. After one week of inserting her BG levels, a client asked the
company smart app whether she can consume desserts after her lunch or not? By checking her
after-meal BG levels, it was {172,171,166,175,170,165,160}. Knowing that the client after-meal
BG should not exceed 200 mg/dl, what should the app recommend to the client? “

UNIT -2 16
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
The goal here is to estimate an average BG level for the client to use it in the company
recommendation system. Having the above BG sample data, we assume these readings are
drawing from a normal distribution, with a mean of 168.43 and standard deviation of 4.69.

Fig 2.13(a) Example of Bayesiandata modeling

Fig 2.13(b) Example of Bayesian data modeling


As you can imagine, multiplying small probability values (red dots in the above figure)
generates very small value (very close to 0). Therefore, we replace multiplication with
summation. In the above case, the sum of these probabilities equal 0.406:

UNIT -2 17
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

In order to find the maximum likelihood, we need to estimate different values of BG levels and
calculate the corresponding likelihoods. The BG value with maximum likelihood is the most
suitable estimation of the BG.

To automate this process, we generate 1000 random numbers in the same range of captured BG
levels, and measure the corresponded likelihoods. The results are illustrated in the following
plot.

Fig 2.13(c) Example of Bayesian data modeling

As you can see, the maximum likelihood estimate of randomly generated BGs equals 169.55,
which is very close to the average of captured BG levels (168.43). The difference between both
values is due to the small size of captured data. The larger sample size you have, the smaller
difference between both estimates you get.

Based on this estimate, the app can recommend it’s patient to consume a small piece of dessert
after at least 3 hours of her lunch, with taking the suitable insulin dose.

Maximum A Posteriori (MAP)

Maximum A Posteriori (MAP) is the second approach of Bayesian modelling, where the
posterior is calculated using both likelihood and prior probabilities.

UNIT -2 18
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
You can think of MAP modelling as a generalization of MLE, where the likelihood is
approximated with prior information. Technically speaking, we deal with our data samples as if
it were generated by the prior distribution:

Example

“Suppose that the client updated the app with her historical BG levels for the last month, which
turned out to be 171 mg/dl on average, with standard deviation of 3, how the app can use this
new information to update the patient’s BG level?”
In this case, we have two distributions of Blood Glucose, one of recent data (the likelihood), and
the other of historical data (the prior). Each data source can be expressed as normal distribution
(see the following figure).

Fig 2.13(d) Example of Bayesian data modeling

The posterior here is updated by multiplying the prior marginal probability by each term of data
probabilities:

The prior marginal probability P(θ) is the summation of all data probabilities over the prior
distribution:

As our goal is to maximize posterior estimation, we generate random values of θ and measure
the corresponding estimations. By generating 500 guesses of θ, we obtain the following posterior
UNIT -2 19
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
distribution (in black). For visualization reasons, I raised the value of posterior probabilities to
the power of 0.08.

Fig 2.13(e) Example of Bayesian data modeling

Now we get more generalized estimation of the client’s BG level. The MAP technique excludes
two measures in the posterior distribution (the red data points outside the black curve), and
generates more convenient estimate. As the below plot shows, the Standard Error (SE) of the
posterior is less than that of the likelihood. Standard error is an indication of the reliability of a
standard expectation (i.e. mean of predicted normal distribution). It is calculated as SE=σ/√N,
where σ is the standard deviation and N is data size.

Fig 2.13(f) Example of Bayesian data modeling

The new MAP posterior estimation of the patient’s BG level is higher than the one estimated
using MLE. This may lead the app to prohibit the patient of consuming desserts after her lunch,
until her BG levels become more stabilized.

UNIT -2 20
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
2.4 Inference and Bayesian Network

2.4.1 Bayesian Network

A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph. It is also called a Bayes network,
belief network, decision network, or Bayesian model. Bayesian networks are probabilistic,
because these networks are built from a probability distribution, and also use probability theory
for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram. A Bayesian network graph is made up
of nodes and Arcs (directed links), where:

Fig 2.14Bayesian Network graph

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other

UNIT -2 21
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
o In the above diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

The Bayesian network has mainly two components:

o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node. Bayesian network is based on Joint
probability distribution and conditional probability.

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work when they
hear the alarm. David always calls Harry when he hears the alarm, but sometimes he got
confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.

Solution:

o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
UNIT -2 22
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or
CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a Boolean variable with k Boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]


=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

UNIT -2 23
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig 2.15 Example of Bayesian Network graph

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.


P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm (A):

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Table 2.1Conditional probabilityfor Alarm (A)

Conditional probability for David (D) Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Table 2.2Conditional probabilityfor David (D)

Conditional probability table for Sophia (S) Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

UNIT -2 24
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

Table 2.3Conditional probabilityfor Sophia (S)

From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.

2.4.2 Inference in Bayesian Network


Inference is the process of calculating a probability distribution of interest e.g. P(A | B=True), or
P(A,B|C, D=True). The terms inference and queries are used interchangeably.

The following terms are all forms of inference will slightly difference semantics:

• Prediction - focused around inferring outputs from inputs.


• Diagnostics - inferring inputs from outputs.
• Supervised anomaly detection - essentially the same as prediction
• Unsupervised anomaly detection - inference is used to calculate the P(e) or more
commonly log(P(e)).
• Decision making under uncertainty - optimization and inference combined.

A few examples of inference in practice:

• Given a number of symptoms, which diseases are most likely?


• How likely is it that a component will fail, given the current state of the system?
• Given recent behavior of 2 stocks, how will they behave together for the next 5 time
steps?

2.5 Support Vector Machine


The Support Vector Machine is a supervised learning algorithm mostly used for classification but
it can be used also for regression. The main idea is that based on the labeled data (training data)
the algorithm tries to find the optimal hyperplane which can be used to classify new data points.
In two dimensions the hyperplane is a simple line. Usually a learning algorithm tries to learn
the most common characteristics of a class and the classification is based on those representative
characteristics learnt.

Example, let’s consider two classes, apples and lemons.

Other algorithms will learn the most evident, most representative characteristics of apples and
lemons, like apples are green and rounded while lemons are yellow and have elliptic form.

UNIT -2 25
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
In contrast, SVM will search for apples that are very similar to lemons, for example apples which
are yellow and have elliptic form. This will be a support vector. The other support vector will be a
lemon similar to an apple (green and rounded). So other algorithms learns
the differences while SVM learns similarities.

Fig 2.16(a) Support Vector Machine

As we go from left to right, all the examples will be classified as apples until we reach the yellow
apple. From this point, the confidence that a new example is an apple drops while the lemon class
confidence increases. When the lemon class confidence becomes greater than the apple class
confidence, the new examples will be classified as lemons (somewhere between the yellow apple
and the green lemon).
Based on these support vectors, the algorithm tries to find the best hyperplane that separates the
classes. In 2D the hyperplane is a line, so it would look like this:

Fig 2.16 (b) Support Vector Machine

UNIT -2 26
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

But why did I draw the blue boundary like in the picture above? I could also draw boundaries like

this:

Fig 2.16 (b) Support Vector Machine

As you can see, we have an infinite number of possibilities to draw the decision boundary. So
how can we find the optimal one?

Finding the Optimal Hyperplane

Intuitively the best line is the line that is far away from both apple and lemon examples (has the
largest margin). To have optimal solution, we have to maximize the margin in both ways (if we
have multiple classes, then we have to maximize it considering each of the classes).

Fig 2.16 (c) Support Vector Machine

So if we compare the picture above with the picture below, we can easily observe, that the first is
the optimal hyperplane (line) and the second is a sub-optimal solution, because the margin is far
shorter.

UNIT -2 27
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.16 (d) Support Vector Machine

Because we want to maximize the margins taking in consideration all the classes, instead of using
one margin for each class, we use a “global” margin, which takes in consideration all the classes.
This margin would look like the purple line in the following picture:

Fig 2.16 (e) Support Vector Machine

This margin is orthogonal to the boundary and equidistant to the support vectors.

Basic Steps

The basic steps of the SVM are:


1. select two hyperplanes (in 2D) which separates the data with no points between them (red
lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be the decision
boundary

SVM for Non-Linear Data Sets

An example of non-linear data is:

UNIT -2 28
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.17 Support Vector Machine non Linear data set

In this case we cannot find a straight line to separate apples from lemons. So how can we solve
this problem. We will use the Kernel Trick.

The basic idea is that when a data set is inseparable in the current dimensions, add another
dimension, maybe that way the data will be separable. Just think about it, the example above is in
2D and it is inseparable, but maybe in 3D there is a gap between the apples and the lemons,
maybe there is a level difference, so lemons are on level one and apples are on level two. In this
case, we can easily draw a separating hyperplane (in 3D a hyperplane is a plane) between level 1
and 2.

Mapping to Higher Dimensions

To solve this problem we shouldn’t just blindly add another dimension, we should transform the
space so we generate this level difference intentionally.

Mapping from 2D to 3D

Let's assume that we add another dimension called X3. Another important transformation is that
in the new dimension the points are organized using this formula x1² + x2².
If we plot the plane defined by the x² + y² formula, we will get something like this:

UNIT -2 29
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.18Support Vector Machine mapping from 2D to 3D

Now we have to map the apples and lemons (which are just simple points) to this new space.
Think about it carefully, what did we do? We just used a transformation in which we added levels
based on distance. If you are in the origin, then the points will be on the lowest level. As we move
away from the origin, it means that we are climbing the hill (moving from the center of the plane
towards the margins) so the level of the points will be higher.

Pros

1. SVN can be very efficient, because it uses only a subset of the training data, only the support
vectors
2. Works very well on smaller data sets, on non-linear data sets and high dimensional spaces
3. Is very effective in cases where number of dimensions is greater than the number of samples
4. It can have high accuracy, sometimes can perform even better than neural networks
5. Not very sensitive to overfitting

Cons

1. Training time is high when we have large data sets


2. When the data set has more noise (i.e. target classes are overlapping) SVM doesn’t perform
well

Popular Use Cases

1. Text Classification
2. Detecting spam
3. Sentiment analysis
4. Aspect-based recognition
5. Aspect-based recognition
UNIT -2 30
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
6. Handwritten digit recognition

2.6 Analysis of time series


A time series is a sequence of observations over a certain period. A univariate time series
consists of the values taken by a single variable at periodic time instances over a period, and a
multivariate time series consists of the values taken by multiple variables at the same periodic
time instances over a period. The simplest example of a time series that all of us come across on
a day to day basis is the change in temperature throughout the day or week or month or year.
The analysis of temporal data is capable of giving us useful insights on how a variable changes
over time, or how it depends on the change in the values of other variable(s). This relationship of
a variable on its previous values and/or other variables can be analyzed for time series
forecasting and has numerous applications in artificial intelligence.

Significance of Time Series

TSA is the backbone for prediction and forecasting analysis, specific to the time-based problem
statements.

• Analyzing the historical dataset and its patterns


• Understanding and matching the current situation with patterns derived from the previous stage.
• Understanding the factor or factors influencing certain variable(s) in different periods.

With help of “Time Series” we can prepare numerous time-based analyses and results by
using:

• Forecasting
• Segmentation
• Classification
• Descriptive analysis`
• Intervention analysis

2.6.1 Time Series Modeling Techniques


To capture these components, there are a number of popular time series modeling techniques as
mentioned below:
• Naïve Methods
These are simple estimation techniques, such as the predicted value is given the value equal to
mean of preceding values of the time dependent variable, or previous actual value. These are
used for comparison with sophisticated modelling techniques.
• Auto Regression
Auto regression predicts the values of future time periods as a function of values at previous
time periods. Predictions of auto regression may fit the data better than that of naïve methods,
but it may not be able to account for seasonality.

UNIT -2 31
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• ARIMA Model
An auto-regressive integrated moving-average models the value of a variable as a linear function
of previous values and residual errors at previous time steps of a stationary time series.
However, the real world data may be non-stationary and have seasonality, thus Seasonal-
ARIMA and Fractional-ARIMA were developed. ARIMA works on univariate time series, to
handle multiple variables VARIMA was introduced.
• Exponential Smoothing
It models the value of a variable as an exponential weighted linear function of previous values.
This statistical model can handle trend and seasonality as well.
• LSTM
Long Short-Term Memory model (LSTM) is a recurrent neural network which is used for time
series to account for long term dependencies. It can be trained with large amount of data to
capture the trends in multi-variate time series.

2.6.2 Components of Time Series Analysis

A time series has 4 components as given below:

i. Trend: It is the increasing or decreasing behavior of a variable with time. It indicates that
there is no fixed interval and any divergence within the given dataset is a continuous
timeline. The trend would be Negative or Positive or Null Trend
ii. Seasonality: In which regular or fixed interval shifts within the dataset in a continuous
timeline.
iii. Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern
iv. Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.

Fig 2.19 Components of Time Series Analysis

Limitations of Time Series Analysis

Time series has the below-mentioned limitations:


UNIT -2 32
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• Similar to other models, the missing values are not supported by TSA
• The data points must be linear in their relationship.
• Data transformations are mandatory, so a little expensive.
• Models mostly work on Uni-variate data.

Data Types of Time Series

TS data-types, there are two major types of TS Data types as:


1. Stationary
2. Non- Stationary
Stationary: A dataset should follow the below thumb rules, without having Trend, Seasonality,
Cyclical, and Irregularity component of time series
• The MEAN value of them should be completely constant in the data during the analysis
• The VARIANCE should be constant with respect to the time-frame
• The COVARIANCE measures the relationship between two variables.

Non- Stationary: This is just the opposite of Stationary.

Fig 2.20 Components of Time Series Analysis

2.6.3 Time Series Analysis in R

Time Series in R is used to see how an object behaves over a period of time. In R, it can be
easily done by ts() function with some parameters. Time series takes the data vector and each
data is connected with timestamp value as given by the user. This function is mostly used to
learn and forecast the behavior of an asset in business for a period of time. For example, sales
analysis of a company, inventory analysis, price analysis of a particular stock or market,
population analysis, etc.

Syntax: objectName <- ts(data, start, end, frequency)


where,
• data represents the data vector
• start represents the first observation in time series
• end represents the last observation in time series

UNIT -2 33
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
• frequency represents number of observations per unit time. For example, frequency=1 for
monthly data.

Example: Let’s take the example of COVID-19 pandemic situation. Taking a total number of
positive cases of COVID-19 cases weekly from 22 January 2020 to 15 April 2020 of the
world in data vector.

Code:

# Weekly data of COVID-19 positive cases from


# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)

# library required for decimal_date() function


library(lubridate)

# output to be created as png file


png(file ="timeSeries.png")

# creating time series object


# from date 22 January, 2020
mts<- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)

# plotting the graph


plot(mts, xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic",
col.main ="darkgreen")

# saving the file


dev.off()

Output

UNIT -2 34
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.20 Time Series Analysis in R

2.7 Nonlinear Dynamics and Time Series Analysis

Irregular temporal behavior is ubiquitous in the world surrounding us: wind speeds, water wave
heights, stock market prices, exchange rates, blood pressure, or heart rate all fluctuate more or
less irregularly in time.
Such signals are the output of quite complex systems with nonlinear feedback loops and external
driving. Our goal is to quantify, understand, model, and predict such irregular fluctuations.
Therefore, our research includes the study of deterministic and stochastic model systems which
are selected because of interesting dynamical/statistical behavior and which serve as
paradigmatic data models. We design data analysis methods, we design tests and measures of
performance for such methods, and we apply these to data sets with
various properties. Last but not least, we study data sets because we wish to understand the
underlying phenomena and to improve data based predictions of fluctuations.

Out of the initially listed examples of data sources, the atmosphere sticks out for two reasons:
On the one hand, the atmosphere is an exciting highly complex physical system, where many
different sub-fields of physics meet:
Hydro dynamical transport, thermodynamics, light-matter interaction, droplet formation,
altogether forming a system which is not only far from equilibrium but also far from a linear
regime around some working point. On the other hand, climate change and the impact of
extreme weather on human civilization give atmospheric physics a high relevance. Since the
only perfect model of the atmosphere is the real world itself, and since due to very strong
nonlinearities and hierarchical structures the misleading effects of any approximation might
UNIT -2 35
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
be tremendous, a data based approach to climate issues is urgently needed as a complement to
simulating climate by models.

In a broader context, our work can also seen as part of what nowadays is called data science: We
make huge data sets accessible to our studies, we design visualization concepts and analysis
tools, we test hypothesis, construct models, and apply forecast schemes similar to machine
learning. The time series aspect enters our work through the fact that the temporal order in which
data are recorded carries part of the information, and physics enters as background information,
constraints, and reference models.

2.8 Rule induction

Rule induction is an area of machine learning in which formal rules are extracted from a set of
observations.
The rules extracted may represent a full scientific model of the data, or merely represent local
patterns in the data.
Rule induction is an area of machine learning in which formal rules are extracted from a set of
observations. The rules extracted may represent a full scientific model of the data, or merely
represent local patterns in the data.

Rule induction is a data mining process of deducing if-then rules from a data set. These symbolic
decision rules explain an inherent relationship between the attributes and class labels in the data
set. Many real-life experiences are based on intuitive rule induction. For example, we can
proclaim a rule that states “if it is 8 a.m. on a weekday, then highway traffic will be heavy” and
“if it is 8 p.m. on a Sunday, then the traffic will be light.” These rules are not necessarily right all
the time. 8 a.m. weekday traffic may be light during a holiday season. But, in general, these rules
hold true and are deduced from real-life experience based on our every day observations. Rule
induction provides a powerful classification approach.

Key points:

1. It is the extraction of useful if-then rules from data based on statistical significance.

2. Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-
part, stating a classification, prediction, or other expression of a property that holds for cases
defined in the antecedent).

3.Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-
part, stating a classification, prediction, or other expression of a property that holds for cases
defined in the antecedent).

4.Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-

UNIT -2 36
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
part, stating a classification, prediction, or other expression of a property that holds for cases
defined in the antecedent).

5. Rule induction is an area of machine learning in which formal rules are extracted from a set
of observations. The rules extracted may represent a full scientific model of the data, or merely
represent local patterns in the data.

6.It is the extraction of useful if-then rules from data based on statistical significance.

2.9 Principal Component Analysis

Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal Components.
It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It
is a technique to draw strong patterns from the given dataset by reducing the variances.

PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the power
allocation in various communication channels. It is a feature extraction technique, so it contains
the important variables and drops the least important variable. It is a statistical procedure that
uses an orthogonal transformation which converts a set of correlated variables to a set of
uncorrelated variables. PCA is a most widely used tool in exploratory data analysis and in
machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique
used to examine the interrelations among a set of variables. It is also known as a general factor
analysis where regression determines a line of best fit.

The PCA algorithm is based on some mathematical concepts such as:


o Variance and Covariance
o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:


o Dimensionality: It is the number of features or variables present in the given dataset.
More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges from
-1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1
indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v
will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.

2.9.1 Principal Components in PCA


As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs is either equal to or less than the original features present
in the dataset.

UNIT -2 37
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Some properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has
the most importance, and n PC will have the least importance.

2.9.2 Steps for PCA algorithm

1. Getting the dataset


Firstly, we need to take the input dataset and divide it into two subparts X and Y, where
X is the training set, and Y is the validation set.

2. Representing data into a structure


Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.

3. Standardizing the data

In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance. If
the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.

4. Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.

5. Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.

6. Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.

7. Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will multiply the P* matrix to the
Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.

8. Remove less or unimportant features from the new dataset.

The new feature set has occurred, so we will decide here what to keep and what to

UNIT -2 38
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
remove. It means, we will only keep the relevant or important features in the new dataset,
and unimportant features will be removed out.

Fig 2.21 Principal Components in PCA

Applications of Principal Component Analysis


o PCA is mainly used as the dimensionality reduction technique in various AI applications
such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.
o
2.10 Neural Networks
2.10.1 Neurons

Scientists agree that our brain has around 100 billion neurons. These neurons have hundreds of
billions connections between them.

Fig 2.22 Neurons

UNIT -2 39
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Neurons (aka Nerve Cells) are the fundamental units of our brain and nervous system. The neurons
are responsible for receiving input from the external world, for sending output (commands to our
muscles), and for transforming the electrical signals in between.

1943: Warren S. McCulloch and Walter Pitts published “A logical calculus of the ideas
immanent in nervous activity )” This research sought to understand how the human brain could
produce complex patterns through connected brain cells, or neurons. One of the main ideas that
came out of this work was the comparison of neurons with a binary threshold to Boolean logic
(i.e., 0/1 or true/false statements).

1958: Frank Rosenblatt is credited with the development of the perceptron, documented in his
research, “The Perceptron: A Probabilistic Model for Information Storage and Organization in
the Brain” He takes McCulloch and Pitt’s work a step further by introducing weights to the
equation. Leveraging an IBM 704, Rosenblatt was able to get a computer to learn how to
distinguish cards marked on the left vs. cards marked on the right.

1974: While numerous researchers contributed to the idea of back propagation, Paul Werbos was
the first person in the US to note its application within neural networks within his

1989: Yann LesCun published a paper illustrating how the use of constraints in backpropagation
and its integration into the neural network architecture can be used to train algorithms. This
research successfully leveraged a neural network to recognize hand-written zip code digits
provided by the U.S. Postal Service.

• A neural network is a series of algorithms that endeavors to recognize underlying


relationships in a set of data through a process that mimics the way the human brain
operates. In this sense, neural networks refer to systems of neurons, either organic or
artificial in nature.
• Neural networks reflect the behavior of the human brain, allowing computer programs to
recognize patterns and solve common problems in the fields of AI, machine learning, and
deep learning.
• Neural networks, also known as artificial neural networks (ANNs) or simulated neural
networks (SNNs), are a subset of machine learning and are at the heart of deep
learning algorithms. Their name and structure are inspired by the human brain,
mimicking the way that biological neurons signal to one another.

Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer,
one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to
another and has an associated weight and threshold. If the output of any individual node is
above the specified threshold value, that node is activated, sending data to the next layer of
the network. Otherwise, no data is passed along to the next layer of the network. Neural
networks can adapt to changing input; so the network generates the best possible result
without needing to redesign the output criteria. The concept of neural networks, which has its
roots in artificial intelligence, is swiftly gaining popularity in the development of trading
systems.

UNIT -2 40
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.23 Simple neural Network

• Neural networks are a series of algorithms that mimic the operations of an animal brain
to recognize relationships between vast amounts of data.
• As such, they tend to resemble the connections of neurons and synapses found in the
brain.
• They are used in a variety of applications in financial services, from forecasting and
marketing research to fraud detection and risk assessment.
• Neural networks with several process layers are known as "deep" networks and are used
for deep learning algorithms
• The success of neural networks for stock market price prediction varies.
Working of neural networks

The human brain is the inspiration behind neural network architecture. Human brain cells, called
neurons, form a complex, highly interconnected network and send electrical signals to each other
to help humans process information. Similarly, an artificial neural network is made of artificial
neurons that work together to solve a problem. Artificial neurons are software modules, called
nodes, and artificial neural networks are software programs or algorithms that, at their core, use
computing systems to solve mathematical calculations.

Simple neural network architecture


A basic neural network has interconnected artificial neurons in three layers:

UNIT -2 41
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
i. Input Layer
Information from the outside world enters the artificial neural network from the input layer.
Input nodes process the data, analyze or categorize it, and pass it on to the next layer.

ii. Hidden Layer


Hidden layers take their input from the input layer or other hidden layers. Artificial neural
networks can have a large number of hidden layers. Each hidden layer analyzes the output from
the previous layer, processes it further, and passes it on to the next layer.

iii. Output Layer


The output layer gives the final result of all the data processing by the artificial neural network.
It can have single or multiple nodes. For instance, if we have a binary (yes/no) classification
problem, the output layer will have one output node, which will give the result as 1 or 0.
However, if we have a multi-class classification problem, the output layer might consist of more
than one output node.

Types of Neural Networks

A. Perceptron

Perceptron model, proposed by Minsky-Papert is one of the simplest and oldest models of
Neuron. It is the smallest unit of neural network that does certain computations to detect
features or business intelligence in the input data. It accepts weighted inputs, and apply the
activation function to obtain the output as the final result. Perceptron is also known as TLU
(threshold logic unit). Perceptron is a supervised learning algorithm that classifies the data
into two categories, thus it is a binary classifier. A perceptron separates the input space into
two categories by a hyperplane represented by the following equation:

UNIT -2 42
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Fig 2.24 Perceptron

Advantages
Perceptrons can implement Logic Gates like AND, OR, or NAND.

Disadvantages

Perceptrons can only learn linearly separable problems such as boolean AND problem. For
non-linear problems such as the boolean XOR problem, it does not work.

Fig 2.25 Perceptron (P) Disadvantage

B. Feed Forward Neural Networks

The simplest form of neural networks where input data travels in one direction only, passing
through artificial neural nodes and exiting through output nodes. Where hidden layers may
or may not be present, input and output layers are present there. Based on this, they can be
further classified as a single-layered or multi-layered feed-forward neural network.

Number of layers depends on the complexity of the function. It has uni-directional forward
propagation but no backward propagation. Weights are static here. An activation function is
fed by inputs which are multiplied by weights. To do so, classifying activation function or
step activation function is used. For example: The neuron is activated if it is above threshold
(usually 0) and the neuron produces 1 as an output. The neuron is not activated if it is below
threshold (usually 0) which is considered as -1. They are fairly simple to maintain and are
equipped with to deal with data which contains a lot of noise.

UNIT -2 43
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.25 Feed Forward Neural Networks

Applications on Feed Forward Neural Networks:

• Simple classification (where traditional Machine-learning based classification


algorithms have limitations)
• Face recognition [Simple straight forward image processing]
• Computer vision [Where target classes are difficult to classify]
• Speech Recognition

Advantages of Feed Forward Neural Networks

1. Less complex, easy to design & maintain


2. Fast and speedy [One-way propagation]
3. Highly responsive to noisy data

Disadvantages of Feed Forward Neural Networks

1. Cannot be used for deep learning [due to absence of dense layers and back
propagation]

C. Multilayer Perceptron

An entry point towards complex neural nets where input data travels through various layers
of artificial neurons. Every single node is connected to all neurons in the next layer which
makes it a fully connected neural network. Input and output layers are present having
multiple hidden Layers i.e. at least three or more layers in total. It has a bi -directional
propagation i.e. forward propagation and backward propagation.

UNIT -2 44
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
Inputs are multiplied with weights and fed to the activation function and in
backpropagation, they are modified to reduce the loss. In simple words, weights are machine
learnt values from Neural Networks. They self-adjust depending on the difference between
predicted outputs vs training inputs. Nonlinear activation functions are used followed by
softmax as an output layer activation function.

Fig 2.26 Multi-Layer Perceptron

Applications on Multi-Layer Perceptron

• Speech Recognition
• Machine Translation
• Complex Classification

Advantages on Multi-Layer Perceptron

1. Used for deep learning [due to the presence of dense fully connected layers and
back propagation]

Disadvantages on Multi-Layer Perceptron:

1. Comparatively complex to design and maintain Comparatively slow (depends on


number of hidden layers)

D. Convolutional Neural Network

Convolution neural network contains a three-dimensional arrangement of neurons, instead of


the standard two-dimensional array. The first layer is called a convolutional layer. Each

UNIT -2 45
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
neuron in the convolutional layer only processes the information from a small part of the
visual field. Input features are taken in batch-wise like a filter. The network understands the
images in parts and can compute these operations multiple times to complete the full image
processing. Processing involves conversion of the image from RGB or HSI scale to grey -
scale. Furthering the changes in the pixel value will help to detect the edges and images can
be classified into different categories.

Propagation is uni-directional where CNN contains one or more convolutional layers


followed by pooling and bidirectional where the output of convolution layer goes to a fully
connected neural network for classifying the images as shown in the above diagram. Filters
are used to extract certain parts of the image. In MLP the inputs are multiplied with weights
and fed to the activation function. Convolution uses RELU and MLP uses nonlinear
activation function followed by softmax. Convolution neural networks show very effective
results in image and video recognition, semantic parsing and paraphrase detection.

Fig 2.27 Convolution Neural Network

Applications on Convolution Neural Network

• Image processing
• Computer Vision
• Speech Recognition
• Machine translation

Advantages of Convolution Neural Network

1. Used for deep learning with few parameters


2. Less parameters to learn as compared to fully connected layer

Disadvantages of Convolution Neural Network

• Comparatively complex to design and maintain


• Comparatively slow [depends on the number of hidden layers]
UNIT -2 46
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

E. Radial Basis Function Neural Networks

Radial Basis Function Network consists of an input vector followed by a layer of RBF
neurons and an output layer with one node per category. Classification is performed by
measuring the input’s similarity to data points from the training set where each neuron stores
a prototype. This will be one of the examples from the training set.

Fig 2.28 Radial Basis Function Neural Networks

When a new input vector [the n-dimensional vector that you are trying to classify] needs to
be classified, each neuron calculates the Euclidean distance between the input and its
prototype. For example, if we have two classes i.e. class A and Class B, then the new input
to be classified is more close to class A prototypes than the class B prototypes. Hence, it
could be tagged or classified as class A.

Each RBF neuron compares the input vector to its prototype and outputs a value ranging
which is a measure of similarity from 0 to 1. As the input equals to the prototype, the output
of that RBF neuron will be 1 and with the distance grows between the input and prototype
the response falls off exponentially towards 0. The curve generated out of neuron’s response
tends towards a typical bell curve. The output layer consists of a set of neurons [one per
category].

UNIT -2 47
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.29 (a) Radial Basis Function Neural Networks

Fig 2.29 (b) Radial Basis Function Neural Networks

Fig 2.29 (c) Radial Basis Function Neural Networks

Application: Power Restoration

UNIT -2 48
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

a. Powercut P1 needs to be restored first


b. Powercut P3 needs to be restored next, as it impacts more houses
c. Powercut P2 should be fixed last as it impacts only one house

F. Recurrent Neural Networks

Designed to save the output of a layer, Recurrent Neural Network is fed back to the input to
help in predicting the outcome of the layer. The first layer is typically a feed forward neural
network followed by recurrent neural network layer where some information it had in the
previous time-step is remembered by a memory function. Forward propagation is
implemented in this case. It stores information required for its future use. If the prediction is
wrong, the learning rate is employed to make small changes. Hence, making it gradually
increase towards making the right prediction during the backpropagation.

Fig 2.30 Recurrent Neural Networks

Applications of Recurrent Neural Networks

• Text processing like auto suggest, grammar checks, etc.


• Text to speech processing
• Image tagger
• Sentiment Analysis

Advantages of Recurrent Neural Networks

1. Model sequential data where each sample can be assumed to be dependent on


historical ones is one of the advantage.

UNIT -2 49
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
2. Used with convolution layers to extend the pixel effectiveness.

Disadvantages of Recurrent Neural Networks

1. Gradient vanishing and exploding problems


2. Training recurrent neural nets could be a difficult task
3. Difficult to process long sequential data using ReLU as an activation function.

Improvement over RNN: LSTM (Long Short-Term Memory) Networks

Fig 2.31 LSTM (Long Short-Term Memory) Networks

LSTM networks are a type of RNN that uses special units in addition to standard units.
LSTM units include a ‘memory cell’ that can maintain information in memory for long
periods of time. A set of gates is used to control when information enters the memory when
it’s output, and when it’s forgotten. There are three types of gates viz, Input gate, output gate
and forget gate. Input gate decides how many information from the last sample will be kept
in memory; the output gate regulates the amount of data passed to the next layer, and forget
gates control the tearing rate of memory stored. This architecture lets them learn longer -term
dependencies

This is one of the implementations of LSTM cells, many other architectures exist.

UNIT -2 50
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.32 Implementations of LSTM

G. Sequence to sequence models

A sequence to sequence model consists of two Recurrent Neural Networks. Here, there exists
an encoder that processes the input and a decoder that processes the output. The encoder and
decoder work simultaneously – either using the same parameter or different ones. This
model, on contrary to the actual RNN, is particularly applicable in those cases where the
length of the input data is equal to the length of the output data. While they possess similar
benefits and limitations of the RNN, these models are usually applied mainly in chat bots,
machine translations, and question answering systems.

UNIT -2 51
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.33 Sequence to sequence models

H. Modular Neural Network

A modular neural network has a number of different networks that function independently
and perform sub-tasks. The different networks do not really interact with or signal each other
during the computation process. They work independently towards achieving the output.

Fig 2.34 Modular Neural Network

As a result, a large and complex computational process are done significantly faster by
breaking it down into independent components. The computation speed increases because
the networks are not interacting with or even connected to each other.

UNIT -2 52
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Applications of Modular Neural Network

1. Stock market prediction systems


2. Adaptive MNN for character recognitions
3. Compression of high-level input data

Advantages of Modular Neural Network

1. Efficient
2. Independent training
3. Robustness
Disadvantages of Modular Neural Network

1. Moving target Problems


I. Deep neural network architecture

Deep neural networks, or deep learning networks, have several hidden layers with millions of
artificial neurons linked together. A number, called weight, represents the connections between
one node and another. The weight is a positive number if one node excites another, or negative if
one node suppresses the other. Nodes with higher weight values have more influence on the
other nodes.

Theoretically, deep neural networks can map any input type to any output type. However, they
also need much more training as compared to other machine learning methods. They need
millions of examples of training data rather than perhaps the hundreds or thousands that a
simpler network might need.

2.11 Fuzzy logic: Extracting fuzzy models from data


Fuzzy logic
Fuzzy logic is an approach to computing based on "degrees of truth" rather than the usual "true
or false" (1 or 0) Boolean logic on which the modern computer is based. The idea of fuzzy logic
was first advanced by LotfiZadeh of the University of California at Berkeley in the 1960s.

The term fuzzy refers to things that are not clear or are vague.Fuzzy logic is an approach to
variable processing that allows for multiple possible truth values to be processed through the
same variable. Fuzzy logic attempts to solve problems with an open, imprecise spectrum of data
and heuristics that makes it possible to obtain an array of accurate conclusions. Fuzzy logic is
designed to solve problems by considering all available information and making the best
possible decision given the input.

Fuzzy-Logic theory has introduced a framework whereby human knowledge can be formalized
and used by machines in a wide variety of applications, ranging from cameras to trains. The
basic ideas that we discussed in the earlier posts were concerned with only this aspect with
regards to the use of Fuzzy Logic-based systems; that is the application of human experience
UNIT -2 53
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
into machine-driven applications. While there are numerous instances where such techniques are
relevant; there are also applications where it is challenging for a human user to articulate the
knowledge that they hold. Such applications include driving a car or recognizing images.
Machine learning techniques provide an excellent platform in such circumstances, where sets of
inputs and corresponding outputs are available, building a model that provides the
transformation from the input data to the outputs using the available data.

In the Boolean system truth value, 1.0 represents the absolute truth value and 0.0 represents
the absolute false value. But in the fuzzy system, there is no logic for the absolute truth and
absolute false value. But in fuzzy logic, there is an intermediate value too present which is
partially true and partially false.

Fig 2.35 Fuzzy logic

Architecture
Its Architecture contains four parts:
• Rule base: It contains the set of rules and the IF-THEN conditions provided by the experts
to govern the decision-making system, on the basis of linguistic information. Recent
developments in fuzzy theory offer several effective methods for the design and tuning of
fuzzy controllers. Most of these developments reduce the number of fuzzy rules.
• Fuzzification: It is used to convert inputs i.e. crisp numbers into fuzzy sets. Crisp inputs
are basically the exact inputs measured by sensors and passed into the control system for
processing, such as temperature, pressure, rpm’s, etc.
• Inference engine: It determines the matching degree of the current fuzzy input with
respect to each rule and decides which rules are to be fired according to the input field.
Next, the fired rules are combined to form the control actions.
• Defuzzification: It is used to convert the fuzzy sets obtained by the inference engine into a
crisp value. There are several defuzzification methods available and the best-suited one is
used with a specific expert system to reduce the error.

UNIT -2 54
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.36 Fuzzy logic Architecture

Procedure

Step 1: Divide the input and output spaces into fuzzy regions.

We start by assigning some fuzzy sets to each input and output space. Wang and Mendel
specified an odd number of evenly spaced fuzzy regions, determined by 2N+1 where N is an
integer. As we will see later on, the value of N affects the performance of our models and can
result in under/overfitting at times. N is, therefore, one of the hyper parameters that we will use to
tweak this system’s performance.

UNIT -2 55
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig.2.37 Divisions of an input space into fuzzy regions where N=2

Step 2 : Generate Fuzzy Rules from data.

We can use our input and output spaces, together with the fuzzy regions that we have just defined,
and the dataset for the application to generate fuzzy rules in the form of:
If {antecedent clauses} then {consequent clauses}

We start by determining the degree of membership of each sample in the dataset to the different
fuzzy regions in that space. If, as an example, we consider a sample depicted below:

UNIT -2 56
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig.2.38 Corres

we obtain the following degrees of membership values.

Fig.2.39 Degree-of-membership values for sample-1

We then assign the region having the maximum degree of membership of to the spaces, which is
indicated by the highlighted elements in the above table so that it is possible to obtain a rule:

sample 1 =>If x1 is b1 and x2 is s1 then y is ce => Rule 1

The next illustration shows a second example, together with the degree of membership results

that it generates.

UNIT -2 57
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig.2.40 Degree-of-membership values for sample-1

This sample will, therefore, produce the following rule:


sample 2=>If x1 is b1 and x2 is cethen y is b1 => Rule 2

Step 3 : Assign a degree to each rule.

Step 2 is very straightforward to implement, yet it suffers from one problem; it will generate

conflicting rules, that is, rules that have the same antecedent clauses but different consequent

clauses. Wang and Medel solved this issue by assigning a degree to each rule, using a product

strategy such that the degree is the product of all the degree-of-membership values from both

UNIT -2 58
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

antecedent and consequent spaces forming the rule. We retain the rule having the most significant

degree, while we discard the rules having the same antecedent but a having a smaller degree.

If we refer to the previous example, the degree of Rule 1 will equate to:

and for Rule 2 we obtain:

We notice that this procedure reduces the number of rules radically in practice.

It is also possible to fuse human knowledge to the knowledge obtained from data by introducing a

human element to the rule degree, that has high applicability in practice, as human supervision

can assess the reliability of data, and hence the rules generated from it directly. In the cases where

human intervention is not desirable, this factor is set to 1 for all rules. Rule 1 can be hence

defined as follows;

Step 4: Create a Combined Fuzzy Rule Base

It is a matrix that holds the fuzzy rule-base information for a system. A Combined Fuzzy Rule

Base can contain the rules that are generated numerically using the procedure described above,
but also rules that are obtained from human experience.

UNIT -2 59
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig.2.41 Combined Fuzzy Rule Base for this system. Note Rule 1 and Rule 2.

Step 5: Determine a mapping based on the Combined Fuzzy Rule Base.

The final step in this procedure explains the defuzzification strategy used to determine the value

of y, given (x1, x2). Wang and Mendel suggest a different approach to the max-min computation

used by Mamdani. We have to consider that, in practical applications, the number of input spaces

will be significant when compared to the typical control application where Fuzzy Logic is

typically used. Besides, this procedure will generate a large number of rules, and therefore it
would be impractical to compute an output using the ‘normal’ approach.

UNIT -2 60
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

For a given input combination (x1, x2), we combine the antecedents of a given rule to determine

the degree of output control corresponding to (x1, x2) using the product operator. If

is the degree of output control for the ithRule,

Therefore, for Rule 1


If x1 is b1 and x2 is s1 then y is ce

We now define the centre of a fuzzy region as the point that has the smallest absolute value
among all points at which the membership function for this region is equal to 1 as illustrated
below;

Fig 2.42Center of fuzzy region

The value of y for a given (x1, x2) combination is thus

UNIT -2 61
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

where K is the number of rules.

Testing

A (very dirty) implementation of the above algorithm was developed in Python to test it with real

datasets. The code and data used are available in Github. Some considerations on this system
include.

• The fuzzy system is generated from the test data directly.

• The sets were created using the recommendation in the original paper, that is evenly spaced.
It is, however, interesting to see the effects of changing this method. One idea is to have sets
created around the dataset mean with a spread relatable to the standard deviation — this
might be investigated in a future post.

• The system created does not cater for categorical data implicitly, and this is a future
improvement that can affect the performance of the system considerably in real-life
scenarios.

Testing metrics

We will use the coefficient of determination (R-Squared) to assess the performance of this system

and to tune the hyper parameter that was identified, the number of fuzzy sets generated.

To explain R-Squared, we must first define the sum of squares total and the sum of squares
residual.
The sum of squares total is the sum of the squared difference between the dependent variable (y)

and the mean of the observed dependent variable.


UNIT -2 62
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

The sum of squares residual is the sum of the squared difference between the actual and estimated

value of the dependent variable.

R-Squared can be then calculated as

we notice that R-Squared will have a value between 0 and 1, the larger, the better. If R-Squared

=1, then there is no error and the estimated values will be equal to the actual values.

Case Study : Temperature

For a second test, we have used the Weather in Szeged 2006–2016 available at Kaggle. The

dataset has over 96,000 training examples that consist of 12 features.

UNIT -2 63
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

For this exercise, we will be discarding most of these features and assess if we can predict the

temperature given the month and humidity.Upon examining the data, we notice that the average

temperature varied between 1 and 23 degrees Celcius with a variability of about 20 degrees per

month.

Fig 2.43 Center of fuzzy region

The average humidity varies between 0.63 and 0.85, but we also notice that it is always possible

to reach 100%, irrelevant of the month.

UNIT -2 64
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig 2.44 Average humidity

The best Fuzzy system that was tested consisted of 3 fuzzy spaces (N=1) for the input variables
and nine fuzzy spaces (N=4) for the temperature. The system generated the nine rules depicted in
the Fuzzy Distribution Map below and attained an R-Squared value of 0.75, using a test sample of
20%.

Fig 2.45 Fuzzy Distribution Map

UNIT -2 65
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
2.12 Fuzzy Decision Tree
FID Fuzzy Decision Tree has been introduced in 1996. It is a classification system,
implementing the popular and

efficient recursive partitioning technique of decision trees, while combining fuzzy representation
and approximate reasoning for dealing with noise and language uncertainty. A fuzzy decision
tree induction method, which is based on the reduction of classification ambiguity with fuzzy
evidence, is developed. Fuzzy decision trees represent classification knowledge more naturally
to the way of human thinking and are more robust in tolerating imprecise, conflict, and missing
information.

Fuzzy Decision Tree (FID) combines fuzzy representation, and its approximate reasoning, with
symbolic decision trees. As such, they provide for handling of language related uncertainty,
noise, missing or faulty features, robust behavior, while also providing comprehensible
knowledge interpretation.

Intelligent Decision Tree Algorithm Based on Fuzzy Theory

Unlike classical mathematics, which is deterministic, the introduction of fuzzy theory brings the
discipline of mathematics into a new field, which expands the range of applications of
mathematics, bringing it from the exact space to the fuzzy space. Like all disciplines, fuzzy
theory was born from the needs of production and practice. From time to time, we encounter
fuzzy concepts in many areas of real life. For example, we use fast and slow when describing
speed, expensive, and cheap when describing price, and it is difficult to separate these adjectives
with a precise boundary. The distinctive feature of fuzzy decision tree algorithm is the
integration of fuzzy theory and classical decision tree algorithm, which expands the application
field of decision tree algorithm.

The fuzzy fication improvement of the classical decision tree algorithm mainly includes the
following:

(1) Preprocessing of continuous attributes (i.e., how to fuzzify continuous


attributes): for most fuzzy decision tree algorithms, fuzzification of continuous
attributes before modeling is necessary, and a few algorithms fuzzify the data in
the process of modeling.
(2) Selection rules for splitting attributes: compared with the rules for split
attribute selection of clear decision tree, the fuzzy decision tree algorithm extends
them to adapt to fuzzy data.
(3) Matching rules for decision trees: for a fuzzy decision tree, it will give the degree
to which the test data belong to a certain classification, i.e., a reflection of the
attribute affiliation, rather than an absolute classification like a clear decision tree.

Compared with the decision rules of the clear decision tree algorithm, the fuzzy rules generated
by the fuzzy decision tree are more realistic, and the set composed of these fuzzy rules is called
the fuzzy rule set, as shown in Figure 1. If the basis for splitting attribute selection of decision
tree algorithm is collectively called clear heuristic, and that of fuzzy decision tree is called fuzzy
heuristic, then the differences between these two types of algorithms are mainly in the

UNIT -2 66
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
differences of heuristics, leaf node selection criteria or branch ending criteria, and the final
generated rules. Although the fuzzy decision tree algorithm is an improvement of the decision
tree algorithm, it does not mean that this fuzzy algorithm is better than the clear algorithm in all
aspects, and different algorithms should be chosen according to the actual application area.

Fig. 2.46 Decision tree model generated based on fuzzy algorithm.

UNIT -2 67
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Fig. 2.47 Architecture of fuzzy decision tree induction

UNIT -2 68
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI
7. The generation of FDT for pattern classification consists of three major steps namely fuzzy
partitioning (clustering), induction of FDT and fuzzy rule inference for classification.

8. The first crucial step in the induction process of FDT is the fuzzy partitioning of input space
using any fuzzy clustering techniques.

9. FDTs are constructed using any standard algorithm like Fuzzy ID3 where we follow a top-
down, recursive divide and conquer approach, which makes locally optimal decisions at each
node.

10. As the tree is being built, the training set is recursively partitioned into smaller subsets and
the generated fuzzy rules are used to predict the class of an unseen pattern by applying suitable
fuzzy inference/reasoning mechanism on the FDT.
11. The general procedure for generating fuzzy decision trees using Fuzzy ID3 is as follows :

• Prerequisites : A Fuzzy partition space, leaf selection threshold th and the best
node selection criterion
• Procedure : While there exist candidate nodes

2.13 Stochastic search methods


Stochastic Search is an optimization algorithm which incorporates randomness in its exploration
of the search space. Sophisticated search techniques form the backbone of modern machine
learning and data analysis. Computer systems that can extract information from huge data sets,
to recognize patterns, to do classification, or to suggest diagnoses, in short, systems that are
adaptive and to some extent able to learn, fundamentally rely on effective and efficient search
techniques. The ability of organisms to learn and adapt to signals from their environment is one
of the core features of life. Technically, any adaptive system needs some kind of search operator
in order to explore a feature space which describes all possible configurations of the system.
Usually, one is interested in “optimal” or at least close to “optimal” configurations defined with
respect to a specific application domain: the weight settings of a neural network for correct
classification of some data, parameters that describe the body shape of an airplane with
minimum drag, a sequence of jobs assigned to a flexible production line in a factory resulting in
minimum idle time for the machine park, the configuration for a stable bridge with minimum
weight or minimum cost to build and maintain, or a set of computer programs that implement a
robot control task with a minimum number of commands.

Stochastic search algorithms are designed for problems with inherent random noise or
deterministic problems solved by injected randomness. The search favors designs with better
performance. An important feature of stochastic search algorithms is that they can carry out
broad search of the design space and thus avoid local optima. Also, stochastic search algorithms
do not require gradients to guide the search, making them a good fit for discrete problems.
However, there is no necessary condition for an optimum solution and the algorithm must run
multiple times to make sure the attained solutions are robust. To handle constraints, penalties can
also be applied on designs that violate constraints. For constraints that are difficult to be
formulated explicitly, a true/false check is straightforward to implement.

UNIT -2 69
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

In order to describe stochastic processes in statistical terms, we can give the following

definitions:
• Observation: the result of one trial.
• Population: all the possible observation that can be registered from a trial.
• Sample: a set of results collected from separated independent trials.

For example, the toss of a fair coin is a random process, but thanks to The Law of the Large

Numbers we know that given a large number of trials we will get approximately the same

number of heads and tails.

The Law of the Large Numbers states that: “As the size of the sample increases, the mean value

of the sample will better approximate the mean or expected value in the population. Therefore,

as the sample size goes to infinity, the sample mean will converge to the population mean. It is

important to be clear that the observations in the sample must be independent”

Some examples of random processes are stock markets and medical data such as blood pressure

and EEG analysis.

One of the main application of Machine Learning is modelling stochastic processes. Some

examples of stochastic processes used in Machine Learning are:

Poisson processes

Poisson Processes are used to model a series of discrete events in which we know the average

time between the occurrence of different events but we don’t know exactly when each of these

events might take place.


A process can be considered to belong to the class of Poisson Processes if it can meet the

following criteria’s:
1. The events are independent of each other (if an event happens, this does not alter the
probability that another event can take place).
2. Two events can’t take place simultaneously.
3. The average rate between events occurrence is constant.

Lets take as an example power-cuts. The electricity provider might advertise power cuts are

likely to happen every 10 months on average, but we can’t precisely tell when the next power cut

is going to happen. For example, if a major problem happens, the electricity might go off

repeatedly for 2–3 days (eg. in case the company needs to make some changes to the power
source) and then after that stay on for the next 2 years.Therefore, for this type of processes, we

UNIT -2 70
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

can be quite sure of the average time between the events, but their occurrence is randomly

spaced in time.From a Poisson Process, we can then derive a Poisson Distribution which can be

used to find the probability of the waiting time between the occurrence of different events or the

number of possible events in a time period.

A Poisson Distribution can be modelled using the following formula (Figure 2), where k

represents the expected number of events which can take place in a period.

Some examples of phenomena which can be modelled using Poisson Processes are radioactive

decay of atoms and stock market analysis.


Random Walk and Brownian motion processes

A Random Walk can be any sequence of discrete steps (of always the same length) moving in

random directions. Random Walks can take place in any type of dimensional space (eg. 1D, 2D,

nD).

Figure 2.48: Random Walk in High Dimensions

UNIT -2 71
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Let’s imagine that we are in a park and we can see a dog looking for food. He is currently in

position zero on the number line and he has an equal probability to move left or right to find any

food.

Figure 2.49: Number Line

Now, if we want to find out what is going to be the position of the dog after a certain number of

N steps we can take advantage again of The Law of the Large Numbers. Using this law, we will

found out that as N goes to infinity our dog will probably be back at its starting point. Anyway,

this is not of much use in this case.

Therefore, we can instead try to use Root-Mean-Square (RMS) as our distance metric (we first

square all the values, then we calculate their average and finally we do the square root of the

result). In this way, all our negative numbers will become positive and the average will not be

any more equal to zero.

In this example, using RMS we will find out that if our dog takes 100 steps it would have moved

of 10 steps from the origin on average ( √100 = 10).

As mentioned before, Random Walk is used to describe a discrete-time process. Instead,

Brownian Motion can be used to describe a continuous-time random walk. Some examples of

random walks applications are: tracing the path taken by molecules when moving through a gas

during the diffusion process, sports events predictions etc

Hidden Markov Models (HMMs)

HMMs are probabilistic graphical models used to predict a sequence of hidden (unknown) states

from a set of observable states. This class of models follows the Markov processes assumption:

“The future is independent of the past, given that we know the present”. Therefore, when

working with Hidden Markov Models, we just need to know our present state in order to make a

prediction about the next one (we don’t need any information about the previous states). To
make our predictions using HMMs we just need to calculate the joint probability of our hidden

UNIT -2 72
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

states and then select the sequence which yields the highest probability (the most likely to

happen).

In order to calculate the joint probability, we need three main types of information:
• Initial condition: the initial probability we have to start our sequence in any of the hidden
states.
• Transition probabilities: the probabilities of moving from one hidden state to another.
• Emission probabilities: the probabilities of moving from a hidden state to an observable
state.

As a simple example, let’s imagine we are trying to predict how the wheatear is going to be like

tomorrow based on what a group of people is wearing.

In this case, the different types of weather are going to be our hidden states (eg. sunny, windy

and rainy) and the types of clothing worn are going to be our observable states (eg. t-shirt, long

trousers and jackets). Our Initial condition is going to be our starting point in the series. The

transition probabilities, are instead going to represent the likelihood we are going to move from a

different type of weather a day after the other. Finally, the emission probabilities are going to be

the probabilities someone is going to wear a certain attire depending on the weather of the

previous day.

Figure 2.50 Hidden Markov Model example


One main problem when using Hidden Markov Models is that as the number of states increases,

the number of probabilities and possible scenarios increases exponentially. In order to solve that,
UNIT -2 73
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

is possible to use another algorithm called the Viterbi Algorithm. From a Machine Learning

point of view, the observations form our training data and the number of hidden states forms our

hyper-parameters to tune. One of the most common applications for HMMs in Machine

Learning is in agent-based situations such as Reinforcement Learning.

Figure 2.51 HMMs in Reinforcement Learning


Gaussian Processes

Gaussian Processes are a class of stationary, zero-mean stochastic processes which are

completely dependent on their auto covariance functions. This class of models can be used for

both regression and classification tasks.one of the greatest advantages of Gaussian Processes is

that they can provide estimates about uncertainty, for example giving us an estimate of how sure

an algorithm is that an item belongs to a class or not.In order to deal with situations which,

embed a certain degree of uncertainty is typically made use of probability distributions.


A simple example of a discrete probability distribution can be the roll of a dice.

Imagine now one of your friends challenges you to play at dice and you make 50 tows. In the

case of a fair dice, we would expect that each of the 6 faces has the same probability to appear

(1/6 each). This is shown in Figure 7.

UNIT -2 74
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

Figure 2.52: Fair Dice Probability Distribution

Anyway, the more you keep playing and the more you notice that the dice tends to land always

on the same faces. At this point, you start thinking the dice might be loaded and therefore you

update your initial belief about the probability distribution.

Figure 2.53: Loaded Dice Probability Distribution

UNIT -2 75
DATA ANALYTICS BY- NIRAJ KUMAR TIWARI

This process is known as Bayesian Inference. Bayesian Inference is a process through which we

update our beliefs about the world based on the gathering of new evidences. We start with

a prior belief and once we update it with brand new information we construct a posterior belief.

This same reasoning is valid for discrete distributions as well as continuous

distributions.Gaussian processes can, therefore, allow us to describe probability distributions of

which we can later update the distribution using Bayes Rule once we gather new training data.

Auto-Regressive Moving average processes

Auto-Regressive Moving Average (ARMA) processes are a really important class of stochastic

processes used to analyses time-series. What characterizes ARMA models is that their auto

covariance functions only depend on a limited number of unknown parameters.

The ARMA acronym can be broken down in two main parts:


• Auto-Regressive = the model takes advantage of the connection between a predefined
number of lagged observations and the current one.
• Moving Average = the model takes advantage of the relationship between the residual error
and the observations.

The ARMA model makes use of two main parameters (p, q). These are:

• p = number of lag observations.

• q = the size of the moving average window.

ARMA processes assume that a time series fluctuates uniformly around a time-invariant mean. If

we are trying to analyze a time-series which does not follow this pattern, then this series will
need to be differenced until it would be able to achieve stationarity.

UNIT -2 76

You might also like