Unit-5 Bda
Unit-5 Bda
UNIT-5
Predictive analytics:
Predictive analytics is a branch of advanced analytics that makes predictions
about future outcomes using historical data combined with statistical modeling,
data mining techniques and machine learning. Companies employ predictive
analytics to find patterns in this data to identify risks and opportunities.
Predictive analytics is often associated with big data and data science.
Classification models
Clustering models
Clustering models fall under unsupervised learning. They group data based on
similar attributes. For example, an e-commerce site can use the model to
separate customers into similar groups based on common features and develop
marketing strategies for each group. Common clustering algorithms include k-
means clustering, mean-shift clustering, density-based spatial clustering of
applications with noise (DBSCAN), expectation-maximization (EM) clustering
using Gaussian Mixture Models (GMM), and hierarchical clustering.
Linear regression
Linear regression analysis is used to predict the value of a variable based on the
value of another variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict the other variable's
value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving
one or more independent variables that best predict the value of the dependent
variable. Linear regression fits a straight line or surface that minimizes the
discrepancies between predicted and actual output values. There are simple
linear regression calculators that use a “least squares” method to discover the
best-fit line for a set of paired data. You then estimate the value of X
(dependent variable) from Y (independent variable).
For understanding the concept let’s consider a salary dataset where it is given
the value of the dependent variable(salary) for every independent
variable(years experienced).
Salary dataset:
The equation of regression line is given by:
y = a + bx
Where y is the predicted response value, a is the y y-intercept,
intercept, x is the
feature value and b is a slope.
To create the model, let’s evaluate the values of regression coefficient a
and b. And as soon as the estimation of these coefficients is done, the
response
onse model can be predicted. Here we are going to use Least Square
Technique.
The principle of least squares is one of the popular methods for finding
a curve fitting a given data. Say (x1, y1), (x2, y2)….(xn, yn) be n
observations from an experiment. We are are interested in finding a curve
Multiple Linear Regression:
One of the most common types of predictive analysis is multiple linear
regression. This type of analysis allows you to understand the relationship
between a continuous dependent variable and two or more independent variables.
The independent variables can be either continuous (like age and height) or
categorical (like gender and occupation). It's important to note that if your
dependent variable is categorical, you shoul
shouldd dummy code it before running the
analysis.
In multiple linear regression, the dependent variable is the outcome or result
from you're trying to predict. The independent variables are the things that
explain your dependent variable. You can use them to build a model that
accurately predicts your dependent variable from the independent variables.
For your model to be reliable and valid, there are some essential requirements:
Our eyes are drawn to colours and patterns. We can quickly identify red from
blue, and square from the circle. Our culture is visual, including everything from
art and advertisements to TV and movies.
Data visualization is another form of visual art that grabs our interest and
keeps our eyes on the message. When we see a chart, we quickly see trends and
outliers. If we can see something, we internalize it quickly. It’s storytelling with
a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t
see a trend, you know how much more effective a visualization can be. The uses
of Data Visualization as follows.
Box plots
Histograms
Heat maps
Charts
Tree maps
Word Cloud/Network diagram
Box Plots
The image above is a box plot. A boxplot is a standardized way of displaying the
distribution of data based on a five-number summary (“minimum”, first quartile
(Q1), median, third quartile (Q3), and “maximum”). It can tell you about your
outliers and what their values are. It can also tell you if your data is
symmetrical, how tightly your data is grouped, and if and how your data is
skewed.
A box plot is a graph that gives you a good indication of how the values in the
data are spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space,
which is useful when comparing distributions between many groups or datasets.
For some distributions/datasets, you will find that you need more information
than the measures of central tendency (median, mean, and mode). You need to
have information on the variability or dispersion of the data.
It is a plot that lets you discover, and show, the underlying frequency
distribution (shape) of a set of continuous data. This allows the inspection of
the data for its underlying distribution (e.g., normal distribution), outliers,
skewness, etc. It is an accurate representation of the distribution of numerical
data, it relates only one variable. Includes bin or bucket- the range of values
that divide the entire range of values into a series of intervals and then count
how many values fall into each interval.
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses
height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the most
attention, a heat map shows you in a visual way that’s easy to assimilate and
make decisions from. It is a graphical representation of data where the
individual values contained in a matrix are represented as colours. Useful for
two purposes: for visualizing correlation tables and for visualizing missing values
in the data. In both cases, the information is conveyed in a two-dimensional
table.
Note that heat maps are useful when examining a large number of values, but
they are not a replacement for more precise graphical displays, such as bar
charts, because colour differences cannot be perceived accurately.
Charts
Line Chart
The simplest technique, a line plot is used to plot the relationship or dependence
of one variable on another. To plot the relationship between the two variables,
we can simply call the plot function.
Bar Charts
Bar charts are used for comparing the quantities of different categories or
groups. Values of a category are represented with the help of bars and they can
be configured with vertical or horizontal bars, with the length or height of each
bar representing the value.
Pie Chart
Scatter Charts
Bubble Charts
It is a variation of scatter chart in which the data points are replaced with
bubbles, and an additional dimension of data is represented in the size of the
bubbles.
Timeline Charts
A regression coefficient is the quantity that sits in front of an independent variable in your
regression equation. It is a parameter estimate describing the relationship between one of
the independent variables in your model and the dependent variable.
In the simple linear regression below, the quantity 0.5, which sits in front of the variable X, is
a regression coefficient. The intercept—in this case 2—is also a coefficient, but you’ll hear it
referred to, instead, as the “intercept,” “constant,” or " β0".
For the sake of this article, we will leave the intercept out of our discussion.
Y^=2+0.5X
Regression coefficients tell us about the line of best fit and the estimated relationship
between an independent variable and the dependent variable in our model. In a simple
linear regression with only one independent variable, the coefficient determines the slope of
the regression line; it tells you whether the regression line is upward or downward-sloping
and how steep the line is.
Regressions can have more than one dependent variable, and, therefore, more than one
regression coefficient. In the multivariate regression below, there are two independent
variables ( X1 andX2). This means you have two regression coefficients: 0.7 and -3.2. Each
coefficient gives you information about the relationship between one of the independent
variables and the dependent (or response) variable, Y.
Y^=−4+0.7X1−3.2X2
In the regression here, the coefficient 0.7 suggests a positive linear relationship between X1
and Y. If all other independent variables in the model are held constant, as X1 increases by
1 unit, we estimate that Y increases by 0.7 units.
In OLS, you can find the regression line by minimizing the sum of squared errors. Here the
errors—or residuals—areare the vertical distances between each point on the scatter plot and
the regression
gression line. The regression coefficient on X tells you the slope of the regression line.
We typically perform our regression calculations using statistical software like R or Stata.
When we do this, we not only create scatter plots
plots and lines but also create a regression
output table like the one below. A regression output table is a table summarizing the
regression line, the errors of your model, and the statistical significance of each parameter
estimated by your model.
In the table here, the independent variables ( X1 and X2) are listed in the first column of the
table, and the coefficients on these variables are listed in the second column of the table in
rows 3 and 4.
In linear regression, your regression coefficients will be constants that are either positive or
negative. Here is how you can interpret the coefficients.
1. Non-Zero Coefficient
2. Positive Coefficient
3. Negative Coefficient
If the regression coefficient is negative, there is a negative (or inverse) relationship between
the independent variable and the dependent variable. As X increases, Y tends to decrease,
and as X decreases, Y tends to increase.
Remember, your coefficients are only estimates. You’ll never know with certainty what the
true parameters are, and what the exact relationship is between your variables.
In regression, you can estimate how much of the variation in your independent variable can
be explained by a dependent variable by calculating R2, and you can calculate how
confident you can be in your estimates using tests of statistical significance.
In a simple OLS linear regression of the form Y = B0+B1X, you can find the regression
coefficient B1 using the following equation.
In this article, we’ve mainly discussed the simplest form of regression: a linear regression
with one independent variable. As you continue to study statistics, you’ll encounter many
more complex forms of regression. In these other regression models, the coefficients might
take on slightly different forms and may need to be interpreted differently.
Linear Regression
Linear regression is one of the most basic forms of regression. As you saw earlier, in linear
regression, you find a line of best fit (a regression line) that minimizes the sum of squared
errors. This line models the relationship between a dependent variable and an independent
variable.
Logistic Regression (or Logit Regression)
We use a logistic regression when you want to study a binary outcome, and you are trying to
estimate the likelihood of one of the two possible outcomes occurring. Logistic regression
allows you to predict whether an outcome variable will be true or false, a win or a loss, heads
or tails, 1 or 0, or any other binary set of outcomes.
In logistic regression, you interpret the regression coefficients differently than you would in a
linear model. In linear regression, a coefficient of 2 means that as your independent variable
increases by one unit, your dependent variable is expected to increase by 2 units. In logistic
regression, a coefficient of 2 means that as your independent variable increases by one unit,
the log odds of your dependent variable increase by 2.
In a non-linear regression, you estimate the relationship between your variables using a
curve rather than a line. For example, if we know that the relationship between Y and X1
cannot simply be expressed by a line, but rather with a curve, we may want to include X1,
but also its quadratic version X12 . In this case, we will get two coefficients related to X11;
one for X1 and one for X12 . Something like this:
Y^=−8.2+1.5X1−0.5(X1)2
Here, we no longer can say that if X1 changes by one unit, Y changes by 0.4 units since X1
appears twice in the regression. Instead, the relationship between Y and X1 is non-linear. If
the level of X1 is 1 and we increase it by 1 unit, then Y increases by (1.5 - 1) units.
However, if the level of �1X1 is 2 and we increase it by 1 unit, then Y increases by (1.5 - 2).
This is because the partial derivative of Y with respect to �1X1 is no longer a constant and
is 1.5 - 2 ✕ 0.5 �1X1.
Ridge Regression
Ridge regression is a technique used in machine learning. Statisticians and data scientists
use ridge regressions to adjust linear regressions to avoid overfitting the model to training
data. In ridge regression, the parameters of the model (including the regression coefficients)
are found by minimizing the sum of squared errors plus a value called the ridge regression
penalty.
As a result of the adjustment, the dependent variables become less sensitive to changes in
the independent variable. In other words, the coefficients in a ridge regression tend to be
smaller in absolute value than the coefficients in an OLS regression.
Lasso Regression
Lasso regression is similar to ridge regression. It is an adjustment method used with OLS to
adjust for the risk of overfitting a model to training data. In a Lasso regression, you adjust
your OLS regression line by a value known as the Lasso regression penalty. Similar to the
Ridge regression, the lasso regression penalty shrinks the coefficients in the regression
equation.