0% found this document useful (0 votes)
139 views

Unit-5 Bda

Uploaded by

indraneel3118
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views

Unit-5 Bda

Uploaded by

indraneel3118
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

BDA

UNIT-5

Predictive analytics:
Predictive analytics is a branch of advanced analytics that makes predictions
about future outcomes using historical data combined with statistical modeling,
data mining techniques and machine learning. Companies employ predictive
analytics to find patterns in this data to identify risks and opportunities.
Predictive analytics is often associated with big data and data science.

Types of predictive modeling

Predictive analytics models are designed to assess historical data, discover


patterns, observe trends, and use that information to predict future trends.
Popular predictive analytics models include classification, clustering, and time
series models.

Classification models

Classification models fall under the branch of supervised machine learning


models. These models categorize data based on historical data, describing
relationships within a given dataset. For example, this model can be used to
classify customers or prospects into groups for segmentation purposes.
Alternatively, it can also be used to answer questions with binary outputs, such
answering yes or no or true and false; popular use cases for this are fraud
detection and credit risk evaluation. Types of classification models
include logistic regression, decision trees, random forest, neural networks, and
Naïve Bayes.

Clustering models

Clustering models fall under unsupervised learning. They group data based on
similar attributes. For example, an e-commerce site can use the model to
separate customers into similar groups based on common features and develop
marketing strategies for each group. Common clustering algorithms include k-
means clustering, mean-shift clustering, density-based spatial clustering of
applications with noise (DBSCAN), expectation-maximization (EM) clustering
using Gaussian Mixture Models (GMM), and hierarchical clustering.

Time series models


Time series models use various data inputs at a specific time frequency, such as
daily, weekly, monthly, et cetera. It is common to plot the dependent variable
over time to assess the data for seasonality, trends, and cyclical behavior,
which may indicate the need for specific transformations and model types.
Autoregressive (AR), moving average (MA), ARMA, and ARIMA models are all
frequently used time series models. As an example, a call center can use a time
series model to forecast how many calls it will receive per hour at different
times of day.

Predictive analytics can be deployed in across various industries for different


business problems. Below are a few industry use cases to illustrate how
predictive analytics can inform decision-making within real-world situations.

 Banking: Financial services use machine learning and quantitative tools to


predict credit risk and detect fraud. As an example, BondIT is a company
that specializes in fixed-income asset-management services. Predictive
analytics allows them to support dynamic market changes in real-time in
addition to static market constraints. This use of technology allows it to
both customize personal services for clients and to minimize risk.
 Healthcare: Predictive analytics in health care is used to detect and
manage the care of chronically ill patients, as well as to track specific
infections such as sepsis. Geisinger Health used predictive analytics to
mine health records to learn more about how sepsis is diagnosed and
treated. Geisinger created a predictive model based on health records
for more than 10,000 patients who had been diagnosed with sepsis in the
past. The model yielded impressive results, correctly predicting patients
with a high rate of survival.
 Human resources (HR): HR teams use predictive analytics and employee
survey metrics to match prospective job applicants, reduce employee
turnover and increase employee engagement. This combination of
quantitative and qualitative data allows businesses to reduce their
recruiting costs and increase employee satisfaction, which is particularly
useful when labor markets are volatile.
 Marketing and sales: While marketing and sales teams are very familiar
with business intelligence reports to understand historical sales
performance, predictive analytics enables companies to be more proactive
in the way that they engage with their clients across the customer
lifecycle. For example, churn predictions can enable sales teams to
identify dissatisfied clients sooner, enabling them to initiate
conversations to promote retention. Marketing teams can leverage
predictive data analysis for cross-sell strategies, and this commonly
manifests itself through a recommendation engine on a brand’s website.
 Supply chain: Businesses commonly use predictive analytics to manage
product inventory and set pricing strategies. This type of predictive
analysis helps companies meet customer demand without overstocking
warehouses. It also enables companies to assess the cost and return on
their products over time. If one part of a given product becomes more
expensive to import, companies can project the long-term impact on
revenue if they do or do not pass on additional costs to their customer
base. For a deeper look at a case study, you can read more about
how FleetPride used this type of data analytics to inform their decision
making on their inventory of parts for excavators and tractor trailers.
Past shipping orders enabled them to plan more precisely to set
appropriate supply thresholds based on demand.
Benefits of predictive modelling

 Security: Every modern organization must be concerned with keeping


data secure. A combination of automation and predictive analytics
improves security. Specific patterns associated with suspicious and
unusual end user behavior can trigger specific security procedures.
 Risk reduction: In addition to keeping data secure, most businesses are
working to reduce their risk profiles. For example, a company that
extends credit can use data analytics to better understand if a customer
poses a higher-than-average risk of defaulting. Other companies may use
predictive analytics to better understand whether their insurance
coverage is adequate.
 Operational efficiency: More efficient workflows translate to improved
profit margins. For example, understanding when a vehicle in a fleet used
for delivery is going to need maintenance before it’s broken down on the
side of the road means deliveries are made on time, without the
additional costs of having the vehicle towed and bringing in another
employee to complete the delivery.
 Improved decision making: Running any business involves making
calculated decisions. Any expansion or addition to a product line or other
form of growth requires balancing the inherent risk with the potential
outcome. Predictive analytics can provide insight to inform the decision-
making process and offer a competitive advantage.

Linear regression
Linear regression analysis is used to predict the value of a variable based on the
value of another variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict the other variable's
value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving
one or more independent variables that best predict the value of the dependent
variable. Linear regression fits a straight line or surface that minimizes the
discrepancies between predicted and actual output values. There are simple
linear regression calculators that use a “least squares” method to discover the
best-fit line for a set of paired data. You then estimate the value of X
(dependent variable) from Y (independent variable).

It is a commonly used type of predictive analysis. It is a statistical approach


for modeling the relationship between a dependent variable and a given set of
independent variables.

There are two types of linear regression.


 Simple Linear Regression
 Multiple Linear Regression
Let’s discuss Simple Linear regression using R.

Simple Linear Regression:

It is a statistical method that allows us to summarize and study relationships


between two continuous (quantitative) variables. One variable denoted x is
regarded as an independent variable and the other one denoted y is regarded
as a dependent variable. It is assumed that the two variables are linearly
related. Hence, we try to find a linear function that predicts the response
value(y) as accurately as possible as a function of the feature or independent
variable(x).

For understanding the concept let’s consider a salary dataset where it is given
the value of the dependent variable(salary) for every independent
variable(years experienced).

Salary dataset:
The equation of regression line is given by:
y = a + bx
Where y is the predicted response value, a is the y y-intercept,
intercept, x is the
feature value and b is a slope.
To create the model, let’s evaluate the values of regression coefficient a
and b. And as soon as the estimation of these coefficients is done, the
response
onse model can be predicted. Here we are going to use Least Square
Technique.
The principle of least squares is one of the popular methods for finding
a curve fitting a given data. Say (x1, y1), (x2, y2)….(xn, yn) be n
observations from an experiment. We are are interested in finding a curve
Multiple Linear Regression:
One of the most common types of predictive analysis is multiple linear
regression. This type of analysis allows you to understand the relationship
between a continuous dependent variable and two or more independent variables.

The independent variables can be either continuous (like age and height) or
categorical (like gender and occupation). It's important to note that if your
dependent variable is categorical, you shoul
shouldd dummy code it before running the
analysis.
In multiple linear regression, the dependent variable is the outcome or result
from you're trying to predict. The independent variables are the things that
explain your dependent variable. You can use them to build a model that
accurately predicts your dependent variable from the independent variables.
For your model to be reliable and valid, there are some essential requirements:

 The independent and dependent variables are linearly related.

 There is no strong correlation between the independent variables.

 Residuals have a constant variance.

 Observations should be independent of one another.

 It is important that all variables follow multivariate normality.

Data visualization is a graphical representation of information and data. By


using visual elements like charts, graphs, and maps, data visualization tools
provide an accessible way to see and understand trends, outliers, and patterns
in data. This blog on data visualization techniques will help you understand
detailed techniques and benefits.

Benefits of good data visualization

Our eyes are drawn to colours and patterns. We can quickly identify red from
blue, and square from the circle. Our culture is visual, including everything from
art and advertisements to TV and movies.

Data visualization is another form of visual art that grabs our interest and
keeps our eyes on the message. When we see a chart, we quickly see trends and
outliers. If we can see something, we internalize it quickly. It’s storytelling with
a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t
see a trend, you know how much more effective a visualization can be. The uses
of Data Visualization as follows.

 Powerful way to explore data with presentable results.


 Primary use is the pre-processing portion of the data mining
process.
 Supports the data cleaning process by finding incorrect and missing
values.
 For variable derivation and selection means to determine which
variable to include and discarded in the analysis.
 Also play a role in combining categories as part of the data
reduction process.
Data Visualization Techniques

 Box plots
 Histograms
 Heat maps
 Charts
 Tree maps
 Word Cloud/Network diagram

Box Plots

The image above is a box plot. A boxplot is a standardized way of displaying the
distribution of data based on a five-number summary (“minimum”, first quartile
(Q1), median, third quartile (Q3), and “maximum”). It can tell you about your
outliers and what their values are. It can also tell you if your data is
symmetrical, how tightly your data is grouped, and if and how your data is
skewed.

A box plot is a graph that gives you a good indication of how the values in the
data are spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space,
which is useful when comparing distributions between many groups or datasets.
For some distributions/datasets, you will find that you need more information
than the measures of central tendency (median, mean, and mode). You need to
have information on the variability or dispersion of the data.

List of Methods to Visualize Data

 Column Chart: It is also called a vertical bar chart where each


category is represented by a rectangle. The height of the
rectangle is proportional to the values that are plotted.
 Bar Graph: It has rectangular bars in which the lengths are
proportional to the values which are represented.
 Stacked Bar Graph: It is a bar style graph that has various
components stacked together so that apart from the bar, the
components can also be compared to each other.
 Stacked Column Chart: It is similar to a stacked bar; however, the
data is stacked horizontally.
 Area Chart: It combines the line chart and bar chart to show how
the numeric values of one or more groups change over the progress
of a viable
iable area.
 Dual Axis Chart: It combines a column chart and a line chart and
then compares the two variables.
 Line Graph: The data points are connected through a straight line;
therefore, creating a representation of the changing trend.
 Mekko Chart: It can be called a two-dimensional
dimensional stacked chart
with varying column widths.
 Pie Chart: It is a chart where various components of a data set
are presented in the form of a pie which represents their
proportion in the entire data set.
 Waterfall Chart: With the help lp of this chart, the increasing
effect of sequentially introduced positive or negative values can be
understood.
 Bubble Chart: It is a multi-variable
variable graph that is a hybrid of
Scatter Plot and a Proportional Area Chart.
 Scatter Plot Chart: It is also called a scatter chart or scatter
graph. Dots are used to denote values for two different numeric
variables.
 Bullet Graph: It is a variation of a bar graph. A bullet graph is
used to swap dashboard gauges and meters.
 Funnel Chart: The chart determines the flow of users with the
help of a business or sales process.
 Heat Map: It is a technique of data visualization that shows the
level of instances as color in two dimensions.
Histograms

A histogram is a graphical display of data using bars of different heights. In a


histogram, each bar groups numbers into ranges. Taller bars show that more
data falls in that range. A histogram displays the shape and spread of
continuous sample data.

It is a plot that lets you discover, and show, the underlying frequency
distribution (shape) of a set of continuous data. This allows the inspection of
the data for its underlying distribution (e.g., normal distribution), outliers,
skewness, etc. It is an accurate representation of the distribution of numerical
data, it relates only one variable. Includes bin or bucket- the range of values
that divide the entire range of values into a series of intervals and then count
how many values fall into each interval.

Bins are consecutive, non- overlapping intervals of a variable. As the adjacent


bins leave no gaps, the rectangles of histogram touch each other to indicate
that the original value is continuous.

Heat Maps

A heat map is data analysis software that uses colour the way a bar graph uses
height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the most
attention, a heat map shows you in a visual way that’s easy to assimilate and
make decisions from. It is a graphical representation of data where the
individual values contained in a matrix are represented as colours. Useful for
two purposes: for visualizing correlation tables and for visualizing missing values
in the data. In both cases, the information is conveyed in a two-dimensional
table.
Note that heat maps are useful when examining a large number of values, but
they are not a replacement for more precise graphical displays, such as bar
charts, because colour differences cannot be perceived accurately.

Charts

Line Chart

The simplest technique, a line plot is used to plot the relationship or dependence
of one variable on another. To plot the relationship between the two variables,
we can simply call the plot function.
Bar Charts

Bar charts are used for comparing the quantities of different categories or
groups. Values of a category are represented with the help of bars and they can
be configured with vertical or horizontal bars, with the length or height of each
bar representing the value.

Pie Chart

It is a circular statistical graph which decides slices to illustrate numerical


proportion. Here the arc length of each slide is proportional to the quantity it
represents. As a rule, they are used to compare the parts of a whole and are
most effective when there are limited components and when text and
percentages are included to describe the content. However, they can be
difficult to interpret because the human eye has a hard time estimating areas
and comparing visual angles.

Scatter Charts

Another common visualization technique is a scatter plot that is a two-


dimensional plot representing the joint variation of two data items. Each marker
(symbols such as dots, squares and plus signs) represents an observation. The
marker position indicates the value for each observation. When you assign more
than two measures, a scatter plot matrix is produced that is a series scatter
plot displaying every possible pairing of the measures that are assigned to the
visualization. Scatter plots are used for examining the relationship, or
correlations, between X and Y variables.

Bubble Charts

It is a variation of scatter chart in which the data points are replaced with
bubbles, and an additional dimension of data is represented in the size of the
bubbles.

Timeline Charts

Timeline charts illustrate events, in chronological order — for example the


progress of a project, advertising campaign, acquisition process — in whatever
unit of time the data was recorded — for example week, month, year, quarter.
It shows the chronological sequence of past or future events on a timescale.
Tree Maps

A treemap is a visualization that displays hierarchically organized data as a set


of nested rectangles, parent elements being tiled with their child elements. The
sizes and colours of rectangles are proportional to the values of the data points
they represent. A leaf node rectangle has an area proportional to the specified
dimension of the data. Depending on the choice, the leaf node is coloured, sized
or both according to chosen attributes. They make efficient use of space, thus
display thousands of items on the screen simultaneously.

Word Clouds and Network Diagrams for Unstructured Data

The variety of big data brings challenges because semi-structured, and


unstructured data require new visualization techniques. A word cloud visual
represents the frequency of a word within a body of text with its relative size
in the cloud. This technique is used on unstructured data as a way to display
high- or low-frequency words.

Another visualization technique that can be used for semi-structured or


unstructured data is the network diagram. Network diagrams represent
relationships as nodes (individual actors within the network) and ties
(relationships between the individuals). They are used in many applications, for
example for analysis of social networks or mapping product sales across
geographic areas.

Interaction Techniques for Data Visualisation:

Data Visualization is basically a graphical representation of information and


data. It is a visual content through which people understand the significance of
data. There are various data visualizations and its data visualization methods or
techniques which helps people to understand the importance of data. In general,
patterns, trends, and correlations might go unnoticed in text-based form data
but through visualizations, with various techniques, it can be exposed and
recognized easier with different software.

Interactive data visualization supports exploratory thinking so that decision-


makers can actively investigate intriguing findings. Interactive visualization
supports faster decision making, greater data access and stronger user
engagement along with desirable results in several other metrics. Some of the
key findings include:
 70% of the interactive visualization adopters improve collaboration
and knowledge sharing.
 64% of the interactive visualization adopters improve user trust in
underlying data.
 Interactive Visualization users engage data more frequently.
 Interactive Visualizes are more likely than static visualizers to be
satisfied easily with the use of analytical tools.

The Benefits Of Interactive Data Visualizations

 Finding correlations - Displaying data on a single dashboard can help you


find different connections behind the better and worse performance. For
example, whether sales are higher because of a better online presence, or
because of a recent paid advertising campaign, or whether there is a
correlation between a few recent negative reviews and a drop in sales. It
is important that any decisions or new strategies you wish to implement
are data-driven and have a solid foundation.
 Quick action - As mentioned earlier, you process information faster when
it is visualized. This means you notice issues or gaps in performance
faster and have the ability to act on those findings. It also gives
investors, board members, and other stakeholders a better overview of
the situation.
 Identifying new trends - Humans are good at categorizing things or
recognizing patterns, in fact, our brains are wired to do so. So by
curating performance dashboards, you have a much better chance of
spotting trends and figuring out which of your strategies have been more
effective, as well as other factors that have an impact on your success.
 Complex concepts are simplified - The goal of interactive data
visualization is to convey insights, which leads to business intelligence.
It's easier to tell a story and share your findings when you yourself can
understand them better. Moreover, a lot of tools allow you to drill down
into your insights and generate reports on interactive data that then
could be used when creating easier to 'digest' content.

What Is a Regression Coefficient?

A regression coefficient is the quantity that sits in front of an independent variable in your
regression equation. It is a parameter estimate describing the relationship between one of
the independent variables in your model and the dependent variable.
In the simple linear regression below, the quantity 0.5, which sits in front of the variable X, is
a regression coefficient. The intercept—in this case 2—is also a coefficient, but you’ll hear it
referred to, instead, as the “intercept,” “constant,” or " β0".

For the sake of this article, we will leave the intercept out of our discussion.

Y^=2+0.5X

Regression coefficients tell us about the line of best fit and the estimated relationship
between an independent variable and the dependent variable in our model. In a simple
linear regression with only one independent variable, the coefficient determines the slope of
the regression line; it tells you whether the regression line is upward or downward-sloping
and how steep the line is.

Regression Coefficient in Multiple Regression

Regressions can have more than one dependent variable, and, therefore, more than one
regression coefficient. In the multivariate regression below, there are two independent
variables ( X1 andX2). This means you have two regression coefficients: 0.7 and -3.2. Each
coefficient gives you information about the relationship between one of the independent
variables and the dependent (or response) variable, Y.

Y^=−4+0.7X1−3.2X2

In the regression here, the coefficient 0.7 suggests a positive linear relationship between X1
and Y. If all other independent variables in the model are held constant, as X1 increases by
1 unit, we estimate that Y increases by 0.7 units.

The coefficient in front of X2 is negative, indicating a negative correlation between X2 and Y.


If X2 were to increase by one unit, and all other variables were held constant, we would
predict Y to decrease by -3.2.

Applying the Regression Coefficient

Simple Linear Regression Model

The figure below is a scatterplot showing the relationship between an independent


variable—also called a predictor variable—plotted along the x-axis and the dependent
variable plotted along the y-axis. Each point on the scatter plot represents an observation
from a dataset.
In linear regression, we estimate the relationship between the independent variable (X) and
the dependent variable (Y) using a straight line. There are a few different ways to fit this line,
but the most common method is called the Ordinary Least Squares Method (or OLS).

In OLS, you can find the regression line by minimizing the sum of squared errors. Here the
errors—or residuals—areare the vertical distances between each point on the scatter plot and
the regression
gression line. The regression coefficient on X tells you the slope of the regression line.

Regression Output Tables

We typically perform our regression calculations using statistical software like R or Stata.
When we do this, we not only create scatter plots
plots and lines but also create a regression
output table like the one below. A regression output table is a table summarizing the
regression line, the errors of your model, and the statistical significance of each parameter
estimated by your model.

In the table here, the independent variables ( X1 and X2) are listed in the first column of the
table, and the coefficients on these variables are listed in the second column of the table in
rows 3 and 4.

Regression Coefficient Interpretation

In linear regression, your regression coefficients will be constants that are either positive or
negative. Here is how you can interpret the coefficients.
1. Non-Zero Coefficient

A non-zero regression coefficient indicates a relationship between the independent variable


and the dependent variable.

2. Positive Coefficient

If the regression coefficient is positive, there is a positive relationship between the


independent variable and the dependent variable. As X increases, Y tends to increase, and
as X decreases, Y tends to decrease.

3. Negative Coefficient

If the regression coefficient is negative, there is a negative (or inverse) relationship between
the independent variable and the dependent variable. As X increases, Y tends to decrease,
and as X decreases, Y tends to increase.

Remember, your coefficients are only estimates. You’ll never know with certainty what the
true parameters are, and what the exact relationship is between your variables.

In regression, you can estimate how much of the variation in your independent variable can
be explained by a dependent variable by calculating R2, and you can calculate how
confident you can be in your estimates using tests of statistical significance.

How To Find Regression Coefficients

In a simple OLS linear regression of the form Y = B0+B1X, you can find the regression
coefficient B1 using the following equation.

β0= Cov(Xi,Yi)/ Var(Xi)


Chances are, however, that you will not be solving regression coefficients by hand. Instead,
you’ll use software like Excel, R, or Stata to find your regression coefficients.

Regression Coefficients in Different Types of Regression Models

In this article, we’ve mainly discussed the simplest form of regression: a linear regression
with one independent variable. As you continue to study statistics, you’ll encounter many
more complex forms of regression. In these other regression models, the coefficients might
take on slightly different forms and may need to be interpreted differently.

Here’s a list of some commonly used regression models.

Linear Regression

Linear regression is one of the most basic forms of regression. As you saw earlier, in linear
regression, you find a line of best fit (a regression line) that minimizes the sum of squared
errors. This line models the relationship between a dependent variable and an independent
variable.
Logistic Regression (or Logit Regression)

We use a logistic regression when you want to study a binary outcome, and you are trying to
estimate the likelihood of one of the two possible outcomes occurring. Logistic regression
allows you to predict whether an outcome variable will be true or false, a win or a loss, heads
or tails, 1 or 0, or any other binary set of outcomes.

In logistic regression, you interpret the regression coefficients differently than you would in a
linear model. In linear regression, a coefficient of 2 means that as your independent variable
increases by one unit, your dependent variable is expected to increase by 2 units. In logistic
regression, a coefficient of 2 means that as your independent variable increases by one unit,
the log odds of your dependent variable increase by 2.

Regression Models with Non-linear Terms

In a non-linear regression, you estimate the relationship between your variables using a
curve rather than a line. For example, if we know that the relationship between Y and X1
cannot simply be expressed by a line, but rather with a curve, we may want to include X1,
but also its quadratic version X12 . In this case, we will get two coefficients related to X11;
one for X1 and one for X12 . Something like this:

Y^=−8.2+1.5X1−0.5(X1)2

Here, we no longer can say that if X1 changes by one unit, Y changes by 0.4 units since X1
appears twice in the regression. Instead, the relationship between Y and X1 is non-linear. If
the level of X1 is 1 and we increase it by 1 unit, then Y increases by (1.5 - 1) units.

However, if the level of �1X1 is 2 and we increase it by 1 unit, then Y increases by (1.5 - 2).
This is because the partial derivative of Y with respect to �1X1 is no longer a constant and
is 1.5 - 2 ✕ 0.5 �1X1.

Ridge Regression

Ridge regression is a technique used in machine learning. Statisticians and data scientists
use ridge regressions to adjust linear regressions to avoid overfitting the model to training
data. In ridge regression, the parameters of the model (including the regression coefficients)
are found by minimizing the sum of squared errors plus a value called the ridge regression
penalty.

As a result of the adjustment, the dependent variables become less sensitive to changes in
the independent variable. In other words, the coefficients in a ridge regression tend to be
smaller in absolute value than the coefficients in an OLS regression.

Lasso Regression

Lasso regression is similar to ridge regression. It is an adjustment method used with OLS to
adjust for the risk of overfitting a model to training data. In a Lasso regression, you adjust
your OLS regression line by a value known as the Lasso regression penalty. Similar to the
Ridge regression, the lasso regression penalty shrinks the coefficients in the regression
equation.

You might also like