Fds Unit FINAL
Fds Unit FINAL
PREPARED BY
VERIFIED BY
1
UNIT V
PREDICTIVE ANALYTICS
Linear Least Squares
Implementation Goodness Of Fit
Testing A Linear Model
Weighted Resampling.
Regression Using Stats models
Multiple Regression
Nonlinear Relationships
Logistic Regression
Estimating Parameters
Time Series Analysis
Moving Averages
Missing Values
Serial Correlation
Autocorrelation.
Introduction To Survival Analysis.
2
LIST OF IMPORTANT QUESTIONS
UNIT V
PREDICTIVE ANALYTICS
PART A (2 marks)
3
PART B(16 marks)
x 1 2 3 4 5
y 2 5 3 8 7
3.We will now calculate a chi-square statistic for a specific example. Suppose that we
have a simple random sample of 600 M&M candies with the following distribution:
Male Female
4
Number of TVs per Household Number of Households
1 73
2 378
3 459
4 90
5
UNIT V
PART A(2-MARKS)
The term correlation can be defined as the degree of interdependence between two
variables Two variables are said to be correlated when the change in the value of one
variable leads to the change in the values of the other. In other words, two variables of the
same group are said to be correlated when an increase in the value of one variable leads to
an increase in the value of the other or an increase in the value of one variable leads to the
decrease in the value of the other or the decrease in the value of one variable, leads to the
decrease in the value of the other or decrease in the value of one variable leads to an
increase in the value of the other variables.
When an increase in the value of one variable leads to an increase in the value of
the other variable and when the decrease in the value of one variable leads to the decrease
in the value of the other variable, the correlation between the two variables is said to be a
positive correlation.
When the change in the values of two related variables is in the same direction and
same proportion, the correlation is called perfect positive correlation. The coefficient of
correlation, in this case, is +1. On the other hand, when the value of the two related
variables changes in the same proportion but in opposite direction, the correlation is called
a perfect negative correlation. The coefficient of correlation, in this case, is -1.
This method of measuring correlation was given by Karl Pearson in 1896. This
method is also known as ‘Pearson coefficient of correlation’ or ‘Product moment method of
correlation’. It is one of the most widely used mathematical methods of computing
correlation. It clearly explains the degree and direction of the relationship between two
7
variables. Karl Pearson’s coefficient is denoted easy and it is computed on the basis of
mean and standard deviation.
Normality.
Cause and effect relationship.
Linear relationship
12.Walter the two merits and demerits of Spearman’s rank difference method.
13. Give any two merits and demerits of coefficient of concurrent deviation method.
In this method, only the direction of change is studied and the magnitude of change
is completely ignored.
It is a rough indicator of correlation.
The lines which give the best estimate of the value of one variable for any given
value of the other variable are called the ‘Lines of regression’ or ‘Regression lines’. In other
words, regression lines are used to predict the value of the dependent variable when the
value of the independent variable is known.
The regression equation of X on Y describes the variations in the values of X for the
given changes in the values of. In other words, this equation is used for estimating or
predicting the value of X for a given value of Y. This equation is expressed as follows:
When the coefficient of correlation (Y) and standard deviation (6) are given in the question,
it can be easily calculated as
The regression equation of Y on X describes the variations in the values of Y for the
given changes in the values of X. In other words, this equation is used for estimating or
predicting the value of Y for a given value of X. This equation is expressed as follows:
When the coefficient of correlation (Y) and standard deviation (0) is given in the question, it
can be easily calculated as
When we study the correlation among more than two variables, but in that study, we
only consider the inter-relationship between two variables, and the third variable is
assumed to be constant, the correlation is said to be a partial correlation.
10
You can use multiple linear regression when you want to know: How strong the
relationship is between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
28.comparison Between Correlation and Regression
A statistical measure
Describes how an
that defines co-
independent variable is
Meaning relationship or
associated with the
association of two
dependent variable.
variables.
Where,
The above formulas are used to find the correlation coefficient for the given data. Based on
the value obtained through these formulas, we can determine how strong is the association
between two variables.
Trend
Seasonal Variations
Cyclical Variations
Random or Irregular Movements
34. What are parameters used to estimate?
13
equipment malfunctions, lost files, and many other reasons. In any dataset, there are
usually some missing data.
Let us assume that the given points of data are (x_1, y_1), (x_2, y_2), …, (x_n, y_n)
in which all x’s are independent variables, while all y’s are dependent ones. Also, suppose
that f(x) be the fitting curve and d represents error or deviation from each given point.
The least-squares explain that the curve that best fits is represented by the property that
the sum of squares of all the deviations from given values must be minimum.
14
1.Explain about Least square method ?
Least square method is the process of finding a regression line or best-fitted line for
any data set that is described by an equation. This method requires reducing the sum of
the squares of the residual parts of the points from the curve or line and the trend of
outcomes is found quantitatively. The method of curve fitting is seen while regression
analysis and the fitting equations to derive the curve is the least square method.
Let us look at a simple example, Ms. Dolma said in the class "Hey students who
spend more time on their assignments are getting better grades". A student wants to
estimate his grade for spending 2.3 hours on an assignment. Through the magic of the
least-squares method, it is possible to determine the predictive model that will help him
estimate the grades far more accurately. This method is much simpler because it requires
nothing more than some data and maybe a calculator.
In this section, we’re going to explore least squares, understand what it means, learn
the general formula, steps to plot it on a graph, know what are its limitations, and see what
tricks we can use with least squares.
The least-squares method is a statistical method used to find the line of best fit of the
form of an equation such as y = mx + b to the given data. The curve of the equation is
called the regression line. Our main objective in this method is to reduce the sum of the
squares of errors as much as possible. This is the reason this method is called the least-
squares method. This method is often used in data fitting where the best fit result is
assumed to reduce the sum of squared errors that is considered to be the difference
between the observed values and corresponding fitted value. The sum of squared errors
helps in finding the variation in observed data. For example, we have 4 data points and
using this method we arrive at the following graph.
15
Figure 2 : Least Square Method
The two basic categories of least-square problems are ordinary or linear least squares and
nonlinear least squares.
Even though the least-squares method is considered the best method to find the line of
best fit, it has a few limitations. They are:
This method exhibits only the relationship between the two variables. All other causes
and effects are not taken into consideration.
This method is unreliable when data is not evenly distributed.
This method is very sensitive to outliers. In fact, this can skew the results of the least-
squares analysis.
16
Least Square Method Graph
Look at the graph below, the straight line shows the potential relationship between
the independent variable and the dependent variable. The ultimate goal of this method is to
reduce this difference between the observed response and the response predicted by the
regression line. Less residual means that the model fits better. The data points need to be
minimized by the method of reducing residuals of each point from the line. There are
vertical residuals and perpendicular residuals. Vertical is mostly used in polynomials and
hyperplane problems while perpendicular is used in general as seen in the image below.
Least-square method is the curve that best fits a set of observations with a minimum
sum of squared residuals or errors. Let us assume that the given points of data are (x 1, y1),
(x2, y2), (x3, y3), …, (xn, yn) in which all x’s are independent variables, while all y’s are
dependent ones. This method is used to find a linear line of the form y = mx + b, where y
and x are variables, m is the slope, and b is the y-intercept. The formula to calculate slope
m and the value of b is given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
17
b = (∑y - m∑x)/n
Following are the steps to calculate the least square using the above formulas.
Step 1: Draw a table with 4 columns where the first two columns are for x and y points.
Step 2: In the next two columns, find xy and (x)2.
Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
Step 4: Find the value of slope m using the above formula.
Step 5: Calculate the value of b using the above formula.
Step 6: Substitute the value of m and b in the equation y = mx + b
x 1 2 3 4 5
y 2 5 3 8 7
x Y xy x2
1 2 2 1
2 5 10 4
3 3 9 9
4 8 32 16
5 7 35 25
m = 65/50 = 13/10
18
Find the value of b by using the formula,
b = (∑y - m∑x)/n
b = (25 - 1.3×15)/5
b = (25 - 19.5)/5
b = 5.5/5
3. We will now calculate a chi-square statistic for a specific example. Suppose that
we have a simple random sample of 600 M&M candies with the following distribution:
If the null hypothesis were true, then the expected counts for each of these colors would be
(1/6) x 600 = 100. We now use this in our calculation of the chi-square statistic.
We calculate the contribution to our statistic from each of the colors. Each is of the form
(Actual – Expected)2/Expected.:
We then total all of these contributions and determine that our chi-square statistic is 125.44
+ 22.09 + 0.09 + 25 +29.16 + 33.64 =235.42.
19
Degrees of Freedom
The number of degrees of freedom for a goodness of fit test is simply one less than
the number of levels of our variable. Since there were six colors, we have 6 – 1 = 5 degrees
of freedom.
Male Female
20
1 73
2 378
3 459
4 90
Solution:
As many of the values in this data set are repeated multiple times, you can easily
compute the sample mean as a weighted mean. Follow these steps to calculate the
weighted arithmetic mean:
Step 3: Now, compute the denominator of the weighted mean formula by adding the
weights together.
21
6. Explain about Multiple Regression ?
In linear regression, there is only one independent and dependent variable involved. But, in
the case of multiple regression, there will be a set of independent variables that helps us to
explain better or predict the dependent variable y.
where x1, x2, ….xk are the k independent variables and y is the dependent variable.
Residual: The variations in the dependent variable explained by the regression model are
called residual or error variation. It is also known as random error or sometimes just “error”.
This is a random error due to different sampling methods.
22
Advantages of Stepwise Multiple Regression
Only independent variables with non zero regression coefficients are included in the
regression equation.
The changes in the multiple standard errors of estimate and the coefficient of
determination are shown.
The stepwise multiple regression is efficient in finding the regression equation with
only significant regression coefficients.
The steps involved in developing the regression equation are clear.
Mostly, the statistical inference has been kept at the bivariate level. Inferential
statistical tests have also been developed for multivariate analyses, which analyses the
relation among more than two variables. Commonly used extension of correlation analysis
for multivariate inferences is multiple regression analysis. Multiple regression analysis
shows the correlation between each set of independent and dependent variables.
Multicollinearity
Signs of Multicollinearity
To find the difference between the two equations, i.e. linear and nonlinear, one
should know the definitions for them. So, let us define and see the difference between
them.
It forms a straight line or represents the It does not form a straight line but forms a curve.
23
equation for the straight line
It has only one degree. Or we can also A nonlinear equation has the degree as 2 or
define it as an equation having the more than 2, but not less than 2.
maximum degree 1.
All these equations form a straight line in It forms a curve and if we increase the value of
XY plane. These lines can be extended the degree, the curvature of the graph
to any direction but in a straight form. increases.
Where x and y are the variables, m is Where x and y are the variables and a,b and c
the slope of the line and c is a constant are the constant values
value.
Examples: Examples:
10x = 1 x2+y2 = 1
9y + x + 2 = 0 x2 + 12xy + y2 = 0
4y = 3x x2+x+2 = 25
99x + 12 = 23 y
8. The linear equation has only one variable usually and if any equation has two variables in
it, then the equation is defined as a Linear equation in two variables. For example, 5x + 2 =
1 is Linear equation in one variable. But 5x + 2y = 1 is a Linear equation in two variables.
Solved Examples
⇒ 3x – 2x = 18 – 9
⇒x=9
24
Solution: Given, x+2y = 1
x=y
⇒ y + 2y = 1
⇒ 3y = 1
⇒y=⅓
∴x=y=⅓
9. How many years will it take for a bacteria population to reach 9000, if its growth is
modeled by
here, t in years?
Solution:
-0.12(t-20)=ln(0.111)
t = -ln(0.111)/0.12 + 20
On simplifying,
t=38.31 years
25
Figure 4:sample graph
Serial correlation (also called Autocorrelation) is where error terms in a time series
transfer from one period to another. In other words, the error for one time period a is
correlated with the error for a subsequent time period b. For example, an underestimate for
one quarter’s profits can result in an underestimate of profits for subsequent quarters. This
can result in a myriad of problems, including:
Inefficient Ordinary Least Squares Estimates and any forecast based on those
estimates. An efficient estimator gives you the most information about a sample;
inefficient estimators can perform well, but require much larger sample sizes to do
so.
Exaggerated goodness of fit (for a time series with positive serial correlation and
an independent variable that grows over time).
Standard errors that are too small (for a time series with positive serial correlation
and an independent variable that grows over time).
T-statistics that are too large.
26
False positives for significant regression coefficients. In other words, a regression
coefficient appears to be statistically significant when it is not.
Types of Autocorrelation
The most common form of autocorrelation is first-order serial correlation, which can
either be positive or negative.
Positive serial correlation is where a positive error in one period carries over into a
positive error for the following period.
Negative serial correlation is where a negative error in one period carries over into a
negative error for the following period.
Second-order serial correlation is where an error affects data two time periods later. This
can happen when your data has seasonality. Orders higher than second-order do happen,
but they are rare.
A plot of residuals. Plot et against t and look for clusters of successive residuals on
one side of the zero line. You can also try adding a Lowess line, as in the image
below.
A Durbin-Watson test.
A Lagrange Multiplier Test.
Ljung Box Test.
A correlogram. A pattern in the results is an indication for autocorrelation. Any values
above zero should be looked at with suspicion.
The Moran’s I statistic, which is similar to a correlation coefficient.
27