II-I_MCA_Data Science and Analytics_Course Material_Unit2
II-I_MCA_Data Science and Analytics_Course Material_Unit2
COURSE MATERIAL
UNIT 2
COURSE MCA
DEPARTMENT MCA
SEMESTER 21
PREPARED BY A.JYOTHSNA
(Faculty Name / s) Assistant Professor
Version V-1
MCA-SEM 21
SVCE TIRUPATI
MCA-SEM 21
SVCE
1. Course Objectives
The objectives of this course is to
1. The course gives you a set of practical skills for handling data that comes in a
variety of formats and sizes, such as texts, spatial and time series data.
2. These skills cover the data analysis lifecycle from initial access and acquisition,
modelling, transformation, integration, querying, application of statistical
learning and data mining methods, and presentation of results.
3. This includes data wrangling, the process of converting raw data into a more
useful form that can be subsequently analysed.
2. Prerequisites
Students should have knowledge on
1. Basic Mathematics
2. Basic understanding of programming
3. Syllabus
UNIT II
Linear Regression, Simple Linear Regression, Multiple Linear Regression, Other
Considerations in the Regression Model, Comparison of Linear Regression with K-
Nearest Neighbours, Linear Regression.
4. Course outcomes
1. Understand business intelligence and business and data analytics.
2. To understand the business data analysis through the powerful tools of data
application.
3. Understand the methods of data mining.
4. Apply basic tools (plots, graphs, summary statistics) to carry out EDA.
5. Understand the key elements of a data science project
6. Identify the appropriate data science technique and/or algorithm to use for the
major data science tasks.
DSA PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 P10 PO11 PO12 PSO1 PSO2
CO1 3 3 2 2
CO2 3 3 2 2
CO3 3 3 2 2
CO4 3 3 2 2
CO5 3 3 2 2
1|D S A ‐ U N I T ‐ I
MCA-SEM
SVCE
6. Lesson Plan
8 Revision on Unit-2 T1
8. Lecture Notes
2|D S A ‐ U N I T ‐ I
MCA-SEM
SVCE
ei =yi- y^i represents the ith residual. This is the difference between the
ith observed response value and ith response value that is predicted
by our linear model.
The Residual sum of squares (RSS) can be defined as
o RSS = e12+ e22 +....................+ en2
The least squares approach chooses β^0 &β^1 to minimize the RSS.
Using some calculus, one can show that the minimizers are
∑n (xi — x̅ )(yi — y¯)
^
β1=
i=1
∑n (xi — x̅)2
i
β^0 = y¯ — β^1 x̅
where x̅ and y¯ are the sample
means.
1.3 ASSESSING THE ACCURACY OF THE COEFFICIENT ESTIMATES:
Y= β0 + β1 x + є
Where β0 = intercept term
β1= slope
є = mean-zero random error term
Standard Error of the smaple mean μ^ is necessary to calculate
how accurate is the μ^
From the well-known formula
3|D S A ‐ U N I T ‐ I
MCA-SEM
SVCE
4|D S A ‐ U N I T ‐ I
MCA-SEM
SVCE
It takes the form of a proportion the proportion of variance explained and
so it always takes on a value between 0 and 1, and is independent of the
scale Y.
Given Estimates β^0 , β^1 , β^2 ……… β^n , we can make predictions
using the formula
This is done using standard statistical software. The Values β^0 , β^1 ,
β^2 ………
β^n that minimize RSS are the multiple least squares regression coefficient
estimates.
5|D S A ‐ U N I T ‐ I
MCA-SEM
SVCE
2.6.2 SOME IMPORTANT QUESTIONS
When we perform multiple linear Regression we usually are interested in
answering a few important questions
1. Is at least one of the predictors x , x , x
1 2 3............................ xp useful in predicting
the response?
2.
Do all the predictors help to explain Y, or is only a subset of the
predictors useful?
3.
How well does the model fit the data?
4.
Given a set of predictor values, what response value should we predict
, and how accurate is our prediction?
Is there a Relationship between the Response and Predictors
In Simple linear Regression setting, in order to determine the relationship
between the response and the predictor simply we check whether β1
= 0.
In the multiple regression setting with P predictors, we need to ask
whether all of the regression coefficients are zero, i.e., whether β1 = β2 =
……… βp = 0.
As in the simple linear Regression, we use a hypothesis test to answer
the question.
H0: β1 = β2 = ……… βp = 0.
Verus the alternative
Ha = at least one βi is non-zero
This hypothesis test is performed by computing the F-Statistic
MCA-SEM
SVCE
There are three classical approaches to choose a smaller set of
models to consider.
Forward Selection
Backward Selection
Mixed Selection
Forward Selection
Initially begin with the null model which contains an intercept but no
predictors.
We then fit P simple linear regressions and add to the null model the
variable that results in lowest RSS.
This approach is continued until some stopping rule is satisfied.
Backward Selection
We start with all variables in the model, and remove the variable with
the largest P-value. Which is the least statistically significant.
The new (p-1) – variable model is fit, and the variable with the largest p-
value is removed.
This procedure continues until a stopping rule is reached.
Mixed Selection
This is a combination of forward and backward selection
We start with no variables in the model, and as with forward selection,
we add the variable that provides the best fit.
We continue to perform these forwards and backward steps until all
variables in the model have a sufficiently low p-value.
Model fit
Two of the most common numerical measures of model fit are the RSS
and R2.
These Quantities are computed and interpreted in the same fashion
as for simple linear Regression.
In simple Regression. R2 is the square of the correlation of the response
and the variable.
In multiple Regression, it is equals to the square of the correlation
between the response and the fitted linear model.
An R2 value close to 1 indicates that the model explains a large
portion of the variance in the response variable.
Predictions
Once we have fit the multiple regression model, there are three sorts
of uncertainty associated with this prediction.
Least Squares plane estimate only true population regression plane.
The inaccuracy in the coefficient estimates is related to the reducible
error.
Model Bias
7|D S A ‐ U N I T ‐ I I
MCA-SEM
SVCE
In practice assuming a linear model for f(x) is almost always an
approximately of reality. So there is an additional source of potentially
reducible error called model bias.
1.7 OTHER CONSIDERATION IN THE REGRESSION MODEL
2.7.1 Qualitative Predictors
Resulting Model
8|D S A ‐ U N I T ‐ I I
MCA-SEM
SVCE
9|D S A ‐ U N I T ‐ I I
MCA-SEM
SVCE
9. Pra ctice Quiz
1. Line ar Regression is a supervised machine learning algorithm
a) true
b) false
2. Which of the following methods do we use to find the best fit line for data in Line ar
Regression?
a) Least Square Error
b) Maximum Likelihood
c) Logarithmic Loss
d) Both A and B
3. Which of the following evaluation metrics can be used to evaluate a model while
modeling a continuous output variable?
a) AUC-ROC
b) Accuracy
c) Logloss
d) Mean-Squared-Error
4. Which of the following is true about Residuals?
A) Lower is better
B)Higher is better
C) A or B depend on the situation
D) None of these
5. Which of the following statement is true about outliers in Line ar regression?
a) Linear regression is sensitive to outliers
B) Linear regression is not sensitive to outliers
C) Can’t say
D) None of these
6. Which of the following metrics can be used for evaluating regression models?
1) R Squared
2) Adjusted R Squared
3) F Statistics
4) RMSE / MSE / MAE
a) 2 and 4.
b) 1 and 2.
c) 3 and 4.
d) All of the above.
7. A regression model in which more than one independent variable is used to
predict the dependent variable is called
a) a simple linear regression model
b) a multiple regression model
c) an independent model
d) none of the above
8. A multiple regression model has the form: y = 2 + 3x1 + 4x2. As x1 increases by 1
unit (holding x2 constant), y will?
10|D S A ‐ U N I T ‐
MCA-SEM
SVCE
a) incre ase by 3 units
b) decrease by 3 units
c) increase by 4 units
d) decrease by 4 units
9. A multiple regression model has
a) only one independent variable
b) more than one dependent variable
c) more than one independent variable
d) none of the above
10. A measure of goodness of fit for the estimated regression equation is the
a) multiple coefficient of determination
b) mean square due to error
c) mean square due to regression
d) none of the above
10. Assignments
S.No Question BL CO
Define simple linear regression and explain how to estimate the
1 2 1
coefficients.
Define Hypothesis Testing and explain Hypothesis testing with an
2 2 1
example.
Compare Linear Regression with K-Nearest Neighbors, Linear
3 Regression. 2 1
11. Questions
S.No Question BL CO
1 Define simple linear regression and explain how to estimate the 1 1
coefficients.
2 Define Hypothesis Testing and explain Hypothesis testing with an 2 1
example.
3 Compare Linear Regression with K-Nearest Neighbors, Linear 2 1
Regression.
11|D S A ‐ U N I T ‐
MCA-SEM
SVCE
13. Real Time Applications
S.No Application CO
1 Predictive Analytics: 1
Predictive analytics i.e., forecasting future opportunities and risks is the
most prominent application of regression analysis in business.
2 Operation Efficiency: 1
Regression models can also be used to optimize business processes. A
factory manager, for example, can create a statistical model to
understand the impact of oven temperature on the shelf life of the
cookies baked in those ovens.
3 Supporting Decisions: 1
Businesses today are overloaded with data on finances, operations and
customer purchases.
4 Correcting Errors: 1
Regression is not only great for lending empirical support to
management decisions but also for identifying errors in judgment.
5 New Insights: 1
Over time businesses have gathered a large volume of unorganized data
that has the potential to yield valuable insights.
12|D S A ‐ U N I T ‐
MCA-SEM
SVCE
16. Mini Project Suggestion
1. Budget a Long Drive
Suppose you want to go on a long drive (from Delhi to Lonawala). Before going
on a trip this long, it’s best to prepare a budget and figure out how much you
need to spend on a particular section. You can use a linear regression model
here to determine the cost of gas you’ll have to get.
2. Compare Unemployment Rates with Gains in Stock Market
If you’re an economics enthusiast, or if you want to use your knowledge of
Machine Learning in this field, then this is one of the best linear regression project
ideas for you. We all know how unemployment is a significant problem for our
country. In this project, we’d find the relation between the unemployment
rates and the gains happening in the stock market.
3. Compare Salaries of Batsmen with The Average Runs They Score per Game
Cricket is easily the most popular game in India. You can use your knowledge
of machine learning in this simple yet exciting project where you’ll plot
the relationship between the salaries of batsmen and the average runs they
score in every game. Our cricketers are among some of the highest-earning
athletes in the world. Working on this project would help you find out how
much their batting averages are responsible for their earnings.
4. Compare the Dates in a Month with the Monthly Salary
This project explores the application of machine learning in human resources
and management. It is among the beginner-level linear regression projects, so
if you haven’t worked on such a project before, then you can start with this
one. Here, you’ll take the dates present in a month and compare it with the
monthly salary.
5. Compare Average Global Temperatur es and Levels of Pollution
Pollution and its impact on the environment is a prominent topic of discussion.
The recent pandemic has also shown us how we can still save our
environment. You can use your machine learning skills in this field too. This
project would help you in understanding how machine learning can solve the
various problems present in this domain as well.
13|D S A ‐ U N I T ‐
MCA-SEM