0% found this document useful (0 votes)
8 views

II-I_MCA_Data Science and Analytics_Course Material_Unit2

Data science and analytics

Uploaded by

jeevansai496
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

II-I_MCA_Data Science and Analytics_Course Material_Unit2

Data science and analytics

Uploaded by

jeevansai496
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

SVCE TIRUPATI

COURSE MATERIAL

DATA SCIENCE & ANALYTICS


SUBJECT (CA20FPC302)

UNIT 2

COURSE MCA

DEPARTMENT MCA

SEMESTER 21

PREPARED BY A.JYOTHSNA
(Faculty Name / s) Assistant Professor

Version V-1

PREPARED / REVISED DATE 17-03-2022

MCA-SEM 21
SVCE TIRUPATI

TABLE OF CONTENTS – UNIT 2


S. NO CONTENTS PAGE NO.
1 COURSE OBJECTIVES 1
2 PREREQUISITES 1
3 SYLLABUS 1
4 COURSE OUTCOMES 1
5 CO - PO/PSO MAPPING 2
6 LESSON PLAN 2
7 ACTIVITY BASED LEARNING 2
8 LECTURE NOTES 3
2.1 LINEAR REGRESSION 3
2.2 ESTIMATING THE COEFFICIENTS 3
2.3 ASSESSING THE ACCURACY OF THE COEFFICIENT ESTIMATE 4
2.4 HYPOTHESIS TESTING 5
2.5 ASSESSING THE ACCURACY OF THE MODEL 5
2.6 MULTIPLE LINEAR REGRESSION 7
2.7 OTHER CONSIDERATION IN THE REGRESSION MODEL 10
2.8 COMPARISON OF LINEAR REGRESSION WITH K-NEAREST 12
NEIGHBORS
9 PRACTICE QUIZ 14
10 ASSIGNMENTS 14
11 QUESTIONS & ANSWERS 14
12 SUPPORTIVE ONLINE CERTIFICATION COURSES 15
13 REAL TIME APPLICATIONS 15
14 CONTENTS BEYOND THE SYLLABUS 15
15 PRESCRIBED TEXT BOOKS & REFERENCE BOOKS 15
16 MINI PROJECT SUGGESTION 16

MCA-SEM 21
SVCE
1. Course Objectives
The objectives of this course is to
1. The course gives you a set of practical skills for handling data that comes in a
variety of formats and sizes, such as texts, spatial and time series data.
2. These skills cover the data analysis lifecycle from initial access and acquisition,
modelling, transformation, integration, querying, application of statistical
learning and data mining methods, and presentation of results.
3. This includes data wrangling, the process of converting raw data into a more
useful form that can be subsequently analysed.

2. Prerequisites
Students should have knowledge on
1. Basic Mathematics
2. Basic understanding of programming

3. Syllabus
UNIT II
Linear Regression, Simple Linear Regression, Multiple Linear Regression, Other
Considerations in the Regression Model, Comparison of Linear Regression with K-
Nearest Neighbours, Linear Regression.

4. Course outcomes
1. Understand business intelligence and business and data analytics.
2. To understand the business data analysis through the powerful tools of data
application.
3. Understand the methods of data mining.
4. Apply basic tools (plots, graphs, summary statistics) to carry out EDA.
5. Understand the key elements of a data science project
6. Identify the appropriate data science technique and/or algorithm to use for the
major data science tasks.

5. Co-PO / PSO Mapping

DSA PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 P10 PO11 PO12 PSO1 PSO2

CO1 3 3 2 2

CO2 3 3 2 2

CO3 3 3 2 2

CO4 3 3 2 2

CO5 3 3 2 2

1|D S A ‐ U N I T ‐ I

MCA-SEM
SVCE
6. Lesson Plan

Lecture No. Weeks Topics to be covered References


Linear Regression T1
1
Simple Linear Regression T1
2
5
3 Multiple Linear Regression T1

4 Other Considerations in the Regression Model T1

5 Comparison of Linear Regressions with K-Nearest Neighbors T1

6 Comparison of Linear Regressions with K-Nearest Neighbors T1


6 Linear Regression
7 T1

8 Revision on Unit-2 T1

7. Activity Based Learning


1. Develop Regression model using Linear Regression
2. Analyzing performance of Linear Regression model and KNN

8. Lecture Notes

1.1 LINEAR REGRESSION


Linear Regression is a very simple approach for supervised learning. It is a useful
tool for predicting a quantative response. It is used for finding linear
relationship between target and one or more predictors.
They are two types of Linear Regression
• Simple Linear Regression
• Multiple Linear Regression
2.1.1 Simple Linear Regression:
Simple Linear Regression is a very straight forward approach for predicting
a quantitative response Y on the basis of a single predictor variable X. It
assumes that there is approximately a linear relationship between X on
Y. Mathematically we can write this linear relationship as
Y ≈ β0+β1 x
 β0 and β1 are two unknown constants that represent the intercept
and slope terms in the linear model.
 β0& β1 together known as model coefficients or parameters.
1.2 ESTIMATING THE COEFFICIENTS:
 In Practice, β0 & β1 are unknown we must use data to estimate
the coefficients.
 Let y^i = β^0 + β^1 xi be the prediction for y based on the ith value of X.

2|D S A ‐ U N I T ‐ I

MCA-SEM
SVCE
 ei =yi- y^i represents the ith residual. This is the difference between the
ith observed response value and ith response value that is predicted
by our linear model.
 The Residual sum of squares (RSS) can be defined as
o RSS = e12+ e22 +....................+ en2
 The least squares approach chooses β^0 &β^1 to minimize the RSS.
Using some calculus, one can show that the minimizers are

∑n (xi — x̅ )(yi — y¯)
^
β1=
i=1
∑n (xi — x̅)2
i
β^0 = y¯ — β^1 x̅
where x̅ and y¯ are the sample
means.
1.3 ASSESSING THE ACCURACY OF THE COEFFICIENT ESTIMATES:
Y= β0 + β1 x + є
Where β0 = intercept term
β1= slope
є = mean-zero random error term
 Standard Error of the smaple mean μ^ is necessary to calculate
how accurate is the μ^
 From the well-known formula

where σ is the standard deviation

 The standard errors associated with β^0 and β^ , we use the


following formulas:

where σ2 = var (є)

 These standard errors can be used to compute confidence intervals. A


95% confidence interval is defined as a range of values such that with
95% probability the range will contain the true unknown value of the
parameter. It has the form

o That is, there is approximately a 95% chance that the interval

Will contain the true value of β1

3|D S A ‐ U N I T ‐ I

MCA-SEM
SVCE

1.4 HYPOTHESIS TESTING


 Standard errors can also be used to perform hypothesis tests on the
coefficients. The most common hypothesis test involves testing the null
hypothesis of
H0: There is no relation between X and Y versus the alternative hypothesis
H1: There is some relationship between X and Y.
Mathematically, this corresponds to testing
H0: β0 = 0
Verus
HA : β1 ≠0
Since if β1 = 0 then the model reduces to Y= β 0 +E and X is not associated
with Y
 To test the null hypothesis, we compute a t-statistic given by

This will have a t-Distribution with n-2 degrees of freedom, assuming β1 = 0


 Using Statistical software, it is easy to compute the probability of
observing any value equal to |t| or larger. We call this probability the P-
value.
1.5 ASSESSING THE ACCURACY OF THE MODEL
 Once the null Hypothesis is rejected in favor of the alternative
hypothesis, it is natural to want to quantify the extent to which the model
fits the data.
 The Quality of a linear Regression fit is typically assessed using two
related Quantities:
 The Residual Standard Error (RSE)
 R2 Statistic
Residual Standard Error (RSE)
 We compute the RSE

Where RSS = Residual Sum of Squares


 The RSE is considered a measure of the lack of fit of the model to the data.
 If the predictions obtained using the model are very close to the true
outcomes values- i.e, if y^i ≈ yi for i=1,2,…..n then RSE will be small, and
we can conclude that the model fits the data very well.
 On the other hand, if y^i is very far from yi for one or more observations,
then the RSE may be quite large, indicating that the model doesn’t fit the
date well .
2
R Statistic
 The R2 provides an alternative measure of fit.

4|D S A ‐ U N I T ‐ I

MCA-SEM
SVCE
 It takes the form of a proportion the proportion of variance explained and
so it always takes on a value between 0 and 1, and is independent of the
scale Y.

Where TSS = Total Sum of Squares


 R2 statistic that is close to 1 indicates that a large proportion of the
variability in the response.
 The R2 statistic has interpretational advantage over RSE, since unlike
RSE, it always lies between 0 and 1.
 The R2 statistic is a measure of the linear relationship between X
and Y . R2 =r2 where r is the correlation between X and Y.
1.6 MULTIPLE LINEAR REGRESSION

 Simple Linear Regression is a useful approach for predicting a response on


the basis of single predictor variable. However, in practice we often have
more than one predictor.
 Multiple linear regression can accommodate multiple predictors.
Y= β0 + β1 x1 + β0 + β1 x2 +… + βp + βp x +E
 We interpret βi as the average effect on Y of a one unit increase in X j,
holding all other predictors fixed.
 For example,

2.6.1 Estimating Regression Coefficients

 Given Estimates β^0 , β^1 , β^2 ……… β^n , we can make predictions
using the formula

 We estimate β0, β1,…….. βp as the values that minimize the sum of


squares residuals

 This is done using standard statistical software. The Values β^0 , β^1 ,
β^2 ………
β^n that minimize RSS are the multiple least squares regression coefficient
estimates.

5|D S A ‐ U N I T ‐ I

MCA-SEM
SVCE
2.6.2 SOME IMPORTANT QUESTIONS
 When we perform multiple linear Regression we usually are interested in
answering a few important questions
1. Is at least one of the predictors x , x , x
1 2 3............................ xp useful in predicting

the response?
2.
Do all the predictors help to explain Y, or is only a subset of the
predictors useful?
3.
How well does the model fit the data?
4.
Given a set of predictor values, what response value should we predict
, and how accurate is our prediction?
Is there a Relationship between the Response and Predictors
 In Simple linear Regression setting, in order to determine the relationship
between the response and the predictor simply we check whether β1
= 0.
 In the multiple regression setting with P predictors, we need to ask
whether all of the regression coefficients are zero, i.e., whether β1 = β2 =
……… βp = 0.
 As in the simple linear Regression, we use a hypothesis test to answer
the question.
H0: β1 = β2 = ……… βp = 0.
Verus the alternative
Ha = at least one βi is non-zero
 This hypothesis test is performed by computing the F-Statistic

 If the linear model assumptions are correct, then

and that, provided H0 is true,

 Hence, when there is no relationship between the response and


predictors, one would expect the F-statistic to take on a value
close

to 1. On the other hand, if Ha is true then .


So we expect F to be greater than 1.

Deciding on Important Variable


 The task of determining which predictors are associated with the
response to fit a single model is referred to as variable selection
6|D S A ‐ U N I T ‐ I I

MCA-SEM
SVCE
 There are three classical approaches to choose a smaller set of
models to consider.
 Forward Selection
 Backward Selection
 Mixed Selection
Forward Selection
 Initially begin with the null model which contains an intercept but no
predictors.
 We then fit P simple linear regressions and add to the null model the
variable that results in lowest RSS.
 This approach is continued until some stopping rule is satisfied.
Backward Selection
 We start with all variables in the model, and remove the variable with
the largest P-value. Which is the least statistically significant.
 The new (p-1) – variable model is fit, and the variable with the largest p-
value is removed.
 This procedure continues until a stopping rule is reached.
Mixed Selection
 This is a combination of forward and backward selection
 We start with no variables in the model, and as with forward selection,
we add the variable that provides the best fit.
 We continue to perform these forwards and backward steps until all
variables in the model have a sufficiently low p-value.
Model fit
 Two of the most common numerical measures of model fit are the RSS
and R2.
 These Quantities are computed and interpreted in the same fashion
as for simple linear Regression.
 In simple Regression. R2 is the square of the correlation of the response
and the variable.
 In multiple Regression, it is equals to the square of the correlation
between the response and the fitted linear model.
An R2 value close to 1 indicates that the model explains a large
portion of the variance in the response variable.
Predictions
 Once we have fit the multiple regression model, there are three sorts
of uncertainty associated with this prediction.
 Least Squares plane estimate only true population regression plane.
 The inaccuracy in the coefficient estimates is related to the reducible
error.
Model Bias

7|D S A ‐ U N I T ‐ I I

MCA-SEM
SVCE
In practice assuming a linear model for f(x) is almost always an

approximately of reality. So there is an additional source of potentially
reducible error called model bias.
1.7 OTHER CONSIDERATION IN THE REGRESSION MODEL
2.7.1 Qualitative Predictors

 Some predictors are not quantitative but are qualitative, taking a


discrete set of values.
 These are also called Categorical predictors or factor variables.
 For Example: Investigate difference in credit card balance
between males and females, ignoring the other variables. We create a
new variable

Resulting Model

With more than two levels, we create additional dummy



variable is known as the baseline.
2.7.2 Extensions of the Liner Model
 Two of the most important assumptions state that the relationship
between the predictor and response are additive & linear.
 The additive assumption means that the effect of changes in a
predictor xj on the response Y is independent of the values of the
other predictors.
 The Linear assumption states that the change in response Y due to
a one unit change in xj is constant, regardless of the value xj.
Removing the additive Assumptions
 If TV increases by one unit, then sales will increase by β1,
independently from the amount of radio budget.
 This simple model may be wrong. It may be the case that the
coefficient for TV should increase or radio increases
Sales= β0 + β1 *TV+ β2 *Radio+ β3 * Newspaper + E
 How to extend the standard linear Regression model, by
“releasing” the additive assumption?

8|D S A ‐ U N I T ‐ I I

MCA-SEM
SVCE

 Interaction term β3 x1 x2 is added to remove interaction


effect Y= β0 +( β1 + β3 *2)x1 + β2 x2 +E
Y= β0 + β1x1 + β2x2 + E
 Since β˜1 changes with x2 , the effect of x1 on y is no longer
constant advertising x2 will change the impact of x1 on y.
Non-Linear Relationships
 The linear regression model assumes a linear Relationship between
the response and predictors.
 But in some cases, the true relationship between the response and
the predictors may be non-linear.
 A very simple way to directly extend the linear model to
accommodate non-linear relationships is using polynomial
regression.
 Polynomial regression is a form of regression analysis in which the
relationship between the independent variable X and the
dependent variable Y.
 It fits a nonlinear Relationship between the value of x and the
corresponding conditional mean of Y.
1.8 COMPARISON OF LINEAR REGRESSION WITH K-NEAREST NEIGHBORS
 Linear regression is an example of a parametric approach because it
assumes a linear function form for f(x)
 Parametric methods have several advantages. They are often easy to fit,
because one need estimate only a small number of coefficients.
 But parametric methods do have a disadvantage by construction, they
make strong assumptions about the form of f(x).
 If the specified functional form is far from the truth, and prediction
accuracy is our goal, then the parametric method will perform poorly.
 Non-parametric methods do not explicitly assume a parametric form for
f(x), and there by provide an alternative and more flexible approach for
performing regression.
 The simplest and best-known non parametric methods is K-NN regression
 The KNN Regression method is closely related to the KNN classifier.
 KNN Regression first identifier the K training observations that are closest to
x0 , represented by N0. It then estimates f(x0) using the average of all
the training responses in N0.

9|D S A ‐ U N I T ‐ I I

MCA-SEM
SVCE
9. Pra ctice Quiz
1. Line ar Regression is a supervised machine learning algorithm
a) true
b) false
2. Which of the following methods do we use to find the best fit line for data in Line ar
Regression?
a) Least Square Error
b) Maximum Likelihood
c) Logarithmic Loss
d) Both A and B
3. Which of the following evaluation metrics can be used to evaluate a model while
modeling a continuous output variable?
a) AUC-ROC
b) Accuracy
c) Logloss
d) Mean-Squared-Error
4. Which of the following is true about Residuals?
A) Lower is better
B)Higher is better
C) A or B depend on the situation
D) None of these
5. Which of the following statement is true about outliers in Line ar regression?
a) Linear regression is sensitive to outliers
B) Linear regression is not sensitive to outliers
C) Can’t say
D) None of these
6. Which of the following metrics can be used for evaluating regression models?
1) R Squared
2) Adjusted R Squared
3) F Statistics
4) RMSE / MSE / MAE
a) 2 and 4.
b) 1 and 2.
c) 3 and 4.
d) All of the above.
7. A regression model in which more than one independent variable is used to
predict the dependent variable is called
a) a simple linear regression model
b) a multiple regression model
c) an independent model
d) none of the above
8. A multiple regression model has the form: y = 2 + 3x1 + 4x2. As x1 increases by 1
unit (holding x2 constant), y will?

10|D S A ‐ U N I T ‐

MCA-SEM
SVCE
a) incre ase by 3 units
b) decrease by 3 units
c) increase by 4 units
d) decrease by 4 units
9. A multiple regression model has
a) only one independent variable
b) more than one dependent variable
c) more than one independent variable
d) none of the above
10. A measure of goodness of fit for the estimated regression equation is the
a) multiple coefficient of determination
b) mean square due to error
c) mean square due to regression
d) none of the above

10. Assignments

S.No Question BL CO
Define simple linear regression and explain how to estimate the
1 2 1
coefficients.
Define Hypothesis Testing and explain Hypothesis testing with an
2 2 1
example.
Compare Linear Regression with K-Nearest Neighbors, Linear
3 Regression. 2 1

11. Questions

S.No Question BL CO
1 Define simple linear regression and explain how to estimate the 1 1
coefficients.
2 Define Hypothesis Testing and explain Hypothesis testing with an 2 1
example.
3 Compare Linear Regression with K-Nearest Neighbors, Linear 2 1
Regression.

12. Supportive Online Certification Courses


1. Essentials of Data Science With R Software - 2: Sampling Theory and Linear
Regression Analysis-By Prof. Shalabh organized by IIT Kanpur |12 weeks

11|D S A ‐ U N I T ‐

MCA-SEM
SVCE
13. Real Time Applications
S.No Application CO
1 Predictive Analytics: 1
Predictive analytics i.e., forecasting future opportunities and risks is the
most prominent application of regression analysis in business.
2 Operation Efficiency: 1
Regression models can also be used to optimize business processes. A
factory manager, for example, can create a statistical model to
understand the impact of oven temperature on the shelf life of the
cookies baked in those ovens.
3 Supporting Decisions: 1
Businesses today are overloaded with data on finances, operations and
customer purchases.
4 Correcting Errors: 1
Regression is not only great for lending empirical support to
management decisions but also for identifying errors in judgment.
5 New Insights: 1
Over time businesses have gathered a large volume of unorganized data
that has the potential to yield valuable insights.

14. Contents Beyond the Syllabus


1. Multiple Line ar Regression Analysis with R
Applying the multiple linear regression model using R.
2. Variable Selection using LASSO Regression
Data analysts and data scientists use different regression methods for different
kinds of analytics problems. From the simplest ones to the most complex ones.
One of the most talked-about methods is the Lasso. Lasso was often described
as one of the most useful linear regression tools and we are about to find out
why.

15. Prescribed Text Books & Reference Books


Text Book
1. Gareth James Daniela Witten Trevor Hastie, Robert Tibshirani, An Introduction to
Statistical Learning with Applications in R, February 11, 2013, web link:
www.statlearning.com.
2. Mark Gardener, Beginning R The statistical Programming Language, Wiley, 2015.
3. Han ,Kamber, and J Pei, Data Mining Concepts and Techniques, 3rd edition,
Morgan Kaufman, 2012.
References:
1. Sinan Ozdemir, Principles of Data Science, Packt Publishing Ltd Dec 2016.
2. Joel Grus, Data Science from Scratch, Oreilly media, 2015.

12|D S A ‐ U N I T ‐

MCA-SEM
SVCE
16. Mini Project Suggestion
1. Budget a Long Drive
Suppose you want to go on a long drive (from Delhi to Lonawala). Before going
on a trip this long, it’s best to prepare a budget and figure out how much you
need to spend on a particular section. You can use a linear regression model
here to determine the cost of gas you’ll have to get.
2. Compare Unemployment Rates with Gains in Stock Market
If you’re an economics enthusiast, or if you want to use your knowledge of
Machine Learning in this field, then this is one of the best linear regression project
ideas for you. We all know how unemployment is a significant problem for our
country. In this project, we’d find the relation between the unemployment
rates and the gains happening in the stock market.
3. Compare Salaries of Batsmen with The Average Runs They Score per Game
Cricket is easily the most popular game in India. You can use your knowledge
of machine learning in this simple yet exciting project where you’ll plot
the relationship between the salaries of batsmen and the average runs they
score in every game. Our cricketers are among some of the highest-earning
athletes in the world. Working on this project would help you find out how
much their batting averages are responsible for their earnings.
4. Compare the Dates in a Month with the Monthly Salary
This project explores the application of machine learning in human resources
and management. It is among the beginner-level linear regression projects, so
if you haven’t worked on such a project before, then you can start with this
one. Here, you’ll take the dates present in a month and compare it with the
monthly salary.
5. Compare Average Global Temperatur es and Levels of Pollution
Pollution and its impact on the environment is a prominent topic of discussion.
The recent pandemic has also shown us how we can still save our
environment. You can use your machine learning skills in this field too. This
project would help you in understanding how machine learning can solve the
various problems present in this domain as well.

13|D S A ‐ U N I T ‐

MCA-SEM

You might also like