0% found this document useful (0 votes)
19 views37 pages

Data Science Unit-5

The document covers key concepts in data science, specifically focusing on performance measures, logistic regression, and K-Nearest Neighbors (KNN) implementations in R. It details various performance metrics such as accuracy, sensitivity, specificity, and ROC analysis, along with case studies including weather forecasting and automotive crash testing. Additionally, it provides step-by-step instructions for implementing logistic regression and KNN algorithms using R, including data preparation and model evaluation techniques.

Uploaded by

yejem28478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views37 pages

Data Science Unit-5

The document covers key concepts in data science, specifically focusing on performance measures, logistic regression, and K-Nearest Neighbors (KNN) implementations in R. It details various performance metrics such as accuracy, sensitivity, specificity, and ROC analysis, along with case studies including weather forecasting and automotive crash testing. Additionally, it provides step-by-step instructions for implementing logistic regression and KNN algorithms using R, including data preparation and model evaluation techniques.

Uploaded by

yejem28478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

PE 515 CS – DATA SCIENCE

UNIT – V
04.01.2023
 Performance Measure
 Logistic regression implementation in R
 K- Nearest Neighbors (KNN)
 K- Nearest Neighbors Implementation in R
 Clustering: K-Means Algorithm
 K-Means Implementation in R
 Case Studies of Data Science Application
o Weather Forecasting
o Stock Market Prediction
o Object Recognition
o Real Time Sentiment Analysis

Performance measures

 Result of a Case study – classification of car based on certain attributes

Page 1 of 1
 2 classes are Hatchback and SUV
 Confusion matrix – What was predicted by classifier (prediction) and
actual (Reference / True Condition)
 Number of data points used = 20
Reference
Prediction Hatchback SUV
Hatchback 10 1
SUV 0 9

 Example: True Positive


o 1st word refers to the truth / non-truth of the prediction
Page 2 of 2
o 2nd word – relates to prediction

 True Positive – success of the classifier (Power)


 False positive – mistake of the classifier (Type I error)
 True Negative – success of the classifier
 False Negative – mistake of the classifier (Type II error)

 Perfect classifier – other than diagonal elements, other values are ‘0’.

Page 3 of 3
Measures of Performance
Terminology
 TP – True Positive (Correct Identification of Positive Labels)
 TN – True Negative (Correct Identification of Negative Labels)
 FP – False Positive (Incorrect Identification of Positive Labels)
 FN – False Negative (Incorrect Identification of Negative Labels)

 Total Samples, N = TP + TN + FT + FN
1) Accuracy – overall effectiveness of a classifier, A = (TP + TN) / N
 Maximum value that accuracy can take is 1
 This happens when the classifier exactly classifies two groups (i.e.
FP=0 & FN=0)
 Total number of True Positive Labels = TP + FN
 Total number of True Negative Labels = TN + FP
2) Sensitivity – effectiveness of a classifier to identify positive labels,
Se=TP/(TP+FN)
3) Specificity – effectiveness of a classifier to identify negative labels,
Sp= TN/(FP+TN)
 Both Se & Sp lie between 0 & 1, 1 is an ideal value for each of them
4) Balanced Accuracy – BA = (Sensitivity + Specificity) / 2
5) Prevalence – How often does the yes condition actually occur in our
sample, P = (TP+FN)/N
6) Positive Predictive Value – Proportion of correct results in labels
identified as positive,
PPV = (sensitivity * prevalence) / ((sensitivity * prevalence) + (1-
specificity)*(1-prevalence))
7) Negative Prediction Value – Proportion of correct results in labels
identified as negative
NPV = (specificity * (1-prevalence)) / (((1-sensitivity)*prevalence +
((specificity)*(1-prevalence)))
8) Detection Rate = TP / N
9) Detection Prevalence – Prevalence of Predicted Events, DP = (TP +

Page 4 of 4
FP) / N
10) The Kappa Statistic (or value) is a metric that compares an
Observed Accuracy with Expected Accuracy (random chance)
Kappa = (observed accuracy – expected accuracy) / (1 – expected
accuracy)
Observed Accuracy, OA = a+d / N
Expected Accuracy, EA = ((a+c)(a+b)+(b+d)(c+d))/N

Where a, b, c and d are TP, FP, FN and TN respectively


 First one in the result is positive label
 Second one in the result is the negative label
 Refer the final outcome for confirmation, “Positive Class” –
Hatchback

Page 5 of 5
 Which is important measure – Kind of application dependent

ROC – Receiver Operating Characteristics


 Originally developed and used in signal detection theory
 ROC graph:
 Sensitivity as a function of specificity
 Sensitivity (y – axis) and 1-specificty (x-axis)

 Best value for sensitivity and specificity is 1.

For single classifier:


 ROC can be used to
o see the classifier performance at different threshold levels (from 0
to 1)
o AUC – Area under the ROC
 An area of 1 represents a perfect test; an area of 0.5
represents a worthless model
 0.90 – 1 = excellent
 0.80 – 0.90 = good
 0.70 – 0.80 = fair
 0.60 – 0.70 = poor
o AUC < 0.5, check whether your labels are marked in opposite

Page 6 of 6
For several classifiers:
 ROC can be used to
o Compare different classifiers at one threshold or overall threshold
levels
o Performance
o Model 3 > Model 2 > Model 1

05.01.2023
Logistic Regression implementation in R

 Case study
o Problem statement
 Solve the case study suing R
o Read the data from a .csv file
o Understand the data

Page 7 of 7
o glm() function
o Interpret the results

Page 8 of 8
Key points
 Logistic regression is primarily used as a classification algorithm
 It is a supervised learning algorithm
o Data is labeled
 Parametric Approach
 The decision boundary is derived based on the probability interpretation
 Decision boundary can be linear or nonlinear
 The probabilities are also modeled as sigmoidal function

Automotive Crash Testing


 Problem statement – a crash test is a form of destructive testing that is
performed in order to ensure high safety standards for various cars

Page 9 of 9
 Several cars have rolled into an independent audit unit for crash test
 They are being evaluated on a defined scale {poor (-10) to excellent (10)}
on the following parameters:
1. Manikin head impact
2. Manikin body impact
3. Interior impact
4. HVAC (heating, ventilation & air conditioning) impact
5. Safety alarm system
 Each crash test is very expensive to perform
 The crash test was performed for only 100 cars
 Type of a car – Hatchback / SUV, was noted
 However with this data in future they should be able to predict the type
of the car
 Part of data reserved for building a model and remaining kept for
analysis
 Data for 80 cars is given in crashTest_1.csv
 Data for remaining 20 cars is given in crashTest_1_TEST.csv
 Use logistic regression classification technique to classify the car types
as hatchback / SUV

Solution to Case Study Using R


 Setting working directory, clearing variables in the workspace
 Installing or loading required packages
#set the working directory as the directory which contains the data files
#setwd(“path of the directory with data files”)
rm(list=ls()) # to clear the environment
#install.packages(“caret”, dependencies = true)
library(caret) #for confusion matrix

Reading the data


 Data for this case study is provided in files named, crashTest_1.csv

Page 10 of 10
(Training Data) & crashTest_1_TEST.csv (Testing Data)
 To read the data from a .csv file, use read.csv() function
 read.csv() – reads a file in a table format and creates a data frame from
it
 Syntax – read.csv(file, row.names=1)
o File – the name of the file which the data are to be read from. Each
row of the table appears as one line if the file.
o row.names – a vector of row names. This can be a vector giving
the actual row names, or a single number giving the column of the
table which contains the row names, or character string giving the
name of the table column containing the row names.

#Reading the data


crashTest_1<-read.csv(“crashTest1.csv”, row.names=1)
crashTest_1_TEST<-read.csv(“crashTest1_1_TEST.csv”, row.names=1)

Viewing the data


 View(CrashTest_1)

Page 11 of 11

Understanding the data


 crashTest_1 contains 80 observations of 6 variables
 crashTest_1 _TEST contains 20 observations of 6 variables
 The variables are Manikin head impact, Manikin body impact, Interior
impact, HVAC impact, safety alarm system
 First five columns are the details about the car and last column is the
label which says whether the car type Hatchback / SUV

Structure of the data


 Variables and their data types
 str()
syntax – str(object)
 object – any R object about which you want to have some information

Page 12 of 12

Summary of the data


 Summary of data – The function invokes particular methods which
depend on the class of the first argument
 summary()
o summary gives a 5 point summary for numeric attributes in the
data
 Syntax
o summary (object)
 object - any R object about which you want to have some information

Page 13 of 13
 summary of trained data

 summary of test

Page 14 of 14
glm()
glm(formula, data, family)

Arguments:
 Formula – object of class”formula” (or one that can be coerced to
that class): a symbolic description of the model to be fitted
 Data – data frame containing variables
 Family – a description of the error distribution and link function to be
used in te model.
For glm, this can be a character string naming a family function, a
family function or the result of a call to a family function in specific,
family=’binomial’ corresponds to logistic regression

Building a logistic regression model


# Model
Logisfit<-glm(formula = crashTest_1$CarType~., family=’binomial’, data =
crashTest_1)

Modeled the probability as sigmoidal function

Log odds ratio:


The odds ratio is the probability of
success/probability of failure.

Decision boundary, hyper plane


equation
p(X) = probability of success,
1-p(X) = probability of failure

Page 15 of 15
2 degrees of freedom
1st – when there is NULL model (only with intercept – Reduced Model, 80 –
1 =79)
2nd – included all the variable in the modeling – Full Model, 80 – 6 = 74)

Summary of model

Page 16 of 16
Fisher Scoring iterations (MLE), no. of iterations = 25
Finding the Odds
 predict()
 syntax: predict(object)

# Finding the odds


logisTrain<-predict(logisfit, type=’response’) #by default – training set

 predict() with type = ‘response’ gives probabilities


 by default otherwise it returns log(odds)

Plotting the probabilities


plot(logisTrain)

Page 17 of 17
Classes are well separated
Which side belongs to which car type?

Identifying probabilities associated with the Car Type


 Mean of probabilities
 This helps us identify the probabilities associated with the two classes
 Tapply(logisTrain, crashTest_1$CarType, mean)

 Low probabilities are associated with car type ‘Hatchback’


 higher probabilities are associated with car type ‘SUV’

Page 18 of 18
Predicting on test data

# Predicting on test data


logispred<-predict(logisfit, newdata=crashTest_1_TEST,type =’response’)
plot(logispred)

“logispred” is the output which has the probability values

Results
 It is classified that the test point is Hatchback / SUV by setting a
threshold

crashTest_1TEST[logispred<=0.5,”LogisPred”]<-“Hatchback”
crashTest_1TEST[logispred>0.5,”LogisPred”]<-“SUV”

Page 19 of 19

 Predicted Results @ 7th Column

Page 20 of 20
Confusion Matrix
confusionMatrix(table(crashTest_1_TEST[,7], crashTest_1_TEST[,6]),
positive = ‘Hatchback’)

Page 21 of 21
10.01.2023
k – Nearest Neighbors (kNN)

 Simple and powerful classification algorithm


 k – Nearest Neighbors (kNN) is a non-parametric method used for classification
 It is a lazy learning algorithm where all computation is deferred until
classification
 It is also an instance based learning algorithm where the function is
approximated locally
 In logistic regression, earlier Used trained data for hyper plane, estimated
parameter were used to predict the test data
 In kNN, data itself will be used, not estimated parameters, here number of
neighbors will be use, which is a tuning parameter for the algorithm
 (It is not the parameter derived from the data)
 Distinction between parameter and non-parameter
 In Logistic regression, without parameters, predicting data is not possible
 But in kNN, using data, it is possible
 No prior work is required for doing classification using kNN (difference b/w
Logistic Regression and Knn

Why kNN and when does one use it?

Why kNN?
 Simplest of all classification algorithms and easy to implement
 There is no explicit training phase and the algorithm does not perform any
generalization of the trained data (no optimization required)

When does one use this kNN algorithm?


 When there are nonlinear decision boundaries between classes and when the
amount of data is large

 When there are more data, complication also increases, but there are many ways
to address this issue.

Input features
 input features can be both quantitative and qualitative

Outputs
Page 22 of 22
 Outputs are categorical values, which typically are the classes of the data

 kNN explains a categorical value using the majority votes of nearest neighbors

Assumptions
 Being nonparametric, the algorithm does not make any assumptions about the
underlying data distribution
 Select the parameter ‘k’ based on the data
 Requires a distance metric to define proximity between any two data points
 Example: Euclidean distance, Mahalanobis distance or Hamming distance

Algorithm
 The kNN classification is performed using the following four steps
1. Compute the distance matrix between the test data point and all the
labeled data points
2. Order the labeled data points in the increasing order of the distance metric
3. Select the top ‘k’ labeled data points and look at the class labels
4. Find the class label that the majority of these ‘k’ labeled data points have
and assign it to the test data point
 Easy to solve multiclass problems using kNN
 New test point xnew
 Calculating distances (refer the image below)
 When the distance is ‘0’, xnew itself
 If k=3, find the first 3 distances

Page 23 of 23
 If all the data points are from class 1, then xnew is also from class 1
 If all the data points are from class 2, then xnew is also from class 2
 If 2 belongs to class 1 & 1 belongs to class 2, then due to majority of
votes, it still stays in class 1
 If 2 belongs to class 2 & 1 belongs to class 1, then due to majority of
votes, it still stays in class 2

 The algorithm can be used with minor modification for the Function
Approximation
 If k=5, then take the first 5 points and then look for the majority votes,
and then assign this class to the new test data point.

Illustration of kNN

Page 24 of 24
 There is a possibility of data points getting misclassified in the region
where there is a mix of the data points (as shown in the below figure)

 As we approach farther away, the misclassification problem is less

Page 25 of 25
 We didn’t define any boundary here
 This algorithm can be effectively used for solving Complicated
nonlinear boundaries

Illustration of kNN (Testing)


 No label for test data (yellow color data point)
 In the next image, when k=3, the new test data point (yellow) will get
the label (red – class 2)
 In the next image, when k=5, the new test data point (yellow) will get
the label (blue – class 1)

Page 26 of 26

Things to consider
 Following are some things one should consider before applying kNN
algorithm
o Parameter selection
o Presence of noise
o Feature selection and scaling

Page 27 of 27
o Curse of dimensionality
Parameter selection
o The best choice of ‘k’ depends on the data
o Larger values of ‘k’ reduces the effect of noise on classification
but makes the decision boundaries between classes less
distinct
o Smaller values of ‘k’ tends to be affected by the noise with clear
separation between classes
Feature selection and scaling
o It is important to remove irrelevant features
o When the number of features is too large, and suspected to be
highly redundant, feature extraction is required
o If the features are carefully chosen then it is expected that the
classification will be better

28.01.2023
K- nearest neighbours implementation in R

 Case Study
o Problem statement
 Solve the Case Study using R
o Read the data from a .csv file
o Understand the data
o knn() function
o interpret the results
 Key points
o knn is primarily used a classification algorithm
o it is a supervised learning algorithm
 data is labeled
o Non-parametric method
o No explicit training phase is involved
o Lazy learning algorithm
o Notion of distance is needed

Page 28 of 28
o Majority voting method

 Automotive Service Company: A Case Study


 Problem statement
o An automotive service chain is launching its new grand service
station this weekend. They offer to service a wide variety of
cars. The current capacity of the station is to check 315 cars
thoroughly per day.
o As an inaugural offer, they claim to freely check all cars that
arrive on their launch day, and report whether they need
servicing or not.
o Unexpectedly, they got 450 cars. The service men won’t work
longer than the working hours but the data analysts have to.
o Can you save the day for the new service station?

 How can a data scientist save a day for them?


o “serviceTrainData.csv” – a dataset which contains some
attributes of a car that can be easily measured and won’t
require much time to conclude that if services are needed for
that or not.
o Now for the cars they cannot check in detail, they measure
those attributes – “serviceTestData.csv”
o knn classification technique is used to classify the cars which
cannot be tested manually and to say whether service is
needed or not.
 Getting things ready
o Setting working directory, clearing variables in the workspace
o Installing or loading required packages
# knn implementation in R
# set the working directory as the directory which contains the
data files
# setwd(“Path of the directory with data files”)
Page 29 of 29
Rm(list=ls()) # to clear the environment
# install.packages(“caret”, dependencies = TRUE)
# install.packages(“class”, dependencies = TRUE)
library (caret) # for confusion matrix
library (class) # for knn
 Reading the data
 Data for this case study is provided in file named
o serviceTrainData.csv
o serviceTestData.csv
 To read the data from .csv file, use read.csv() function is used
 read.csv() function – reads a file in table format and creates a data
 Syntax – read.csv(file, row.names)
 file – the name of the file which the data are to be read from. Each
row of the table appears as one line of the file.
 row.names – a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the table
which contains the row names, or character string giving the name of
the table column containing the row names.
# reading the data
ServiceTrain <- read.csv (“serviceTrainData.csv “)
ServiceTest <- read.csv (“serviceTestData.csv “)

Page 30 of 30
 Viewing the data

 Understanding the data


o ServiceTrain contains 315 observations of 6 variables
o ServiceTest contains 135 observations of 6 variables
o The variables are:
1. OilQual
2. Engineperf
3. NormMileage
4. TyreWear
Page 31 of 31
5. HVACwear
6. Service
o First five columns are the details about the car and last column
is the label which says whether a service is needed or not
 Structure of the data
o Structure of data
 Variables and their data types
o str() – compactly displays the internal structure of an ‘R’ object
o Syntax – str(object)
o Object – any R object about which you want to have some
information

Page 32 of 32
 Summary of the data
o Summary of data
 The function invokes particular methods which depend on
the class of the first argument
o summary()
o gives a 5 point summary for numeric attributes in the data
o Syntax – summary(object)
o Object – any R object about which you want to have some
information
o

Page 33 of 33
 Implementation of k-nearest neighbours: knn()
knn(train, test, cl, k = 1)
Arguments
train Matrix of data frame of training set cases
test Matrix or data frame of test set cases.
A vector will be interpreted as a row vector for a single
case
cl Factor of true classifications of training set
k Number of neighbours considered
 Applying knn algorithm on data

Page 34 of 34
# Applying k-NN algorithm
# K Nearest neighbour is a lazy algorithm and can do prediction directly
with the testing dataset, command “knn”, accepts training and testing
datasets the class variable of interest. i.e. outcome categorical variable is
provided for the parameter “cl”. Parameter “k: is to specify the number of
nearest neighbours required.

Predictedknn <- knn (train = ServiceTrain [ , -6],


test = ServiceTest [ , -6],
cl=ServiceTrain$Service,
k = 3)
o ServiceTrain [ , -6] gives information in ServiceTrain except the
last column
o ServiceTest [ , -6] gives information in ServiceTest except the
last column
o ServiceTrain$Service gives the last column of training data as a
classification factor to the algorithm

Results: Generating Confusion Matrix Manually

Page 35 of 35
# Command to develop and print a confusion matrix
conf_matrix = table(predictedknn, ServiceTest[,6])

predictedknn No Yes
No 99 0
Yes 0 36

# A measure of accuracy is calculated by summing the true positives and


true negatives and dividing them by total number of samples
knn_accuracy = sum(diag(conf_matrix))/nrow(ServiceTest)

> knn_accuracy
[1] 1

Page 36 of 36
 Conclusion
o read.csv() can be used to read data from .csv files
o str() function gives data types of each attribute in the given R-
object
o summary() – provides a summary of R – objects
o K-nearest neighbours is supervised learning technique – needs
labeled data
o In R knn algorithm can be implemented using knn()

Page 37 of 37

You might also like