Data Science Unit-5
Data Science Unit-5
UNIT – V
04.01.2023
Performance Measure
Logistic regression implementation in R
K- Nearest Neighbors (KNN)
K- Nearest Neighbors Implementation in R
Clustering: K-Means Algorithm
K-Means Implementation in R
Case Studies of Data Science Application
o Weather Forecasting
o Stock Market Prediction
o Object Recognition
o Real Time Sentiment Analysis
Performance measures
Page 1 of 1
2 classes are Hatchback and SUV
Confusion matrix – What was predicted by classifier (prediction) and
actual (Reference / True Condition)
Number of data points used = 20
Reference
Prediction Hatchback SUV
Hatchback 10 1
SUV 0 9
Perfect classifier – other than diagonal elements, other values are ‘0’.
Page 3 of 3
Measures of Performance
Terminology
TP – True Positive (Correct Identification of Positive Labels)
TN – True Negative (Correct Identification of Negative Labels)
FP – False Positive (Incorrect Identification of Positive Labels)
FN – False Negative (Incorrect Identification of Negative Labels)
Total Samples, N = TP + TN + FT + FN
1) Accuracy – overall effectiveness of a classifier, A = (TP + TN) / N
Maximum value that accuracy can take is 1
This happens when the classifier exactly classifies two groups (i.e.
FP=0 & FN=0)
Total number of True Positive Labels = TP + FN
Total number of True Negative Labels = TN + FP
2) Sensitivity – effectiveness of a classifier to identify positive labels,
Se=TP/(TP+FN)
3) Specificity – effectiveness of a classifier to identify negative labels,
Sp= TN/(FP+TN)
Both Se & Sp lie between 0 & 1, 1 is an ideal value for each of them
4) Balanced Accuracy – BA = (Sensitivity + Specificity) / 2
5) Prevalence – How often does the yes condition actually occur in our
sample, P = (TP+FN)/N
6) Positive Predictive Value – Proportion of correct results in labels
identified as positive,
PPV = (sensitivity * prevalence) / ((sensitivity * prevalence) + (1-
specificity)*(1-prevalence))
7) Negative Prediction Value – Proportion of correct results in labels
identified as negative
NPV = (specificity * (1-prevalence)) / (((1-sensitivity)*prevalence +
((specificity)*(1-prevalence)))
8) Detection Rate = TP / N
9) Detection Prevalence – Prevalence of Predicted Events, DP = (TP +
Page 4 of 4
FP) / N
10) The Kappa Statistic (or value) is a metric that compares an
Observed Accuracy with Expected Accuracy (random chance)
Kappa = (observed accuracy – expected accuracy) / (1 – expected
accuracy)
Observed Accuracy, OA = a+d / N
Expected Accuracy, EA = ((a+c)(a+b)+(b+d)(c+d))/N
Page 5 of 5
Which is important measure – Kind of application dependent
Page 6 of 6
For several classifiers:
ROC can be used to
o Compare different classifiers at one threshold or overall threshold
levels
o Performance
o Model 3 > Model 2 > Model 1
05.01.2023
Logistic Regression implementation in R
Case study
o Problem statement
Solve the case study suing R
o Read the data from a .csv file
o Understand the data
Page 7 of 7
o glm() function
o Interpret the results
Page 8 of 8
Key points
Logistic regression is primarily used as a classification algorithm
It is a supervised learning algorithm
o Data is labeled
Parametric Approach
The decision boundary is derived based on the probability interpretation
Decision boundary can be linear or nonlinear
The probabilities are also modeled as sigmoidal function
Page 9 of 9
Several cars have rolled into an independent audit unit for crash test
They are being evaluated on a defined scale {poor (-10) to excellent (10)}
on the following parameters:
1. Manikin head impact
2. Manikin body impact
3. Interior impact
4. HVAC (heating, ventilation & air conditioning) impact
5. Safety alarm system
Each crash test is very expensive to perform
The crash test was performed for only 100 cars
Type of a car – Hatchback / SUV, was noted
However with this data in future they should be able to predict the type
of the car
Part of data reserved for building a model and remaining kept for
analysis
Data for 80 cars is given in crashTest_1.csv
Data for remaining 20 cars is given in crashTest_1_TEST.csv
Use logistic regression classification technique to classify the car types
as hatchback / SUV
Page 10 of 10
(Training Data) & crashTest_1_TEST.csv (Testing Data)
To read the data from a .csv file, use read.csv() function
read.csv() – reads a file in a table format and creates a data frame from
it
Syntax – read.csv(file, row.names=1)
o File – the name of the file which the data are to be read from. Each
row of the table appears as one line if the file.
o row.names – a vector of row names. This can be a vector giving
the actual row names, or a single number giving the column of the
table which contains the row names, or character string giving the
name of the table column containing the row names.
Page 11 of 11
Page 12 of 12
Page 13 of 13
summary of trained data
summary of test
Page 14 of 14
glm()
glm(formula, data, family)
Arguments:
Formula – object of class”formula” (or one that can be coerced to
that class): a symbolic description of the model to be fitted
Data – data frame containing variables
Family – a description of the error distribution and link function to be
used in te model.
For glm, this can be a character string naming a family function, a
family function or the result of a call to a family function in specific,
family=’binomial’ corresponds to logistic regression
Page 15 of 15
2 degrees of freedom
1st – when there is NULL model (only with intercept – Reduced Model, 80 –
1 =79)
2nd – included all the variable in the modeling – Full Model, 80 – 6 = 74)
Summary of model
Page 16 of 16
Fisher Scoring iterations (MLE), no. of iterations = 25
Finding the Odds
predict()
syntax: predict(object)
Page 17 of 17
Classes are well separated
Which side belongs to which car type?
Page 18 of 18
Predicting on test data
Results
It is classified that the test point is Hatchback / SUV by setting a
threshold
crashTest_1TEST[logispred<=0.5,”LogisPred”]<-“Hatchback”
crashTest_1TEST[logispred>0.5,”LogisPred”]<-“SUV”
Page 19 of 19
Predicted Results @ 7th Column
Page 20 of 20
Confusion Matrix
confusionMatrix(table(crashTest_1_TEST[,7], crashTest_1_TEST[,6]),
positive = ‘Hatchback’)
Page 21 of 21
10.01.2023
k – Nearest Neighbors (kNN)
Why kNN?
Simplest of all classification algorithms and easy to implement
There is no explicit training phase and the algorithm does not perform any
generalization of the trained data (no optimization required)
When there are more data, complication also increases, but there are many ways
to address this issue.
Input features
input features can be both quantitative and qualitative
Outputs
Page 22 of 22
Outputs are categorical values, which typically are the classes of the data
kNN explains a categorical value using the majority votes of nearest neighbors
Assumptions
Being nonparametric, the algorithm does not make any assumptions about the
underlying data distribution
Select the parameter ‘k’ based on the data
Requires a distance metric to define proximity between any two data points
Example: Euclidean distance, Mahalanobis distance or Hamming distance
Algorithm
The kNN classification is performed using the following four steps
1. Compute the distance matrix between the test data point and all the
labeled data points
2. Order the labeled data points in the increasing order of the distance metric
3. Select the top ‘k’ labeled data points and look at the class labels
4. Find the class label that the majority of these ‘k’ labeled data points have
and assign it to the test data point
Easy to solve multiclass problems using kNN
New test point xnew
Calculating distances (refer the image below)
When the distance is ‘0’, xnew itself
If k=3, find the first 3 distances
Page 23 of 23
If all the data points are from class 1, then xnew is also from class 1
If all the data points are from class 2, then xnew is also from class 2
If 2 belongs to class 1 & 1 belongs to class 2, then due to majority of
votes, it still stays in class 1
If 2 belongs to class 2 & 1 belongs to class 1, then due to majority of
votes, it still stays in class 2
The algorithm can be used with minor modification for the Function
Approximation
If k=5, then take the first 5 points and then look for the majority votes,
and then assign this class to the new test data point.
Illustration of kNN
Page 24 of 24
There is a possibility of data points getting misclassified in the region
where there is a mix of the data points (as shown in the below figure)
Page 25 of 25
We didn’t define any boundary here
This algorithm can be effectively used for solving Complicated
nonlinear boundaries
Page 26 of 26
Things to consider
Following are some things one should consider before applying kNN
algorithm
o Parameter selection
o Presence of noise
o Feature selection and scaling
Page 27 of 27
o Curse of dimensionality
Parameter selection
o The best choice of ‘k’ depends on the data
o Larger values of ‘k’ reduces the effect of noise on classification
but makes the decision boundaries between classes less
distinct
o Smaller values of ‘k’ tends to be affected by the noise with clear
separation between classes
Feature selection and scaling
o It is important to remove irrelevant features
o When the number of features is too large, and suspected to be
highly redundant, feature extraction is required
o If the features are carefully chosen then it is expected that the
classification will be better
28.01.2023
K- nearest neighbours implementation in R
Case Study
o Problem statement
Solve the Case Study using R
o Read the data from a .csv file
o Understand the data
o knn() function
o interpret the results
Key points
o knn is primarily used a classification algorithm
o it is a supervised learning algorithm
data is labeled
o Non-parametric method
o No explicit training phase is involved
o Lazy learning algorithm
o Notion of distance is needed
Page 28 of 28
o Majority voting method
Page 30 of 30
Viewing the data
Page 32 of 32
Summary of the data
o Summary of data
The function invokes particular methods which depend on
the class of the first argument
o summary()
o gives a 5 point summary for numeric attributes in the data
o Syntax – summary(object)
o Object – any R object about which you want to have some
information
o
Page 33 of 33
Implementation of k-nearest neighbours: knn()
knn(train, test, cl, k = 1)
Arguments
train Matrix of data frame of training set cases
test Matrix or data frame of test set cases.
A vector will be interpreted as a row vector for a single
case
cl Factor of true classifications of training set
k Number of neighbours considered
Applying knn algorithm on data
Page 34 of 34
# Applying k-NN algorithm
# K Nearest neighbour is a lazy algorithm and can do prediction directly
with the testing dataset, command “knn”, accepts training and testing
datasets the class variable of interest. i.e. outcome categorical variable is
provided for the parameter “cl”. Parameter “k: is to specify the number of
nearest neighbours required.
Page 35 of 35
# Command to develop and print a confusion matrix
conf_matrix = table(predictedknn, ServiceTest[,6])
predictedknn No Yes
No 99 0
Yes 0 36
> knn_accuracy
[1] 1
Page 36 of 36
Conclusion
o read.csv() can be used to read data from .csv files
o str() function gives data types of each attribute in the given R-
object
o summary() – provides a summary of R – objects
o K-nearest neighbours is supervised learning technique – needs
labeled data
o In R knn algorithm can be implemented using knn()
Page 37 of 37