0% found this document useful (0 votes)

19 views37 pages

Data Science Unit-5

The document covers key concepts in data science, specifically focusing on performance measures, logistic regression, and K-Nearest Neighbors (KNN) implementations in R. It details various performance metrics such as accuracy, sensitivity, specificity, and ROC analysis, along with case studies including weather forecasting and automotive crash testing. Additionally, it provides step-by-step instructions for implementing logistic regression and KNN algorithms using R, including data preparation and model evaluation techniques.

Uploaded by

yejem28478

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views37 pages

Data Science Unit-5

Uploaded by

yejem28478

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

PE 515 CS – DATA SCIENCE

UNIT – V
04.01.2023
 Performance Measure
 Logistic regression implementation in R
 K- Nearest Neighbors (KNN)
 K- Nearest Neighbors Implementation in R
 Clustering: K-Means Algorithm
 K-Means Implementation in R
 Case Studies of Data Science Application
o Weather Forecasting
o Stock Market Prediction
o Object Recognition
o Real Time Sentiment Analysis

Performance measures

 Result of a Case study – classification of car based on certain attributes

Page 1 of 1
 2 classes are Hatchback and SUV
 Confusion matrix – What was predicted by classifier (prediction) and
actual (Reference / True Condition)
 Number of data points used = 20
Reference
Prediction Hatchback SUV
Hatchback 10 1
SUV 0 9

 Example: True Positive

o 1st word refers to the truth / non-truth of the prediction
Page 2 of 2
o 2nd word – relates to prediction

 True Positive – success of the classifier (Power)

 False positive – mistake of the classifier (Type I error)
 True Negative – success of the classifier
 False Negative – mistake of the classifier (Type II error)

 Perfect classifier – other than diagonal elements, other values are ‘0’.

Page 3 of 3
Measures of Performance
Terminology
 TP – True Positive (Correct Identification of Positive Labels)
 TN – True Negative (Correct Identification of Negative Labels)
 FP – False Positive (Incorrect Identification of Positive Labels)
 FN – False Negative (Incorrect Identification of Negative Labels)

 Total Samples, N = TP + TN + FT + FN
1) Accuracy – overall effectiveness of a classifier, A = (TP + TN) / N
 Maximum value that accuracy can take is 1
 This happens when the classifier exactly classifies two groups (i.e.
FP=0 & FN=0)
 Total number of True Positive Labels = TP + FN
 Total number of True Negative Labels = TN + FP
2) Sensitivity – effectiveness of a classifier to identify positive labels,
Se=TP/(TP+FN)
3) Specificity – effectiveness of a classifier to identify negative labels,
Sp= TN/(FP+TN)
 Both Se & Sp lie between 0 & 1, 1 is an ideal value for each of them
4) Balanced Accuracy – BA = (Sensitivity + Specificity) / 2
5) Prevalence – How often does the yes condition actually occur in our
sample, P = (TP+FN)/N
6) Positive Predictive Value – Proportion of correct results in labels
identified as positive,
PPV = (sensitivity * prevalence) / ((sensitivity * prevalence) + (1-
specificity)*(1-prevalence))
7) Negative Prediction Value – Proportion of correct results in labels
identified as negative
NPV = (specificity * (1-prevalence)) / (((1-sensitivity)*prevalence +
((specificity)*(1-prevalence)))
8) Detection Rate = TP / N
9) Detection Prevalence – Prevalence of Predicted Events, DP = (TP +

Page 4 of 4
FP) / N
10) The Kappa Statistic (or value) is a metric that compares an
Observed Accuracy with Expected Accuracy (random chance)
Kappa = (observed accuracy – expected accuracy) / (1 – expected
accuracy)
Observed Accuracy, OA = a+d / N
Expected Accuracy, EA = ((a+c)(a+b)+(b+d)(c+d))/N

Where a, b, c and d are TP, FP, FN and TN respectively

 First one in the result is positive label
 Second one in the result is the negative label
 Refer the final outcome for confirmation, “Positive Class” –
Hatchback

Page 5 of 5
 Which is important measure – Kind of application dependent

ROC – Receiver Operating Characteristics

 Originally developed and used in signal detection theory
 ROC graph:
 Sensitivity as a function of specificity
 Sensitivity (y – axis) and 1-specificty (x-axis)


 Best value for sensitivity and specificity is 1.

For single classifier:

 ROC can be used to
o see the classifier performance at different threshold levels (from 0
to 1)
o AUC – Area under the ROC
 An area of 1 represents a perfect test; an area of 0.5
represents a worthless model
 0.90 – 1 = excellent
 0.80 – 0.90 = good
 0.70 – 0.80 = fair
 0.60 – 0.70 = poor
o AUC < 0.5, check whether your labels are marked in opposite

Page 6 of 6
For several classifiers:
 ROC can be used to
o Compare different classifiers at one threshold or overall threshold
levels
o Performance
o Model 3 > Model 2 > Model 1



05.01.2023
Logistic Regression implementation in R

 Case study
o Problem statement
 Solve the case study suing R
o Read the data from a .csv file
o Understand the data

Page 7 of 7
o glm() function
o Interpret the results

Page 8 of 8
Key points
 Logistic regression is primarily used as a classification algorithm
 It is a supervised learning algorithm
o Data is labeled
 Parametric Approach
 The decision boundary is derived based on the probability interpretation
 Decision boundary can be linear or nonlinear
 The probabilities are also modeled as sigmoidal function

Automotive Crash Testing

 Problem statement – a crash test is a form of destructive testing that is
performed in order to ensure high safety standards for various cars

Page 9 of 9
 Several cars have rolled into an independent audit unit for crash test
 They are being evaluated on a defined scale {poor (-10) to excellent (10)}
on the following parameters:
1. Manikin head impact
2. Manikin body impact
3. Interior impact
4. HVAC (heating, ventilation & air conditioning) impact
5. Safety alarm system
 Each crash test is very expensive to perform
 The crash test was performed for only 100 cars
 Type of a car – Hatchback / SUV, was noted
 However with this data in future they should be able to predict the type
of the car
 Part of data reserved for building a model and remaining kept for
analysis
 Data for 80 cars is given in crashTest_1.csv
 Data for remaining 20 cars is given in crashTest_1_TEST.csv
 Use logistic regression classification technique to classify the car types
as hatchback / SUV

Solution to Case Study Using R

 Setting working directory, clearing variables in the workspace
 Installing or loading required packages
#set the working directory as the directory which contains the data files
#setwd(“path of the directory with data files”)
rm(list=ls()) # to clear the environment
#install.packages(“caret”, dependencies = true)
library(caret) #for confusion matrix

Reading the data

 Data for this case study is provided in files named, crashTest_1.csv

Page 10 of 10
(Training Data) & crashTest_1_TEST.csv (Testing Data)
 To read the data from a .csv file, use read.csv() function
 read.csv() – reads a file in a table format and creates a data frame from
it
 Syntax – read.csv(file, row.names=1)
o File – the name of the file which the data are to be read from. Each
row of the table appears as one line if the file.
o row.names – a vector of row names. This can be a vector giving
the actual row names, or a single number giving the column of the
table which contains the row names, or character string giving the
name of the table column containing the row names.

#Reading the data

crashTest_1<-read.csv(“crashTest1.csv”, row.names=1)
crashTest_1_TEST<-read.csv(“crashTest1_1_TEST.csv”, row.names=1)

Viewing the data

 View(CrashTest_1)

Page 11 of 11


Understanding the data

 crashTest_1 contains 80 observations of 6 variables
 crashTest_1 _TEST contains 20 observations of 6 variables
 The variables are Manikin head impact, Manikin body impact, Interior
impact, HVAC impact, safety alarm system
 First five columns are the details about the car and last column is the
label which says whether the car type Hatchback / SUV

Structure of the data

 Variables and their data types
 str()
syntax – str(object)
 object – any R object about which you want to have some information

Page 12 of 12


Summary of the data

 Summary of data – The function invokes particular methods which
depend on the class of the first argument
 summary()
o summary gives a 5 point summary for numeric attributes in the
data
 Syntax
o summary (object)
 object - any R object about which you want to have some information

Page 13 of 13
 summary of trained data

 summary of test

Page 14 of 14
glm()
glm(formula, data, family)

Arguments:
 Formula – object of class”formula” (or one that can be coerced to
that class): a symbolic description of the model to be fitted
 Data – data frame containing variables
 Family – a description of the error distribution and link function to be
used in te model.
For glm, this can be a character string naming a family function, a
family function or the result of a call to a family function in specific,
family=’binomial’ corresponds to logistic regression

Building a logistic regression model

# Model
Logisfit<-glm(formula = crashTest_1$CarType~., family=’binomial’, data =
crashTest_1)

Modeled the probability as sigmoidal function

Log odds ratio:

The odds ratio is the probability of
success/probability of failure.

Decision boundary, hyper plane

equation
p(X) = probability of success,
1-p(X) = probability of failure

Page 15 of 15
2 degrees of freedom
1st – when there is NULL model (only with intercept – Reduced Model, 80 –
1 =79)
2nd – included all the variable in the modeling – Full Model, 80 – 6 = 74)

Summary of model

Page 16 of 16
Fisher Scoring iterations (MLE), no. of iterations = 25
Finding the Odds
 predict()
 syntax: predict(object)

# Finding the odds

logisTrain<-predict(logisfit, type=’response’) #by default – training set

 predict() with type = ‘response’ gives probabilities

 by default otherwise it returns log(odds)

Plotting the probabilities

plot(logisTrain)

Page 17 of 17
Classes are well separated
Which side belongs to which car type?

Identifying probabilities associated with the Car Type

 Mean of probabilities
 This helps us identify the probabilities associated with the two classes
 Tapply(logisTrain, crashTest_1$CarType, mean)


 Low probabilities are associated with car type ‘Hatchback’

 higher probabilities are associated with car type ‘SUV’

Page 18 of 18
Predicting on test data

# Predicting on test data

logispred<-predict(logisfit, newdata=crashTest_1_TEST,type =’response’)
plot(logispred)

“logispred” is the output which has the probability values

Results
 It is classified that the test point is Hatchback / SUV by setting a
threshold

crashTest_1TEST[logispred<=0.5,”LogisPred”]<-“Hatchback”
crashTest_1TEST[logispred>0.5,”LogisPred”]<-“SUV”

Page 19 of 19

 Predicted Results @ 7th Column

Page 20 of 20
Confusion Matrix
confusionMatrix(table(crashTest_1_TEST[,7], crashTest_1_TEST[,6]),
positive = ‘Hatchback’)

Page 21 of 21
10.01.2023
k – Nearest Neighbors (kNN)

 Simple and powerful classification algorithm

 k – Nearest Neighbors (kNN) is a non-parametric method used for classification
 It is a lazy learning algorithm where all computation is deferred until
classification
 It is also an instance based learning algorithm where the function is
approximated locally
 In logistic regression, earlier Used trained data for hyper plane, estimated
parameter were used to predict the test data
 In kNN, data itself will be used, not estimated parameters, here number of
neighbors will be use, which is a tuning parameter for the algorithm
 (It is not the parameter derived from the data)
 Distinction between parameter and non-parameter
 In Logistic regression, without parameters, predicting data is not possible
 But in kNN, using data, it is possible
 No prior work is required for doing classification using kNN (difference b/w
Logistic Regression and Knn

Why kNN and when does one use it?

Why kNN?
 Simplest of all classification algorithms and easy to implement
 There is no explicit training phase and the algorithm does not perform any
generalization of the trained data (no optimization required)

When does one use this kNN algorithm?

 When there are nonlinear decision boundaries between classes and when the
amount of data is large

 When there are more data, complication also increases, but there are many ways
to address this issue.

Input features
 input features can be both quantitative and qualitative

Outputs
Page 22 of 22
 Outputs are categorical values, which typically are the classes of the data

 kNN explains a categorical value using the majority votes of nearest neighbors

Assumptions
 Being nonparametric, the algorithm does not make any assumptions about the
underlying data distribution
 Select the parameter ‘k’ based on the data
 Requires a distance metric to define proximity between any two data points
 Example: Euclidean distance, Mahalanobis distance or Hamming distance

Algorithm
 The kNN classification is performed using the following four steps
1. Compute the distance matrix between the test data point and all the
labeled data points
2. Order the labeled data points in the increasing order of the distance metric
3. Select the top ‘k’ labeled data points and look at the class labels
4. Find the class label that the majority of these ‘k’ labeled data points have
and assign it to the test data point
 Easy to solve multiclass problems using kNN
 New test point xnew
 Calculating distances (refer the image below)
 When the distance is ‘0’, xnew itself
 If k=3, find the first 3 distances

Page 23 of 23
 If all the data points are from class 1, then xnew is also from class 1
 If all the data points are from class 2, then xnew is also from class 2
 If 2 belongs to class 1 & 1 belongs to class 2, then due to majority of
votes, it still stays in class 1
 If 2 belongs to class 2 & 1 belongs to class 1, then due to majority of
votes, it still stays in class 2

 The algorithm can be used with minor modification for the Function
Approximation
 If k=5, then take the first 5 points and then look for the majority votes,
and then assign this class to the new test data point.

Illustration of kNN

Page 24 of 24
 There is a possibility of data points getting misclassified in the region
where there is a mix of the data points (as shown in the below figure)

 As we approach farther away, the misclassification problem is less

Page 25 of 25
 We didn’t define any boundary here
 This algorithm can be effectively used for solving Complicated
nonlinear boundaries

Illustration of kNN (Testing)

 No label for test data (yellow color data point)
 In the next image, when k=3, the new test data point (yellow) will get
the label (red – class 2)
 In the next image, when k=5, the new test data point (yellow) will get
the label (blue – class 1)


Page 26 of 26


Things to consider
 Following are some things one should consider before applying kNN
algorithm
o Parameter selection
o Presence of noise
o Feature selection and scaling

Page 27 of 27
o Curse of dimensionality
Parameter selection
o The best choice of ‘k’ depends on the data
o Larger values of ‘k’ reduces the effect of noise on classification
but makes the decision boundaries between classes less
distinct
o Smaller values of ‘k’ tends to be affected by the noise with clear
separation between classes
Feature selection and scaling
o It is important to remove irrelevant features
o When the number of features is too large, and suspected to be
highly redundant, feature extraction is required
o If the features are carefully chosen then it is expected that the
classification will be better

28.01.2023
K- nearest neighbours implementation in R

 Case Study
o Problem statement
 Solve the Case Study using R
o Read the data from a .csv file
o Understand the data
o knn() function
o interpret the results
 Key points
o knn is primarily used a classification algorithm
o it is a supervised learning algorithm
 data is labeled
o Non-parametric method
o No explicit training phase is involved
o Lazy learning algorithm
o Notion of distance is needed

Page 28 of 28
o Majority voting method

 Automotive Service Company: A Case Study

 Problem statement
o An automotive service chain is launching its new grand service
station this weekend. They offer to service a wide variety of
cars. The current capacity of the station is to check 315 cars
thoroughly per day.
o As an inaugural offer, they claim to freely check all cars that
arrive on their launch day, and report whether they need
servicing or not.
o Unexpectedly, they got 450 cars. The service men won’t work
longer than the working hours but the data analysts have to.
o Can you save the day for the new service station?

 How can a data scientist save a day for them?

o “serviceTrainData.csv” – a dataset which contains some
attributes of a car that can be easily measured and won’t
require much time to conclude that if services are needed for
that or not.
o Now for the cars they cannot check in detail, they measure
those attributes – “serviceTestData.csv”
o knn classification technique is used to classify the cars which
cannot be tested manually and to say whether service is
needed or not.
 Getting things ready
o Setting working directory, clearing variables in the workspace
o Installing or loading required packages
# knn implementation in R
# set the working directory as the directory which contains the
data files
# setwd(“Path of the directory with data files”)
Page 29 of 29
Rm(list=ls()) # to clear the environment
# install.packages(“caret”, dependencies = TRUE)
# install.packages(“class”, dependencies = TRUE)
library (caret) # for confusion matrix
library (class) # for knn
 Reading the data
 Data for this case study is provided in file named
o serviceTrainData.csv
o serviceTestData.csv
 To read the data from .csv file, use read.csv() function is used
 read.csv() function – reads a file in table format and creates a data
 Syntax – read.csv(file, row.names)
 file – the name of the file which the data are to be read from. Each
row of the table appears as one line of the file.
 row.names – a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the table
which contains the row names, or character string giving the name of
the table column containing the row names.
# reading the data
ServiceTrain <- read.csv (“serviceTrainData.csv “)
ServiceTest <- read.csv (“serviceTestData.csv “)

Page 30 of 30
 Viewing the data

 Understanding the data

o ServiceTrain contains 315 observations of 6 variables
o ServiceTest contains 135 observations of 6 variables
o The variables are:
1. OilQual
2. Engineperf
3. NormMileage
4. TyreWear
Page 31 of 31
5. HVACwear
6. Service
o First five columns are the details about the car and last column
is the label which says whether a service is needed or not
 Structure of the data
o Structure of data
 Variables and their data types
o str() – compactly displays the internal structure of an ‘R’ object
o Syntax – str(object)
o Object – any R object about which you want to have some
information

Page 32 of 32
 Summary of the data
o Summary of data
 The function invokes particular methods which depend on
the class of the first argument
o summary()
o gives a 5 point summary for numeric attributes in the data
o Syntax – summary(object)
o Object – any R object about which you want to have some
information
o

Page 33 of 33
 Implementation of k-nearest neighbours: knn()
knn(train, test, cl, k = 1)
Arguments
train Matrix of data frame of training set cases
test Matrix or data frame of test set cases.
A vector will be interpreted as a row vector for a single
case
cl Factor of true classifications of training set
k Number of neighbours considered
 Applying knn algorithm on data

Page 34 of 34
# Applying k-NN algorithm
# K Nearest neighbour is a lazy algorithm and can do prediction directly
with the testing dataset, command “knn”, accepts training and testing
datasets the class variable of interest. i.e. outcome categorical variable is
provided for the parameter “cl”. Parameter “k: is to specify the number of
nearest neighbours required.

Predictedknn <- knn (train = ServiceTrain [ , -6],

test = ServiceTest [ , -6],
cl=ServiceTrain$Service,
k = 3)
o ServiceTrain [ , -6] gives information in ServiceTrain except the
last column
o ServiceTest [ , -6] gives information in ServiceTest except the
last column
o ServiceTrain$Service gives the last column of training data as a
classification factor to the algorithm

Results: Generating Confusion Matrix Manually

Page 35 of 35
# Command to develop and print a confusion matrix
conf_matrix = table(predictedknn, ServiceTest[,6])

predictedknn No Yes
No 99 0
Yes 0 36

# A measure of accuracy is calculated by summing the true positives and

true negatives and dividing them by total number of samples
knn_accuracy = sum(diag(conf_matrix))/nrow(ServiceTest)

> knn_accuracy
[1] 1

Page 36 of 36
 Conclusion
o read.csv() can be used to read data from .csv files
o str() function gives data types of each attribute in the given R-
object
o summary() – provides a summary of R – objects
o K-nearest neighbours is supervised learning technique – needs
labeled data
o In R knn algorithm can be implemented using knn()

Page 37 of 37

ASSIGN 310SE Tang Kah Mun 11536048 PDF
No ratings yet
ASSIGN 310SE Tang Kah Mun 11536048 PDF
23 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
lec45
No ratings yet
lec45
20 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
AST Day 2 Slides
No ratings yet
AST Day 2 Slides
58 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Predicting Mode of Transport
No ratings yet
Predicting Mode of Transport
29 pages
Uni T - 2 - R Programming
No ratings yet
Uni T - 2 - R Programming
10 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Classification Models
No ratings yet
Classification Models
3 pages
LogisticRegression Vs RandomForest 1601998364
No ratings yet
LogisticRegression Vs RandomForest 1601998364
14 pages
Session-11 Machine Learning - Jupyter Notebook
No ratings yet
Session-11 Machine Learning - Jupyter Notebook
11 pages
Cvms
No ratings yet
Cvms
37 pages
EDAN96_2024_Last_lecture-1
No ratings yet
EDAN96_2024_Last_lecture-1
78 pages
Information Securtiy
No ratings yet
Information Securtiy
8 pages
cor
No ratings yet
cor
6 pages
DS_UNIT_4
No ratings yet
DS_UNIT_4
13 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
2 Modele lineare
No ratings yet
2 Modele lineare
43 pages
Topic 7 Regression (Cont2) Logistic Regression
No ratings yet
Topic 7 Regression (Cont2) Logistic Regression
33 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
No ratings yet
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
54 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Session-11 Machine Learning
No ratings yet
Session-11 Machine Learning
27 pages
Concepts - Model Evaluation (Data Mining Fundamentals)
No ratings yet
Concepts - Model Evaluation (Data Mining Fundamentals)
40 pages
MLS+2+-+Classification
No ratings yet
MLS+2+-+Classification
13 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
DS assignment COMPLETED DOC
No ratings yet
DS assignment COMPLETED DOC
11 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Machine learning with Titanic dataset tutorial
No ratings yet
Machine learning with Titanic dataset tutorial
7 pages
Linear Regression Example
No ratings yet
Linear Regression Example
26 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Module-2_Logistic Regression in Machine Learning
No ratings yet
Module-2_Logistic Regression in Machine Learning
28 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Machine Learning Project: Choice of Employee Mode of Transport
No ratings yet
Machine Learning Project: Choice of Employee Mode of Transport
35 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
L22 KNN+Metrics
No ratings yet
L22 KNN+Metrics
18 pages
My Notes
No ratings yet
My Notes
15 pages
data science dse
No ratings yet
data science dse
24 pages
Project Report
100% (3)
Project Report
36 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
Aditya Slides For IBM
No ratings yet
Aditya Slides For IBM
125 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Most Powerful Diesel Engine in The World
No ratings yet
The Most Powerful Diesel Engine in The World
8 pages
Haba NG Hair Mo Brad.: Glynis Neri
No ratings yet
Haba NG Hair Mo Brad.: Glynis Neri
7 pages
Bugreport Merlin - Global SP1A.210812.016 2023 11 26 07 28 39 Dumpstate - Log 10975
No ratings yet
Bugreport Merlin - Global SP1A.210812.016 2023 11 26 07 28 39 Dumpstate - Log 10975
32 pages
Food Ordering Mobile Applications - A New Wave in Food Entrepreneurship
No ratings yet
Food Ordering Mobile Applications - A New Wave in Food Entrepreneurship
6 pages
CHE228.L1 - Reactor Design For Multiple Rxns
No ratings yet
CHE228.L1 - Reactor Design For Multiple Rxns
21 pages
Bypass Switch - 40A 1.0 Za PDF
No ratings yet
Bypass Switch - 40A 1.0 Za PDF
1 page
School Management: 1. Conceptual Data Model
No ratings yet
School Management: 1. Conceptual Data Model
8 pages
HE5091 Jan 2024 Tutorial 2.student Version
No ratings yet
HE5091 Jan 2024 Tutorial 2.student Version
2 pages
Unit 2 Marketing Strategies and Planning-MBA-MM
No ratings yet
Unit 2 Marketing Strategies and Planning-MBA-MM
16 pages
WTW Incubator TS 606-2
100% (1)
WTW Incubator TS 606-2
26 pages
Achieve3000 - Lesson
No ratings yet
Achieve3000 - Lesson
15 pages
BÀI TẬP ÔN HỌC KỲ II THEO MA TRẬN KEYS
100% (1)
BÀI TẬP ÔN HỌC KỲ II THEO MA TRẬN KEYS
16 pages
CIGRE Operational Evaluation of RTV Coating Performance Over 17 Years On The Coastal Area at Jubail-SA
No ratings yet
CIGRE Operational Evaluation of RTV Coating Performance Over 17 Years On The Coastal Area at Jubail-SA
9 pages
Office Floor-01
No ratings yet
Office Floor-01
10 pages
Algorithms Course Outline BCS 4J
No ratings yet
Algorithms Course Outline BCS 4J
2 pages
QE140 Robotrac Screener Manual
No ratings yet
QE140 Robotrac Screener Manual
204 pages
GEA Service Manual V1.4e
No ratings yet
GEA Service Manual V1.4e
32 pages
Feasibility Review Checklist: General Details
No ratings yet
Feasibility Review Checklist: General Details
4 pages
HTB-CPTS-Course-Outline
No ratings yet
HTB-CPTS-Course-Outline
2 pages
Resume
No ratings yet
Resume
1 page
KBSK Who
No ratings yet
KBSK Who
3 pages
Chapter 3 - Artist and Artisan
No ratings yet
Chapter 3 - Artist and Artisan
27 pages
Egeardrive Product Sheet PDF
No ratings yet
Egeardrive Product Sheet PDF
2 pages
Release Notes For Platform MPI: Windows
No ratings yet
Release Notes For Platform MPI: Windows
33 pages
Dwi Agus Darmono HP.+6281330393557: Hotel For Crew
No ratings yet
Dwi Agus Darmono HP.+6281330393557: Hotel For Crew
2 pages
PPAP Full List of Elements Required
No ratings yet
PPAP Full List of Elements Required
2 pages
ACC 2203 Syllabus
No ratings yet
ACC 2203 Syllabus
4 pages
Practical Accounting 2 With Answers
No ratings yet
Practical Accounting 2 With Answers
13 pages
Database Surnames Organisations
No ratings yet
Database Surnames Organisations
4 pages

Data Science Unit-5

Uploaded by

Data Science Unit-5

Uploaded by

PE 515 CS – DATA SCIENCE

 Result of a Case study – classification of car based on certain attributes

 Example: True Positive

 True Positive – success of the classifier (Power)

Where a, b, c and d are TP, FP, FN and TN respectively

ROC – Receiver Operating Characteristics

 Best value for sensitivity and specificity is 1.

For single classifier:

Automotive Crash Testing

Solution to Case Study Using R

Reading the data

#Reading the data

Viewing the data

Understanding the data

Structure of the data

Summary of the data

Building a logistic regression model

Modeled the probability as sigmoidal function

Log odds ratio:

Decision boundary, hyper plane

# Finding the odds

 predict() with type = ‘response’ gives probabilities

Plotting the probabilities

Identifying probabilities associated with the Car Type

 Low probabilities are associated with car type ‘Hatchback’

# Predicting on test data

“logispred” is the output which has the probability values

 Simple and powerful classification algorithm

Why kNN and when does one use it?

When does one use this kNN algorithm?

 As we approach farther away, the misclassification problem is less

Illustration of kNN (Testing)

 Automotive Service Company: A Case Study

 How can a data scientist save a day for them?

 Understanding the data

Predictedknn <- knn (train = ServiceTrain [ , -6],

Results: Generating Confusion Matrix Manually

# A measure of accuracy is calculated by summing the true positives and

You might also like