Analyzing the Ionosphere using R
Jeffrey Strickland
2022-09-05
Radar Data
This radar data was collected by a system in Goose Bay, Labrador. This system consists of a
phased array of 16 high-frequency antennas with a total transmitted power on the order of
6.4 kilowatts. See [1] for more details. The targets were free electrons in the ionosphere.
“Good” radar returns are those showing evidence of some type of structure in the
ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere
[1].
Received signals were processed using an autocorrelation function whose arguments are
the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay
system. Instances in this database are described by 2 attributes per pulse number,
corresponding to the complex values returned by the function resulting from the complex
electromagnetic signal.
The instances that comprise the dataset represent:
• There are 34 predictors in columns 1 through 34. All 34 are continuous
• The 35th attribute is either “good” or “bad” according to the definition summarized
above.
Data Preprocessing
Load the Data and Print a Summary
First, we load the data into R Studio using read.csv to read the ionosphere file into R
Studio. We also print a summary of the data, as has been our custom.
ion_data <- read.csv("C:/Users/jeff/Documents/Data/ionosphere.csv")
summary(ion_data)
## V1 V2 V3 V4
## Min. :0.0000 Min. :0 Min. :-1.0000 Min. :-1.00000
## 1st Qu.:1.0000 1st Qu.:0 1st Qu.: 0.4721 1st Qu.:-0.06474
## Median :1.0000 Median :0 Median : 0.8711 Median : 0.01631
## Mean :0.8917 Mean :0 Mean : 0.6413 Mean : 0.04437
## 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.: 1.0000 3rd Qu.: 0.19418
## Max. :1.0000 Max. :0 Max. : 1.0000 Max. : 1.00000
## V5 V6 V7 V8
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.4127 1st Qu.:-0.0248 1st Qu.: 0.2113 1st Qu.:-0.05484
## Median : 0.8092 Median : 0.0228 Median : 0.7287 Median : 0.01471
## Mean : 0.6011 Mean : 0.1159 Mean : 0.5501 Mean : 0.11936
## 3rd Qu.: 1.0000 3rd Qu.: 0.3347 3rd Qu.: 0.9692 3rd Qu.: 0.44567
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## V9 V10 V11 V12
## Min. :-1.00000 Min. :-1.00000 Min. :-1.00000 Min. :-1.00000
## 1st Qu.: 0.08711 1st Qu.:-0.04807 1st Qu.: 0.02112 1st Qu.:-0.06527
## Median : 0.68421 Median : 0.01829 Median : 0.66798 Median : 0.02825
## Mean : 0.51185 Mean : 0.18135 Mean : 0.47618 Mean : 0.15504
## 3rd Qu.: 0.95324 3rd Qu.: 0.53419 3rd Qu.: 0.95790 3rd Qu.: 0.48237
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## V13 V14 V15 V16
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.0000 1st Qu.:-0.07372 1st Qu.: 0.0000 1st Qu.:-0.08170
## Median : 0.6441 Median : 0.03027 Median : 0.6019 Median : 0.00000
## Mean : 0.4008 Mean : 0.09341 Mean : 0.3442 Mean : 0.07113
## 3rd Qu.: 0.9555 3rd Qu.: 0.37486 3rd Qu.: 0.9193 3rd Qu.: 0.30897
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.00000
## V17 V18 V19 V20
## Min. :-1.0000 Min. :-1.000000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.0000 1st Qu.:-0.225690 1st Qu.: 0.0000 1st Qu.:-0.23467
## Median : 0.5909 Median : 0.000000 Median : 0.5762 Median : 0.00000
## Mean : 0.3819 Mean :-0.003617 Mean : 0.3594 Mean :-0.02402
## 3rd Qu.: 0.9357 3rd Qu.: 0.195285 3rd Qu.: 0.8993 3rd Qu.: 0.13437
## Max. : 1.0000 Max. : 1.000000 Max. : 1.0000 Max. : 1.00000
## V21 V22 V23 V24
## Min. :-1.0000 Min. :-1.000000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.0000 1st Qu.:-0.243870 1st Qu.: 0.0000 1st Qu.:-0.36689
## Median : 0.4991 Median : 0.000000 Median : 0.5318 Median : 0.00000
## Mean : 0.3367 Mean : 0.008296 Mean : 0.3625 Mean :-0.05741
## 3rd Qu.: 0.8949 3rd Qu.: 0.188760 3rd Qu.: 0.9112 3rd Qu.: 0.16463
## Max. : 1.0000 Max. : 1.000000 Max. : 1.0000 Max. : 1.00000
## V25 V26 V27 V28
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.: 0.0000 1st Qu.:-0.33239 1st Qu.: 0.2864 1st Qu.:-0.44316
## Median : 0.5539 Median :-0.01505 Median : 0.7082 Median :-0.01769
## Mean : 0.3961 Mean :-0.07119 Mean : 0.5416 Mean :-0.06954
## 3rd Qu.: 0.9052 3rd Qu.: 0.15676 3rd Qu.: 0.9999 3rd Qu.: 0.15354
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.00000
## V29 V30 V31 V32
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.000000
## 1st Qu.: 0.0000 1st Qu.:-0.23689 1st Qu.: 0.0000 1st Qu.:-0.242595
## Median : 0.4966 Median : 0.00000 Median : 0.4428 Median : 0.000000
## Mean : 0.3784 Mean :-0.02791 Mean : 0.3525 Mean :-0.003794
## 3rd Qu.: 0.8835 3rd Qu.: 0.15407 3rd Qu.: 0.8576 3rd Qu.: 0.200120
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.000000
## V33 V34 Class
## Min. :-1.0000 Min. :-1.00000 Length:351
## 1st Qu.: 0.0000 1st Qu.:-0.16535 Class :character
## Median : 0.4096 Median : 0.00000 Mode :character
## Mean : 0.3494 Mean : 0.01448
## 3rd Qu.: 0.8138 3rd Qu.: 0.17166
## Max. : 1.0000 Max. : 1.00000
Encode Character Variable
Here, we encode the variable “Class” with numeric values 1 and 0, representing the “Class”
variable, where 1 = “good” and 0 = “bad”. We have previous perform this transformation
using LabelEncoder from the superml package.
library(superml)
lbl = LabelEncoder$new()
ion_data$Y = lbl$fit_transform(ion_data$Class)
After running the above script we get the new column Y comprised of 0s and 1s, so we can
drop the Class column.
drop<-"Class"
ion_data = ion_data[!(names(ion_data) %in% drop)]
Train-Test Split
We do not use this immediately, but we have defined it in anticipation of using it later.
Here, we split the data into 60% train data and 40% test data.
library(caTools)
samples<-sample.split(ion_data,SplitRatio = 0.6)
# Train data
ion_trn<- subset(ion_data,samples==TRUE)
# Test data
ion_tst<- subset(ion_data,samples==FALSE)
Fitting the model
Here, we train a model using the glm() function from the stats package. glm is used to fit
generalized linear models, specified by giving a symbolic description of the linear predictor
and a description of the error distribution.
A typical predictor has the form response ~ terms where response is the (numeric)
response vector and terms is a series of terms which specifies a linear predictor for
response. For binomial families like ours, the response can also be specified as a factor
(when the first level denotes “bad” and all others “good”). A terms specification of the form
𝑓𝑖𝑟𝑠𝑡 + 𝑠𝑒𝑐𝑜𝑛𝑑 indicates all the terms in first together with all the terms in second with
any duplicates removed. In the data, the variable V1 and V2 represent this construct, while
the response is a character variable type labeled “Class” or “Y”, as re redefined it.
The summary function can be used to obtain or print a summary of the results and the anova
function to produce an analysis of variance table. We ran several models but only kept the
results of the model we labeled model_lm. V3, V5, V8, V22 and V26 represent various radar
returns, classified as either “good” (1) or “bad” (0). The Coefficient table provides the
estimate, its standard error, z-values, and the associated p-value.
The standard deviation of an estimate is called the standard error. The standard error of
the coefficient measures how precisely the model estimates the coefficient’s unknown
value. The standard error of the coefficient is always positive. We use the standard error of
the coefficient to measure the precision of the estimate of the coefficient. The smaller the
standard error, the more precise the estimate. For our model, all the standard errors are
about the same, so the model was able to estimate all the coefficients with equal precision.
Also, the standard errors are relatively small.
Notice that the Residual deviance is smaller than the Null deviance, which is what we
want. The null deviance tells us how well the response variable can be predicted by a
model with only an intercept term. The residual deviance tells us how well the response
variable can be predicted by a model with p predictor variables. The lower the value, the
better the model is able to predict the value of the response variable.
To determine if a model is “useful” we can compute the Chi-Square statistic as:
𝜒 2 = Null deviance– Residual deviance
with 𝑝 degrees of freedom.
The p-values in the table are z-value sand are different that Chi-square values ( the come
from different distributions). We can find the 𝑝-value associated with this Chi-Square
statistic. The lower the 𝑝-value, the better the model is able to fit the dataset compared to a
model with just an intercept term.
model_lm = glm(Y ~ V3 + V5 + V8 + V22 + V26, data = ion_trn, family = binomia
l)
summary(model_lm)
##
## Call:
## glm(formula = Y ~ V3 + V5 + V8 + V22 + V26, family = binomial,
## data = ion_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7123 -0.4699 -0.2998 0.3519 3.1770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.7510 0.5059 5.438 5.40e-08 ***
## V3 -3.6441 0.7182 -5.074 3.90e-07 ***
## V5 -2.2427 0.5180 -4.329 1.50e-05 ***
## V8 -1.2238 0.4377 -2.796 0.005174 **
## V22 2.4450 0.5811 4.208 2.58e-05 ***
## V26 -1.7644 0.5340 -3.304 0.000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 276.92 on 210 degrees of freedom
## Residual deviance: 139.46 on 205 degrees of freedom
## AIC: 151.46
##
## Number of Fisher Scoring iterations: 6
print(paste("Chi-Square =", model_lm$null.deviance/model_lm$deviance))
## [1] "Chi-Square = 1.98565603155999"
ANOVA Table
The ANOVA table for our model displays the model and its Deviance Residuals plus its
Residuals Deviance
Deviance residual is another type of residual measures. It measures the disagreement
between the maxima of the observed and the fitted log likelihood functions. Since logistic
regression uses the maximal likelihood principle, the goal in logistic regression is to
minimize the sum of the deviance residuals.
Residual Deviance is the same as Null Deviance. Standard ‘raw’ residuals aren’t used in
GLM modeling because they don’t always make sense. But for the sake of completeness,
they are calculated and appear as the model object model_lm$residuals. A better
alternative would be Pearson residuals. These also come in a standardized variant, useful
to ensure the residuals have a constant variance.
anova(model_lm)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 210 276.92
## V3 1 78.447 209 198.47
## V5 1 15.629 208 182.84
## V8 1 18.155 207 164.69
## V22 1 12.629 206 152.06
## V26 1 12.598 205 139.46
Ploting Residuals
Here, well plot the Deviance Residuals and the Pearson residuals. Interestingly, thee
Diviance residuals have a pattern not present in the plots of the other classes of residuals.
There is a ‘hump’ around 0.5 and another around 1.5. If we doing this analysis for real, that
should prompt an investigation. The Pearson residual present the same result but with a
different scale.
plot(density(resid(model_lm, type='deviance')))
lines(density(resid(model_lm, type='deviance')), col='red')
plot(density(rstandard(model_lm, type='pearson')))
lines(density(rstandard(model_lm, type='pearson')), col='red')
Potting the Model
There are four plots that the model provides information for, where we only have to have R
perform a simple plot. The plots are: Residuals vs. Fitted plot, A Normal Q-Q plot, a plot, and
a Residuals vs. Leverage plot.
When looking at this plot, we check for two things:
1. Verify that the red line is roughly horizontal across the plot. If it is, then the
assumption of homoscedasticity is likely satisfied for a given regression model. That
is, the spread of the residuals is roughly equal at all fitted values.
2. Verify that there is no clear pattern among the residuals. In other words, the
residuals should be randomly scattered around the red line with roughly equal
variability at all fitted values.
Residuals vs. Fitted
This plot is a scatter plot of residuals on the 𝑦-axis and fitted values (estimated responses)
on the 𝑥-axis. The plot is used to detect non-linearity, unequal error variances, and outliers.
The plot suggests that there is a decreasing relationship between our modeled responses
and its residuals, but the relationaship may not be linear. It also shows a data point (#129)
that is nearly off the chart. We’ll look at that point later in the discussion.
Normal Q-Q
A Normal Q-Q plot helps us assess whether a set of data plausibly came from some
theoretical Normal distribution. It is a scatterplot created by plotting two sets of quantiles
against one another. If both sets of quantiles came from the same distribution, we should
see the points forming a line that’s roughly straight.
Scale vs. Location
A scale-location plot is a type of plot that displays the fitted values of a regression model
along the x-axis and the the square root of the standardized residuals along the y-axis.
Residuals vs. Leverage
Each observation from the dataset is shown as a single point within the plot. The x-axis
shows the leverage of each point and the y-axis shows the standardized residual of each
point.
Leverage refers to the extent to which the coefficients in the regression model would
change if a particular observation was removed from the dataset. Observations with high
leverage have a strong influence on the coefficients in the regression model. If we remove
these observations, the coefficients of the model would change noticeably.
If any point in this plot falls outside of Cook’s distance (the red dashed lines) then it is
considered to be an influential observation. We can see that observation #129 lies closest
to the border of Cook’s distance, but it doesn’t fall outside of the dashed line. This means
there are not any influential points in our regression model.
try(plot(model_lm));try(text(model_lm))
## Error in xy.coords(x, y, recycle = TRUE, setLab = FALSE) :
## 'x' is a list, but does not have components 'x' and 'y'
Making Predictions
Now that we have what appears to be a decent model, we use it to make predictions. There
are two function available to complete this task: First, the stats package have a special
glm.predict() functions. The second function, predict() from the base R package is
simpler and the one we’ll use. the type = link argument refers to the “logit” link-function
used in the model (by default, the family binomial uses the logit link function). Our
responses are “1”s and “zeros”.
The pred object that we create here will be used for model scoring and diagnostics, as we
go forward.
pred = ifelse(predict(model_lm, type = "link") > 0, "1", "0")
Going forward, we want to ensure that the actual values and the predicted values are the
same type. We know that we’ll need both sets to be character values for scoring and
building a confusion matrix, so we transform the actuals here.
ion_trn$Y_trn<-as.character(ion_trn$Y)
ion_tst$Y_tst<-as.character(ion_tst$Y)
Score the Trained Model
We use the actual and predicted values for scoring the model by checking the percentage of
predictions that match the actuals.
score<-mean(pred==ion_trn$Y_trn)
score
## [1] 0.8672986
Results
Here, we’ll generate several kinds of results, including the table representing the confusion
matrix, a matrix of overall results, and a matrix of classes. Recall that the classes represent
the responses described as free electrons in the ionosphere. “Good” radar returns are those
showing evidence of some type of structure in the ionosphere. “Bad” returns are those that
do not; their signals pass through the ionosphere.
Confusion Matrix
To predict based on the actual training data requires that both the predicted values and the
actual values are formatted as arrays. So, we make them arrays and check their structures
to ensure they match.
ion_trn$Y_trn<-as.array(ion_trn$Y_trn)
pred<-as.array(pred)
str(pred)
## chr [1:211(1d)] "0" "0" "0" "0" "1" "1" "0" "0" "0" "0" "0" "1" "1" "0" .
..
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:211] "1" "2" "4" "7" ...
str(ion_trn$Y_trn)
## chr [1:211(1d)] "0" "1" "1" "0" "1" "1" "0" "1" "0" "0" "0" "1" "1" "1" .
..
We use the confusionMatrix() function form the caret package to construct the confusion
matrix. To make the code simpler, we use a helper function named train_tab, which forms
a table of predicted and actual values (we’ll use all of its output later), and we print a table
of actuals and predictions.
library(caret)
train_tab = table(predicted = pred, actual = ion_trn$Y)
train_con_mat = confusionMatrix(train_tab)
results <- confusionMatrix(train_tab)
print(table(predicted = pred,actuals = ion_trn$Y))
## actuals
## predicted 0 1
## 0 125 19
## 1 9 58
The Confusion Matrix Results
First, look at the confusion matrix. The table is organized so that the actuals are the
columns and the predicted ar the rows. So, if we look at row 0 and column 0, we are
viewing the frequency when the model predicted the actual number of “0”s correctly. These
are called “true negatives” or TNs. If we shift to the right, while still in row 0, we see that
there of 29 counts where the model predicted the “0”s class when they were actually “1”s.
These are called ’false negatives” or FNs.
Moving down to row 1 and column 1, guess what? These are the frequency of “true
positives” or TPs, when the model predicted “1”s and the actuals were “1”s. So, shifting left
to row 1 and column 0, we have the “false positives” or FPs, and these are cases when the
model predicted “1”s while they were actually “0”s.
To summarize, there are 125 TNs, 21 FNs, 53 TPs, and 12 FPs.
Confusion Matrix
Bad Good
Bad 125 19
Good 9 58
The Overall Results
The confusion matrix object we created (above) provides a lot of model performance
metrics. We’ll get the matrix of overall results. The matrix of overall results will provide
and accuracy score, which we’ll explain later, with lower and upper bounds, to for a
confidence interval. It also, gives us the p-value for the accuracy score. Accuracy is the ratio
of the correctly labeled classes to the whole pool of classes, and may be the most intuitive
measure. Accuracy answers the following question: How many “goods” did we correctly
label out of all the classes?
(𝑇𝑃 + 𝑇𝑁)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)
• numerator: all correctly labeled classes (All goods)
• denominator: all classes
Finally, the overall results give us the p-value of the McNemar’s Test, which we’ll explain
now, since it may be the most infomative result.
In a widely cited 1998 paper on the use of statistical hypothesis tests to compare classifiers
titled “Approximate Statistical Tests for Comparing Supervised Classification Learning
Algorithms“, Thomas Dietterich recommends the use of the McNemar’s test.
“For algorithms that can be executed only once, McNemar’s test is the only test
with acceptable Type I error.”
— Thomas Dietterich
Specifically, Dietterich’s study was concerned with the evaluation of different statistical
hypothesis tests, some operating upon the results from resampling methods. The concern
of the study was low Type I error, that is, the statistical test reporting an effect when in fact
no effect was present (false positive).
Statistical tests for deep learning models have always been an issue, and as well see that
this si the case for multinomial logistic regression torward the end of this chapter.
However, McNemar’s test is favorably accepted. The test is based on contingency tables,
which we will not discuss here. We will state that a contingency table is a tabulation or
count of two categorical variables. In the case of the McNemar’s test, we are interested in
binary variables correct/incorrect or yes/no for a control and a treatment or two cases.
This is called a 2×2 contingency table, and is much like our confusion matrix.
The McNemar Test
The McNemar test statistics is calculated from a contingency table as:
(Yes/No − No/Yes)2
𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 =
Yes/No + No/Yes
Where Yes/No is the count of test instances that Classifier1 got correct and Classifier2 got
incorrect, and No/Yes is the count of test instances that Classifier1 got incorrect and
Classifier2 got correct.
This calculation of the test statistic assumes that each cell in the contingency table used in
the calculation has a count of at least 25. The test statistic has a Chi-Squared distribution
with 1 degree of freedom.
We can see that only two elements of the contingency table are used, specifically that the
Yes/Yes and No/No elements are not used in the calculation of the test statistic. As such, we
can see that the statistic is reporting on the different correct or incorrect predictions
between the two models, not the accuracy or error rates. This is important to understand
when making claims about the finding of the statistic.
The default assumption, or null hypothesis, of the test is that the two cases disagree to the
same amount. If the null hypothesis is rejected, it suggests that there is evidence to suggest
that the cases disagree in different ways, that the disagreements are skewed.
Given the selection of a significance level, the p-value calculated by the test can be
interpreted as follows:
• 𝑝 > 𝛼: fail to reject 𝐻0 , no difference in the disagreement (e.g. treatment had no
effect).
• 𝑝 <= 𝛼: reject 𝐻0 , significant difference in the disagreement (e.g. treatment had an
effect).
Interpreting the McNemar Test
It is important to take a moment to clearly understand how to interpret the result of the
test in the context of two machine learning classifier models.
The two terms used in the calculation of the McNemar’s Test capture the errors made by
both models. Specifically, the No/Yes and Yes/No cells in the contingency table. The test
checks if there is a significant difference between the counts in these two cells. That is all.
If these cells have counts that are similar, it shows us that both models make errors in
much the same proportion, just on different instances of the test set. In this case, the result
of the test would not be significant and the null hypothesis would not be rejected.
“Under the null hypothesis, the two algorithms should have the same error rate
…”
— Approximate Statistical Tests for Comparing Supervised Classification
Learning Algorithm, 1998.
If these cells have counts that are not similar, it shows that both models not only make
different errors, but in fact have a different relative proportion of errors on the test set. In
this case, the result of the test would be significant and we would reject the null hypothesis.
“So we may reject the null hypothesis in favor of the hypothesis that the two
algorithms have different performance when trained on the particular training”
— Approximate Statistical Tests for Comparing Supervised Classification
Learning Algorithm, 1998.
We can summarize this as follows:
• Fail to Reject Null Hypothesis: Classifiers have a similar proportion of errors on
the test set.
• Reject Null Hypothesis: Classifiers have a different proportion of errors on the test
set.
After performing the test and finding a significant result, it may be useful to report an effect
statistical measure in order to quantify the finding. For example, a natural choice would be
to report the odds ratios, or the contingency table itself, although both of these assume a
sophisticated reader.
It may be useful to report the difference in error between the two classifiers on the test set.
In this case, be careful with your claims as the significant test does not report on the
difference in error between the models, only the relative difference in the proportion of
error between the models.
Finally, in using the McNemar’s test, Dietterich highlights two important limitations that
must be considered. They are:
1. There is no measure of the training set or model variability.
2. It is a more indirect comparison of models, than the Train/Test Split method.
as.matrix(results, what = "overall")
## [,1]
## Accuracy 8.672986e-01
## Kappa 7.055716e-01
## AccuracyLower 8.139494e-01
## AccuracyUpper 9.099767e-01
## AccuracyNull 6.350711e-01
## AccuracyPValue 3.726257e-14
## McnemarPValue 8.897301e-02
The Classes Results
Taking these in order, Sensitivity is the ratio of the correctly “+ve”good” labeled by our
model to all classes that are “good” in reality. Sensitivity answers the following question: Of
all the classes that are good, how many of those do we correctly predict?
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
• numerator: “good” labeled for “good” classes.
• denominator: all classes that are “good” (whether detected by our model or not)
Specificity is the correctly “goods” labeled by the model for all the “goods” in reality.
Specifity answers the following question: Of all the “classes” who are “good”, how many of
those did we correctly predict?
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
• numerator: “bad” labeled for “bad” classes.
• denominator: all classes that are “bad” in reality (whether labeled “bad” or “good”)
To better understand Sensitivity and Secificity, we’ll use a receiver-operator curve or ROC.
as.matrix(results, what = "classes")
## [,1]
## Sensitivity 0.9328358
## Specificity 0.7532468
## Pos Pred Value 0.8680556
## Neg Pred Value 0.8656716
## Precision 0.8680556
## Recall 0.9328358
## F1 0.8992806
## Prevalence 0.6350711
## Detection Rate 0.5924171
## Detection Prevalence 0.6824645
## Balanced Accuracy 0.8430413
The Receiver Operating Characteristic Curve
An important way to visualize sensitivity and specificity is via the receiving operator
characteristic curve. Let’s see how we can generate this curve in R. The pROC package’s
roc() function is nice in that it lets one plot confidence intervals for the curve.
On the x-axis specificity decreases as the false positives increase. On the y-axis sensitivity
increases with false positives. One interpretation of this is in terms of how much you ‘pay’
in terms of false positives to obtain true positives. The area under the curve summarizes
this: if it is high you pay very little, while if it is low you pay a lot. The ‘ideal’ curve achieves
sensitivity 1 for specificity 1, and has AUC 1. This implies you pay nothing in false positives
for true positives. Our observed curve is pretty good though, as it has a high slope early on,
and a high AUC of 0.804.
library(pROC)
N_ion_all = nrow(ion_trn)
N_ion_trn = round(0.75*(N_ion_all))
N_ion_tst = N_ion_all-N_ion_trn
model02 = glm(Y ~ V3 + V5 + V8 + V22 + V26, data = ion_data, family = 'binomi
al')
predictions <- predict(model02,newdata=ion_tst, type='response')[1:N_ion_tst]
ion_sensitivity <- sensitivity(factor(round(predictions)),factor(ion_tst['Y']
[1:N_ion_tst,]))
ion_specificity <- specificity(factor(round(predictions)),factor(ion_tst['Y']
[1:N_ion_tst,]))
ion_roc <- roc(round(predictions),ion_tst['Y'][1:N_ion_tst,],ci=TRUE,plot=TRU
E, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE, print.auc=TRUE, show.th
res=TRUE, col = "red", grid.col=c("red2", "red3"))
ion.ci <- ci.se(ion_roc)
plot(ion.ci,type='shape', col = "lightblue", lwd = 2)
plot(ion.ci,type='bars', col = 'blue', lwd = 2)
library(pROC)
roc.list <- roc(Y ~ V3 + V5 + V8 + V22 + V26, data = ion_trn)
ci.list <- lapply(roc.list, ci.se, specificities = seq(0, 1, l = 25))
dat.ci.list <- lapply(ci.list, function(ciobj)
data.frame(x = as.numeric(rownames(ciobj)),
lower = ciobj[, 1],
upper = ciobj[, 3]))
p <- ggroc(roc.list) + theme_minimal() + geom_abline(slope=1, intercept = 1,
linetype = "dashed", alpha=0.7, color = "grey", size = 1.5) + coord_equal()
for(i in 1:3) {
p <- p + geom_ribbon(
data = dat.ci.list[[i]],
aes(x = x, ymin = lower, ymax = upper),
fill = i + 1,
alpha = 0.2,
inherit.aes = F)
}
Back to classes results.
Precision is the positive predicted values divided by all the postive values:
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
• numerator: All the actual “good” classes that were classified as “good”.
• Denominator: All the predicted “good” classes whether they were true or false.
The F1-score considers both precision and recall. Recall is the same as sensitivity. It is the
harmonic mean (average) of the precision and recall. The F1 Score is best if there is some
sort of balance between precision (𝑝) and recall (𝑟) in the system. On the other hand, the
F1 Score is low if one measure is improved at the expense of the other. For example, if 𝑃 =
1 and 𝑅 = 0, the F1 score is 0.
𝑅𝑒𝑐𝑎𝑙𝑙 ⋅ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 2 ⋅
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
**Balanced Accuracy”” is simply the arithmetic mean of Sensitivity and Specificity:
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦
Balanced Accuracy =
2
Prevalence is the proportion of individuals with disease, or the proportion of nodes with
individuals with disease, or the proportion of individuals with disease in each node. I’m a
rocket scientist, not medical doctor, but to me this means the proportion or classes that are
rated as “bad” and really as “bad”. So, it provides a percentage of the data that yield “bad”
as a class. And, this takes us to some general rules of thumb.
Rules of Thumb
Accuracy is a great measure but only when you have symmetric data (false negatives &
false positives counts are close), and if false negatives and false positives have similar costs.
If the cost of false positives and false negatives are different then F1 is your savior. F1 is
best if you have an uneven class distribution.
Precision is how sure you are of your true positives, while recall is how sure you are that
you are not missing any positives.
Choose Sensitivity if the idea of false positives is far better than false negatives. In other
words, if the occurrence of false negatives is so unacceptable/intolerable, that you’d rather
get some extra false positives (false alarms) over saving some false negatives. For example,
you would rather get some healthy people labeled diabetic over leaving a diabetic person
labeled healthy.
Choose Precision if you want to be more confident of your true positives. For example, you
woul rather have some spam emails in your inbox rather than some regular emails in your
spam box. So, the Outlook wants to be extra sure that email 𝑋 is spam before it is sent to
your spam box and you never get to see it.
Choose Specificity if you want to cover all true negatives, meaning you don’t want any false
alarms or you don’t want any false positives. For example, if you’re running a drug test in
which all people who test positive will immediately go to jail, you don’t want anyone drug-
free going to jail. False positives here are intolerable.
Table of Results Plot
This is a pretty rudimentary plot but is does give an idea of proportions of TPs, FPs, TNs,
and FNs. The top left box represents TNs and the bottom right represents TPs. FPs are
under the TNs and FNs are above the TPs. We’ll make our visual better by creating a
heatmap in the next section.
plot(as.table(results))
Heatmap of the Confusion Matrix
Define Element of the Confusion Matrix
First, we create a dataframe called data comprised of the Actuals and Predicted Classes,
two columns of (“0”,“1”)s. We also rename the column headings, using “Actual” and
“Predicted”.
Second, we form a dataframe named actual and populate it with a table of the Actuals from
data. hat is, we compute the frequencies using data$Actual. Then, we rename the column
headings: “Actual” and “ActualFreq”.
Third, we generate another dataframe named confusion comprised of a table of
data$Actual and data$Predicted. Since “Actual” and “Predicted” were already columns of
data, the predited frequencies will become the third column. And, we rename the columns:
“Actual”, “Predicted”, and “Freq”.
Finally, we merge the confusion dataframe and the actual dataframe, sorted by “Actual”.
This gives us a dataframe populated with actuals, predicted, predicted frequencies and
actual frequencies (“Actual”, “Predicted”, “Freq”, “ActualFreq”). Since we want the heatmap
to show percentages rather than counts, we divied the actual fequencies, “Freq”, by the
total “ActualFreq”, giving us a fifth column of precentages, called “Percent”
data = data.frame(cbind(ion_trn$Y_trn, pred))
names(data) = c("Actual", "Predicted")
#compute frequency of actual categories
actual = as.data.frame(table(data$Actual))
names(actual) = c("Actual","ActualFreq")
#build confusion matrix
confusion = as.data.frame(table(data$Actual, data$Predicted))
names(confusion) = c("Actual","Predicted","Freq")
#calculate percentage of test cases based on actual frequency
confusion = merge(confusion, actual, by="Actual")
confusion$Percent = confusion$Freq/confusion$ActualFreq*100
Render The Heatmap Plot
We’ll create the plot using three different layers: * Layer 1: first we draw tiles and fill color
based on percentage of test cases * Layer 2: We define the text that will fill the heatmap,
including the values for the percentage of FP, TP, FN, and TN. * Layer 3: we draw diagonal
tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
* Then we render the plot
# Layer 1
tile <- ggplot() + theme(text = element_text(size=16)) +
geom_tile(aes(x=Predicted, y=Actual, fill=Percent),data=confusion, color="b
lack",size=0.1) +
labs(x="Predicted",y="Actual")
# Layer 2
tile = tile +
geom_text(aes(x=Predicted,y=Actual, label=sprintf("%.1f", Percent)),data=co
nfusion, size=5, colour="black") +
scale_fill_gradient(low="grey",high="red")
# Layer 3
tile = tile +
geom_tile(aes(x=Predicted,y=Actual), data=subset(confusion, as.character(Pr
edicted)==as.character(Actual)), color="blue", size=1, fill="black", alpha=0)
# Render
tile
References
1. Sigillito, V., Wing, S., Hutton, L. & Baker, K.. (1988). Ionosphere. UCI Machine
Learning Repository.