Credit Card Default
Credit Card Default
Background
Yu-Chen (Amber) Lu
Date: 2018-12-21
Abstract
To find out the best fit algorithm for amount of given credit in NT dollars against other factors, which are
important variables using Bayesian information criterion.
0. Introduction
High credit card default rate can make a business in trouble even bankrupt. The propose of this project is
to predict whether a cilent defaults on her/his credit card, the business can underwrite credit cards more
carefully to the potential clients who cannot pay bills with high probability.
In order to get more insight of this dataset, I did exploratory data analysis using ggplot2. Another perceptive
of this project is to predict the limit balance of a credit card. Thus, I used “Default of Credit Card Clients
Dataset”, which was downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/00350/.
This dataset contains information on default payments, demographic factors, credit data, history
of payment, and bill statements of credit card clients in Taiwan from April 2005 to September
2005.
1
(17) BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
(18) BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
(19) PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
(20) PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
(21) PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
(22) PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
(23) PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
(24) PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
(25) default.payment.next.month: Default payment (1=yes, 0=no)
3) Citation
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive
accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2),
2473-2480.
4) Library sources
library(plyr)
library(dplyr)
2
library(tree)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(gbm)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(class)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(gam)
## ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5
## 1 1 20000 2 2 1 24 2 2 -1 -1 -2
## 2 2 120000 2 2 2 26 -1 2 0 0 0
## 3 3 90000 2 2 2 34 0 0 0 0 0
3
## 4 4 50000 2 2 1 37 0 0 0 0 0
## 5 5 50000 1 2 1 57 -1 0 -1 0 0
## 6 6 50000 1 1 2 37 0 0 0 0 0
## PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6
## 1 -2 3913 3102 689 0 0 0
## 2 2 2682 1725 2682 3272 3455 3261
## 3 0 29239 14027 13559 14331 14948 15549
## 4 0 46990 48233 49291 28314 28959 29547
## 5 0 8617 5670 35835 20940 19146 19131
## 6 0 64400 57069 57608 19394 19619 20024
## PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## 1 0 689 0 0 0 0
## 2 0 1000 1000 1000 0 2000
## 3 1518 1500 1000 1000 1000 5000
## 4 2000 2019 1200 1100 1069 1000
## 5 2000 36681 10000 9000 689 679
## 6 2500 1815 657 1000 1000 800
## default.payment.next.month
## 1 1
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
glimpse(CreditCard)
## Observations: 30,000
## Variables: 25
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ LIMIT_BAL <dbl> 20000, 120000, 90000, 50000, 50000,...
## $ SEX <int> 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2,...
## $ EDUCATION <int> 2, 2, 2, 2, 2, 1, 1, 2, 3, 3, 3, 1,...
## $ MARRIAGE <int> 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2,...
## $ AGE <int> 24, 26, 34, 37, 57, 37, 29, 23, 28,...
## $ PAY_0 <int> 2, -1, 0, 0, -1, 0, 0, 0, 0, -2, 0,...
## $ PAY_2 <int> 2, 2, 0, 0, 0, 0, 0, -1, 0, -2, 0, ...
## $ PAY_3 <int> -1, 0, 0, 0, -1, 0, 0, -1, 2, -2, 2...
## $ PAY_4 <int> -1, 0, 0, 0, 0, 0, 0, 0, 0, -2, 0, ...
## $ PAY_5 <int> -2, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, ...
## $ PAY_6 <int> -2, 2, 0, 0, 0, 0, 0, -1, 0, -1, -1...
## $ BILL_AMT1 <dbl> 3913, 2682, 29239, 46990, 8617, 644...
## $ BILL_AMT2 <dbl> 3102, 1725, 14027, 48233, 5670, 570...
## $ BILL_AMT3 <dbl> 689, 2682, 13559, 49291, 35835, 576...
## $ BILL_AMT4 <dbl> 0, 3272, 14331, 28314, 20940, 19394...
## $ BILL_AMT5 <dbl> 0, 3455, 14948, 28959, 19146, 19619...
## $ BILL_AMT6 <dbl> 0, 3261, 15549, 29547, 19131, 20024...
## $ PAY_AMT1 <dbl> 0, 0, 1518, 2000, 2000, 2500, 55000...
## $ PAY_AMT2 <dbl> 689, 1000, 1500, 2019, 36681, 1815,...
## $ PAY_AMT3 <dbl> 0, 1000, 1000, 1200, 10000, 657, 38...
## $ PAY_AMT4 <dbl> 0, 1000, 1000, 1100, 9000, 1000, 20...
## $ PAY_AMT5 <dbl> 0, 0, 1000, 1069, 689, 1000, 13750,...
## $ PAY_AMT6 <dbl> 0, 2000, 5000, 1000, 679, 800, 1377...
## $ default.payment.next.month <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
4
# Get some plots of the data set
CreditCard_plot = CreditCard
CreditCard_plot$LIM_cut = cut(as.numeric(as.character(CreditCard_plot$LIMIT_BAL)),
c((0:8)*130000), right = FALSE,
labels = c("0-130K", "130K-260K", "260K-390K",
"390K-520K", "520K-650K", "650K-780K",
"780K-910K", "910K-1040K")) # Categorize LIMIT_BAL
CreditCard_plot$Age_cut = cut(as.numeric(as.character(CreditCard_plot$AGE)),
c(seq(20,80,10)), right = FALSE) # Categorize Defualt Rate
# Convert format
CreditCard_plot$default.payment.next.month =
as.character(CreditCard_plot$default.payment.next.month)
CreditCard_plot$EDUCATION = as.character(CreditCard_plot$EDUCATION)
# Plot 1 --------------------------------------------------------------------------
ggplot(data=CreditCard_plot, aes(LIM_cut, Age_cut)) +
geom_bar(stat = "identity", aes(fill = Age_cut), position = "dodge") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs( title = "Age and Limit Balance Catagories", x = "Limit Balance Catagories (NTD)",
y = "Age Catagories (Years)")
[70,80)
[60,70) Age_cut
Age Catagories (Years)
[20,30)
[50,60)
[30,40)
[40,50)
[40,50)
[50,60)
[60,70)
[30,40)
[70,80)
[20,30)
0−130K
130K−260K
260K−390K
390K−520K
520K−650K
650K−780K
780K−910K
910K−1040K
# Plot 2 --------------------------------------------------------------------------
5
# The original dataset in default.payment.next.month column shows 0 and 1,
# which mean credit card does not default and do default respectively.
# It is not so clear for someone who does not know this dataset,
# so I changed the labels of default.payment.next.month on the plot
ggplot(data=CreditCard_plot, aes(x=AGE)) +
geom_histogram(binwidth=.5, colour="black", fill="white") +
facet_grid(default.payment.next.month ~., labeller=default_labeller) +
geom_vline(data=CreditCard_plot, aes(xintercept=mean(AGE, na.rm=T)),
linetype="dashed", size=1, colour="red") +
labs(title = "Histogram of Age and Credit Card Default", x = "Age (Years)",
y = "Count", tag = "B")
## Warning: The labeller API has been updated. Labellers taking `variable`and
## `value` arguments are now deprecated. See labellers documentation.
B
Histogram of Age and Credit Card Default
1000
No Default
500
Count
1000
Default
500
0
20 40 60 80
Age (Years)
## Comment: The bar charts shows it is lower percentage of credit card default
## for people between age 25 and age 40. Also, most of the clients are
6
## between age 25 and age 35.
# Plot 3 --------------------------------------------------------------------------
CreditCard_plot2 = CreditCard_plot
CC_Default = CreditCard_plot$default.payment.next.month
CreditCard_plot2$default.payment.next.month[CC_Default=="1"] = "Default"
CreditCard_plot2$default.payment.next.month[CC_Default=="0"] = "No Default"
CreditCard_plot2$SEX[CreditCard_plot$SEX=="1"] = "Male"
CreditCard_plot2$SEX[CreditCard_plot$SEX=="2"] = "Female"
ggplot(data=CreditCard_plot2, aes(x=AGE)) +
geom_histogram(binwidth=.5, colour="black", fill="white") +
facet_grid(default.payment.next.month ~SEX) +
geom_vline(data=CreditCard_plot2, aes(xintercept=mean(AGE, na.rm=T)),
linetype="dashed", size=1, colour="red") +
labs(title = "Histogram of Age, Gender, and Credit Card Default", x = "Age (Years)",
y = "Count", tag = "C")
C
Histogram of Age, Gender, and Credit Card Default
Female Male
800
600
Default
400
200
Count
0
800
600
400 No Default
200
0
20 40 60 80 20 40 60 80
Age (Years)
# Plot 4 --------------------------------------------------------------------------
ggplot(data=CreditCard_plot, aes(x=LIMIT_BAL, colour=default.payment.next.month)) +
stat_density(geom="line",position="identity") +
stat_density(geom="line", aes(color = "default.payment.next.month")) +
labs(title = "Density of Limit Balance and Credit Card Default",
x = "Limit Balance (NTD)", y = "Density", tag = "D") +
scale_colour_discrete(name="Default", breaks=c("0", "1", "default.payment.next.month"),
labels=c("No", "Yes", "All (Yes and No)"))
7
D
Density of Limit Balance and Credit Card Default
6e−06
4e−06
Default
Density
No
Yes
All (Yes and No)
2e−06
0e+00
## Comment: Light blue line, which represents the density of credit card default,
## has a high peak at limit balance about 10000 NTD. It might tell us
## that credit card might be too easy to approve without careful
## considerations of applicants' credit score.
# Plot 5 --------------------------------------------------------------------------
ggplot(data=CreditCard_plot, aes(MARRIAGE, fill = default.payment.next.month)) +
labs(title = "Stacked Bar Chart of Marital Status and Credit Card Default",
subtitle = ("0: Missing Data; 1: Married; 2:Single; 3: Others"),
x = "Marital status", y = "Density", tag = "E") + geom_histogram(bins = 7) +
scale_fill_discrete(name="Default", breaks=c("0", "1"), labels=c("No", "Yes"))
8
E
Stacked Bar Chart of Marital Status and Credit Card Default
0: Missing Data; 1: Married; 2:Single; 3: Others
15000
10000
Default
Density
No
Yes
5000
0 1 2 3
Marital status
# Plot 6 --------------------------------------------------------------------------
edu_count_table = as.data.frame(table(CreditCard_plot$EDUCATION))
edu_count_table$Prob = edu_count_table$Freq / sum(edu_count_table$Freq)
colnames(edu_count_table) = c("Edu", "Freq", "Prob" )
9
F
Pie Chart of Eduction Level
0.00/1.00
Education Level
Missing Data
Graduate
University
0.75 0.25
x
High school
Others
Unknown
Unknown
0.50
Probability
10
## BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
## Min. : -6676 Min. :-17710 Min. :-61506 Min. :-65167
## 1st Qu.: 2988 1st Qu.: 2694 1st Qu.: 2500 1st Qu.: 2142
## Median : 20185 Median : 20300 Median : 19834 Median : 19120
## Mean : 48509 Mean : 47284 Mean : 45182 Mean : 42037
## 3rd Qu.: 59626 3rd Qu.: 57920 3rd Qu.: 54734 3rd Qu.: 50176
## Max. :613860 Max. :581775 Max. :578971 Max. :548020
## BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
## Min. :-53007 Min. :-339603 Min. : 0 Min. : 0
## 1st Qu.: 1503 1st Qu.: 1150 1st Qu.: 0 1st Qu.: 0
## Median : 18478 Median : 18028 Median : 1636 Median : 1534
## Mean : 39540 Mean : 38271 Mean : 3397 Mean : 3389
## 3rd Qu.: 47853 3rd Qu.: 47424 3rd Qu.: 3478 3rd Qu.: 3310
## Max. :547880 Max. : 514975 Max. :300000 Max. :358689
## PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 1222 Median : 1000 Median : 1000 Median : 1000
## Mean : 3367 Mean : 3156 Mean : 3219 Mean : 3442
## 3rd Qu.: 3000 3rd Qu.: 2939 3rd Qu.: 3000 3rd Qu.: 2974
## Max. :508229 Max. :432130 Max. :332000 Max. :345293
## default.payment.next.month
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
# Define function to calculate default rate based on one factor
DefaultRate = function(tab){
for ( i in 1:length(N)){
factor = names[N[i]]
fre_cc = as.data.frame(table(CreditCard[factor]))
fre_de = as.data.frame(table(default[factor]))
# Left join
fre_table = merge(fre_cc, fre_de, by='Var1', all.x=TRUE)
fre_table[is.na(fre_table)] = 0 # Replace NA as 0
colnames(fre_table) = c(factor, 'AllData', 'Default')
11
return(DefaultRateList)
}
DefaultRateMat = DefaultRate(CreditCard)
# Plot 7 --------------------------------------------------------------------------
# Extract LIMIT_D_R default rate
LimitBal_D_R = as.data.frame(DefaultRateMat[[1]])
LimitBal_D_R$LIM_cut = cut(as.numeric(as.character(LimitBal_D_R$LIMIT_BAL)),
c((0:8)*130000), right = FALSE,
labels = c("0-130K", "130K-260K", "260K-390K",
"390K-520K", "520K-650K", "650K-780K",
"780K-910K", "910K-1040K")) # Categorize LIMIT_BAL
LimitBal_D_R$Rate_cut = cut(as.numeric(as.character(LimitBal_D_R$Rate)),
c(seq(0,1.1,0.1)), right = FALSE) # Categorize Defualt Rate
G
Probability of Default Rate Based on Limit Balance
1.00
20%−30%
0.50
30%−40%
40%−50%
50%−60%
0.25 More than 60%
0.00
12
## Comment: The average of default rate is relatively low
## when the balance limit is between 250,000 and 500,000.
## The volatility of default rate is relatively high
## when the balance limit is between 500,000 and 750,000.
# Plot 8 --------------------------------------------------------------------------
Edu_D_R = as.data.frame(DefaultRateMat[[3]])
Edu_D_R$Rate = as.numeric(as.character(Edu_D_R$Rate))
Edu_D_R$EDUCATION = c("MissingData", "Graduate", "University",
"HighSchool", "Others", "Unknown1", "Unknown2")
H
Count of Default and Non−Default Based on Education Level
10700
9000
8549
6000
Default Yes or No
Count
Yes
No
3680
3330
3000
2036
1237
262
0 14 7 116 18 8 43
0
GraduateHighSchool
MissingData Others University Unknown1Unknown2
Education Level
13
labs(title = "Scatter Plot of Default Rate and Education Level",
x = "Education Level", y = "Count ", tag = "I")
I
Scatter Plot of Default Rate and Education Level
0.25
0.20
0.15
Count
0.10
0.05
0.00
## Comment: Clients with higher education level has lower default rate.
# Plot 9 --------------------------------------------------------------------------
Age_D_R = as.data.frame(DefaultRateMat[[5]])
14
J
Level of Default Rate Baesd on Age
1500 Rate
0.6
0.4
Count of All Data
1000 0.2
0.0
Rate
500 0.0
0.2
0.4
0.6
20 40 60 80
Age (Years)
## Comment: More people have credit card between age 22 and age 40.
## The variance is higher for clients older than 60.
## 0 1 2 3
##
## 1 0 0 2 6 0
## 1 1 1690 2633 30
## 2 1 2370 2940 63
## 3 12 1048 894 36
## 4 0 18 23 1
## 5 0 48 46 1
## 6 0 14 11 0
## 2 0 0 2 4 0
## 1 3 2032 4176 20
## 2 5 4472 4080 99
## 3 32 1813 1015 67
## 4 0 34 45 2
## 5 0 102 81 2
## 6 0 14 10 2
## We can see some unknown data and 0 in this data set
##### Note. can do predict for those missing data set as well,
##### but have not included in this project.
# Data Cleaning
15
# Remove rows with Education=0,5,6 and MARRIAGE=0,3 and LIMIT_BAL,SEX,AGE=0
without0 = apply(CreditCard,1, function(x) all(x[2:6]!=0) && x[4]!=5 && x[4]!=6 && x[5]!=3)
CreditCard = CreditCard[without0,]
There are 24 factors against amount of given credit. In order to aviod overfitting, I selected the most
important factors using forward stepwise selection.
Algorithm: Forward Stepwise Selection a. Let M0 denote the null model, which contains no predictors.
b. For k = 0, 1, . . . , p − 1; p is the number of predictors b.1 Consider all p − k models that augment the
predictors in Mk with one additional predictor b.2 Choose the best among these p − k models, and call it
Mk−∞ . Here best is defined as having the smallest RSS, or equivalently largest R2 . c. Select a single best
model fromamong M0 , M∞ , . . . , M√ using cross-validated prediction error, Cp, AIC, BIC, or adjusted R2 .
At the step c, I chose Bayesian Information Criterion (BIC) for determining the cross-validated prediction
error. The Bayesian Information Criterion (BIC) gives unnecessary variable much greater penalty, so it can
more efficient to aviod overfitting.
Criteria: Bayesian Information Criterion For the least squares model with d predictors up to irrelevant
constants
1
BIC = (RSS + dσ̂ 2 log n)
n
Since log n > 2 for n > 7, the BIC places a heavier penalty on models with many variables.
M0 ## 2. Quantitative factors as responses Statistic Learning on LIMIT_BAL against other factors
# Set up the 70% for train set and the rest of 30% for test set
train = round(nrow(CreditCard) * 0.7,0)
train = sample(nrow(CreditCard) ,train)
CreditCard.train = CreditCard[train, ]
CreditCard.test = CreditCard[-train, ]
# Determine how many paramenters will be used using **Bayesian Information Criterion (BIC)**
plot(Forward.summary$bic,type='b',col="blue", pch=19, xlab = "Number of Variables",
ylab = "Cross-Validated Prediction Error",
main = "Forward Stepwise Selection using BIC")
points(which.min(Forward.summary$bic), Forward.summary$bic[which.min(Forward.summary$bic)],
col="red", pch=19)
16
Forward Stepwise Selection using BIC
Cross−Validated Prediction Error
−3000
−5000
−7000
−9000
5 10 15 20
Number of Variables
# Logistic Regression
fit.lm = lm(formulaQ2, data = CreditCard.train)
# Predict
yhat.lm = predict(fit.lm, CreditCard.test)
# Test MSE
mse_lm = round(mean((yhat.lm - CreditCard.test$LIMIT_BAL)^2), 4)
paste("The test MSE using linear regession is", mse_lm)
17
##################################################################################
# Lasso and Elastic-Net Regularized Generalized Linear Models
tryalpha = seq(0,1,0.1)
x.train = model.matrix(formulaQ2, data = CreditCard.train)
x.test = model.matrix(formulaQ2, data = CreditCard.test)
y = CreditCard.train$LIMIT_BAL
}
plot(tryalpha, mse_glmnet, xlab = "Alpha", ylab = "Test Mean-Squared Errors",
main = "Test MSE using Regularized Generalized Linear Models")
10561000000
10559500000
Alpha
# Test MSE
# Lasso and Elastic-Net Regularized Generalized Linear Models (glmnet)
paste("The lowest test MSE using glmnet is",
min(mse_glmnet), "with alpha =",
tryalpha[which.min(mse_glmnet)], "as alpha is in [0, 1]")
## [1] "The lowest test MSE using glmnet is 10559269010.7383 with alpha = 0.6 as alpha is in [0, 1]"
18
##################################################################################
# Tree
# Fit a regression tree to the training set
fit.tree = tree(formulaQ2, data = CreditCard.train)
# Test MSE
tree.pred = predict(fit.tree, newdata = CreditCard.test)
mse.tree = mean((tree.pred - CreditCard.test$LIMIT_BAL)^2)
print (paste("The test MSE using tree is", round(mse.tree,4) ))
19
yhat.bag = predict(fit.bag, newdata = CreditCard.test)
mse.bag[i] = round(mean((yhat.bag - CreditCard.test$LIMIT_BAL)^2),4)
plot(trymtry, mse.bag, type = "b", xlab = "Number of variables", ylab = "Test MSE",
main = "Test MSE using Bagging Approach")
8.9e+09
8.7e+09
10 11 12 13 14 15 16
Number of variables
20
plot(lambdas, test.err, type = "b", xlab = "Shrinkage values", ylab = "Test MSE",
main = "Test MSE using Boosting Algorithm")
Shrinkage values
mse_boosting = round(min(test.err), 4)
paste("The test MSE using boosting is", mse_boosting)
## Test.MSE
## lm 10560901025
## glmnet 10559269011
## tree 11370064101
## bag 8686941488
## boosting 9383846760
I did machine leanring on this credit card dataset using five algorithms, including linear regression (lm),
lasso and elastic-net regularized generalized linear models (glmnet), classification tree, bagging, and boosting.
From the test MSE table, we can see linear regression is not a good fit for our credit card dataset to predict
limit balance, even shrinking the coefficients α from 0 to 1. Among these five algorithms, bagging approach
has the lowest test MSE. It is because bagging approach (Bootstrap aggregation) can reduce the variance
and hence decrease the prediction mean-squared errors of a statistical learning method. Also, it takes many
training sets from the population and build a separate prediction model using each training set. Then we
21
average the resulting predictions. Thus, the test MSE is lower than the test MSE using a single.
And then, I did the same process on whether clients default payment next month (default.payment.next.month)
against others factors.
# Determine which parameters are more important using forward selection
Forward = regsubsets(default.payment.next.month ~., data = CreditCard.train,
method="forward", nvmax=length(CreditCard)-1)
Forward.summary = summary(Forward)
# Determine how many paramenters will be used using **Bayesian Information Criterion (BIC)**
plot(Forward.summary$bic,type='b',col="blue", pch=19,
xlab = "Number of Variables",
ylab = "Cross-Validated Prediction Error",
main = "Forward Stepwise Selection using BIC")
points(which.min(Forward.summary$bic),
Forward.summary$bic[which.min(Forward.summary$bic)],
col="red", pch=19)
−2450
−2550
−2650
5 10 15 20
Number of Variables
22
## I picked 6 parameters, which has the minimum error using bic
######################################################################################
# Generalized Linear Model with Logistic Regression
fit.glm = glm(formulaQ3, data = CreditCard.train, family = binomial)
# Predict
pred.prob = predict(fit.glm, CreditCard.test, type = "response")
pred.glm = rep(0, length(pred.prob))
pred.glm[pred.prob > 0.5] = 1
pred.table = table(pred.glm, CreditCard.test$default.payment.next.month)
pred.table
##
## pred.glm 0 1
## 0 6604 1500
## 1 210 471
# Sensitivity: TP/P = TPR
Sensitivity = pred.table[1,1] / sum(pred.table[,1])
# Specificity: TN/N = TNR
Specificity = pred.table[2,2] / sum(pred.table[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table[1,1], pred.table[2,2]) / sum(pred.table[,])
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table[1,2],pred.table[2,1]) / sum(pred.table[,])
# Predict table
pred.table.lda = table(pred.prob.lda$class, CreditCard.test$default.payment.next.month)
pred.table.lda
##
## 0 1
## 0 6570 1457
## 1 244 514
23
# Sensitivity: TP/P = TPR
Sensitivity = pred.table.lda[1,1] / sum(pred.table.lda[,1])
# Specificity: TN/N = TNR
Specificity = pred.table.lda[2,2] / sum(pred.table.lda[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table.lda[1,1],pred.table.lda[2,2]) / sum(pred.table.lda[,])
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table.lda[1,2],pred.table.lda[2,1]) / sum(pred.table.lda[,])
# Predict table
pred.table.qda = table(pred.prob.qda$class, CreditCard.test$default.payment.next.month)
pred.table.qda
##
## 0 1
## 0 5574 841
## 1 1240 1130
# Sensitivity: TP/P = TPR
Sensitivity = round(pred.table.qda[1,1] / sum(pred.table.qda[,1]),4)
# Specificity: TN/N = TNR
Specificity = round(pred.table.qda[2,2] / sum(pred.table.qda[,2]),4)
# Accuracy: (TP + TN)/(P + N)
Accuracy = round(sum(pred.table.qda[1,1],
pred.table.qda[2,2]) / sum(pred.table.qda[,]),4)
# Total Error Rate: (FP + FN)/(P + N)
TotalError = round(sum(pred.table.qda[1,2],
pred.table.qda[2,1]) / sum(pred.table.qda[,]),4)
# Set up the 70% for train set and the rest of 30% for test set
24
train = round(nrow(CreditCard) * 0.7,0)
train = sample(nrow(CreditCard) ,train)
CreditCard.bic.train = CreditCard.bic[train, ]
CreditCard.bic.test = CreditCard.bic[-train, ]
pred.tables.knn = table(NULL)
Accuracy.knn.table = table(NULL)
for(K in 6:25){
25
Accuracy using KNN
0.77
Accuracy (%)
0.76
0.75
0.74
10 15 20 25
Number of K
paste("The highest accuracy using KNN is", max(Accuracy.knn.table), " with K =",
which.max(Accuracy.knn.table))
##
## pred.gam 0 1
## 0 6814 1971
pred.table.gam.Accuracy = round(pred.table.gam[1] / length(pred.prob.gam),4)
paste("The accuracy of generalized additive model is", pred.table.gam.Accuracy)
26
fit.tree = tree(formulaQ3, data = CreditCard.train)
0.1426 0.4162
##
## pred.tree 0 1
## 0 6528 1333
## 1 286 638
# Sensitivity: TP/P = TPR
Sensitivity = pred.table.tree[1,1] / sum(pred.table.tree)
# Specificity: TN/N = TNR
Specificity = pred.table.tree[2,2] / sum(pred.table.tree[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table.tree[1,1], pred.table.tree[2,2]) / sum(pred.table.tree)
27
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table.tree[1,2], pred.table.tree[2,1]) / sum(pred.table.tree[,])
######################################################################################
# Lasso and Elastic-Net Regularized Generalized Linear Models
# Fit a regression tree to the training set
x.train = model.matrix(formulaQ3, data = CreditCard.train)
x.test = model.matrix(formulaQ3, data = CreditCard.test)
tryalpha = seq(0,1,0.1)
acc_glmnet = rep(NA, length(tryalpha))
for (i in 1:length(tryalpha)){
28
Accuracy using Regularized Generalized Linear Models
0.798
Accuracy (%)
0.794
0.790
Alpha
## [1] "The highest accuracy using glmnet is 0.7977 with alpha = 0 as alpha is in [0, 1]"
P3_accuracy = data.frame("Accuracy"= c( glm.Confusion$Accuracy, lda.Confusion$Accuracy,
qda.Confusion$Accuracy, max(Accuracy.knn.table),
pred.table.gam.Accuracy, tree.Confusion$Accuracy,
max(acc_glmnet)))
rownames(P3_accuracy) = c("glm", "lda", "qda", "KNN", "gam", "tree", "glmnet" )
P3_accuracy
## Accuracy
## glm 0.8054
## lda 0.8064
## qda 0.7631
## KNN 0.7742
## gam 0.7756
## tree 0.8157
## glmnet 0.7977
# From these seven algorithms, Tree has the highest accuracy.
In part 3, I did machine leanring on this credit card dataset using seven algorithms, including generalized
linear model (glm), linear and quadratic discriminant analysis, k-nearest neighbors, generalized additive
model, classification tree, and Lasso and elastic-net regularized generalized linear models. From the accuracy
table, the possibility of credit card default next month against other factors is near linear relation based
29
on high accuracy of generalized linear model. Additionally, it has clear feature on repayment status of the
previous two months. Thus, the accuracy of classification tree is the highest, and lda is the second highest.
Conclusion
1. More credit card defualt for limit balance about 10000. It might mean that credit card might be too
easy to be issued for people who have low credit scores. The variance of the default rate for limit
balance over 500,000 NTD is higher than other range of limit balance.
2. It is lower default rate for cardholders have higher education level. Moreover, the default rate for clients
whose age over 60 was higher than mid age and young people.
3. The best fit algorithm for predicting limit balance is bagging approach.
4. The best fit algorithm for predicting whether a client default next month is classification tree.
Reference
1. https://bradzzz.gitbooks.io/ga-dsi-seattle/content/dsi/dsi_05_classification_databases/2.1-lesson/
assets/datasets/DefaultCreditCardClients_yeh_2009.pdf
2. https://gerardnico.com/data_mining/stepwise_regression
3. http://www-math.mit.edu/~rmd/650/bic.pdf
30