100% found this document useful (1 vote)
173 views24 pages

Project 3 Thera Bank

The document summarizes a project report for Thera Bank that aims to build a classification model to identify customers most likely to purchase a loan. It performs exploratory data analysis on a dataset of 5,000 customers, identifies patterns and outliers. Key findings include age and experience being normally distributed while income, credit spending and mortgages have many outliers. Family size and education correlate with loan acceptance, with more advanced education customers needing loans for higher studies. The analysis will inform feature selection for building a predictive model to increase the loan approval success rate.

Uploaded by

Meghapriya1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
173 views24 pages

Project 3 Thera Bank

The document summarizes a project report for Thera Bank that aims to build a classification model to identify customers most likely to purchase a loan. It performs exploratory data analysis on a dataset of 5,000 customers, identifies patterns and outliers. Key findings include age and experience being normally distributed while income, credit spending and mortgages have many outliers. Family size and education correlate with loan acceptance, with more advanced education customers needing loans for higher studies. The analysis will inform feature selection for building a predictive model to increase the loan approval success rate.

Uploaded by

Meghapriya1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

PROJECT #3

Project report on Thera Bank


1.0 PROJECT OBJECTIVE

The objective is to build a classification model to identify the potential customers with higher
probability to purchase the loan.

2.0 ASSUMPTIONS

Increase success rate from 9% of last year by Converting liability customer to personal loan
customers.

3.0 EXPLORATORY DATA ANALYSIS

To perform Exploratory data analysis by using Thera bank data set, perform predictive analysis and
build up a model to identify potential customers. This is done through various histograms,
identification of outliers and also identification of specific area (Zip code) to cut short the process of
customer base identification.

3.1 Environment set up and data analysis

3.1.1 Installation of packages and library setup

1. install.packages("caret")
2. install.packages("rpart")
3. install.packages("rpart.plot")
4. install.packages("randomForest")
5. install.packages("lattice")
6. install.packages("ggplot2")
7. install.packages(“scales”)
8. library(ROCR)
9. library(ineq)
10. library(rattle)
11. library(RColorBrewer

3.1.2 Setup working directory

Set up working directory and import “Thera Bank_dataset.xlsx” for further interpretation of data and
model build-up.

setwd("C:/Users/MEGHA/Desktop/Thera Bank")

3.1.3 Import the data and read data set

library(readxl)

Thera_Bank_dataset <- read_excel("Thera Bank_dataset.xlsx")

View("Thera Bank_dataset.xlsx")

dim(Thera_Bank_dataset): To find out number of observations and variables.

[1] 5000 14

To find out Class of each Feature, along with internal structure

> str(Thera_Bank_dataset)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5000 obs. of 14 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
$ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
$ Income (in K/year) : num 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP Code : num 91107 90089 94720 94112 91330 ...
$ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : num 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...

colnames(Thera_Bank_dataset)<- c ('ID','Age (in years)','Experience (in years)','Income (in K/year)','ZI


P Code','Family members','CCAvg','Education','Mortgage','Personal Loan','Securities Account','CD Acc
ount','Online','CreditCard')

> head(Thera_Bank_dataset)

# A tibble: 6 x 14

ID `Age (in years)` `Experience (in~ `Income (in K/y~


<dbl> <dbl> <dbl> <dbl>
1 1 25 1 49
2 2 45 19 34
3 3 39 15 11
4 4 35 9 100
5 5 35 8 45
6 6 37 13 29

#renaming columns for ease of understanding

> names(Thera_Bank_dataset)[2]<-"age_in_years"
> names(Thera_Bank_dataset)[3]<-"experience_in_years"
> names(Thera_Bank_dataset)[4]<-"income_in_K_month"
> names(Thera_Bank_dataset)[5]<-"zip_code"
> names(Thera_Bank_dataset)[6]<-"family_size"
> names(Thera_Bank_dataset)[10]<-"did_accept_personal_loan_offer"
#predictor variable
> names(Thera_Bank_dataset)[11]<-"have_securities_account"
> names(Thera_Bank_dataset)[12]<-"have_deposit_account"
> names(Thera_Bank_dataset)[13]<-"have_online_access"
> names(Thera_Bank_dataset)[14]<-"have_CC"

> summary(Thera_Bank_dataset)

ID age_in_years experience_in_years
Min. : 1 Min. :23.00 Min. :-3.0
1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0
Median :2500 Median :45.00 Median :20.0
Mean :2500 Mean :45.34 Mean :20.1
3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0
Max. :5000 Max. :67.00 Max. :43.0
income_in_K_month zip_code family_size
Min. : 8.00 Min. : 9307 Min. :1.000
1st Qu.: 39.00 1st Qu.:91911 1st Qu.:1.000
Median : 64.00 Median :93437 Median :2.000
Mean : 73.77 Mean :93153 Mean :2.397
3rd Qu.: 98.00 3rd Qu.:94608 3rd Qu.:3.000
Max. :224.00 Max. :96651 Max. :4.000
NA's :18
CCAvg Education Mortgage
Min. : 0.000 Min. :1.000 Min. : 0.0
1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0.0
Median : 1.500 Median :2.000 Median : 0.0
Mean : 1.938 Mean :1.881 Mean : 56.5
3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101.0
Max. :10.000 Max. :3.000 Max. :635.0
did_accept_personal_loan_offer have_securities_account
Min. :0.000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000
Median :0.000 Median :0.0000
Mean :0.096 Mean :0.1044
3rd Qu.:0.000 3rd Qu.:0.0000
Max. :1.000 Max. :1.0000
have_deposit_account have_online_access have_CC
Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.0000 Median :1.0000 Median :0.000
Mean :0.0604 Mean :0.5968 Mean :0.294
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.0000 Max. :1.000

family_size_fact Education_fact
1 :1464 1:2096
2 :1292 2:1403
3 :1009 3:1501
4 :1217
NA's: 18

have_deposit_account_fact have_online_access_fact
0:4698 0:2016
1: 302 1:2984
have_CC_fact zip_code_fact have_securities_account_fact
0:3530 94720 : 169 0:4478
1:1470 94305 : 127 1: 522
95616 : 116
90095 : 71
93106 : 57
92037 : 54
(Other):4406

The customers out of 5000 total customers, who took personal loan vs no
personal loan were 90.4% and 9.6% respectively

did_accept_personal_loan_offer_fact
0:4520
1: 480

Points considered:

#1. ID is only a dummy variable and hence, removed from the dataset

#2. There are negative values in professional experience years , replace all negative values with a 0.

#3. There are invalid zip codes (some 4 digit ones , which needs to be prefixed with 0)

#4. Here we are dealing with imbalanced data , only 10% of the population accepted the offer

#5. Majority of the customers do not have a securities account - 90%

#6. Majority of the customers do not have a deposits account - 94%

#7. Mortgage data is highly skewed , means only few customers have house mortgage

#8. Age, professional experience have normal distribution


## Univariate analysis
> hist(Thera_Bank_dataset$age_in_years,main = "Histogram of Age",xlab = "age_in_years")

X-axis: Age
Y-axis: Frequency

Inference : Age is very close to the normal distribution


#Converting the categorical variables to factor from numeric

> Thera_Bank_dataset$family_size_fact<-as.factor(Thera_Bank_dataset$family_size)
> Thera_Bank_dataset$Education_fact<-as.factor(Thera_Bank_dataset$Education)
> Thera_Bank_dataset$did_accept_personal_loan_offer_fact<-as.factor(Thera_Bank_dataset$did_a
ccept_personal_loan_offer)
> Thera_Bank_dataset$have_deposit_account_fact<-as.factor(Thera_Bank_dataset$have_deposit_a
ccount)
> Thera_Bank_dataset$have_online_access_fact<-as.factor(Thera_Bank_dataset$have_online_acces
s)
> Thera_Bank_dataset$have_CC_fact<-as.factor(Thera_Bank_dataset$have_CC)
> Thera_Bank_dataset$zip_code_fact<-as.factor(Thera_Bank_dataset$zip_code)
> Thera_Bank_dataset$have_securities_account_fact<-as.factor(Thera_Bank_dataset$have_securiti
es_account)

# Grouped Bar Plot


counts <- table(Thera_Bank_dataset$family_size_fact, Thera_Bank_dataset$did_accept_personal_lo
an_offer)
barplot(counts, main="family_size_fact vs did_accept_personal_loan_offer",
xlab="Personal Loan No vs Yes", col=c("darkblue","red","green","yellow"),
legend = rownames(counts), beside=TRUE)
More the family members more is the possibility of taking loan.
0-No loan
1-Loan

counts <- table(Thera_Bank_dataset$Education, Thera_Bank_dataset$did_accep


t_personal_loan_offer)
> barplot(counts, main="Education Category vs did_accept_personal_loan_off
er",
+xlab="Personal Loan No vs Yes", col=c("darkblue","red","green"),
+legend = c("1 Undergrad", "2 Graduate","3 Advanced/Professional"), beside
=TRUE)

Inference : Hypothesis : Advanced/Professional require loan for higher studies


Boxplot for numerical data
boxplot(Thera_Bank_dataset$age_in_years, main = toupper("Boxplot
of Age"),ylab = "age_in_years",col = "blue")

Inference : Not much outlier in Age column

boxplot(Thera_Bank_dataset$experience_in_years,main = toupper("Boxplot of
Experience"),ylab = "experience_in_years",col = "purple")

Inference : Not much outlier in Experience column


boxplot(Thera_Bank_dataset$income_in_K_month,main = toupper("Boxplot of Mo
nthly Income"),ylab = "Monthly Income",col = "orange")

Inference : Many outliers in the monthly income


boxplot(Thera_Bank_dataset$CCAvg,main = toupper("Boxplot of Average Spendi
ng of credit card per month"),ylab = "Average Spending",col = "red")

Inference : Here too in the average spending of credit card per month there are many outliers
boxplot(Thera_Bank_dataset$Mortgage,main = toupper("Boxplot of House Mortg
age if any"),ylab = "House Mortgage",col = "red")

Inference : Here too in there are lots of outliers

Maximum outliers observed in


• Monthly income
• Average spending against Credit card
• Mortgage

3.1.4 Correlation between the numeric features

my_data <- Thera_Bank_dataset[, c(2,3,4,7,9)]


> res <- cor(my_data)
> round(res, 2)

age_in_years experience_in_years
age_in_years 1.00 0.99
experience_in_years 0.99 1.00
income_in_K_month -0.06 -0.05
CCAvg -0.05 -0.05
Mortgage -0.01 -0.01
income_in_K_month CCAvg Mortgage
age_in_years -0.06 -0.05 -0.01
experience_in_years -0.05 -0.05 -0.01
income_in_K_month 1.00 0.65 0.21
CCAvg 0.65 1.00 0.11
Mortgage 0.21 0.11 1.00
age_in_years experience_in_years income_in_K_month CCAvg Mortgage
age_in_years 1.00000000 0.99421486 -0.05526862 -0.05201218 -0.01253859

experience_in_years 0.99421486 1.00000000 -0.04657418 -0.05007651 -0.01058155

income_in_K_month -0.05526862 -0.04657418 1.00000000 0.64598367 0.20680623

CCAvg -0.05201218 -0.05007651 0.64598367 1.00000000 0.10990472

Mortgage -0.01253859 -0.01058155 0.20680623 0.10990472 1.00000000

Thera_Bank_dataset_cont_vars<-Thera_Bank_dataset[,c(2:4,7,9)]
corrmatrix<-cor(Thera_Bank_dataset_cont_vars)
corrplot(corrmatrix,method='circle',type='upper',order='FPC')
library(corrplot)
Inference
1. There is high degree of correlation between Age of the person and professional experience
2. There is some degree of correlation between income and spending on credit card
3. There is some degree of correlation between income and mortgage amount.
4. Correlation matrix plotted corroborates the above fact, so we drop professional experience
column and keep only age in years
5. But there is no significant correlation between years of experience and income per month

4.0 Cluster Analysis


Clustering features are only numerical
All the categorical features have not been considered as they do not make much sense when we do
clustering
1. Age in Years
2. Experience
3. Monthly Income
4. CCAvg
5. Mortgage

wss <- (nrow(my_data)-1)*sum(apply(my_data,2,var))


> for(i in 2:15)wss[i]<- sum(fit=kmeans(my_data,centers=i,15)$wi
thinss)
> plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster
",ylab="with cluster sum of squares")

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters
into which the data may be clustered. The Elbow Method is one of the most popular methods to
determine this optimal value of k (clusters).
We now demonstrate the given method using the K-Means clustering technique
Inference : Based on the elbow curve we can see 4 clusters formed.

fit <- kmeans(my_data,4)


> library(cluster)
> library(fpc)

Getting the cluster means


mydata <- data.frame(my_data,fit$cluster)
> cluster_mean <- aggregate(mydata,by = list(fit$cluster),FUN = mean)
> cluster_mean

Group.1 age_in_years experience_in_years income_in_K_month


1 1 45.35738 20.12164 56.98742
2 2 45.73072 20.40566 51.66331
3 3 44.62385 19.47706 129.17431
4 4 44.44778 19.44667 139.28778
CCAvg Mortgage fit.cluster
1 1.566745 141.411074 1
2 1.387648 0.000000 2
3 3.044434 343.116208 3
4 3.605644 1.925556 4

“It is important to remember that Data Analytics Projects require a delicate balance between
experimentation, intuition, but also following (once a while) a process to avoid getting fooled by
randomness and “finding results and patterns” that are mainly driven by our own biases and not by
the facts/data themselves"
As Kmeans is prone to outliers lets re-cluster them after outlier removal
> my_data2<-my_data
> outliers3 <- boxplot(my_data2$income_in_K_month, plot=FALSE)
> outliers3<-outliers3$out
> my_data2 <- my_data2[-which(my_data2$income_in_K_month %in% outliers3),]
> outliers4 <- boxplot(my_data2$CCAvg, plot=FALSE)
> my_data2 <- my_data2[-which(my_data2$CCAvg %in% outliers4),]
> outliers5 <- boxplot(my_data2$Mortgage, plot=FALSE)
> outliers5<-outliers5$out
> my_data2 <- my_data2[-which(my_data2$Mortgage %in% outliers5),]
> nrow(my_data2)

[1] 4402

Plotting elbow curve for the outlier removed data


Inference : 5 clusters make sense here

fit2<-kmeans(my_data2,5)
> my_data3 <- data.frame(my_data2)
> cluster_mean_2 <- aggregate(my_data3,by = list(fit2$cluster),FUN = mean)
> cluster_mean_2

Group.1 age_in_years experience_in_years income_in_K_month


1 1 45.41659 20.198930 59.32114
2 2 35.10999 9.868307 71.47323
3 3 43.52968 18.610350 142.68493
4 4 47.66155 22.370759 32.45961
5 5 54.64748 29.233094 81.78273
CCAvg Mortgage
1 1.6275914 144.8599465
2 1.8128075 0.0000000
3 3.7670928 2.1476408
4 0.9817447 0.8626817
5 2.0223741 0.3280576

Inference : The 5 clusters make much more sense after outlier removal

my_data2$cluster<-fit2$cluster
install.packages("dplyr")
library(dplyr)
head(my_data2)

# A tibble: 6 x 6
age_in_years experience_in_ye~ income_in_K_mon~ CCAvg Mortgage cluster
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 25 1 49 1.6 0 2
2 45 19 34 1.5 0 4
3 35 9 100 2.7 0 2
4 37 13 29 0.4 155 1
5 53 27 72 1.5 0 5
6 50 24 22 0.3 0 4

index<-as.integer(row.names.data.frame(my_data2))

did_accept_personal_loan_offer<-Thera_Bank_dataset[index,10]

my_data2$Personal_loan<-did_accept_personal_loan_offer

head(my_data2)

age_in_years experience_in_y~ income_in_K_mon~ CCAvg Mortgage cluster Personal_loan$d~

1 25 1 49 1.6 0 2 0

2 45 19 34 1.5 0 4 0

3 35 9 100 2.7 0 2 0

4 37 13 29 0.4 155 1 0

5 53 27 72 1.5 0 5 0

6 50 24 22 0.3 0 4 0

To check the Personal_loan vs Cluster barchart


# Grouped Bar

Plot counts <- table( my_data2$Personal_loan,my_data2$cluster)

barplot(counts, main="Family members vs Personal Loan", xlab="Personal Loan No vs Yes",


col=c("red","green"), legend = c("Personal_Loan_No","Personal_Loan_Yes"), beside=TRUE)
Inference :

Since population target is close to 500, we will target the cluster 4 segment to make higher
conversion ratio.

This would help the company to spend the marketing money on the correctly predicted customers.

5.0 Cart Model


Creating Training and Testing Dataset The given data set is divided into Training and Testing data set,
with 70:30 proportion. The distribution of Responder and Non Responder Class is verified in both the
data sets, and ensured it’s close to equal.
set.seed(111)
trainIndex <- createDataPartition(Personal_loan,p=0.7,
list = FALSE,
times = 1)
train.data <- Thera_Bank_dataset[trainIndex,2:length(Thera_Bank_dataset)]
test.data <- Thera_Bank_dataset[-trainIndex,2:length(Thera_Bank_dataset)]

5.1 Model Building - CART (Unbalanced Dataset) Setting the control parameter inputs for rpart
r.ctrl <- rpart.control(minsplit = 100,
+ minbucket = 10,
+ cp = 0,
+ xval = 10)
#Exclude columns - "Customer ID" and "Acct Opening Date"
+ cart.train <- train.data
+ m1 <- rpart(formula = Personal_loan~.,
+ data = cart.train,
+ method = "class",
+ control = r.ctrl)
#install.packages("rattle")
#install.packages("RColorBrewer")
library(rattle)
fancyRpartPlot(m1)

printcp(m1)

Classification tree:
rpart(formula = Personal_loan ~ ., data = cart.train, method =
"class", control = r.ctrl)
Variables actually used in tree construction:
[1] CCAvg Education Family_members Income_Monthly
Zip_code
Root node error: 308/3062 = 0.10059
n= 3062

CP nsplit rel error xerror xstd

1 0.3214286 0 1.000000 1.00000 0.054039

2 0.1525974 2 0.357143 0.36688 0.033871

3 0.0487013 3 0.204545 0.21753 0.026283

4 0.0097403 5 0.107143 0.23052 0.027039

5 0.0000000 6 0.097403 0.23377 0.027224


5.2 Pruning the cart tree to ensure that the model is not overfitting

“Overfitting happens when a model learns the detail and noise in the training data to the extent that
it negatively impacts the performance of the model on new data. The problem is that these concepts
do not apply to new data and negatively impact the models ability to generalize”
plotcp(m1)

Will consider 0.045 as the prune parameter and rebuild the tree.

ptree<- prune(m1, cp= 0.045 ,"CP")


printcp(ptree)

Classification tree:
rpart(formula = Personal_loan ~ ., data = cart.train, method =
"class", control = r.ctrl)

Variables actually used in tree construction:


[1] CCAvg Education Family_members Income_Monthly
Zip_code

Root node error: 308/3062 = 0.10059

n= 3062

CP nsplit rel error xerror xstd

1 0.321429 0 1.00000 1.00000 0.054039

2 0.152597 2 0.35714 0.36688 0.033871

3 0.048701 3 0.20455 0.21753 0.026283

4 0.045000 5 0.10714 0.23052 0.027039

fancyRpartPlot(ptree,
uniform = TRUE,
main = "Final Tree",
palettes = c("Blues", "Oranges"))
5.3 Predicting on the test set

## Scoring Holdout sample

cart.test <- test.data

cart.test$predict.class = predict(ptree, cart.test,type = "class")

x<-cart.test$Personal_loan

cart.test$predict.score = predict(ptree, cart.test, type = "prob")

library(caret)

confusionMatrix(table(as.factor(x),cart.test$predict.class ))

5.4 Confusion Matrix and Statistics

0 1

0 1744 22

1 25 147

Accuracy : 0.9757

95% CI : (0.9679, 0.9821)

No Information Rate : 0.9128

P-Value [Acc > NIR] : <2e-16

Kappa : 0.8489

Mcnemar's Test P-Value : 0.7705

Sensitivity : 0.9859

Specificity : 0.8698

Pos Pred Value : 0.9875

Neg Pred Value : 0.8547

Prevalence : 0.9128

Detection Rate : 0.8999

Detection Prevalence : 0.9112

Balanced Accuracy : 0.9278

'Positive' Class : 0

5.6 AUC/ROC performance metrics

ROC cure for pruned tree


library("ROCR")
Pred.cart = predict(ptree, newdata = cart.test, type = "prob")[,2]
Pred2 = prediction(Pred.cart, cart.test$Personal_loan)
plot(performance(Pred2, "tpr", "fpr"))
abline(0, 1, lty = 2)

plotting AUC
auc.tmp <- performance(Pred2,"auc")
auc <- as.numeric([email protected])
print(auc)
[1] 0.973238
Inference : The area under the curve is close to 0.97
Result : The Cart model has given close to 97.5 % accuracy in predicting the people who will take
personal loan on the test data.

6.0 Random Forest model

library(randomForest)

library(caret)
trainIndex <- createDataPartition(Personal_loan, p=0.7, list = FALSE, times = 1)

Thera_Bank_dataset _2<- Thera_Bank_dataset [,-5]

train.data <- Thera_Bank_dataset _2 [trainIndex,2:length(Thera_Bank_dataset _2) ]

train.data$Personal_loan<-as.factor(train.data$Personal_loan)

train.data<-na.omit(train.data)

test.data <- Thera_Bank_dataset _2 [-trainIndex,2:length(Thera_Bank_dataset _2) ]

test.data<-na.omit(test.data)

test.data$Personal_loan<-as.factor(test.data$Personal_loan)

model1 <- randomForest(Personal_loan ~ ., ntree = 100,data = train.data, importance = TRUE)

model1

Call:

randomForest(formula = Personal_loan ~ ., data = train.data, ntree = 100, importance = TRUE)

Type of random forest: classification

Number of trees: 100

No. of variables tried at each split

OOB estimate of error rate: 1.41%

0 1 class.error

0 2730 6 0.002192982

1 37 277 0.117834395

Pred_rf <- predict(model1, test.data, type = 'class')


confusionMatrix(test.data$Personal_loan, Pred_rf)
Confusion Matrix and Statistics

Reference

Prediction 0 1

0 1766 2

1 25 139

Accuracy : 0.986

95% CI : (0.9797, 0.9908)

No Information Rate : 0.927

P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9039

Mcnemar's Test P-Value : 2.297e-05

Sensitivity : 0.9860

Specificity : 0.9858

Pos Pred Value : 0.9989

Neg Pred Value : 0.8476

Prevalence : 0.9270

Detection Rate : 0.9141

Detection Prevalence : 0.9151

Balanced Accuracy : 0.9859

'Positive' Class : 0

Result : Random forest has perfomed very well with 98.9% accuracy on the test data

ROC curve for random forest


library("ROCR")
Pred_rf <- predict(model1, test.data, type = 'prob')[,2]
require(pROC)
rf.roc<-roc(test.data$Personal_loan,Pred_rf)
plot(rf.roc)
Inference : The ROC is very close to ideal
auc(rf.roc)
Area under the curve: 0.9975
Inference:
• Monthly Income and Education is the most significant factor that decides personal loan.
• Random forest has performed better with 98.9% accuracy on the test data, as compared to
The Cart model that gave 97.5 % accuracy in predicting the people who will take personal
loan on the test data

You might also like