Algorithm Comparison For Data Mining Classification: Assessing Bank Customer Credit Scoring Default Risk
Algorithm Comparison For Data Mining Classification: Assessing Bank Customer Credit Scoring Default Risk
ABSTRACT
Rating consumer credit risk involves assessing credit application risks. Thus, every business must appropriately
identify debtors and non-debtors. This study uses machine learning approaches to simulate consumer credit risk and
compares the results to the logistic model, determining if machine learning improves client default ratings. The
study examines how customer attributes affect virtual experiences. Despite advances in machine learning models for
credit assessment, unbalanced datasets and some algorithms’ failure to explain forecasts remain major issues.
This study used 2005 Taiwanese credit card consumers’ education, age, marital status, payment history, and sex.
The default experience is modeled using Logistic Regression, K neighbors, Support Vector Machine, Decision Tree,
Random Forest, Ada Boost Classifier, and Gradient Boosting. The models’ Accuracy, precision, recall,
receiver operating characteristic (ROC) curve, and precision-recall curve were evaluated. Random Forest’s
97% ROC metric rating outperformed all other accuracy metrics. The logistic model underperformed, while
machine learning improved the default categorization.
Keywords: Credit scoring; artificial intelligence; machine learning; classification techniques; logistic regression
The applicant’s final score, which the lender will use to any credit to the debtors.
decide whether to grant the loan, is based on a threshold Some authors classified the sufficiency of the borrower
or drop score of Threshold value (Tc). A loan applicant’s using ML approaches like Random Forest (RF) and
status is usually (0) for good and (1) for bad. The model’s AdaBoost (Aniceto et al. 2020). Using the loan database
score is f(x) for new loan applications. If this score is below from the Brazilian Bank, researchers examined various
Tc, the loan is approved; otherwise, it is denied. ML techniques and evaluated the suitability of the
borrowers. Large Brazilian financial institutions’ low-
income borrowers make up the majority of the data sets.
LITERATURE REVIEW The portfolio’s default rate was close to 48%. Using real-
world data, they developed a machine learning (ML) model
This section specifically highlights how financial and showed that Random Forest and AdaBoost performed
institutions are developing and implementing cutting-edge better than competing methods. Only some authors
technologies based on Artificial intelligence- Machine suggested using the model of decision trees to determine
Learning (AI-ML) strategies to deal with their various whether the loan provider poses a risk for performing or
credit risks in both developed and emerging nations. The non-performing loans. Most academics emphasized that a
majority of financial organizations today deal with various categorization issue exists with credit scores (Boughaci &
risks daily. Credit risk, operational risk, market risk, and Alkhawaldeh, 2018). They compared the credit data sets
liquidity risk are a few of these dangers (Leo et al. 2019). from Germany and Australia with well-known classifier
Few writers have discussed the socioeconomic benchmarks. They coupled the Support Vector Machine
implications for determining a client’s credit score in earlier (SVM) model with the Local Search method (LS),
research papers, which have mostly focused on a Stochastic Local Search technique (SLS), and Variable
customer’s demographics and statistical factors (Moradi Neighborhood Search (VNS) approach to determine a
& Mokhatab Rafiei, 2019). The authors emphasized that person’s credit score.
political alterations have an impact on economic aspects
as well. In order to estimate credit risk, they also took MACHINE-LEARNING CLASSIFICATION TECHNIQUES IN
politico-economic issues into account. To first anticipate CREDIT-SCORING
whether a specific loan is performing, they created an
adaptive network-based fuzzy inference system. Most
banks and other financial institutions now prioritize social The major goal is to create a model that can
and economic effects due to Covid-19.One of the writers effectively categorize and measure borrower repayment
evaluated their customers’ credit scores using data from behavior as well as anticipate borrower loan
an Iranian bank, especially when societal and economic applications. This section provides an overview of the
conditions are exceptional. Using the features of Iranian most popular modern machine-learning categorization
bank clients’ behavior as input, that assessed credit scoring approaches that are pertinent to this research and utilized
using a fuzzy inference method, outperforming more to create credit-scoring models.
traditional models, especially during economic crises.
Researchers have stressed the non-linear and non-
LOGISTIC REGRESSION
parametric correlations between the factors influencing
bank lending and how many loans are still outstanding
(Ozgur et al. 2021). They showed how 19 macroeconomic, A specific type of Generalized Linear Model (GLM), or
local, and international variables impacted Turkish bank “Logistic Regression (LR),” is a generalization of the ideas
loans between 2002Q4 and 2019Q2. They contrasted the of regular linear models. As a result, logistic regression is
regression model with ML-based approaches to determine similar to linear regression and is employed in this analysis
how these factors will affect the results. The authors also to solve a classification problem. A binary outcome
pointed out that conventional linear regression methods variable, typically denoted by 0 or 1, is modeled using LR.
struggled with the extremely high dimensionality of the According to Thomas, the scoring model’s result must
datasets, whereas ML-based techniques were able to be binary (accept/good loan, 0; reject/bad loan, 1), and this
accommodate this. For the majority of their debt recovery is based on a number of independent variables(Ala’raj &
management, banking institutions rely on outside sources, Abbod 2015).
which entails increased expenses and market risks.
Therefore, it is always advised to have a reliable strategy
in place for predicting debt repayment before extending
1937
K nearest neighbor is one of the most widely applied credit Another effective machine-learning method utilized in
scoring techniques (KNN). The non-parametric categorization and credit scoring issues is SVM. Due to its
classification method category includes this technique. It excellent outcomes, it is widely employed in the field of
is well known that the non-parametric classifier frequently credit scoring as well as other areas. SVMs that take on
experiences outliers, especially when the training sample the appearance of a linear classifier. SVMs predict a set of
size is small. Numerous credit scoring researchers have two classes of inputs to identify which of the two classes
utilized KNN to evaluate the risk involved in making a is most likely to have the output. In order to create the
loan to a business or a person(Mukid et al. 2018). The finest hyperplane (Line) that divides the input data into
Euclidean distance between the given training samples and two groups, binary classification is accomplished using
the test sample is a common foundation for the k-nearest SVMs (good and bad credit).
neighbor classifier. The primary principle of the k-NN SVM can be used in both linear and non-linear
method is that the training data is used to select the k nearest separation settings. The latter uses a basis expansion h(x)
neighbors of each new point that needs to be predicted. that can be converted back to a non-linear boundary in the
The average of the values of the new point’s k-closest original space to construct the linear boundary in an
neighbors can then be used as a prediction (Zhang & Wang extended and revised version of the feature space. It is
2016). necessary to understand how the kernel function K
computes the inner products of vectors in the transformed
space by using the original space X as input (Dastile et al.
DECISION TREES 2020).
Fitting linear classifiers and suppressors with convex
Decision trees are now frequently used to fit data, anticipate loss functions, such as those of (linear) Support Vector
default, and improve credit rating. The algorithms used by Machines and Logistic Regression, is a breeze with
decision trees work top-down, selecting the variable that stochastic gradient descent (SGD). Text categorization and
divides the dataset “best” at each stage. Any of a number NLP are two areas where SGD has been successfully
of criteria, such as the Gini index, information value, or applied to tackle sparse and massive machine learning
entropy, can be used to determine what is “best.” Predict challenges.
an outcome by following the tree’s branches from the The SGD does not belong to a particular family of
starting (root) node to a leaf node. The final leaf node machine learning models and is necessarily correct, just
contains the solution. Classification trees deliver nominal an optimization technique. The only purpose is to train a
responses, such as “true” or “false,” rather than more model. A straightforward stochastic gradient descent
nuanced answers. Regression trees produce numerical learning procedure that supports various classification
results (Bastos, 2007). penalties and loss functions is implemented by the class
SGD Classifier (Condori-Alejo et al. 2021).
THE DATASET obligations. Of the 30,000 debtors, 23 364 paid back their
loans, while 6636 missed payments. Around 78 percent of
the dataset’s debtors are good debtors, and 22 percent are
The dataset chosen to be used for this study includes 30,000
bad debtors. In this study, the result variable was a binary
Taiwanese credit card customers’ anonymized information
variable called default payment (Yes = 1, No = 0). As shown
from 2005 variables as explanatory data (Yeh & Lien,
in Table 1 , the data are divided into 23 columns with various
2009). The dataset includes characteristics of the clients,
numerical values and categorical information like
including whether or not they were in default on their
education is also hidden as a numerical value.
Selection (FS) is a technique for figuring out the most DATA REDUCTION
helpful features. Preprocessing of the dataset is necessary
when it contains useless data that is noisy (outliers),
Analyzing enormous amounts of data requires a lot of time.
unreliable (missing), and inconsistent. Any extraneous and
It can be accomplished using data cube aggregation, data
correlated data were eliminated from the dataset in order
compression, dimension reduction, data reduction, concept
to increase model accuracy and obtain useful results. We
hierarchy, and discretization development. Because a group
next used data cleaning, discretization, and target class
of academics came to the conclusion that discretization
balancing to our data to produce a dataset that would work
enhances the efficiency of the naive Bayesian algorithm,
well with our algorithms.
we discretized a continuous variable by applying it to our
dataset (Jonathan L. Lustgarten, 2008.), Therefore, the Age
DATA CLEANING
property was split into ten-year chunks, and the Limit bal
attribute was reclassified as Low, Medium, or High instead
of ((0-100,000), (100,001-500,000), and over (500,001))
The methods employed to address missing data involve in New Taiwan dollars, to the labels.. Our next step in the
excluding or imputing the affected records with a preprocessing process was data reduction, where we shrunk
predetermined value. In the case of noisy data, many the dataset to obtain two equal class representations, the
techniques can be utilized, including binning algorithms, default, and no default classes. From 30,000 records, we
clustering, a combination of human and machine were able to reduce it to 13,210, which is a 50% reduction
inspection, and regression analysis. Inconsistencies can be with 6,605 records for each class. To improve performance,
fixed manually. Some numbers in the datasets don’t have we eliminated all redundant information from our dataset;
official meanings on the UCI site; for example, the as a result, there are now only six attributes instead of 24.
Education attribute’s values ranged from one to four, but After that, a 70/30 split was performed randomly
they also had values greater than four (331). across the full data set to create a training set and a test set.
This work uses an imbalanced dataset and employs
sampling methods such as SMOTE, kNN, and Tomek-
FEATURES SELECTION
links. However, the study relied on a relatively simplistic
random over-sampling method for the response variable
A method to lessen dimensionality is feature selection. This due to technical limitations and time constraints. The model
method’s primary use is the extraction of discrete subsets performance evaluation was conducted using the test data
of pertinent features from the original dataset according to set, which was utilized to assess the models’ predictive
the assessment criterion. capabilities after training on the training set.
Data transformation is the process of converting data from The following metrics are the most widely used metrics
one format to another so that it can be used for data mining. for evaluating the effectiveness of the models in credit
A few techniques for doing transformation include scoring out of the many assessment metrics that are used
normalization, smoothing, aggregation, and generalization. in the many types of literature (Dastile et al. 2020). A
The values in the dataset were all expressed as numbers in confusion matrix is a common tool for assessing a
the records; for instance, Categorical information, such as classifier’s performance (see Table 2). All of the
sex, was encoded as a “1” for male and a “2” for female. instances in the data set are displayed in a confusion
That was problematic because including numbers in our matrix and are divided into four groups:
data would make it less relevant to our clients; therefore, TABLE 2. Confusion Matrixes Discretion
we needed to alter some columns to make them better suited
for analyzing the outcome. We converted the string Observed class Predicative Class
representations of these attributes (sex, education, and Class (1)=Good Class (0)=Bad
marital status). Class (1)=Goods True Positive False Negative
Class(0)=Bad False Positive True Negative
1940
TP (True Positive): These are the positive findings that curve, a standard classification statistic, is used. The
the model correctly predicted based on the actual data, probability of binary outcomes, which are typical in our
which, in our case, suggests that the model successfully situation for default and non-default, can be predicted using
predicted the number of defaults that actually occurred in the ROC curve. The ROC curve compares the ratio of false
the actual data. positives to genuine positives (Osei et al. 2021).
TN(True Negatives): These are the negative values The advantage of the ROC curve is that different
that the model correctly predicted based on the actual data, thresholds or modeling techniques can be compared using
suggesting that the model correctly predicted the proportion measurements of the area under the curve (AUC), with a
of non-defaults that were non-defaults in the real data in greater AUC signifying a better model. The movement
our case. along a line indicates a change in the threshold used to
FP (False Positives ): These are the positive values the classify a positive instance. Each line on the plot represents
model mistakenly predicted based on the actual data, the curve for a single model.
suggesting that the observations the model projected as The threshold is 0 (upper right) and 1 (lower left)
default but were not in our case’s real data. (lower left). The AUC, which ranges from 0 to 1 with a
FN (False Negative): These are the negative values good model scoring higher, is the area under the ROC
that the model mistakenly predicted based on the actual curve. A ROC AUC of 0.5 results from a model’s random
data, suggesting that the observations the model projected predictions. Since Sensitivity and (1 - Specificity) are
as defaults weren’t defaults in the real data in our case. plotted, the ROC Curve logically plots between True
The consequences of misclassification in the context Positive Rate and False Positive Rate.
of credit rating are very different. False positives cause the
lender to lose some or both the interest and the principal
that was supposed to be repaid. In contrast, false negatives MODEL RESULT ANALYSIS AND DISCUSSION
solely refer to the opportunity cost of lost interest that could
have been gained. Due to the fact that these individuals The model performance was assessed using the metrics
were given a loan despite the model classifying them as that were presented in Chapter 3. It might be difficult
excellent, false positives are substantially more expensive and subjective to agree on a single criterion for
(Bunker et al. 2016) (Nyangena 2019). evaluating performance, depending on the nature of the
The proportion of actual products that were accurately activity at hand. The quality of a classification model can
identified as such is known as the true positive rate: be evaluated via inspection of the confusion matrix.
After that, a 70/30 split was performed randomly
(1) across the entire data set to create a training and test set.
Sampling methods such as Tomek-links, kNN, and
SMOTE can be used on imbalanced datasets like the one
Similar to the true positive rate, the true negative rate used in this research. The study, however, opted to
is the percentage of actual bad that were accurately simply employ a random over-sampling strategy for the
categorized as such: response variable due to time and technical limitations.
The test data set was used to evaluate how well the
(2) models performed after being trained on the training set
to make predictions.
A confusion matrix is a table that is able to compare
One of the most often used metrics in the field of
the model predicted and actual classes from the labeled
accounting and finance, specifically for applications
data within the validation set. Hence, the confusion
involving credit rating, is the Percentage Correctly
matrices for each of our ensemble classifiers are shown in
Classified. The PCC rate calculates the percentage of cases
Figure (2).
in a given data set that are correctly classified as having The ideal evaluation model would actually be a
excellent credit and having bad credit. The PCC rate is an profit function, which is a function of recall and precision
important factor to consider while assessing the proposed and would need to be optimized. Trade between the TP
scoring models’ capacity for classification. (profit) and the FP (cost), both of which are captured by
the F-measure or the accuracy recall AUC, would be used
to estimate the profit. The F-measure, however, is the
(3) metric picked for this study’s evaluation of the models.
The model with the best F measure can then be modified
and verified in an effort to produce improved evaluation
To assess the accuracy of the model’s predictions of
metrics and predictions.
loan default, the Receiver Operating Characteristic (ROC)
1941
The results in Table 3 below demonstrate how well lowest precision score. On the other hand, the Random
the models performed using the measures after the best Forest and Decision Tree both have the greatest recall
model for each class was chosen. scores, whereas Logistic Regression has the lowest. In
Using the results table above as a guide, We conducted contrast, the Random Forest has the highest precision score,
an estimation of the range of values for the measures that followed by the Decision Tree, and the Logistic Regression
could potentially be employed to evaluate our models. The has the lowest.
Random Forest model, followed by the Decision Tree, has Figure (3) show the ROC curve of each model. The
the best accuracy score, while the SVM Classifier has the ROC curve for Random Forest has a convex circle shape,
1942
which is indicative of reduced rates of false negative and and Decision Tree are the most promising alternatives, yet
false positive errors when compared to other curves. In it is impossible to discern a preferred individual technique
other words, for each given value of sensitivity and from the curves. The skewed ROC curves observed in
specificity, Random Forest performs optimally. In addition, Logistic Regression, K neighbors SVM, Ada Boost, and
there is little to no difference in performance between Ada Gradient Boosting indicate that the increased specificity
Boost, K neighbors, and Gradient Boosting. Random Forest resulted in a markedly reduced sensitivity.
Choosing the kind of false that the Bank can tolerate determines. An increase in the Bank’s risk due to false
while dealing with credit risk is essential. False positives negative results consumers as low-risk, but in reality, they
will force us to turn away consumers who would otherwise would be more likely to default and cause losses for the
be profitable clients because the models incorrectly labeled business, as determined by the Recall score.
them as the Bank’s worst customers, which Precision Score
1943
Figure (4) displays the Precision-Recall (PR) curves The study concludes that machine learning models are
for each model, and the AUC values for each model are more effective at estimating credit risk when dealing with
all greater than 0.5, indicating that the models performed unbalanced information, such as credit data sets. However,
well. the models might have performed even better had the
dataset’s number of features been higher. Additionally,
more advanced sampling strategies like SMOTE may have
helped to balance out the unbalanced data set and boost
performance. Consequently, this demonstrates that the
findings are not restricted to any one particular bank and
may be applied globally to the forecasting of early instances
of corporate insolvency.
Investigate various dataset pre-processing techniques,
such as feature selection or data-filtering techniques, and
ascertain the potential effects on the outcomes. Try to utilize
a filtering condensing strategy rather than pure filtering,
which will remove both outlier items and non-informative
entries that could negatively affect the training process.
CONCLUSION AND FUTURE WORKS This research has received no external funding.
Condori-Alejo, H. I., Aceituno-Rojo, M. R., & Alzamora, Osei, S., Mpinda, B. N., Sadefo-Kamdem, J., & Fadugba,
G. S. 2021. Rural micro credit assessment using J. 2021. Accuracies of some Learning or Scoring
machine learning in a Peruvian microfinance Models for Credit Risk Measurement.
institution. Procedia Computer Science 187: 408– Ozgur, O., Karagol, E. T., & Ozbugday, F. C. 2021.
413. Machine learning approach to drivers of bank
Dastile, X., Celik, T., & Potsane, M. 2020. Statistical lending: evidence from an emerging economy.
and machine learning models in credit scoring: A Financial Innovation 7(1): 1–29.
systematic literature survey. Applied Soft Computing Pławiak, P., Abdar, M., Pławiak, J., Makarenkov, V., &
91: 106263. Acharya, U. R. (2020). DGHNL: A new deep genetic
Durand, D. 1941. Risk Elements in Consumer Installment hierarchical network of learners for prediction of
Financing. National Bureau of Economic Research, credit scoring. Information Sciences 516: 401–418.
New York. Tang, L., Cai, F., & Ouyang, Y. 2019. Applying a
Leo, M., Sharma, S., & Maddulety, K. 2019. Machine nonparametric random forest algorithm to assess
learning in banking risk management: A literature the credit risk of the energy industry in China.
review. Risks 7(1): 29. Technological Forecasting and Social Change 144:
Moradi, S., & Mokhatab Rafiei, F. 2019. A dynamic credit 563–572.
risk assessment model with data mining techniques: Yeh, I.-C., & Lien, C. 2009. The comparisons of data
Evidence from Iranian banks. Financial Innovation mining techniques for the predictive accuracy of
5(1): 1–27. probability of default of credit card clients. Expert
Mukid, M. A., Widiharih, T., Rusgiyono, A., & Prahutama, Systems with Applications 36(2): 2473–2480.
A. 2018. Credit scoring analysis using weighted k Zhang, Y., & Wang, J. 2016. K-nearest neighbors
nearest neighbor. Journal of Physics: Conference and a kernel density estimator for GEFCom2014
Series 1025(1): 12114. probabilistic wind power forecasting. International
Nyangena, B. O. 2019. Consumer Credit Risk Modelling Journal of Forecasting 32(3): 1074–1080.
Using Machine Learning Algorithms: A Comparative Zhou, X., Cheng, S., Zhu, M., Guo, C., Zhou, S., Xu, P.,
Approach. Strathmore University. Xue, Z., & Zhang, W. 2018. A state of the art survey
of data mining-based fraud detection and credit
scoring. MATEC Web of Conferences 189: 3002.