Data Preparation
The first step on the way toward any data science work is the very delicate data preparation for any
forthcoming analysis. What follows this is the cleaning and validation of the dataset in view of integrity.
Data Import
The first operation is importing the dataset in R, which is achieved using the read.csv() function to read
the CSV file with the data and convert it into a data frame called df. Providing the stringsAsFactors =
TRUE argument automatically converts all categorical variables to factors in R. This ensures that
variables intended to be used as categories, for example Credit_Score, will be dealt with appropriately
during the course of the analysis.
df <- read.csv("path_to_data.csv", stringsAsFactors = TRUE)
Data Cleaning and Pre-processing
Once data has been uploaded, cleaning operations must be done, which include the creation of
duplicates and the treatment of missing values. Duplicate rows are to be removed using the distinct()
function, and only the first row of each set of duplicates will be kept.
Next, lapply() is applied to numerical columns (stored in num_cols) to replace missing values.
Specifically, the ifelse() function is used to locate any NA entries in the dataset and substitute them with
their corresponding median values.
Finally, we have categorically converted the Credit_Score variable to incorporate categorical code
concerning credit score levels for proper analysis and modeling carried thereafter.
# Remove duplicates
df <- distinct(df)
# Replace missing values in numerical columns with the median
num_cols <- sapply(df, is.numeric)
df[num_cols] <- lapply(df[num_cols], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))
# Convert Credit_Score to factor type
df$Credit_Score <- as.factor(df$Credit_Score)
Data Validation
Once all cleaning is done, the next important step is data checking to confirm consistency and integrity.
The str(df) check inspects for data types and checks the structure of the dataset. This permits all
variables (e.g. numeric vs. categorical) to be grouped and not confused during modeling.
The summary(df) function produces a summary report with means, medians, and outlier detections.
This, therefore, is a step to check to discover if any columns have the outlier.
In addition, a correlation heatmap identifies multicollinearity between the numeric features. A heatmap
allows the correlation coefficients to be visualized for a quick conception of potential highly correlated
features that should be resolved to avoid bias during modeling.
# Check the structure of the data frame
str(df)
# Generate summary statistics
summary(df)
# Create a correlation heatmap for numerical columns
library(corrplot)
corr_matrix <- cor(df[num_cols], use = "complete.obs")
corrplot(corr_matrix, method = "color")
H1. Missing Values and Errors in Data Detrimentally Impact Credit Score Prediction Models.
Missing values and data errors lead to inaccurate predictions in a dataset. If critical numerical features
such as debt amounts or income levels are missing or inconsistent in any dataset, this can adversely
affect the predictive power of credit score prediction models. Although replacing missing values with
median statistics does assist in keeping data correctness, other critical problems such as vast spaces or
erroneous data can barely attract credit score prediction in this model. Thus, the success of a credit
scoring model is governed directly by data quality and completeness.
H2. High Correlation Between Debt-to-Income Ratio and Credit Score Indicates Strong Impact on
Creditworthiness.
A high correlation existing within the dataset between debt-related parameters (such as the debt-to-
income ratio) and credit score could indicate the comparative heavy reliance on debt as a predictor of
creditworthiness. This theory proposes that the debt-to-income ratio exhibits high levels of correlation
with rating output and that changes in the ratio might have substantial effects on predictions of credit
scores. If such correlations exist, that would suggest that effective debt management would contribute
quite effectively, in this case, given its influence on credit scores.
Objective
The purpose of the analysis is to develop a program for predicting credit scores by employing data
preprocessing, exploratory data analysis, and machine learning classification techniques. The study
focuses mainly on data cleaning, feature engineering, and model evaluation processes to enhance the
precision of credit assessment.
Objective 1 (Priyank): The first task is concerned with analyzing key financial variables' distribution using
visual methods, including histograms, box plots, and scatter plots. The analysis will appreciate numerical
attributes like Age, Annual Income, Outstanding Debt, and categorical variables like Occupation. This
exploration will also help in identifying missing values, outliers, and patterns in the data, thus ensuring a
properly modeled dataset.
Objective 2 (Abinash): Second objective is to study the relationships between financial attributes and
credit scores. This will be done by conducting a correlation analysis to assess the effect of key predictors
on credit scores, such as outstanding debt, credit utilization ratio, and income. Already identifying strong
predictors can be of help for feature selections for machine learning models.
Objective 3 (Raghav): Feature engineering will allow us to enhance the predictive such. Creation of new
variables such as the debt-to-income ratio and encoding techniques on categorical variables will be
applied. Numerical missing values will be filled with the median, while categorical will be defined as
"Unknown." This treatment will ensure both the quality of the data and its readiness for Machine
Learning.
Objective 4 (Dipesh): The next step will be the construction and classification of the machine learning
model for credit score evaluation. Techniques such as Random Forest and Multinomial Logistic
Regression will be used for training the model and classifying credit scores into "Poor," "Standard," and
"Good." Model performance will be evaluated through accuracy, precision, and confusion matrices to
establish its effectiveness in creditworthiness prediction.