Project Report - Credit Card Fraud Detection
Project Report - Credit Card Fraud Detection
Problem Statement
To design and execute a data mining project encompassing data cleaning, preprocessing, classification and
evaluating performance metrics.
Introduction
Given the prevalence of fraud, there is a pressing need for robust fraud detection systems. Broadly, fraud
detection falls into two categories: misuse and anomaly detection. Misuse detection employs
machine-learning-based classification models to differentiate between fraudulent and legitimate transactions.
Conversely, anomaly detection establishes a baseline from sequential records to define the attributes of a typical
transaction and create a distinctive profile for it. This paper presents a strategy for misuse detection utilizing a
blend of K-nearest neighbor (KNN), Naive Bayes, and Decision tree models.
Dataset Details
This dataset contains credit card transactions made by European cardholders in the year 2023. It comprises over
550,000 records, and the data has been anonymized to protect the cardholders' identities. The primary objective
of this dataset is to facilitate the development of fraud detection algorithms and models to identify potentially
fraudulent transactions.
1. Key Features:
a. id: Unique identifier for each transaction
b. V1-V28: Anonymized features representing various transaction attributes (e.g., time, location,
etc.)
c. Amount: The transaction amount
d. Class: Binary label indicating whether the transaction is fraudulent (1) or not (0)
2. Target variable:
The target variable chosen is ‘Class’ which will indicate the presence of fraudulent transactions.
3. Head:
4. Describe:
5. Shape:
(110177, 31)
6. Correlation Heatmap:
plt.style.use("seaborn")
plt.rcParams['figure.figsize']= (22,11)
title = "Correlation Heatmap"
plt.title(title,fontsize=18, weight= 'bold')
sns.heatmap(creditcard_dataset.corr(), cmap="coolwarm", annot=True)
plt.show()
● The most significant strong positive correlations are between V16 - V17 - V18 and between V9 - V10
● The most significant strong negative correlations are between V4 - V14 // V4 - V12 // V4 - V10 //
V10 - V11 // V11 - V14 // V11 - V12 // V21 - V22
● There is clear lack of h-positive correlations in the range of V19 to V28
● There are several moderate to semi-strong positive/negative correlations in the range of V1 to V18
Code Overview:
1) Data Preprocessing
● Handling missing values by filling with the mean.
● Dropping duplicate rows.
● Standardizing features using different scalers: StandardScaler, RobustScaler, and MinMaxScaler
● Detecting outliers using box plot, k means cluster algorithm and RobustScaler.
● Resampling the minority class for balancing.
2) Modeling
● Splitting the balanced data into training and testing sets.
● Implementing three classifiers: Naive Bayes, KNN, and Decision Tree.
● Evaluating each model's performance using classification reports.
● Plotting confusion matrices and ROC curves for model evaluation.
Glossary:
1. Standard Scaler: Scales features by removing mean and scaling to unit variance.
2. Robust Scaler: Scales features using median and interquartile range to mitigate outliers.
3. MinMaxScaler: Scales features to a specified range (default 0 to 1).
4. Confusion Matrix: A table showing true and predicted values to evaluate classifier performance.
5. ROC Curve: Graphical representation of a classifier's true positive rate against false positive rate.
6. Cross Validation: Method to assess model performance by iteratively splitting data into training and
validation sets.
7. Naive Bayes Classifier: Probabilistic algorithm based on Bayes' theorem, assuming feature
independence.
8. K-Nearest Neighbors (KNN) Classifier: Predicts by majority vote of its k-nearest neighbors in
feature space.
9. Decision Tree Classifier: Builds a tree structure to classify instances based on feature conditions.
10. Box plot: It shows the median, quartiles, and potential outliers of numerical data.
11. Scatter plot: A scatter plot is a type of plot that displays values for two variables as points on a 2D
plane.
Methods:
Steps Methods
The dataset was taken from Kaggle, a prominent online community in the fields of machine learning and data
science. The code was executed on Google colab a which is a cloud-based platform. Prior to feeding them into
the algorithm, I partitioned all datasets into training data (80%) and test data (20%). To enhance the robustness
of our methodology and maintain consistent model performance, we employed Stratified K-Fold
cross-validation with a fold value of 5. Furthermore, I addressed the skewed nature of the credit card
fraud-related data by dropping duplicate columns and filling missing values. Standardization of features was
done and to identify outliers, both box plots and scatter plots were employed. Outliers detected through these
methods were effectively managed using the robust scaler method, which proved efficient in handling these data
anomalies. 'Class' variable was isolated as the target for predictive analysis. To ensure consistency and
comparability across different features, Min-Max Scaling was applied, effectively standardizing inconsistent
values within the dataset.
Cross Validation
Classifiers Output
Findings:
RobustScaler Advantage:
● In the absence of RobustScaler, models show perfect scores for the majority class while failing to
identify any instances of the minority class, which signals a lack of generalization and model bias.
RobustScaler helps in handling outliers and might improve the model's capability to detect the
minority class.
Conclusion:
1. Naive Bayes
● Performs reasonably well but has slightly lower recall for fraudulent transactions.
2. KNN:
● Shows better performance in identifying both classes with higher precision and recall.
3. Decision Tree:
● Achieves perfect scores, indicating a likely overfitting issue.
These models offer different trade-offs between precision and recall for identifying fraudulent transactions.
While KNN seems balanced, Decision Tree's perfect scores might indicate overfitting and might not generalize
well on new data. Further data validation could enhance the models' performance and generalize better to new
datasets.
References:
1. Kaggle
2. Stackoverflow
3. Geeksforgeeks
4. Javapoint
5. Naive Bayes
6. Decision tree