0% found this document useful (0 votes)
294 views

Project Report - Credit Card Fraud Detection

The document summarizes a project on credit card fraud detection using machine learning models. It discusses preprocessing credit card transaction data that includes identifying and handling outliers and imbalanced classes. Three classifiers - Naive Bayes, KNN and Decision Tree - are implemented and their performance is evaluated using classification reports, confusion matrices and ROC curves. The dataset is from Kaggle and the code is executed on Google Colab after partitioning data into 80% training and 20% test sets with 5-fold cross validation applied for robustness.

Uploaded by

Snehal Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
294 views

Project Report - Credit Card Fraud Detection

The document summarizes a project on credit card fraud detection using machine learning models. It discusses preprocessing credit card transaction data that includes identifying and handling outliers and imbalanced classes. Three classifiers - Naive Bayes, KNN and Decision Tree - are implemented and their performance is evaluated using classification reports, confusion matrices and ROC curves. The dataset is from Kaggle and the code is executed on Google Colab after partitioning data into 80% training and 20% test sets with 5-fold cross validation applied for robustness.

Uploaded by

Snehal Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

PROJECT REPORT - CREDIT CARD FRAUD DETECTION

Submitted by: Snehal Jain


Roll no. 220727
BA (CA + Maths)
Project link: Credit Card Fraud Detection.pdf

Problem Statement

To design and execute a data mining project encompassing data cleaning, preprocessing, classification and
evaluating performance metrics.

Introduction

Given the prevalence of fraud, there is a pressing need for robust fraud detection systems. Broadly, fraud
detection falls into two categories: misuse and anomaly detection. Misuse detection employs
machine-learning-based classification models to differentiate between fraudulent and legitimate transactions.
Conversely, anomaly detection establishes a baseline from sequential records to define the attributes of a typical
transaction and create a distinctive profile for it. This paper presents a strategy for misuse detection utilizing a
blend of K-nearest neighbor (KNN), Naive Bayes, and Decision tree models.

Dataset Details

This dataset contains credit card transactions made by European cardholders in the year 2023. It comprises over
550,000 records, and the data has been anonymized to protect the cardholders' identities. The primary objective
of this dataset is to facilitate the development of fraud detection algorithms and models to identify potentially
fraudulent transactions.

1. Key Features:
a. id: Unique identifier for each transaction
b. V1-V28: Anonymized features representing various transaction attributes (e.g., time, location,
etc.)
c. Amount: The transaction amount
d. Class: Binary label indicating whether the transaction is fraudulent (1) or not (0)
2. Target variable:

The target variable chosen is ‘Class’ which will indicate the presence of fraudulent transactions.

3. Head:

4. Describe:
5. Shape:
(110177, 31)

6. Correlation Heatmap:

plt.style.use("seaborn")
plt.rcParams['figure.figsize']= (22,11)
title = "Correlation Heatmap"
plt.title(title,fontsize=18, weight= 'bold')
sns.heatmap(creditcard_dataset.corr(), cmap="coolwarm", annot=True)
plt.show()

● The most significant strong positive correlations are between V16 - V17 - V18 and between V9 - V10
● The most significant strong negative correlations are between V4 - V14 // V4 - V12 // V4 - V10 //
V10 - V11 // V11 - V14 // V11 - V12 // V21 - V22
● There is clear lack of h-positive correlations in the range of V19 to V28
● There are several moderate to semi-strong positive/negative correlations in the range of V1 to V18
Code Overview:

1) Data Preprocessing
● Handling missing values by filling with the mean.
● Dropping duplicate rows.
● Standardizing features using different scalers: StandardScaler, RobustScaler, and MinMaxScaler
● Detecting outliers using box plot, k means cluster algorithm and RobustScaler.
● Resampling the minority class for balancing.

2) Modeling
● Splitting the balanced data into training and testing sets.
● Implementing three classifiers: Naive Bayes, KNN, and Decision Tree.
● Evaluating each model's performance using classification reports.
● Plotting confusion matrices and ROC curves for model evaluation.

Glossary:

1. Standard Scaler: Scales features by removing mean and scaling to unit variance.
2. Robust Scaler: Scales features using median and interquartile range to mitigate outliers.
3. MinMaxScaler: Scales features to a specified range (default 0 to 1).
4. Confusion Matrix: A table showing true and predicted values to evaluate classifier performance.
5. ROC Curve: Graphical representation of a classifier's true positive rate against false positive rate.
6. Cross Validation: Method to assess model performance by iteratively splitting data into training and
validation sets.
7. Naive Bayes Classifier: Probabilistic algorithm based on Bayes' theorem, assuming feature
independence.
8. K-Nearest Neighbors (KNN) Classifier: Predicts by majority vote of its k-nearest neighbors in
feature space.
9. Decision Tree Classifier: Builds a tree structure to classify instances based on feature conditions.
10. Box plot: It shows the median, quartiles, and potential outliers of numerical data.
11. Scatter plot: A scatter plot is a type of plot that displays values for two variables as points on a 2D
plane.
Methods:

Steps Methods

Preprocessing 1. Data Loading


● Import necessary libraries: pandas,
matplotlib, seaborn, sklearn modules.
● Read the dataset using ‘pd.read_csv()’ and
describe its basic statistics and shape.
● Visualize the correlation heatmap using
seaborn's `heatmap()`.
2. Handling Missing Values and Duplicate
● Fill missing values with the mean using
`fillna()` method.
● Remove duplicate rows using
`drop_duplicates()`
3. Outlier Detection
● Boxplot IQR method.
● Scatter plot using cluster analysis.
● Removing outliers using ‘RobustScaler`.
4. Feature Scaling
● Standardize features using `StandardScaler`.
● Handle outliers using `RobustScaler`.
● Normalize features using `MinMaxScaler`.
5. Handling Imbalanced Data
● Upsample the minority class using
`resample()` to balance the dataset.
6. Train-Test Split
● Split the balanced data into training and
testing sets using `train_test_split()`.

Evaluation Methods 1. Confusion Matrix


● Defined a function to plot the confusion
matrix for different classifiers.
2. ROC Curve Plotting Function
● Define a function to plot the ROC curve
for different classifiers.
3. Cross Validation
● Split the data into 5 subsets using k-folds

Classifier Performance Metrics 1. Naive Bayes Classifier


● Initialize and fit Gaussian Naive Bayes
classifier.
● Generate predictions and print
classification reports.
Steps Methods

● Plot confusion matrix and ROC curve for


Naive Bayes Classifier.
2. K-Nearest Neighbors (KNN) Classifier
● Initialize and fit KNN classifier.
● Generate predictions and print
classification reports.
● Plot confusion matrix and ROC curve for
KNN Classifier.
3. Decision Tree Classifier
● Initialize and fit the Decision Tree classifier.
● Generate predictions and print
classification reports.
● Plot confusion matrix and ROC curve for
Decision Tree Classifier.

Experimental Setup and Dataset Handling

The dataset was taken from Kaggle, a prominent online community in the fields of machine learning and data
science. The code was executed on Google colab a which is a cloud-based platform. Prior to feeding them into
the algorithm, I partitioned all datasets into training data (80%) and test data (20%). To enhance the robustness
of our methodology and maintain consistent model performance, we employed Stratified K-Fold
cross-validation with a fold value of 5. Furthermore, I addressed the skewed nature of the credit card
fraud-related data by dropping duplicate columns and filling missing values. Standardization of features was
done and to identify outliers, both box plots and scatter plots were employed. Outliers detected through these
methods were effectively managed using the robust scaler method, which proved efficient in handling these data
anomalies. 'Class' variable was isolated as the target for predictive analysis. To ensure consistency and
comparability across different features, Min-Max Scaling was applied, effectively standardizing inconsistent
values within the dataset.
Cross Validation

Classifiers Output

The data was cross validated with a fold value of 5


1. Naive Bayes
2. KNN Classification
3. Decision Tree
Classifiers Output
Classifiers Output
Output Analysis:

1. Naive Bayes Classifier Report: Accuracy: 0.92

Class/Report Precision Recall F1-score

Class 0 0.87 0.98 0.92

Class 1 0.98 0.85 0.91

2. KNN Classifier Report: Accuracy: 0.94

Class/Report Precision Recall F1-score

Class 0 0.92 0.96 0.94

Class 1 0.95 0.92 0.94

3. Decision Tree Classifier Report: Accuracy: 1.00

Class/Report Precision Recall F1-score

Class 0 1.0 1.0 1.0

Class 1 1.0 1.0 1.0

Findings:

RobustScaler Advantage:
● In the absence of RobustScaler, models show perfect scores for the majority class while failing to
identify any instances of the minority class, which signals a lack of generalization and model bias.
RobustScaler helps in handling outliers and might improve the model's capability to detect the
minority class.
Conclusion:

1. Naive Bayes
● Performs reasonably well but has slightly lower recall for fraudulent transactions.

2. KNN:
● Shows better performance in identifying both classes with higher precision and recall.

3. Decision Tree:
● Achieves perfect scores, indicating a likely overfitting issue.

These models offer different trade-offs between precision and recall for identifying fraudulent transactions.
While KNN seems balanced, Decision Tree's perfect scores might indicate overfitting and might not generalize
well on new data. Further data validation could enhance the models' performance and generalize better to new
datasets.

References:

1. Kaggle
2. Stackoverflow
3. Geeksforgeeks
4. Javapoint
5. Naive Bayes
6. Decision tree

You might also like