Vaishnavidocumentation
Vaishnavidocumentation
MASTER OF SCIENCE
IN
COMPUTER SCIENCE
Submitted by
VAISHNAVI A(22SPCS39)
Dr.T.S.URMILA MCA,M.Phil,Ph.D
MADURAI-625009
NOVEMBER-2023
CREDIT CARD FRAUD DETECTION
A MINI PROJECT REPORT
MASTER OF SCIENCE
IN
COMPUTER SCIENCE
Submitted by
VAISHNAVI A(22SPCS39)
Dr.T.S.URMILA MCA,M.Phil,Ph.D
MADURAI-625009
NOVEMBER-2023
THIAGARAJAR COLLEGE (AUTONOMOUS),
MADURAI-625009
BONAFIDE CERTIFICATE
External Examiner
VAISHNAVI A(22SPCS39)
M.Sc., Computer Science,
Department of Computer Science,
Thiagarajar College (Autonomous),
Madurai-625009.
DECLARATION
Date: (22SPCS39)
ACKNOWLEDGEMENT
ACKNOWLEDGEMENT
I thank all the faculty members of Computer Science Department for their
Cooperation. Above all, I am grateful to thank in personal to my family and
friends for their continuous encouragements.
CONTENTS
CONTENTS
1
1.INTRODUCTION
1.1 ABSTRACT
Credit card transaction fraud loss billions of dollars to card issuers every year. A
well-developed fraud detection system with a state-of-the-art fraud detection model is regarded
as essential to reducing fraud losses. The Detection of fraudulent transactions has become a
significant factor affecting the greater utilization of electronic payment. The main contribution
of the work is the development of a fraud detection system that employs a machine learning
architecture together with an advanced feature engineering process. To demonstrate the
effectiveness of the proposed system for detecting fraud in credit card transactions,
experiments were performed using real-world public credit card transaction data sets (Credit
Card Fraud Dataset) consisting of fraudulent transactions and legitimate ones. Implementing
the different Machine Learning Algorithms Such as Light Gradient Boost and Random Forest
Algorithms. The managerial implication of this work is that credit card issuers can apply the
proposed methodology to efficiently identify fraudulent transactions to protect customers’
interests and reduce fraud losses and regulatory costs.
2
1.2 PROJECT DESCRIPTION
The main contribution of the work is the development of a fraud detection system that
employs a machine learning architecture together with an advanced feature engineering process.
The managerial implication of this work is that credit card issuers can apply the proposed
methodology to efficiently identify fraudulent transactions to protect customers’ interests and
reduce fraud losses and regulatory costs.
3
1.3 MODULE DESCRIPTION
LIST OF MODULES
• Data preprocessing
• Model Creation
• Performance Evaluation
Data Preprocessing:
• Data pre-processing module is the process of removing the unwanted data from the dataset.
• Missing data removal: In this process, the null values such as missing values and Nan values
are replaced by 0.
• Encoding Categorical data: That categorical data is defined as variables with a finite set of
label values.
Model Creation:
Data Splitting:
• In this process , The dataset is divided into train dataset and test dataset
• The partitioning available data into two portions, usually for cross-validator purposes.
• One Portion of the data is used to develop a predictive model and the other to evaluate the
model’s performance.
Classification:
Light GBM is a gradient boosting framework based on decision trees to increases the
efficiency of the model and reduces memory usage . It uses two novel techniques: sampling
and exclusive.
4
Random forest or random decision forests are an ensemble learning method for
classification, regression and other tasks that operate by constructing a multitude of decision
trees at training time and outputting the class that is the mode of the classes (classification)
or mean/average prediction (regression) of the individual trees.
Prediction:
After implementing the classification algorithms , getting some predicted values based on
testing data.
In the Prediction module, the credit card is either fraud or non fraud is predicted.
Performance Evaluation:
• The Final Result will get generated based on the overall classification and prediction. The
performance of this proposed approach is evaluated using some measures like,
• Accuracy
• Precision
• Recall
• F1-score
• The result is generated in graph that compares and predicts both algorithms , which
algorithm provide more accuracy.
5
SYSTEM ANALYSIS
6
2.SYSTEM ANALYSIS
The large-scale use of credit cards and the lack of effective security systems result in billion-
dollar losses to credit card fraud.
In Existing System, Machine learning methods, including Support Vector Machine (SVM)
and Decision Tree are used.
The existing system doesn’t effectively classify and predict the fault in credit card detection.
DISADVANTAGES:
7
2.2 PROPOSED SYSTEM
This project proposes an approach for detecting fraudulent credit card transactions that uses
Machine Learning algorithms like, Light Gradient Boost Machine and Random Forest.
The proposed method can identify relatively more fraudulent transactions than the existing
methods under an acceptable false positive rate.
The managerial implication of the work is that credit card issuers can apply the methodology
to efficiently identify fraudulent transactions to protect customers interests and reduce
fraud losses and regulatory costs.
ADVANTAGES:
8
SYSTEM CONFIGURATION
9
3.SYSTEM CONFIGURATION
10
3.3 SOFTWARE SPECIFICATION
Python:
Python is one of those rare languages which can claim to be both simple and powerful.
You will find yourself pleasantly surprised to see how easy it is to concentrate on the solution to
the problem rather than the syntax and structure of the language you are programming in. The
official introduction to Python is Python is an easy to learn, powerful programming language. It
has efficient high-level data structures and a simple but effective approach to object-oriented
programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature,
make it an ideal language for scripting and rapid application development in many areas on most
platforms. I will discuss most of these features in more detail in the next section.
Features of Python
Simple
Python is a simple and minimalistic language. Reading a good Python program feels
almost like reading English, although very strict English! This pseudo-code nature of Python is
one of its greatest strengths. It allows you to concentrate on the solution to the problem rather than
the language itself.
Easy to Learn
As you will see, Python is extremely easy to get started with. Python has an extraordinarily
simple syntax, as already mentioned.
High-level Language
When you write programs in Python, you never need to bother about the low-level details such as
managing the memory used by your program, etc.
Portable
Due to its open-source nature, Python has been ported to (i.e. changed to make it work on) many
platforms. All your Python programs can work on any of these platforms without requiring any changes
at all if you are careful enough to avoid any system-dependent features.
You can use Python on GNU/Linux, Windows, FreeBSD, Macintosh, Solaris, OS/2, Amiga,
AROS, AS/400, BeOS, OS/390, z/OS, Palm OS, QNX, VMS, Psion, Acorn RISC OS, VxWorks,
PlayStation, Sharp Zaur us, Windows CE and Pocket PC!
You can even use a platform like Kivy to create games for your computer and for iPhone, iPad,
and Android.
Interpreted
A program written in a compiled language like C or C++ is converted from the source language
i.e. C or C++ into a language that is spoken by your computer (binary code i.e. 0s and 1s) using a compiler
with various flags and options. When you run the program, the linker/loader software copies the program
from hard disk to memory and starts running it.
Python, on the other hand, does not need compilation to binary. You just run the program directly from
the source code. Internally, Python converts the source code into an intermediate form called bytecodes
and then translates this into the native language of your computer and then runs it. All this, actually,
makes using Python much easier since you don’t have to worry about compiling the program, making
12
sure that the proper libraries are linked and loaded, etc. This also makes your Python programs much
more portable, since you can just copy your Python program onto another computer and it just works!
Object Oriented
Extensible
If you need a critical piece of code to run very fast or want to have some piece of algorithm not
to be open, you can code that part of your program in C or C++ and then use it from your Python program.
Embeddable
You can embed Python within your C/C++ programs to give scripting capabilities for your
program’s users.
Extensive Libraries
The Python Standard Library is huge indeed. It can help you do various things involving regular
expressions, documentation generation, unit testing, threading, databases, web browsers, CGI, FTP,
email, XML, XML-RPC, HTML, WAV files, cryptography, GUI (graphical user interfaces), and other
system-dependent stuff. Remember, all this is always available wherever Python is installed. This is
called the Batteries Included philosophy of Python.
13
SYSTEM DESIGN
14
4.SYSTEM DESIGN
The input design is the link between the information system and the user. It comprises
the developing specification and procedures for data preparation and those steps are necessary
to put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people keying
the data directly into the system. The design of input focuses on controlling the amount of input
required, controlling the errors, avoiding delay, avoiding extra steps and keeping the process
simple. The input is designed in such a way so that it provides security and ease of use with
retaining the privacy. Input Design considered the following things:
15
4.2 OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents
the information clearly. In any system results of processing are communicated to the users and
to other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the right
output must be developed while ensuring that each output element is designed so that people
will find the system can use easily and effectively. When analysis design computer output, they
should Identify the specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the
following objectives.
a. Convey information about past activities, current status or projections of the Future.
b. Signal important events, opportunities, problems, or warnings.
c. Trigger an action.
d. Confirm an action.
16
4.3 LIST OF DIAGRAMS
A data flow diagram shows the way information flows through a process or system. Here some
symbols and its meanings of data flow diagram. Here the data flow diagram
Handling missing
values
Data
Input data
Preprocessing Label Encoding
Training
Data
Light GBM
Classification
Random
Forest
Prediction
Performance
evaluation
17
SYSTEM TESTING
18
5.SYSTEM TESTING
SYSTEM TESTING
System testing is the stage of implementation, which aimed at ensuring that system
works accurately and efficiently before the live operation commence. Testing is the
process of executing a program with the intent of finding an error. A good test case is one
that has a high probability of finding an error. A successful test is one that answers a yet
undiscovered error.
Testing is vital to the success of the system. System testing makes a logical
assumption that if all parts of the system are correct, the goal will be successfully achieved.
. A series of tests are performed before the system is ready for the user acceptance testing.
Any engineered product can be tested in one of the following ways. Knowing the specified
function that a product has been designed to from, test can be conducted to demonstrate
each function is fully operational. Knowing the internal working of a product, tests can be
conducted to ensure that “al gears mesh”, that is the internal operation of the product
performs according to the specification and all internal components have been adequately
exercised.
UNIT TESTING:
Unit testing is the testing of each module and the integration of the overall
system is done. Unit testing becomes verification efforts on the smallest unit of software
design in the module. This is also known as ‘module testing’.
The modules of the system are tested separately. This testing is carried out during
the programming itself. In this testing step, each model is found to be working
satisfactorily as regard to the expected output from the module. There are some validation
checks for the fields. For example, the validation check is done for verifying the data
given by the user where both format and validity of the data entered is included. It is very
easy to find error and debug the system.
19
INTEGRATION TESTING:
Data can be lost across an interface, one module can have an adverse effect on
the other sub function, when combined, may not produce the desired major function.
Integrated testing is systematic testing that can be done with sample data. The need for
the integrated test is to find the overall system performance. There are two types of
integration testing. They are:
White Box testing is a test case design method that uses the control structure of the
procedural design to drive cases. Using the white box testing methods, We Derived test
cases that guarantee that all independent paths within a module have been exercised at
least once.
20
VALIDATION TESTING:
But a single definition is that validation succeeds when the software functions in a manner
that can be reasonably expected by the customer
OUTPUT TESTING:
After performing the validation testing, the next step is output asking the user
about the format required testing of the proposed system, since no system could be
useful if it does not produce the required output in the specific format. The output
displayed or generated by the system under consideration. Here the output format is
considered in two ways. One is screen and the other is printed format. The output
format on the screen is found to be correct as the format was designed in the system
phase according to the user needs. For the hard copy also output comes out as the
specified requirements by the user. Hence the output testing does not result in any
connection in the system.
21
SYSTEM IMPLEMENTATION
22
6.SYSTEM IMPLEMENTATION
The user should be very careful while implementing a project to ensure what
they have planned is properly implemented. The user should not change the purpose
of project while implementing. The user should not go in a roundabout way to achieve
a solution; it should be direct, crisp and clear and up to the point.
23
Implementation is the stage of the project when the theoretical design is
turned out into a working system. Thus it can be considered to be the most critical stage
in achieving a successful new system and in giving the user, confidence that the new
system will work and be effective.
24
SAMPLES
25
7.SAMPLES
Data Preprocessing
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('creditcard_1.csv')
data.head()
data.info()
26
"""Preprocessing"""
print()
print(data.isnull().sum())
data.describe()
"""Data visualization"""
plt.title("Transaction Distribution")
plt.xlabel("Class")
plt.ylabel("Frequency");
Normal = data[data['Class']==0]
Fraud = data[data['Class']==1]
print()
print()
27
print("Valid Cases : {}".format(len(Normal)))
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
LGBM Classifier
"""LGBM Classifier"""
lgbm_pred=lgbm.predict(X_test)
print('\n')
print("------Accuracy------")
lgbm_=accuracy_score(y_test, lgbm_pred)*100
print(LGBM)
print('\n')
print("------Classification Report------")
print(classification_report(lgbm_pred,y_test))
print('\n')
print('Confusion_matrix')
28
print(lgbm_cm)
print('\n')
tn = lgbm_cm[0][0]
fp = lgbm_cm[0][1]
fn = lgbm_cm[1][0]
tp = lgbm_cm[1][1]
Total_TP_FP=lgbm_cm[0][0]+lgbm_cm[0][1]
Total_FN_TN=lgbm_cm[1][0]+lgbm_cm[1][1]
specificity = tn / (tn+fp)
lgbm_specificity=format(specificity,'.3f')
print('RF_specificity:',lgbm_specificity)
print()
plt.figure()
plt.figure()
sns.heatmap(confusion_matrix(y_test,lgbm_pred),annot = True)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
29
plt.ylabel("True")
plt.show()
'''RANDOM FOREST'''
rf_clf=RandomForestClassifier(n_estimators=10)
rf_clf.fit(X_train,y_train)
rf_ypred=rf_clf.predict(X_test)
print('\n')
print("------Accuracy------")
rf=accuracy_score(y_test, rf_ypred)*100
print(RF)
print('\n')
print("------Classification Report------")
print(classification_report(rf_ypred,y_test))
print('\n')
print('Confusion_matrix')
print(rf_cm)
print('\n')
30
tn = rf_cm[0][0]
fp = rf_cm[0][1]
fn = rf_cm[1][0]
tp = rf_cm[1][1]
Total_TP_FP=rf_cm[0][0]+rf_cm[0][1]
Total_FN_TN=rf_cm[1][0]+rf_cm[1][1]
specificity = tn / (tn+fp)
rf_specificity=format(specificity,'.3f')
rf_sensitivity=format(sensitivity,'.3f')
print('RF_specificity:',rf_specificity)
print('\n')
plt.figure()
skplt.estimators.plot_learning_curve(RandomForestClassifier(n_estimators=10),
X_train, y_train, cv=7, shuffle=True, scoring="accuracy", n_jobs=-1, figsize=(6,4),
title_fontsize="large", text_fontsize="large",title="Random Forest Digits
Classification Learning Curve");
plt.figure()
sns.heatmap(confusion_matrix(y_test,rf_ypred),annot = True)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
31
plt.ylabel("True")
plt.show()
#comparision
vals=[lgbm_,rf]
inds=range(len(vals))
labels=["LGBM ","RF"]
fig,ax = plt.subplots()
ax.set_xticklabels(labels)
plt.show()
for i in range(0,10):
if rf_ypred[i]==0:
else:
32
7.2 SCREEN SHOTS
DATA PREPROCESSING:
33
DATA SPLITTING:
In this figure splitting the dataset into train set and test set
IMPLEMENTING ALGORITHMS:
LGBM:
Confusion Matrix:
34
Comparison Graph:
In the below graph shows the number of fraud classes and normal classes
35
Random Forest:
Confusion Matrix:
36
COMPARING THE ALGORITHMS:
In this figure Comparing both algorithms and shows in bar chart and
shows the credit cards are normal or fraud
37
CONCLUSION
38
8.CONCLUSION
39
FUTURE ENHANCEMENT
40
9.FUTURE ENHANCEMENT
41
BIBLIOGRAPHY
42
10.BIBLIOGRAPHY
43
10.2 WEB LINKS
1. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
2. http://www.w3schools.blog/detection-and-prevention-of-fraud
3. https://github.com/topics/credit-card-fraud-detection
44