0% found this document useful (0 votes)
51 views70 pages

Capstone_Project_Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views70 pages

Capstone_Project_Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Credit Card Transaction Anomalies Using ML

A Project Report

Submitted in the partial fulfillment of

the requirements for the award

of the degree

of

Bachelor of Technology
In
Department of Computer Science and Engineering

By

2100032430 - P MANOJ KRISHNA MOULI

under the supervision of

Mrs S Kavitha
Assistant Professor

Department of Computer Science and Engineering


Koneru Lakshmaiah Education Foundation
(Deemed to be University estd., u/s 3 of UGC Act 1956)
Green Fields, Vaddeswaram-522502, Guntur(Dist.), Andhra Pradesh, India
April 2025
Declaration

The Capstone project -2 Report entitled “Credit Card Transaction Anomalies Using ML“ is a
record of Bonafide work of 2100032430 - P MANOJ KRISHNA MOULI submitted in partial
fulfillment for the award of B. Tech in Computer Science and Engineering at KL University. The
results embodied in this report have not been copied from any other
departments/University/Institute.

2100032430 - P MANOJ KRISHNA

2
Certificate
This is to certify that the Capstone project -2 entitled “Credit Card Transaction Anomalies Using
ML “is a record of Bonafide work of 2100032430 - P MANOJ KRISHNA MOULI submitted in
partial fulfillment for the award of B. Tech in Department of Computer Science and Engineering
at the K L University is a record of Bonafide work carried out under our guidance and
supervision.

The results embodied in this report have not been copied from any other departments/
University/Institute.

Signature of the Supervisor Project Co-Ordinator


Mrs. S Kavitha DR. K. SWATHI
Associate Professor Associate Professor

Signature of the HOD Signature of the External Examiner

3
Acknowledgement

It is great pleasure for us to express our gratitude to our honorable president Sri. Koneru
Satyanarayana, for giving us the opportunity and platform with facilities in accomplishing the
project-based laboratory report.

We express our sincere gratitude to our HOD DR. A. SENTHIL for his administration towards
our academic growth. We record it as our privilege to deeply thank you for providing us with the
efficient faculty and facilities to make our ideas into reality.

We express our sincere thanks to our project supervisor MRS S KAVITHA for her novel
association of ideas, encouragement, appreciation, and intellectual zeal which motivated us to
publish this report successfully.

Finally, we are pleased to acknowledge the indebtedness to all those who devoted themselves
directly or directly to making this project report a success.

4
ABSTRACT

A credit card is the most widely used electronic payment method because of the increasing
volume of daily electronic transactions, making it more vulnerable to fraud. Credit card
companies have suffered heavy losses from card fraud. The detection of credit card fraud is
currently the most common issue. Credit card companies are looking for the right
technologies and systems to detect and reducing fraud of transactions on the credit card.
There are several methods for identifying credit card fraud that has been surveyed and
highlighted in this project and has been compared in terms of disadvantages and advantages
for each one. These anomalies pose challenges to Fraud Control Systems (comprising Fraud
Detection Systems & Fraud Prevention Systems) and compromise the transparency of online
payments. Financial institutions are thus compelled to enhance the security of credit card
transactions and ensure the safe and efficient use of e-banking services by their customers.
To achieve this objective, financial institutions are investing efforts in developing more
effective anomaly detection techniques. Credit card anomaly detection involves using
machine learning algorithms to identify unusual or suspicious patterns in credit card
transactions. By analyzing various features such as transaction amount, location, and
frequency, these models aim to detect potentially fraudulent activities, providing a proactive
approach to prevent unauthorized transactions and enhance overall security in the financial
system. The findings of this study have significant implications for online job platforms,
enabling them to proactively filter out spam content and enhance the overall user experience.
By leveraging machine learning techniques, platforms can effectively mitigate the adverse
effects of spam postings, thereby fostering a more reliable and trustworthy job search
environment. Future research directions may involve the exploration of additional features
and advanced modeling techniques to further enhance detection accuracy and scalability.

5
CONTENTS

S. No. Chapter Title Page


No

1. Acknowledgement 4

2. Abstract 5

3. List of Figures 15
Introduction
4. 9-19
1.1 Background

1.2 Problem Statement

1.3 Objective

1.4 Project Scope

1.5 Limitation of work

1.6 Thesis Structure

1.7 Machine Learning

1.8 Machine learning Models

1.9 Key components and functionalities of ML

1.10 Principle component analysis


Literature Survey
5. 20-22
2.1 Journals

2.2 Existing System

2.3 Limitations of existing system

Proposed Methodology

6. 3.1 Dataset details


23-26

6
System design
7.
4.1 Data flow 27-36
4.2 diagram UML
4.3 Diagram Usecase
4.4 Diagram Class
4.5 Diagram Object
4.6 Diagram
4.7 Sequence
diagram
Activity diagram
System Implementation
8. 37-45
5.1 Sample coding
5.2 Input screen
5.3 Output
Testing
9.
6.1 Introduction to testing 46-48
Implementation result
10. 49-50
7.1 Implementation
Conclusion and future enhancement
11. 51
8.1 Conclusion

12. Data Pre-processing and Analysis 52-59


9.1
Bibliography
13 60
10.1 References

7
IST OF FIGURES
Fig No. Title Page No.
1 Data flow diagram 28
2 Use case diagram 31
3 Class diagram 32
4 Sequence diagram 34
5 Activity diagram 36
6 Input screen 44
7 Output image 45
8 Home page 49
9 Add csv to page 49
10 Results page 50

8
CHAPTER 1

INTRODUCTION

Credit card transactions are an integral part of our modern financial landscape, facilitating
millions of purchases daily. However, alongside their convenience comes the risk of
anomalies, which can lead to unauthorized charges, financial losses, and even identity
theft. Anomaly detection, a critical component of transaction monitoring, plays a pivotal
role in identifying and mitigating these risks. Anomaly detection in credit card
transactions involves the identification of patterns that deviate significantly from the
expected behavior of legitimate transactions. Rather than solely focusing on fraud
detection, our project aims to employ anomaly detection techniques to enhance the
security and integrity of credit card transactions. Key Objectives: Identifying Unusual
Patterns: Our project seeks to develop algorithms capable of recognizing unusual patterns
or behaviors within credit card transactions. These anomalies may include unusually large
transactions, transactions from unfamiliar locations, or patterns inconsistent with the
cardholder's typical spending behavior. Reducing False Positives: Traditional fraud
detection systems often generate false positives, flagging legitimate transactions as
suspicious. By employing advanced anomaly detection techniques, we aim to minimize
false positives, thereby enhancing the efficiency of transaction monitoring while reducing
unnecessary disruptions for cardholders. Enhancing Security: Anomalies in credit card
transactions can signal potential security threats, such as compromised accounts or
fraudulent activities. By promptly detecting and addressing these anomalies, our project
aims to bolster the security of credit card transactions and safeguard cardholders' financial
assets. Mitigating Financial Losses: The financial repercussions of credit card fraud and
unauthorized transactions can be substantial, leading to significant losses for both
cardholders and financial institutions. Through effective anomaly detection, we aim to
mitigate these losses by identifying and intercepting fraudulent activities before they
escalate. Improving Customer Experience: Beyond security concerns, our project also
aims to enhance the overall customer experience by minimizing disruptions caused by
fraudulent transactions. By swiftly identifying and resolving anomalies, we strive to
instill confidence among cardholders and foster trust in the reliability of credit card
transactions. In data mining and machine learning, anomalies are often categorized into
different types based on their characteristics. One such category is: • Point anomalies:
9
Point anomalies occur when a single instance of

10
data significantly deviates from the normal behaviour of the rest of the dataset. In other
words, a point anomaly is an individual data point that stands an exception compared to
the majority of the data. Detecting point anomalies involves identifying these isolated
instances that exhibit unusual or unexpected behavior within the dataset. • Contextual
anomalies: Contextual anomalies are detected when a data instance exhibits anomalous
behavior only within a specific context or condition, while appearing normal in other
contexts. This type of anomaly is also referred to as conditional anomalies. Contextual
anomaly detection involves identifying instances where the deviation from normal
behavior occurs under specific circumstances, highlighting the importance of considering
contextual factors in anomaly detection algorithms. • Collective anomalies: Collective
anomalies refer to situations where a group or collection of related data instances exhibits
anomalous behavior collectively, even though individual data points may appear normal
when considered independently. In other words, each data point may not deviate
significantly from the norm on its own, but the collective behavior of the group as a
whole is anomalous. Detecting collective anomalies involves analyzing the relationships
and interactions among data points to identify anomalous patterns or clusters that emerge
collectively within the dataset.
1.1 Background:
In the modern digital landscape, the proliferation of online job portals has revolutionized
recruitment processes, offering convenience and efficiency to both job seekers and
employers. However, amidst the vast array of legitimate job postings lie deceptive spam
postings. These spam postings often contain misleading information, fraudulent schemes,
or malicious intents, leading to wasted time, financial loss, compliance risks and potential
security risks for unsuspecting users. To combat this issue, machine learning (ML)
techniques have emerged as powerful tools for automatically identifying and filtering out
spam job postings. ML models can analyze large volumes of job postings, discern
patterns, and learn from labeled data to distinguish between legitimate and spam postings
effectively. By leveraging various features such as job description, company profile,
salary information, and user engagement metrics, ML algorithms can extract relevant
information and detect anomalies indicative of spam.
One common approach involves using supervised learning algorithms like support vector
machines (SVM), decision trees, or neural networks trained on labeled datasets
comprising both legitimate and spam job postings.

11
1.2 Problem Statement:

With the rise of digital transactions, credit card fraud has become a significant concern
for financial institutions and consumers alike. Detecting fraudulent activities in real-time
is crucial to prevent financial losses and protect customers' assets. Traditional rule-based
systems often fail to keep up with evolving fraud tactics. Hence, there is a pressing need
for more advanced anomaly detection techniques leveraging machine learning
algorithms. This research aims to address the following key challenges:
1.3.1 Imbalanced Data: Credit card fraud is relatively rare compared to legitimate
transactions, resulting in imbalanced datasets. Developing effective anomaly
detection models that can handle this class imbalance while maintaining high
accuracy is essential.
1.3.2 Complex Patterns: Fraudulent activities often exhibit complex patterns that are
difficult to capture using traditional methods. Machine learning algorithms need to be
capable of identifying these patterns amidst the noise inherent in transactional data.
1.3.3 Real-time Detection: Timeliness is crucial in fraud detection. The model should be
able to analyse transactions in real-time, flagging suspicious activities promptly to
prevent further damage.
1.3.4 Generalization: An effective anomaly detection model should generalize well across
different types of fraud and adapt to new fraud tactics over time. It should be robust
against concept drift, where the characteristics of fraudulent behaviour may change
gradually.
1.3.5 Interpretability and Explainability: While achieving high accuracy is important, it
is equally crucial for the model to provide interpretable explanations for its decisions.
This is essential for gaining trust from stakeholders and for compliance purposes.
Addressing these challenges requires the development and evaluation of novel machine
learning algorithms tailored specifically for credit card fraud detection. This research
aims to propose and validate such algorithms, ultimately contributing to the advancement
of fraud detection systems in the financial sector.

12
1.3 Objectives:

The first ideal of the exploration is

1.3.1 Data Collection and Analysis

1.3.2 Feature Engineering

1.3.3 Model Development

1.3.4 Evaluation and Validation

1.3.5 Deployment and Integration

1.4 Project Scope:

The web system procedure is thought to be made clearer by the expansion for this
assignment. This project focuses more on the building's security.
1.4.1 Extent of Client
1.4.1.1 Click Fake job identification stage during home page.
1.4.1.2 Enter your job title, job industry, job description, professional
requirements, educational requirements during Fake Job Identification
stage.
1.4.2 Degree of Structure
1.4.2.1 Home: The Home page allows the client to detection a job, know about
the web system.
1.4.2.2 Fake Job Identification: Truly identify whether the client entered job
title and description is genuine or fake.

1.5 Limitation Of Work

It's a well- known fact that every system has its limitations including this proposed system.
One of the limitations of this operation is that it doesn't have a beautiful and proper interface
since the ideal is to identify the job posting. It only has a simple interface with a form for the
user to fit their job posting title, description, professional requirements and the submit
button. This design is fasting only on the identification of a job posting based on pattern
which was trained by the Machine Learning Model for Identification.

13
1.3 Thes is Structure

Chapter 1

This First Chapter contains the most monumental part which is given a fundamental
depiction of the idea of the whole design. Part of this chapter focuses on the backdrop,
challenge invoice, intents, plan amplitudes and terminations of work, and thesis edifice of
the arrangement.

Chapter 2

This chapter will describe the affiliated work of the other experimenters to gain a
further understanding of the design idea. The conception of a Machine Learning will be
described in this chapter. The Machine Learning Models will be bandied about in this
chapter from the reading material and sources similar to papers, journals, related websites,
and being systems.

Chapter 3

This chapter will delineate the program of this propounded strategy which is by
exploiting a Machine Learning Model. This chapter will explain further about the system
and the system demand of the design.

Chapter 4

This chapter will explain the working of the system shows whereby the system is
being developed for the project.

Chapter 5

Shows the testing of the Machine Learning models and likewise the conclusion of
the identification of the job posting along with a Fake Job Detection result of the system.

Chapter 6

Its the ultimate branch that concludes the accomplishment of the awaited results
prospects and also the unborn work of this proposed design

1.4 Machine Learning

Machine learning is a subset of artificial intelligence (AI) that enables computers to learn
from data and improve their performance on a task without being explicitly programmed. It's

a powerful tool used across various domains, from predicting customer behavior in e-
13
commerce to diagnosing diseases in healthcare. In documentation, understanding the basics
of machine learning can facilitate collaboration with data scientists, engineers, and other
stakeholders involved in developing machine learning systems. It revolves around the idea of
enabling computers to automatically learn and improve from experience
At its core, machine learning involves training algorithms to recognize patterns in data and
make decisions or predictions based on those patterns. The process typically consists of the
following steps:

1. Data Collection: The first step in any machine learning project is gathering relevant data.
This data can come from various sources such as databases, sensors, or web scraping. In
documentation, it's essential to understand the type and format of data being used to inform
users about the inputs required for the machine learning system.

2. Data Preprocessing: Raw data often contains noise, missing values, or inconsistencies that
can impact model performance. Data preprocessing involves cleaning, transforming, and
organizing the data to make it suitable for training machine learning models. This step may
include tasks like removing duplicates, handling missing values, and scaling numerical
features.

3. Feature Engineering: Feature engineering is the process of selecting, transforming, or creating


features (i.e., variables) that are relevant for training the machine learning model. This step
requires domain knowledge and creativity to extract meaningful information from the data.
Documenting the features used in the model helps users understand the factors influencing its
predictions.

4. Model Selection: Choosing the right algorithm or model architecture is crucial for achieving
good performance. There are various types of machine learning models, including supervised
learning (e.g., regression, classification), unsupervised learning (e.g., clustering,
dimensionality reduction), and reinforcement learning. Each type has its strengths and
weaknesses depending on the nature of the problem and the available data. Documenting the
rationale behind model selection helps users understand the trade-offs involved.

5. Model Training: Training a machine learning model involves feeding the prepared data into
the chosen algorithm and adjusting its parameters to minimize errors or optimize performance
metrics. This process requires computational resources and may involve techniques like
cross- validation to evaluate model generalization.

14
6. Model Evaluation: Once trained, the model needs to be evaluated on a separate dataset to
assess its performance and generalization ability. Common evaluation metrics include
accuracy, precision, recall, and F1 score for classification tasks, and mean squared error or R-
squared for regression tasks. Documenting the evaluation results helps users interpret the
model's reliability and limitations.

7. Deployment and Monitoring: After successful evaluation, the trained model is deployed into
production environments where it can make predictions on new data. Continuous monitoring
is essential to ensure that the model performs as expected and remains accurate over time.
Documenting deployment procedures and monitoring protocols helps maintain transparency
and accountability.

1.5 Machine Learning Models

Machine learning models are mathematical representations of patterns and


relationships within data that enable computers to make predictions, classifications, or
decisions without being explicitly programmed. These models serve as the core components
of machine learning systems, and understanding their types, architectures, and functionalities
is essential for leveraging machine learning effectively.

Types of Machine Learning Models:


1. Linear Models: Linear models assume a linear relationship between input features
and the target variable. Examples include linear regression for regression tasks and
logistic regression for binary classification tasks. These models are simple and
interpretable but may struggle with capturing complex patterns in data.
2. Decision Trees: Decision trees partition the feature space into hierarchical decision
rules based on feature values, leading to a tree-like structure. Decision trees are
easy to interpret and can handle both numerical and categorical data. However,
they are prone to overfitting and may lack generalization.
• Ensemble Models: Ensemble models combine multiple base models to improve
predictive performance. Random Forest and Gradient Boosting Machines (GBM) are
popular ensemble techniques. They work by aggregating the predictions of multiple
weak learners, resulting in more robust and accurate models.
• Support Vector Machines (SVM): SVMs are supervised learning models used for
classification and regression tasks. They find the optimal hyperplane that separates data
points into different classes while maximizing the margin between classes. SVMs are

effective in high-dimensional spaces and are less prone to overfitting. 15


Neural Networks: Neural networks are a class of models inspired by the structure and
function of the human brain. They consist of interconnected layers of artificial neurons
(nodes) organized into an input layer, hidden layers, and an output layer. Deep neural
networks, with many hidden layers, are capable of learning complex representations
from data and are widely used in tasks such as image recognition, natural language
processing, and reinforcement learning
3. Clustering Models: Clustering models group similar data points together based on their
features, without any predefined labels. K-means clustering and hierarchical clustering
are common techniques used for unsupervised learning tasks such as customer
segmentation, anomaly detection, and data compression.
4. Recommender Systems: Recommender systems predict user preferences or item
ratings to provide personalized recommendations. Collaborative filtering and content-
based filtering are two popular approaches used in recommendation systems.

1.6 Key Components and Functionalities of ML

Machine learning (ML) encompasses a diverse set of algorithms and techniques that
enable computers to learn from data and make predictions or decisions without being
explicitly programmed. Understanding the key components and functionalities of
machine learning is crucial for effectively designing, implementing, and evaluating ML
systems.

Components of Machine Learning:

1. Data: Data is the foundation of machine learning. It includes both input features
(independent variables) and corresponding labels (for supervised learning) or unlabelled
instances (for unsupervised learning). The quality, quantity, and relevance of data
significantly impact the performance of machine learning models.

2. Model: A model is a mathematical representation of a problem domain or a system under


study. It captures the relationships between input features and output predictions. Models
can range from simple linear regressions to complex deep neural networks.

• Algorithm: An algorithm is a set of rules or procedures used by a machine


learning model to learn from data and make predictions. Different
algorithms are suited to different types of problems and data distributions.
Common machine learning algorithms include linear regression, decision
trees, support vector machines, k-nearest neighbours, and neural networks.
• Features: Features are the variables or attributes used as input to machine 16
learning models. Feature selection, extraction, and engineering involve
identifying relevant features that contribute to the predictive power of the
model. Features can be numerical, categorical, or textual, and their quality
directly impacts model performance.

• Parameters: Parameters are the internal variables of a machine learning


model that are optimized during the training process to minimize prediction
errors or optimize performance metrics. Examples of parameters include
weights and biases in neural networks or coefficients in linear regression.

Functionalities of Machine Learning:


• Training: Training is the process of fitting a machine learning model to
training data by adjusting its parameters or structure to minimize prediction
errors or optimize performance metrics. Training typically involves iterative
optimization algorithms such as gradient descent or backpropagation.
• Evaluation: Evaluation involves assessing the performance of a trained
machine learning model on unseen data to measure its generalization ability.
Evaluation metrics quantify the model's accuracy, precision, recall, F1 score,
mean squared error, or other performance measures depending on the task.
• Prediction: Prediction is the process of using a trained machine learning
model to make predictions or decisions on new, unseen data. Given input
features, the model produces output predictions or classifications based on the
learned patterns from the training data.
• Deployment: Deployment is the process of integrating a trained machine
learning model into production environments where it can make real-time
predictions on new data. Deployment involves considerations such as
scalability, latency, model serving infrastructure, and monitoring.

 Monitoring: Monitoring involves continuously monitoring the performance of


deployed machine learning models in production environments to detect drift,
degradation, or anomalies. Monitoring ensures that models remain accurate and reliable
over time and facilitates model maintenance and updates.
 Interpretability: Interpretability refers to the ability to understand and explain how
machine learning models make predictions or decisions. Techniques such as feature

importance analysis, model visualization 17


1.7 Principle Component Analysis
Principal Component Analysis (PCA) and Isolation Forest are both techniques used in
machine learning and data analysis for credit card transaction anomalies detection
purpose. Principal Component Analysis (PCA): - PCA is a dimensionality reduction
technique primarily used for feature extraction and data visualization. It transforms high-
dimensional data into a lower- dimensional form by finding the principal components
(PCs) that capture the maximum variance in the data. - The principal components are
linear combinations of the original features, ordered by the amount of variance they
explain. - PCA is useful for reducing the dimensionality of data while preserving most of
its important information. It's commonly used for exploratory data analysis, visualization,
and speeding up machine learning algorithms. - PCA works by calculating the
eigenvectors and eigenvalues of the covariance matrix of the data and then selecting a
subset of these vectors to form the new lower-dimensional space.

Credit card fraud remains a significant concern for financial institutions and consumers,
necessitating efficient and accurate anomaly detection methods. Traditional methods
often struggle to cope with the complexities and imbalances inherent in credit card
transaction data. The Isolation Forest algorithm, a relatively recent anomaly detection
technique, shows promise due to its ability to efficiently identify outliers in high-
dimensional data without relying on assumptions about the underlying distributions.

This research seeks to explore the following aspects related to the application of the
Isolation Forest algorithm in credit card fraud detection:

1. Effectiveness in Handling Imbalanced Data: Credit card fraud datasets typically


exhibit severe class imbalances, with fraudulent transactions forming a small fraction
of the overall data. potentially mitigate the challenges posed by class imbalance,
leading to more accurate detection of fraudulent activities.
2. Scalability and Efficiency: As financial institutions process vast volumes of
transactions daily, the scalability and computational efficiency of the detection
algorithm are crucial. The Isolation Forest algorithm's linear time complexity and
ability to work well with high-dimensional data make it a promising candidate for
real-time
fraud detection systems.
3. Generalization to Different Types of Fraud: Credit card fraud manifests in various
forms, ranging from simple unauthorized transactions to sophisticated identity theft
schemes. Assessing the Isolation Forest algorithm's capability to generalize across
different types of fraud scenarios is essential for its practical utility in real-world applications.

18
4. Integration with Real-Time Detection Systems: The Isolation Forest algorithm's
suitability for real-time processing can significantly enhance fraud detection systems'
responsiveness. Investigating its seamless integration with existing transaction
processing pipelines and evaluating its performance under stringent latency
requirements are critical considerations.
Interpretability and Explainability: Despite its effectiveness, the Isolation Forest
algorithm's black-box nature might pose challenges in interpreting and explaining its
decisions, which are essential for stakeholders' trust and regulatory compliance.
Assessing techniques to enhance the algorithm's interpretability without compromising
its detection capabilities is imperative.

19
CHAPTER 2
LITERATURE SURVEY
Prajal save et al. [1] have proposed a model based on a decision tree and a combination
of Luhn's and Hunt's algorithms. Luhn's algorithm is used to determine whether an
incoming transaction is fraudulent or not. It validates credit card numbers via the input,
which is the credit card number. Address Mismatch and Degree of Outlier Ness are used
to assess the deviation of each incoming transaction from the cardholder's normal profile.
In the final step, the general belief is strengthened or weakened using Bayes Theorem,
followed by recombination of the calculated probability with the initial belief of fraud
using an advanced combination heuristic. Finally, highest accuracy occurs 89.44%.

Vimala Devi. J et al. [2] To detect counterfeit transactions, three machinelearning


algorithms were presented and implemented. There are many measures used to evaluate
the performance of classifiers or predictors, such as the Vector Machine, Random Forest,
and Decision Tree. These metrics are either prevalence-dependent or prevalence-
independent. Furthermore, these techniques are used in credit card fraud detection
mechanisms, and the results of these algorithms have been compared. Finally, highest
accuracy occurs 90.64%.

Popat and Chaudhary et al. [3] supervised algorithms were presented Deep learning,
Logistic Regression, Nave Bayesian, Support Vector Machine (SVM), Neural Network,
Artificial Immune System, K Nearest Neighbour, Data Mining, Decision Tree, Fuzzy
logic- based System, and Genetic Algorithm are some of the techniques used. Credit card
fraud detection algorithms identify transactions that have a high probability of being
fraudulent. We compared machine-learning algorithms to prediction, clustering, and
outlier detection. Finally, highest accuracy occurs 90.43%.

Kibria and Sevkli et al. [4] Using the grid search technique, create a deep learning
model. The built model's performance is compared to the performance of two other
traditional machine-learning algorithms: logistic regression (LR) and support vector
machine (SVM). The developed model is applied to the credit card data set and the results
are compared to logistic regression and support vector machine models.

20
Siddhant. Bagga et al. [5] presented several techniques for determining whether a
transaction is real or fraudulent Evaluated and compared the accomplishment of 9
techniques on data of credit
card fraud, including logistic regression, KNN, RF, quadrant discriminative analysis,
naive Bayes, multilayer perceptron, ada boost, ensemble learning, and pipelining, using
different parameters and metrics. ADASYN method is used to balance the dataset.
Accuracy, recall, F1 score, Balanced Classification Rate are used to assess classifier
performance and Matthews’s correlation coefficient. This is to determine which
technique is the best to use to solve the issue based on various metrics. Finally, highest
accuracy occurs 88.998%.

2.1 Existing System

Initially, the system utilized Support Vector Machine (SVM) algorithm for classification
tasks on the credit card transaction dataset. SVM is a popular machine learning algorithm
known for its effectiveness in binary classification tasks. ➢ Through a thorough analysis
of IEEE papers, it becomes evident that various machine learning algorithms such as
Bayesian methods, Random Forest, Logistic Regression, and Support Vector Machines
(SVM) have been extensively employed. These models exhibit promising results,
achieving accuracies upwards of 92% in detecting anomalies within credit card
transactions. Drawbacks • The system used low data preprocessing techniques so the
accuracy arrived low and different sets of features had an impact on the results to
overcome this by using PCA. • The system used algorithms are taking long time to get the
outcomes
Low Accuracy Due to Insufficient Data Preprocessing:
1.
 The system initially employed SVM for classification tasks but encountered low accuracy.
This could be attributed to inadequate data preprocessing techniques.
 Data preprocessing is crucial for preparing the dataset before feeding it into the machine
learning model. It involves steps such as handling missing values, scaling features,
encoding categorical variables, and removing outliers.
 Insufficient data preprocessing can lead to noisy or biased data, which can negatively
impact the model's performance.
 To address this issue, it's essential to implement more robust data preprocessing
techniques to ensure that the dataset is clean, balanced, and suitable for training the
model.
2. High Computational Time with Existing Algorithms:
 Another challenge mentioned is the long computational time taken by the algorithms to
21
produce outcomes.

22
 Algorithms such as SVM, Bayesian methods, Random Forest, and Logistic
Regression were employed, but they were computationally intensive, resulting
in longer processing times.
 High computational time can be a bottleneck, especially when dealing with
large datasets or complex models.

 Dimensionality Reduction Techniques: Principal Component Analysis (PCA) is


mentioned as a method to address the impact of different sets of features on results.
PCA can reduce the dimensionality of the dataset while retaining most of its variance,
thereby speeding up computation.
 Model Optimization: Tuning hyperparameters and optimizing algorithms can help
improve efficiency. For example, using optimized implementations of algorithms or
exploring parallel computing techniques.
 Data Sampling: Instead of using the entire dataset, consider using data sampling
techniques such as random sampling or stratified sampling to work with smaller
subsets of data, which can reduce computational time while still maintaining
representative samples.
 Model Selection: Evaluate and compare the performance of different machine learning
algorithms to identify the most efficient ones for the task at hand. Some algorithms
may inherently require less computational time while achieving comparable or better
results.

In summary, addressing the issues of low accuracy due to insufficient data


preprocessing and high computational time requires a combination of robust data
preprocessing techniques, dimensionality reduction methods like PCA, and
optimization strategies for model selection and implementation. By improving the
quality of the dataset and optimizing the algorithms, it's possible to enhance the
accuracy and efficiency of anomaly detection in credit card transactions.

2.1.1 LIMITATIONS OF EXISTING SYSTEM

1. It doesn't work for large data sets.

2. Limited Context Understanding: Naive Bayes assumes independence among features,


which may lead to a limited understanding of the contextual nuances in job postings.
This limitation can affect the system's ability to discern more sophisticated spam
tactics.
23
CHAPTER 3

PROPOSED METHODOLOGY
The proposed system for anomaly detection in credit card transactions represents a
significant advancement over the existing framework by addressing various limitations
and introducing key enhancements across several critical aspects.

1. Improved Accuracy with Isolation Forest Algorithm:

Anomaly detection in credit card transactions is a complex task due to the ever-evolving
nature of fraudulent activities and the vast amount of legitimate transactions. The
proposed system tackles this challenge by leveraging the Isolation Forest algorithm,
renowned for its effectiveness in isolating anomalies within datasets. Unlike traditional
methods such as clustering or density-based approaches, Isolation Forest excels in
identifying outliers by constructing random decision trees and isolating instances with
fewer splits. This inherent ability to isolate anomalies makes it particularly well-suited for
detecting fraudulent transactions amidst legitimate ones.

Moreover, the utilization of the Kaggle dataset for training and testing purposes enhances
the robustness and generalizability of the model. The Kaggle dataset likely comprises a
diverse range of transaction scenarios, capturing various patterns and anomalies
encountered in real-world credit card usage. By training the model on such a
representative dataset, the proposed system ensures that it can effectively discern between
normal and anomalous transactions across different contexts, thereby improving its
accuracy and reliability.

2. User-Friendly GUI:

One notable improvement over the existing system is the introduction of a user-friendly
Graphical User Interface (GUI). While the existing system might have been effective in
detecting anomalies, its lack of a GUI might have posed challenges in terms of user
interaction and comprehension. The new GUI in the proposed system addresses this
limitation by providing users with an intuitive interface to interact with the anomaly
detection module.

Through the GUI, users can easily input transaction data, visualize the results of anomaly
detection, and interpret the findings. Additionally, features such as interactive charts,
customizable settings, and real-time feedback enhance the overall user experience, making
23
it easier for both analysts and non-technical users to leverage the system effectively. By
improving accessibility and usability, the GUI empowers users to make informed decisions based
on the anomaly detection results, ultimately contributing to better fraud prevention and mitigation
strategies.

3. Speed:

In addition to accuracy and usability enhancements, the proposed system also prioritizes
efficiency in terms of training and prediction speed. The ability to process large volumes
of transaction data rapidly is crucial for timely anomaly detection, especially in the
context of credit card fraud prevention where swift action is paramount.

By optimizing algorithms and leveraging parallel processing techniques, the proposed


system achieves faster training and prediction speeds compared to the existing
framework. This acceleration facilitates real-time monitoring of transactions, enabling
rapid identification and response to fraudulent activities as they occur. As a result, the
proposed system enhances the agility and effectiveness of fraud detection operations,
thereby reducing the potential financial losses associated with fraudulent transactions.

In conclusion, the proposed system represents a comprehensive solution to the challenges


of anomaly detection in credit card transactions. By combining advanced algorithms,
representative datasets, user-friendly interfaces, and optimized performance, it offers
significant improvements in accuracy, usability, and efficiency. As a result, the proposed
system is poised to make a substantial impact in enhancing fraud detection capabilities
and safeguarding financial transactions against fraudulent activities.

3.1 Dataset Details:


1. Credit Card Fraud Detection Datasets: These datasets typically contain anonymized
credit card transactions, where each transaction is labeled as fraudulent or legitimate.
The data usually includes features such as transaction amount, timestamp, and
anonymized customer information. These datasets are commonly used for training
machine learning models to detect fraudulent transactions.
2. Synthetic Fraud Datasets: Some datasets on Kaggle may consist of synthetic or
simulated credit card transactions generated specifically for fraud detection research.
These datasets often mimic real-world transaction patterns and anomalies, making them
valuable for testing the effectiveness of anomaly detection algorithms. 24
3. Imbalanced Datasets: Credit card fraud detection datasets are often highly imbalanced,

imbalanced datasets pose challenges for machine learning model training and
evaluation, requiring techniques such as oversampling, undersampling, or advanced
anomaly detection algorithms to address the class imbalance effectively.
4. Additional Features: In addition to transaction-specific features, some datasets may
include additional information such as merchant category, transaction location, or
device information. These additional features can enrich the dataset and improve the
performance of anomaly detection models by capturing more nuanced patterns of
fraudulent behavior.

When searching for credit card transaction anomaly detection datasets on Kaggle or
other data science platforms, it's essential to review the dataset descriptions, understand
the data schema, and assess the quality and relevance of the data for your specific
research or application. Additionally, ensure that you comply with any data usage
restrictions and privacy regulations when working with sensitive financial transaction
data.

ADVANTAGES OF PROPOSED SYSTEM


The proposed system for anomaly detection in credit card transactions offers several
advantages over existing systems:

1. Improved Accuracy: By employing the Isolation Forest algorithm and leveraging


representative datasets from Kaggle, the proposed system aims to enhance accuracy
levels in detecting anomalies. Isolation Forest is particularly effective in isolating
anomalies within datasets, leading to more precise identification of fraudulent
transactions compared to traditional methods. Additionally, training the model on
diverse and representative datasets improves its ability to generalize to unseen data,
further enhancing accuracy.
2. Efficiency: The proposed system is designed for faster training and prediction speeds
compared to existing systems. This efficiency is crucial for real-time monitoring of
credit card transactions, enabling timely identification and mitigation of fraudulent
activities. Faster processing times also contribute to improved operational efficiency
and reduced response times, ultimately leading to better fraud prevention outcomes.
3. User-Friendly Interface: Introducing a user-friendly Graphical User Interface (GUI)
addresses a significant limitation of existing systems by enhancing accessibility and
usability. The GUI provides an intuitive platform for users to interact with the
system, input transaction data, visualize results, and interpret anomaly detection
findings. This user-friendly interface makes the system more accessible to both
technical and non- technical users, empowering them to make informed decisions
based on the anomaly
detection results. 25
• Scalability: The proposed system is designed to be scalable, capable of handling
large volumes of credit card transactions efficiently. Scalability is essential for
accommodating the growing volume of transactions in the financial industry while
maintaining high levels of performance and accuracy. The system's scalability
ensures that it can adapt to changing demands and accommodate increased
transaction volumes without sacrificing effectiveness or speed.
• Robustness: By utilizing advanced anomaly detection algorithms and leveraging
diverse datasets, the proposed system exhibits robustness in detecting various types
of anomalies in credit card transactions.

26
CHAPTER 4
SYSTEM DESIGN

4.1 Data Flow Diagram:

1. The Data Flow Diagram (DFD) is almost called an air pocket chart. A reasonable
graphical formalism can be utilized to address a design to the degree that
information to the arrangement, different directions completed on this information,
and the result information made by this turn of events.
2. The Data Flow Diagram (DFD) is one of the basic appearance instruments. Showing
the arrangement parts is utilized. These parts are the design cycle, the information
utilized by the cycle, and an outer substance that collaborates with the arrangement
and the data streams in the framework.
3. DFD shows how the data goes through the framework and the standard changed by
an improvement of changes. A graphical technique portrays data streams and the
improvements that are applied as information moves from the obligation to yield.
4. DFD is everything viewed as an air pocket outline. A DFD might be utilized to
address an improvement at any degree of discussion. DFD might be
circumnavigated into levels that address making data streams and enormous detail.

Fig:1 Architecture of Proposed Model


27
Fig 2: data flow diagram

4.2 UML Diagram:


UML tends to Bound together Appearance Language. UML is a standardized
thoroughly strong showing language in the field of coordinated PC programming. The
standard is directed and was made by the Article The board Get-together.

The goal is for UML to change into a common language for making models of
coordinated PC programming. In its perpetual turn of events, UML has contained two
essential parts: Meta-model and documentation. Later on, some sort of procedure or
cycle may correspondingly be added to; or related to UML. The Bound together
Showing Language is a standard language for picking, Information, Making, and
revealing the odds and ends of a thing system, too concerning business showing up and
other non- programming structures.
28
The UML watches out for an arrangement of best orchestrating rehearsals that have
shown persuading in the presence of massive and complex plans. UML is an urgent
piece of making objects- coordinated programming and what improvement process. The
UML regularly uses graphical documentation to convey the arrangement of
programming projects.

GOALS
The following illustrates the UML's enormous preoccupation with progress:

1. Provide customers with an expressive visual appearance language that is easy to


use so they can create and trade important models.
2. Specify the demands for extension and specialization in relation to the middle assessments.

3. Get freed from rigid improvement procedures and programming jargon.

4. Provide a consistent model for organizing the appearance language.

5. Encourage the expansion of the market for OO mechanical social meetings.

6. Encourage the evaluation of more top-level movements, including partnerships,


models, and component elements.
7. Align the suggested structures.

4.3 Use Case Diagram:

A usage case outline in the United Showing Language (UML) is a kind of friendly
diagram described by and produced using a Use case assessment. Its inspiration is to
present a graphical layout of the convenience given by a structure with respect to
performers, their goals (tended to as utilize cases), and any circumstances between those
usage cases. The vital inspiration driving a use case frame is to show what structure
capacities are performed for which performer. The positions of the performers in the
structure can be depicted.
A usage case outline is a sensible depiction of the correspondence among the parts of the
system. It is used in system assessment to perceive, make sense of, and figure out
structure necessities. The usage case contains a lot of possible orders of association
between the application and client in a particular environment and associated with a
particular goal. It incorporates a get-together of parts for example, classes, and
association focuses that can be used together to such an extent that will have an impact
29
more imperative than how much the various parts combined.

30
The usage case should cover all application practices that have implications for the
client. The figure shows the usage case outline for graphical mystery word approval
using the Passpoints contrive for new clients. By looking at the diagram, four use cases
will be found which are making a username, making a mystery expression, picking a
picture, and saving the mystery key. Moreover, the performer of this usage case frame is
another client.

A performer can be described as something that connects points with the structure. The
performer can be a human client or an internal and outside application. Another huge
point is to recognize as far as possible which is shown in the outline.

The performer client lies outside the structure as it is an external client of the
application. Then, the figure shows the use case frame for a graphical mystery key check
for an ongoing client. There are moreover four use cases that can be found in the chart
which are entering a username, entering a mystery key, picking a picture, and
confirming.

31
Fig 3. Use Case Diagram

32
4.4 Class Diagram:
In computer programming, a class graph in the Bound together Demonstrating
Language is a sort of static design outline that depicts the construction of a framework
by showing the framework's classes, their qualities, tasks (or strategies), and the
connections among the classes. It makes sense which class contains data.

A class graph depicts a class credits and tasks and the imperatives forced on the
framework. Class outlines are generally utilized in the demonstration of item-arranged
frameworks since they are the main UML charts, which can be planned
straightforwardly with object-situated dialects. The class graph shows an assortment of
classes, interfaces, affiliations, coordinated efforts, and requirements. It is otherwise
called a Primary Chart. State diagram are a loosely defined diagram to show
workflows of stepwise activities and actions, with support for choice, iteration and
concurrency. UML, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. UML activity diagrams
could potentially model the internal logic of a complex operation. In many ways UML
activity diagrams are the object-oriented equivalent of flow charts and data flow
diagrams (DFDs) from structural development.

Fig 4. Class Diagram


33
4.5 Object Diagram

Class charts are utilized to create object charts, which are then put out to class graphs.
Object outlines answer a class edge's occasion. Class frameworks and article frames
share a significant number of similar major contemplations. Other item outlines don't
address the static perspective on an arrangement, however, this static view portrays the
arrangement at a particular second. Various things and their connections inside an
occasion are portrayed utilizing object outlines. It seems like the inspirations directing
thing charts are class outlines. The way that a class frame searches for a speculative
model with classes and their linkages is very significant. Shockingly, a thin outline
focuses on an occasion at a particular second.

4.6 Sequence Diagram

A grouping graph is a connection chart that shows how cycles work with each other and
in what sort of request. A grouping chart likewise shows object communication
orchestrated in the succession. It portrays the article and classes engaged with the
situation and the grouping of messages traded between the items expected to complete
the usefulness of the situation. Arrangement charts are now and again called occasion
graphs or occasion situations.

A grouping graph shows equal vertical lines (help), various cycles or items that live at
the same time, the flat bolt, and the message trade between them, in the request in which
they happen in the framework. This permits the particular activity of basic runtime
situations graphically. The Figure will show the arrangement outline of the enlistment
cycle. The sequence diagram begins with the initiation of the identification process
triggered by the submission of a job posting description. Next, the processed description
is fed into the machine learning classifier, which has been trained on a dataset of labeled
job postings (genuine or fake). The classifier then evaluates the features extracted from
the description and predicts whether the posting is genuine or fake based on its learned
patterns and decision boundaries. he validation mechanism checks for any additional
criteria or red flags associated with fake job postings, such as suspicious URLs,
unrealistic salary offers, or inconsistent contact information. Once the classification and
validation steps are complete, the system aggregates the results and generates a final
decision regarding the authenticity of the job posting. If the majority of classifiers and
validators indicate that the posting is fake, appropriate actions can be taken, such as
flagging the posting for review or alerting the platform administrators.

34
Fig 5. Sequence Diagram

35
3.7 Activity Diagram

Improvement outlines are graphical portrayals of work instances of stepwise exercises and
activities with help for decision, supplement, and synchronization. In the Bound together
Appearance Language, improvement plans can be utilized to portray the business and
utilitarian every single push toward turn work instances of parts in another turn of events.
An advancement outline shows the general improvement of control.

An advancement chart is a flowchart to address the stream starting with one improvement
and then going with progress. The activity diagram for the identification of fake job
postings in the description involves a series of steps aimed at analyzing the textual content
of job postings and determining their authenticity. Here's a description of the process
depicted in

1. Input Job Posting Descriptio n: The process begins with the input of the job posting
description, which contains textual information about the job role, responsibilities,
qualifications, and other relevant details.
2. Text Preprocessing: The job posting description undergoes text preprocessing, which
involves steps such as tokenization, removing stopwords, stemming, and lemmatization.
3. Feature Extraction: Next, features are extracted from the preprocessed text. This
involves identifying key words, phrases, and other linguistic features that may indicate the
authenticity or falsity of the job posting.
4. Classification Model Selection: Based on the extracted features, a classification model is
selected for identifying fake job postings. Common models include Naive Bayes, Logistic
Regression,
5. Model Training: The selected classification model is trained using a labeled dataset of
job postings, where each posting is categorized as either genuine or fake.
6. Prediction: Once the model is trained, it is used to predict the authenticity of the input job
posting description. The model assigns a probability score or a binary classification (fake
or genuine) to the posting based on its analysis of the textual features.
7. Thresholding: If the model outputs a probability score, a thresholding step may be
applied to convert the scores into binary predictions.
8. Output Decision: Finally, based on the model's prediction, a decision is made regarding
the authenticity of the job posting. If the posting is classified as fake, appropriate actions
may be taken, such as flagging the posting for review or removing it from the platform.

36
Fig 6. Activity Diagram

37
CHAPTER 5

SYSTEM IMPLEMENTAION
Still, you must have created some machine literacy or deep literacy models, If
we're in the field of machine literacy for some time. You must have asked about
how people will use their Jupyter tablet. The answer is they won’t. People can not
use our Jupyter scrapbooks, and you need to emplace your model either as an API
or as a complete web service, or in a mobile device, jeer PI, etc.

What is Python Flask?

Flask is a mesh configuration it's a python component that lets you elaborate trap
exercises effortlessly. It has a fragile and cheap-to-elongate inner space; it is a
micro-frame of reference that does not include an ORM (Object Relational
Manager) or similar characteristics. It does have numerous cool characteristics
like URL vanquishing and an instruction machine. It's a WSGI web app frame.

What is a Web Framework?

A Web operation frame of reference or simply a web scheme represents an


accumulation of libraries and modules that enable web operation inventors to
write operations without fussing about low- position details similar to protocol
thread operation and so on.

What is Flask?
Flask is a popularized Python entanglement infrastructure harnessed for unfolding
mesh usages. It's a feathery and alterable configuration that allows formulators to
fabricate mesh exercises rapidly and smoothly.

WSGI
WSGI is an identification for a general annexation between mesh servers and
mesh operations or fabrics written in python. It was created to simplify the
process of planting web operations on different mesh servers as it allows inventors
to write their operations singly of the web server they will be running on.

38
Gensim
It is an open source library in python which is used in unsupervised topic
modelling and natural language processing. It is designed to extract semantic
topics from documents. It can handle large text collections. Hence it makes it
different from other machine learning software packages which target memory
processing. Gensim also provides efficient multicore implementations for various
algorithms to increase processing speed
PyCharm
PyCharm is modish amalgamated evolution environment operated for Python
simulation. It's elaborated by JetBrains and is procurable in both independent and
compensated interpretations. PyCharm offers a thick pasturage of features,
resemblant as syntax emphasizing, constitution completion, debugging, testing,
decalogue refactoring, and rendition regulator integration.

C:\Users\yaswa\Desktop\Capstone-2\CreditCardTransactionAnomalies\Anomaly_detection\projectcode\e
Trying to unpickle estimator ExtraTreeRegressor from version 1.3.0 when using version 1.4.2. Thsk.
For more info please refer to:

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

C:\Users\yaswa\Desktop\Capstone-2\CreditCardTransactionAnomalies\Anomaly_detection\projectcode\e
Trying to unpickle estimator IsolationForest from version 1.3.0 when using version 1.4.2. This For more
info please refer to:

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

* Serving Flask app 'app'

* Debug mode: on

WARNING: This is a development server. Do not use it in a production deployment. Use a

productio Running on

http://127.0.0.1:50

00 Press CTRL+C

to quit

C:\Users\yaswa\Desktop\ Anomaly_detection\

projectcode>python app.py

39
3.2 Sample coding:

App.py:

from flask import Flask, render_template, request


from sklearn.metrics import confusion_matrix,
accuracy_score, classification_report
import pandas as pd
from sklearn.ensemble
import IsolationForest
import pickle import
matplotlib.pyplot
as plt import seaborn
as sns
import os

app = Flask( name )

# Load the Isolation


Forest model
model_path =
"isolation_forest.pkl"
with open(model_path,
"rb") as f:
model = pickle.load(f)

@app.rou
t e('/')
def
index():
return render_template('index.html')

@app.route('/predict',
methods=['POST']) def
predict():
if request.method
== 'POST': f =
request.files['f
ile'] data = 39
pd.read_csv(f)

X = data.drop("Class", y_pred[y_pred==-
1]=1) conf_matrix = confusion_matrix(y,
y_pred)
accuracy = accuracy_score(y, y_pred)
report = classification_report(y, y_pred)

errors_original = y != y_pred
mislabel_df =
pd.DataFrame(data=X[errors_original],
columns=X.columns)
mislabel_df['Predicted_Label'] =
y_pred[errors_original]
mislabel_df['True_Label'] = y[errors_original]

plt.figure(figsize=(6, 6))
sns.heatmap(pd.DataFrame(conf_matrix),
xticklabels=['Valid',
'Fraud'], yticklabels=['Valid', 'Fraud'],
linewidths=0.05, annot=True, fmt="d",
cmap='BuPu') plt.title("Isolation Forest Classifier -
Confusion Matrix") plt.xlabel('Predicted Value')
plt.ylabel('True Value')
plt.savefig('static/confusion_matrix.png')

return render_template('result.html',
accuracy=accuracy, report=report,
mislabel_df=mislabel_df.to_html(),
image='static/confusion_matrix.png')

if name == ' main ':


app.run(debug=True)

index.html:

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-
scale=1.0">
<title>Isolation Forest Classifier</title>
<link>rel="stylesheet"
href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
40
<style>
body {
background-color:
#f8f9fa; font-family:
Arial, sans-serif;
margin: 0;
padding: 0;
}
.container {
max-width:
600px;
margin
:
100px auto;
padding:
20px;
border
-
radius: 10px;
box-shadow: 0 0 10px rgba(0, 0,
0, 0.1); background-color:
#ffffff;
}
h1
{
text-align:
center;
margin-
bottom:
30px;
}

form {
text-align: center;
}
input[type="fi
le"] {
display:
none;
}
label {
background-color:
#007bff;
41
color: #ffffff;
padding: 1

42
border-
radius:
5px;
cursor:
pointer;
}
input[type="submit"]
{ background-
color:
#28a745; color:
#ffffff
;
border:
none;
padding
:
10px 20px;
border-
radius:
5px;
cursor: pointer;
42

}
input[type="submit"]:hove
r { background-color:
#218838;
}
</style>
</head>
<body>
<div class="container">
<h1>Credit Card Transaction anomalies Predit</h1>
<form method="POST" action="/predict" enctype="multipart/form-data">
<input type="file" name="file" id="file" accept=".csv">
<label for="file">Choose File</label>
<input type="submit" value="Predict">
</form>
</div>
</body>
</html>

43
Result.html:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Isolation Forest Classifier - Results</title>
</head>
<body>
<h1>Isolation Forest Classifier - Results</h1>
<h2>Accuracy Score: {{ accuracy }}</h2>
<h2>Classification Report:</h2>
<pre>{{ report }}</pre>
<h2>Mislabeled Transactions:</h2>
{{ mislabel_df | safe }}
<h2>Confusion Matrix:</h2>
<img src="{{ image }}" alt="Confusion Matrix">
</body>

43

44
3.3 Input Screen:

Fig 7. Input Screen

45
3.4 Output:

Fig8. Output image

Fig9. Output screen

45
CHAPTER 6

TESTING
6.1 Introduction to Testing

The reason for testing is to find blunders. Testing is the most common way of
attempting to find each possible shortcoming or shortcoming in a work item. It gives a
method for really looking at the usefulness of parts, sub-gatherings, and congregations,
as well as a completed item. It is the most common way of practicing programming
with the plan of guaranteeing that the Product framework meets its prerequisites and
client assumptions and doesn't bomb in an unsuitable way. There are different kinds of
tests. Each test type tends to have a particular testing prerequisite

Types of Testing

Unit Testing
Unit testing incorporates the arrangement of examinations that endorse that within-
program reasoning is working fittingly, and that program inputs produce authentic
outcomes. Every decision branch and inside code stream should be endorsed. It is the
difficulty of individual programming units of the application. It is done after the
satisfaction of a solitary unit before compromise. This is an essential difficulty that
relies upon data on its turn of events and is prominent. Unit tests perform crucial tests
at the part level and test a specific business collaboration, application, as well as
structure arrangement. Unit tests ensure that each excellent method of a business cycle
performs definitively to the recorded specifics and contains evidently described inputs
and expected results.
Unit testing is ceaselessly made as a piece out of a joined code and unit top of the thing
lifecycle, offering little appreciation to what the circumstance completely expected for
coding and unit testing to be driven as two clear stages.

Test Strategy and Approach

Field testing will be performed physically and utilitarian tests will be written exhaustively.

Test Objectives
• All field sections should work appropriately.
• Pages should be initiated from the distinguished connection.
• The section screen, messages, and reactions should not be deferred.

46
Features to be Tested

• Confirm that the passages are in the right configuration.


• No copy sections ought to be permitted.
• All connections ought to take the client to the right page.
Integration Testing
Incorporation tests are intended to test coordinated programming parts to decide whether
they really run as one program. Testing is occasion driven and is more worried about the
fundamental result of screens or fields. Incorporation tests exhibit that albeit the parts
were exclusively fulfilled, as shown by fruitful unit testing, the mix of parts is right and
steady. Coordination testing is explicitly pointed toward uncovering the issues that
emerge from the mix of parts.

Programming joining testing is the ordinary mix testing of something like two formed
programming parts on a solitary stage to convey bewilderments accomplished by
interface surrenders.

The undertaking of the incorporation test is to actually look at the parts or programming
applications, for example, parts in a product framework or - one move forward -
programming applications at the organization level - collaborate without mistake.

Test Results: All the experiments referenced above passed effectively. No imperfections
were experienced.

Functional Test
Practical tests give orderly showings that capabilities tried are accessible as indicated by
the business and specialized prerequisites, framework documentation, and client
manuals.

Practical testing is fixated on the accompanying things:

Legitimate Info: distinguished classes of substantial info should be acknowledged.


Invalid Info: recognized classes of invalid information should be dismissed.

Capabilities: distinguished capabilities should be worked out.

Yield: distinguished classes of utilization yields should be worked out.


Frameworks/Methods: connecting frameworks or systems should be conjured.

47
Association and arrangement of utilitarian tests are centered around necessities, key
capabilities, or extraordinary experiments. Also, orderly inclusion relating to recognizing
Business process streams; information fields, predefined processes, and progressive
cycles should be considered for testing. Before useful testing is finished, extra tests are
recognized and the powerful worth of currency is still up in the air.

System Test

Structure testing guarantees that the whole set programming framework meets the necessities. It
tests a construction to guarantee known and clear outcomes. A depiction of progress testing is the
technique illustrated structure compromise test. Structure testing depends upon process portrayals
and moves, including pre-driven process affiliations and mixed focuses.

White Box Testing

This testing is an enchanting district where the master could see what inside endpoints,
improvement, language, and possibly even the thinking of the thing are. That is thinking. used to
take a gander at regions that a black box level can't get to.

Black Box Testing

Disclosure Testing will be attempting the item with basically no data on the internal exercises,
plan, or language of the module being attempted. Discovery tests ought to be created from a
definitive source record, for instance, specific or requirements report, for instance, assurance or
necessities file. It is difficult wherein the item under test is managed, as a black box. you can't
"see" into it. The test offers information sources and responses yield dismissing the way that the
item works.

48
CHAPTER 7

IMPLEMENTATION RESULT
1.1 Implementation

The following are outputs obtained:

Fig10. Home page

Fig11. Add CSV to page

49
Fig12. Results page

50
CHAPTER 8
CONCLUSION AND FUTURE ENHANCEMENT

8.1 Conclusion:
In conclusion, the identification of fake job postings using machine learning (ML)
techniques has shown promising results in improving the efficiency and accuracy of
detecting fraudulent job advertisements. Through the utilization of various features such
as textual content, metadata, and user behavior patterns, ML models have demonstrated
the capability to distinguish between legitimate and fake job postings with a high degree
of accuracy. we have experimented both machine learning algorithms (Isolation Forest,
PCAs). This work shows a comparative study on the evaluation of traditional machine
learning classifiers. We have found highest classification accuracy for Logistic
Regression Classifier among traditional machine learning algorithms i.e, (99 %
accuracy).
Despite the advancements made in the field of identifying fake job postings using ML,
there are several areas for future exploration and improvement:

1. Data Augmentation and Enhancement: Enhancing the quality and diversity of datasets
used for training ML models can lead to better generalization and robustness.

2. Deep Learning Architectures: Further exploration of deep learning architectures, such


as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and
transformer-based models like BERT, could potentially improve the performance of fake
job posting detection systems.

51
CHAPTER 9

9.1 Data Pre-processing and Analysis:

# Transaction Class Distribution


count_classes = pd.value_counts(df['Class'],
sort = True) count_classes.plot(kind = 'bar',
rot=0) plt.title("Transaction Class
Distribution") plt.xticks(range(2), LABELS)
plt.xlabel("Class")
plt.ylabel("Frequency")
Text(0, 0.5,
'Frequency')

Fig 13: Transaction class Distribution

53

52
Local Outlier Factor Model:

from sklearn.metrics import


confusion_matrix from
sklearn.metrics import
accuracy_score from
sklearn.metrics import
classification_report import
pandas as pd
from sklearn.neighbors import LocalOutlierFactor

# Define Local Outlier Factor (LOF) model


lof = LocalOutlierFactor(n_neighbors=20, algorithm='auto', leaf_size=30,
metric='minkowski', p=2, metric_params=None,
contamination=outlier_fraction)

# Fit the LOF model and predict


outliers y_pred_lof =
lof.fit_predict(X)

# Reshape the prediction values to 0 for valid transactions, 1 for fraudulent


transactions y_pred_lof[y_pred_lof == 1] = 0
y_pred_lof[y_pred_lof == -1] = 1

# Calculate confusion matrix


conf_matrix_lof = confusion_matrix(y, y_pred_lof)

# Calculate confusion matrix


conf_matrix_isolation_forest =
confusion_matrix(y, y_pred_lof) labels = ['Valid',
'Fraud']
plt.figure(figsize=(6, 6))
sns.heatmap(pd.DataFrame(conf_matrix_lof), xticklabels=labels,
yticklabels=labels, linewidths=0.05, annot=True, fmt="d",
cmap='BuPu')
plt.title("Local outlier factor - Confusion
Matrix") plt.ylabel('True Value')
plt.xlabel('Predicted
Value') plt.show()

53
from sklearn.metrics import
confusion_matrix from
sklearn.metrics import
accuracy_score from
sklearn.metrics import
classification_report import
pandas as pd
from sklearn.neighbors import LocalOutlierFactor

# Define Local Outlier Factor (LOF) model


lof = LocalOutlierFactor(n_neighbors=20, algorithm='auto', leaf_size=30,
metric='minkowski', p=2, metric_params=None,
contamination=outlier_fraction)

# Fit the LOF model and predict


outliers y_pred_lof =
lof.fit_predict(X)

# Reshape the prediction values to 0 for valid transactions, 1 for fraudulent


transactions y_pred_lof[y_pred_lof == 1] = 0
y_pred_lof[y_pred_lof == -1] = 1

# Calculate confusion matrix


conf_matrix_lof = confusion_matrix(y, y_pred_lof)

# Count errors (mislabeled


transactions) n_errors_lof =
(y_pred_lof != y).sum()

# Print classifier name and


number of errors print("Local
Outlier Factor:", n_errors_lof)

# Print accuracy score


print("Accuracy Score :", accuracy_score(y,
y_pred_lof)) print("Classification Report :\n",
classification_report(y, y_pred_lof))

# Create a DataFrame containing the mislabeled transactions and their


features mislabel_df_lof = pd.DataFrame(data=X[y_pred_lof != y],
columns=X.columns)

# Add a column for the predicted labels


mislabel_df_lof['Predicted_Label'] =

54
y_pred_lof[y_pred_lof != y]

# Add a column for the true labels


mislabel_df_lof['True_Label'] = y[y_pred_lof
!= y]

# Print the mislabeled transactions DataFrame


print("Mislabeled Transactions (Local Outlier
Factor):") print(mislabel_df_lof)

Fig 14: confusion matrix

55
accuracy 1.00 28481
macro avg 0.51 0.51 0.51 28481
weighted avg 1.00 1.00 1.00 28481

Mislabeled Transactions (Local Outlier Factor):


Time V1 V2 V3 V4 V5 V6 \
235644 148479.0 -1.541678 3.846800 -7.604114 3.121459 -1.254924 -2.084875
9800 14500.0 -4.538653 -0.672919 0.934677 -2.588147 -3.465292 1.340718
254344 156685.0 -0.129778 0.141547 -0.894702 -0.457662 0.810608 -0.504723
161052 113829.0 -2.108449 0.116285 -3.169226 -0.649842 -4.809183 5.527902
192529 129741.0 -1.396204 2.618584 -6.036770 3.552454 1.030091 -2.950358
... ... ... ... ... ... ... ...
11331 19728.0 -0.493844 1.962996 -2.526668 2.132008 0.097104 -2.118047
88876 62330.0 1.140865 1.221317 -1.452955 2.067575 0.854742 -0.981223
153885 100501.0 -6.985267 5.151094 -4.599338 4.534479 0.849054 -0.210701
152920 97587.0 1.379129 -2.219215 -0.917027 -0.743102 -0.709693 1.424519
245556 152802.0 1.322724 -0.843911 -2.096888 0.759759 -0.196377 -1.166353

V7 V8 V9 ... V22 V23 V24 \


235644 -2.385027 1.471140 -2.530507 ... 1.064222 0.065370 0.257209
9800 -0.027617 -3.304568 4.851260 ... -0.383959 -1.924213 0.620426
254344 1.373588 -0.209476 0.208494 ... -0.246526 0.484108 0.359637
161052 9.203722 -1.415340 -1.589447 ... 1.395863 -0.582287 -1.739466
192529 -1.528506 0.189319 -1.433554 ... -0.390176 0.356029 -0.762352
... ... ... ... ... ... ... ...
11331 0.311326 0.426446 0.790226 ... -0.379834 0.177180 0.315463
88876 0.325714 -0.037721 0.113219 ... -0.793460 -0.132333 -0.331586
153885 -4.425230 -5.134525 0.069321 ... -2.056177 -0.280334 0.120771
152920 -1.130784 0.344343 1.267057 ... 1.108769 -0.150223 -1.657732
245556 0.482534 -0.349791 1.045007 ... -0.121562 -0.208574 -0.254752

V25 V26 V27 V28 Amount Predicted_Label \


235644 -0.693654 -0.335702 0.577052 0.398348 122.68 0
9800 0.668034 -0.750027 1.194460 -1.043734 570.49 1
254344 -0.435972 -0.248480 0.021527 0.109192 187.11 0
161052 1.331244 0.998640 0.404425 -1.192841 1681.28 1
192529 0.096510 -0.487861 0.062655 -0.240732 1.00 0
... ... ... ... ... ... ...
11331 -0.451658 -0.450737 0.196984 -0.186761 89.99 1
88876 0.664878 -0.309312 0.099942 0.122988 1.00 0
153885 0.569358 0.145971 0.300193 1.779364 0.76 0
152920 -0.536700 -0.026398 -0.034698 -0.020558 352.30 1
245556 -0.098324 -0.613874 0.002654 0.072386 357.95 0

56
161052 0
192529 1
... ...
11331 0
88876 1
153885 1
152920 0
245556 1

[97 rows x 32 columns]

Isolation Forest Model

from sklearn.metrics import confusion_matrix, accuracy_score,


classification_report import pandas as pd
from sklearn.ensemble import
IsolationForest import pickle
isolation_forest = IsolationForest(n_estimators=100, max_samples=len(X),
contamination=outlier_fraction, random_state=state, verbose=0)

y_pred_isolation_forest = isolation_forest.fit_predict(X)
y_pred_isolation_forest[y_pred_isolation_forest == 1] = 0
y_pred_isolation_forest[y_pred_isolation_fores
t == -1] = 1 with
open("isolation_forest.pkl","wb") as f:
pickle.dump(isolation_forest,f)
conf_matrix_isolation_forest = confusion_matrix(y,
y_pred_isolation_forest) labels = ['Valid', 'Fraud']
plt.figure(figsize=(6, 6))

sns.heatmap(pd.DataFrame(conf_matrix_isolation_forest), xticklabels=labels,
yticklabels=labels, linewidths=0.05, annot=True, fmt="d", cmap='BuPu')

plt.title("Isolation Forest Classifier - Confusion


Matrix") plt.ylabel('True Value')
plt.xlabel('Predicted
Value') plt.show()

n_errors_isolation_forest = (y_pred_isolation_forest !=
y).sum() print("Isolation Forest:", n_errors_isolation_forest)
print("Accuracy Score :", accuracy_score(y, y_pred_isolation_forest))
print("Classification Report :\n", classification_report(y,
y_pred_isolation_forest)) errors_original = y != y_pred_isolation_forest
mislabel_df = pd.DataFrame(data=X[errors_original], columns=X.columns)
mislabel_df['Predicted_Label'] = y_pred_isolation_forest[errors_original]

57
mislabel_df['True_Label'] =
y[errors_original] print("Mislabeled
Transactions:") print(mislabel_df)

Output

Fig 15 : confusion matrix

Isolation Forest: 73
Accuracy Score :
0.9974368877497
279 Classification
Report :
precision recall f1-score support

58
0 1.00 1.00 1.00 28432
1 0.26 0.27 0.26 49

accuracy 1.00 28481


macro avg 0.63 0.63 0.63 28481
weighted avg 1.00 1.00 1.00 28481

Mislabeled Transactions:
Time V1 V2 V3 V4 V5 \
235644 148479.0 -1.541678 3.846800 -7.604114 3.121459 -1.254924
261843 160204.0 -26.389030 -17.755687 -10.278766 10.413010 -9.446086
254344 156685.0 -0.129778 0.141547 -0.894702 -0.457662 0.810608
192529 129741.0 -1.396204 2.618584 -6.036770 3.552454 1.030091
47299 43164.0 -12.008347 -17.860112 -4.743411 0.810574 -10.726424
... ... ... ... ... ... ...
146790 87883.0 -1.360293 -0.458069 -0.700404 2.737229 -1.005106
88876 62330.0 1.140865 1.221317 -1.452955 2.067575 0.854742
76705 56705.0 -22.187453 -18.955081 -7.277285 4.786356 -6.214326
153885 100501.0 -6.985267 5.151094 -4.599338 4.534479 0.849054
245556 152802.0 1.322724 -0.843911 -2.096888 0.759759 -0.196377

V6 V7 V8 V9 ... V22 V23 \


235644 -2.084875 -2.385027 1.471140 -2.530507 ... 1.064222 0.065370
261843 6.220135 7.705953 -2.311311 5.501918 ... -0.215655 -9.002474
254344 -0.504723 1.373588 -0.209476 0.208494 ... -0.246526 0.484108
192529 -2.950358 -1.528506 0.189319 -1.433554 ... -0.390176 0.356029
47299 5.766249 15.406218 -1.111401 -0.704243 ... -0.154666 12.451839
... ... ... ... ... ... ... ...
146790 2.891399 5.802537 -1.933197 -1.017717 ... -0.053812 0.580106
88876 -0.981223 0.325714 -0.037721 0.113219 ... -0.793460 -0.132333
76705 3.137738 8.676152 -2.790720 3.982722 ... 0.090859 -1.274548
153885 -0.210701 -4.425230 -5.134525 0.069321 ... -2.056177 -0.280334
245556 -1.166353 0.482534 -0.349791 1.045007 ... -0.121562 -0.208574

V24 V25 V26 V27 V28 Amount \


235644 0.257209 -0.693654 -0.335702 0.577052 0.398348 122.68
261843 1.126551 0.396670 0.220177 -6.592504 1 .373170 1441.06
254344 0.359637 -0.435972 -0.248480 0.021527 0.109192 187.11
192529 -0.762352 0.096510 -0.487861 0.062655 -0.240732 1.00
47299 -0.909707 2.497907 -0.483602 -2.037786 0.266541 5114.10
... ... ... ... ... ... ...
146790 0.216927 0.151643 -0.332115 -0.469800 1.495006 829.41
-
88876 -0.331586 0.664878 -0.309312 0.099942 0.122988 1.00
76705 0.743121 0.893134 0.885374 1.276724 -2.285577 950.79
153885 0.120771 0.569358 0.145971 0.300193 1.779364 0.76
245556 -0.254752 -0.098324 -0.613874 0.002654 0.072386 357.95

59
PLAGARISM REPORT :

60
CHAPTER 10
BIBLIOGRAPHY

10.1 References:

[1] P. Save, P. Tiwarekar, K. N., and N. Mahyavanshi, "A Novel Approach for
Detecting Credit Card Fraud using Decision Trees," International Journal of
Computer Applications, vol. 161, no. 13, pp. 6– 9, 2017, Doi:
10.5120/ijca2017913413.

[2] J. Vimala Devi and K. S. Kavitha presented a paper titled "Fraud Detection in
Credit Card Transactions by using Classification Algorithms" at the International
Conference on Current Trends in Computer, Electrical, Electronics, and Communication
(CTCEEC) 2017. The paper was published in 2018 and can be accessed via DOI:
10.1109/CTCEEC.2017.8455091.

[3] R. R. Popat and J. Chaudhary conducted a survey titled "A Survey on Credit
Card Fraud Detection Using Machine Learning" at the Proceedings of the 2nd
International Conference on Trends in Electronics and Informatics (ICOEI) 2018. The
survey was published in 2018 and can be accessed via DOI:
10.1109/ICOEI.2018.8553963.

[4] G. Kibria and M. Sevkli authored a paper titled "Application of Deep Learning
for Credit Card Approval: A Comparison with Two Machine Learning Techniques."
This paper was published in January 2021 and can be accessed via DOI:
10.18178/ijmlc.2021.11.4.1049.

[5] S. Bagga, A. Goyal, N. Gupta, and A. Goyal authored a paper titled "Credit
Card Fraud Detection using Pipelining and Ensemble Learning." This paper was
published in Volume 173 of Procedia Computer Science in 2020 and can be accessed
via DOI: 10.1016/j.procs.2020.06.014.

[6] , A. RB and S. K. KR, discusses "Credit Card Fraud Detection Using Artificial
Neural Network." The paper was published in Global Transitions Proceedings in 2021
and can be accessed via DOI: 10.1016/j.gltp.2021.01.006..

61

You might also like