0% found this document useful (0 votes)

2 views5 pages

Machine Learning Report

The document outlines a fraud detection model for credit card transactions, emphasizing the importance of ethical considerations and data management procedures. It details the modeling process, including feature engineering, hyperparameter tuning, and the use of SMOTE to address class imbalance, ultimately recommending the implementation of a Machine Learning model that significantly outperforms the baseline practice. The estimated annual profit from the model is projected at approximately $1,426,693.17.

Uploaded by

Sonal Katiyar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views5 pages

Machine Learning Report

Uploaded by

Sonal Katiyar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ASSUMPTIONS

1. There are no online transactions.

2. “account_status” is the baseline practice, which indicates the bank’s action after
fraud is predicted.
A. GENERAL QUESTIONS
1. Leakage: “account_status” causes target leakage since we assume that it
reflects actions taken by the bank after predicting fraud, ‘time_since_last_trans’
may lead to data leakage.
2. Ethical considerations: Under GDPR Article 6(1)(f) and FCRA §1681b,
personal data may be processed for fraud detection under legitimate interest or
permissible use. However, features like “first name”, “last name”, “street”, “job”
may raise privacy and discrimination concerns and should be excluded.
3. Distances combined: Customer and merchant locations were combined via the
Haversine distance formula; abnormally large distances indicate potential fraud.
4. Purchase frequency: Sudden spikes or drops in spending on specific
categories, within a short time, and at unfamiliar merchants may signal fraud.
Frequency features help capture these abnormal patterns
5. Missing values: There are a total 995 missing values in the column
time_since_last_trans, which correspond to cardholders’ first time transaction.
The fraud rate is 8.9%, which is much higher compared to the fraud rate of the
whole dataset(.05%). The rationale behind is that there is no prior history to assess
user behavior, making it harder to detect anomalies. Fraudsters may make use of
this and implement fraud transactions. A boolean feature is_not_first_transaction
was added to indicate first-time transactions and improve model predictability.

B. REPORT
1. Baseline Practice: The current baseline practice involves assigning each credit
account a status: active, suspended, closed, or in collection. For modeling
purposes, we treat accounts with a status of suspended, closed, or in collection as
indicating predicted fraudulent transactions, while active accounts are considered
non-fraudulent. Hence, we convert the account_status into a binary variable: 0 for
non-fraud (active) and 1 for fraud (suspended, closed, or in collection), then
calculate profit based on the confusion matrix.
We then implement ML model, with the best model in the leaderboard: i.e. ‘Light
Gradient Boosted Trees Classifier with Early Stopping and threshold of 0.1, we
compare the result with the baseline practice and recommend the bank to
implement ML model.
● Compare with the baseline practice (on the test set)
Baseline’s Profit: 334850.178
ML Model’s Profit: 570623.119 (outperformed the baseline practice)
● The estimated annual profit of ML model
Because our profit is calculated based on the test set which only accounts for 20%
of the total data (2 years), we need to scale it up in order to calculate the estimated
annual profit.
𝑃𝑟𝑜𝑓𝑖𝑡 𝑜𝑛 𝑇𝑒𝑠𝑡 𝑠𝑒𝑡 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑎𝑡𝑎 𝑜𝑛 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑠𝑒𝑡
𝐴𝑛𝑛𝑢𝑎𝑙 𝑝𝑟𝑜𝑓𝑖𝑡 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑎𝑡𝑎 𝑜𝑛 𝑇𝑒𝑠𝑡 𝑠𝑒𝑡
× 2

570623.119 463099
= 92620
× 2

= $ 1, 426, 693. 17

2. Action/Intervention: We build a Machine Learning model, predict fraud and

calculate the total annual profit. Potentially, we can suggest some actions after
predicting a transaction is fraud, such as suspending or freezing spending
privileges for customers.

3. Data Management Procedure

a. Data review: The original dataset contains 463,099 credit card transactions
across 23 features, with < 0.1% missing values in one column. The target variable
“is_fraud” is highly imbalanced, with only 0.5% labelled as fraud. Features
include numeric, categorical, text, datetime, geolocation, and customer/merchant
identifiers. The “amt” field is heavily right-skewed due to infrequent high-value
transactions. The “account_status” and “is_fraud” are highly correlated. The
dataset is clean and requires minimal preprocessing to handle missing
data and formatting.

b. Preprocessing and Feature engineering:

Preprocessing Explanations
steps

Feature Distance_km: Calculated using the Haversine distance

2
creation formula from merchant and customer latitude/longitude
(merch_lat, merch_long, lat, long)
is_not_first_transaction: Binary variable created
from time_since_last_trans, = 1 if time_since_last_trans don’t
have missing value, else = 0.
Purchase_frequency_time: 1/(mean of time_since_last_trans
of 1 customer)
Purchase_frequency_merchant: purchases at 1 merchant /
total purchases by the customer
Purchase_frequency_category: purchases in 1 category /
total purchases by the customer

Feature account_status: cause target leakage

exclusion first, last, street, job: ethical considerations
trans_num: uninformative
time_since_last_trans: Removed after frequency feature was
created; had missing values highly correlated with fraud
city, state: Not needed since ZIP code is already used

Encoding Gender_binary: Processed using one-hot encoding

Feature Hour: Created by extract the hour from “trans_time”

extraction Age: Created by subtracting the year (from “dob”) from 2025

Numerical Log_amt: Log transformation applied to handle skewed

transformation transaction amounts.

c. Data partitioning: We split the dataset into 3 subsets: training (60%),

validation (20%), test (20%). We used the training set to build model, used the
validation set to test the model and find the threshold maximizing profit, then used
the test set to calculate the profit generated by ML model.
A noticeable point here is that when creating the 3 frequency features, we create
them after splitting data to avoid, because frequency formula is calculated based
on the average of the dataset.
d. Profit components

Predicted

0 (No Fraud) 1 (Fraud)

Actual 0 (No + Processing fees - Saved Amount

Fraud) (2% of transaction amount) (10%*7.5)

1 + Processing fees + Preventable fraud amount

3
(Fraud) (2% of transaction amount) (100% transaction amount)

← 10% of the FP randomly sampled.

4. Modeling procedures
i. Model selection & Hyperparameters: To select the most suitable model with
strong performance and minimal overfitting, we experimented with AdaBoost
Classifier, DT and hyperparameters using DataRobot and TPOT. The model from
DataRobot outperformed others. To determine the optimal threshold, predictions
on the valuation set were exported to Excel to build a profit curve, revealing that a
threshold of 0.1 maximizes profit.

Hyperparameter tuning and model selection were automated using DataRobot.

The platform evaluated a wide range of algorithms, including decision trees,
random forests, gradient boosting machines (e.g., XGBoost), logistic regression,
and neural networks. To optimize each model's performance, DataRobot applied
techniques such as grid search and 5-fold cross-validation, which we explicitly
configured. The platform generated a leaderboard ranking all models based on
performance metrics like accuracy and AUC, allowing us to select the
best-performing model from the validation results.

ii. Sampling Routines

To address the severe class imbalance in the training data, we applied SMOTE
(Synthetic Minority Over-sampling Technique) using the imblearn library.
SMOTE generates synthetic samples of the minority class (fraud cases) by
interpolating between existing minority observations and their nearest neighbors.
During this process, only the minority class was oversampled; the majority class
was left unchanged, preserving the original distribution of legitimate transactions.
The result was a balanced training set that helped the model learn to detect fraud
more effectively without being biased toward the dominant class.

iii. Potential prediction decision thresholds

4
To determine an effective decision threshold for fraud prediction, we evaluated
model performance across thresholds ranging from 0 to 1 in small increments. For
each threshold, we calculated the total profit on the holdout set using predefined
cost-reward structures for TP, FP, FN, and TN outcomes. Using DataRobot, we
tested our holdout set with the top-performing model, Light Gradient Boosted
Trees Classifier with Early Stopping. The profit curve increased initially and then
plateaued, showing local stability around a threshold of 0.1. We selected this
threshold as it offered a stable trade-off between fraud detection and false
positives while maximizing profit.

5. Evaluation process
After running on DataRobot, there is no leakage or overfitting detected. The top
features with the strongest signal are zip, log_amt, Purchase_frequency_merchant,
age, and hour of transaction. Zip indicates that some certain areas are more likely
to have fraud. The other features capture behavioral patterns and anomalies - such
as unusual transaction amounts, buying from new/irregular merchants, age -
related spending behavior, and off-hour transactions - which are often linked to
fraud.

ML Tech Neo Study
No ratings yet
ML Tech Neo Study
146 pages
Credit Fraud
0% (1)
Credit Fraud
67 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
25 pages
credit card fraud detection
No ratings yet
credit card fraud detection
8 pages
B17 Discrete Report
No ratings yet
B17 Discrete Report
16 pages
Phase-2 for DS.docx
No ratings yet
Phase-2 for DS.docx
13 pages
Bank Fraud Detection Project
No ratings yet
Bank Fraud Detection Project
30 pages
Fraud Prediction Random Forest
No ratings yet
Fraud Prediction Random Forest
22 pages
Capstone Project - Credit Card Fraud Prediction - Alexandre Daltro
No ratings yet
Capstone Project - Credit Card Fraud Prediction - Alexandre Daltro
15 pages
Report
No ratings yet
Report
14 pages
pdsreport (1)
No ratings yet
pdsreport (1)
6 pages
Sample phase 4
No ratings yet
Sample phase 4
16 pages
Phase 5
No ratings yet
Phase 5
10 pages
Fraud_Detection_Synopsis
No ratings yet
Fraud_Detection_Synopsis
5 pages
Copy of final eddited research paper1
No ratings yet
Copy of final eddited research paper1
6 pages
Script KHDL
No ratings yet
Script KHDL
4 pages
AI and DS Final Document For Phase 5
No ratings yet
AI and DS Final Document For Phase 5
9 pages
ONLINE PAYMENT FRAUD DETECTION USING MACHINE LEARNING MODEL.key copy
No ratings yet
ONLINE PAYMENT FRAUD DETECTION USING MACHINE LEARNING MODEL.key copy
12 pages
PPT Dự án cuối kỳ nhóm 8
No ratings yet
PPT Dự án cuối kỳ nhóm 8
38 pages
project report
No ratings yet
project report
34 pages
Project_Presentation
No ratings yet
Project_Presentation
11 pages
FINANCIAL FRAUD DETECTION
No ratings yet
FINANCIAL FRAUD DETECTION
11 pages
Module 3.4 Classification Models, Case Study
No ratings yet
Module 3.4 Classification Models, Case Study
12 pages
RJPOLICE HACK 496 Doc Submission
No ratings yet
RJPOLICE HACK 496 Doc Submission
5 pages
Fraud Detection in Banking Data Using Machine Learning
No ratings yet
Fraud Detection in Banking Data Using Machine Learning
17 pages
Midway Report Group 7
No ratings yet
Midway Report Group 7
8 pages
IEEE_Conference_Template (2)
No ratings yet
IEEE_Conference_Template (2)
3 pages
Nityananda Vyawhare 2223216 Case Study 5
No ratings yet
Nityananda Vyawhare 2223216 Case Study 5
5 pages
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
No ratings yet
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
6 pages
Credit Card Fraud Detection Using Machine Learning
No ratings yet
Credit Card Fraud Detection Using Machine Learning
6 pages
Machine Learning Outlier Detection by Using Autoencoders
No ratings yet
Machine Learning Outlier Detection by Using Autoencoders
14 pages
Fraud Detection Project Report
No ratings yet
Fraud Detection Project Report
4 pages
.trashed-1750261541-Phase-2_hari
No ratings yet
.trashed-1750261541-Phase-2_hari
3 pages
Phase 2 New
No ratings yet
Phase 2 New
14 pages
Credit Card Fraud Detection_final
No ratings yet
Credit Card Fraud Detection_final
3 pages
FINANCIAL DISTRESS PREDICTION USING MACHINE LEARNING
No ratings yet
FINANCIAL DISTRESS PREDICTION USING MACHINE LEARNING
5 pages
ads
No ratings yet
ads
8 pages
ONLINE PAYMENT FRAUD DETECTION USING MACHINE LEARNING MODEL.key copy
No ratings yet
ONLINE PAYMENT FRAUD DETECTION USING MACHINE LEARNING MODEL.key copy
11 pages
final project document
No ratings yet
final project document
8 pages
HR template
No ratings yet
HR template
6 pages
ML Fraud Detection Case Study
No ratings yet
ML Fraud Detection Case Study
5 pages
Internship project
No ratings yet
Internship project
8 pages
Credit Card Fraud Detection Using Machine Learning (1) (1)
No ratings yet
Credit Card Fraud Detection Using Machine Learning (1) (1)
8 pages
Credit Card Fraud Detection Using Machine Learning
No ratings yet
Credit Card Fraud Detection Using Machine Learning
28 pages
DOC-20250430-WA0006
No ratings yet
DOC-20250430-WA0006
6 pages
FDS Project Report
No ratings yet
FDS Project Report
7 pages
Final Year Project
No ratings yet
Final Year Project
27 pages
Secureswipe Pioneering Strategies for Next-gen Credit Card Fraud Prevention 1
No ratings yet
Secureswipe Pioneering Strategies for Next-gen Credit Card Fraud Prevention 1
9 pages
synopsis ml projectpdf
No ratings yet
synopsis ml projectpdf
13 pages
Fraud Detection in Financial Transactions.ppt.pptx_20240805_175608_0000 (1)
No ratings yet
Fraud Detection in Financial Transactions.ppt.pptx_20240805_175608_0000 (1)
22 pages
Aifb Lab Manual Exp 6 - Aids
No ratings yet
Aifb Lab Manual Exp 6 - Aids
3 pages
Session 5
No ratings yet
Session 5
21 pages
imac-pretty-1 (1)
No ratings yet
imac-pretty-1 (1)
8 pages
Urtc45901.2018.9244782
No ratings yet
Urtc45901.2018.9244782
4 pages
Fighting Money Laundering With Statistics and Machine Learning
No ratings yet
Fighting Money Laundering With Statistics and Machine Learning
7 pages
32
No ratings yet
32
7 pages
Anu Presentation
No ratings yet
Anu Presentation
16 pages
Cryptocurrencies: The new era of currency for companies
From Everand
Cryptocurrencies: The new era of currency for companies
Max Editorial
No ratings yet
Textbook of Urgent Care Management: Chapter 13, Financial Management
From Everand
Textbook of Urgent Care Management: Chapter 13, Financial Management
Glenn Dean
No ratings yet
Siebel Incentive Compensation Management ( ICM ) Guide
From Everand
Siebel Incentive Compensation Management ( ICM ) Guide
Mohammed Azizuddin Aamer
No ratings yet
Algorithm Trading Strategies- Crypto and Forex - The Advanced Guide For Practical Trading Strategies
From Everand
Algorithm Trading Strategies- Crypto and Forex - The Advanced Guide For Practical Trading Strategies
Murry Naga
5/5 (1)
Data Science For Supply Chain Forecasting 2nd Edition Nicolas Vandeput download
No ratings yet
Data Science For Supply Chain Forecasting 2nd Edition Nicolas Vandeput download
78 pages
Car Insurance Claim Prediction - First Seminar
No ratings yet
Car Insurance Claim Prediction - First Seminar
26 pages
Phase 2 Healthcare Chatbot Paper
No ratings yet
Phase 2 Healthcare Chatbot Paper
6 pages
Material+for+Student+CAIPC+(V062021A)+EN
No ratings yet
Material+for+Student+CAIPC+(V062021A)+EN
100 pages
Thesis 18
No ratings yet
Thesis 18
55 pages
2403 13536
No ratings yet
2403 13536
23 pages
03 - Model Evaluation Comparison
No ratings yet
03 - Model Evaluation Comparison
80 pages
Thesis
No ratings yet
Thesis
73 pages
ITS OD 307 Artificial Intel 1023
No ratings yet
ITS OD 307 Artificial Intel 1023
5 pages
Unit 2 Supervised Learning and Applications
No ratings yet
Unit 2 Supervised Learning and Applications
13 pages
Assignment 3 28855
No ratings yet
Assignment 3 28855
3 pages
Research Proposal
No ratings yet
Research Proposal
25 pages
IHU Maria-Kotsokechagia ID3305190056
No ratings yet
IHU Maria-Kotsokechagia ID3305190056
73 pages
DP 100 PDF
No ratings yet
DP 100 PDF
45 pages
ML2+Coded++Project+Business+Report
No ratings yet
ML2+Coded++Project+Business+Report
58 pages
Labati 2018
No ratings yet
Labati 2018
11 pages
34 Link - Congestion - Prediction - Using - Machine - Learning - For - Software-Defined-Network - Data - Plane
No ratings yet
34 Link - Congestion - Prediction - Using - Machine - Learning - For - Software-Defined-Network - Data - Plane
5 pages
Python CodeTantra Unit-5 & Unit-6
No ratings yet
Python CodeTantra Unit-5 & Unit-6
4 pages
Exploring fusion techniques in U-Net and DeepLab V3 architectures for multi-modal land cover classification
No ratings yet
Exploring fusion techniques in U-Net and DeepLab V3 architectures for multi-modal land cover classification
12 pages
Koopman16 Sae Autonomous Validation
No ratings yet
Koopman16 Sae Autonomous Validation
10 pages
Sri Ram Project Phase 1 Report
No ratings yet
Sri Ram Project Phase 1 Report
36 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
9 pages
Machine Learning Applications For Building Structural Design and Performance Assessment
No ratings yet
Machine Learning Applications For Building Structural Design and Performance Assessment
41 pages
A Comprehensive Guide To Ensemble Learning (With Python Codes)
No ratings yet
A Comprehensive Guide To Ensemble Learning (With Python Codes)
22 pages
Unit3
No ratings yet
Unit3
80 pages
Supersizing Self-Supervision Learning To Grasp From 50K Tries and 700 Robot Hours
No ratings yet
Supersizing Self-Supervision Learning To Grasp From 50K Tries and 700 Robot Hours
8 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
24 pages
Blue Book Format
No ratings yet
Blue Book Format
49 pages
Comparing CNNs and Random Forests For Landsat
No ratings yet
Comparing CNNs and Random Forests For Landsat
19 pages

Machine Learning Report

Uploaded by

Machine Learning Report

Uploaded by

ASSUMPTIONS

1. There are no online transactions.

2. Action/Intervention: We build a Machine Learning model, predict fraud and

3. Data Management Procedure

b. Preprocessing and Feature engineering:

Feature Distance_km: Calculated using the Haversine distance

Feature account_status: cause target leakage

Encoding Gender_binary: Processed using one-hot encoding

Feature Hour: Created by extract the hour from “trans_time”

Numerical Log_amt: Log transformation applied to handle skewed

c. Data partitioning: We split the dataset into 3 subsets: training (60%),

0 (No Fraud) 1 (Fraud)

Actual 0 (No + Processing fees - Saved Amount

1 + Processing fees + Preventable fraud amount

← 10% of the FP randomly sampled.

Hyperparameter tuning and model selection were automated using DataRobot.

ii. Sampling Routines

iii. Potential prediction decision thresholds

You might also like