0% found this document useful (0 votes)
2 views5 pages

Machine Learning Report

The document outlines a fraud detection model for credit card transactions, emphasizing the importance of ethical considerations and data management procedures. It details the modeling process, including feature engineering, hyperparameter tuning, and the use of SMOTE to address class imbalance, ultimately recommending the implementation of a Machine Learning model that significantly outperforms the baseline practice. The estimated annual profit from the model is projected at approximately $1,426,693.17.

Uploaded by

Sonal Katiyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

Machine Learning Report

The document outlines a fraud detection model for credit card transactions, emphasizing the importance of ethical considerations and data management procedures. It details the modeling process, including feature engineering, hyperparameter tuning, and the use of SMOTE to address class imbalance, ultimately recommending the implementation of a Machine Learning model that significantly outperforms the baseline practice. The estimated annual profit from the model is projected at approximately $1,426,693.17.

Uploaded by

Sonal Katiyar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ASSUMPTIONS

1. There are no online transactions.


2. “account_status” is the baseline practice, which indicates the bank’s action after
fraud is predicted.
A. GENERAL QUESTIONS
1. Leakage: “account_status” causes target leakage since we assume that it
reflects actions taken by the bank after predicting fraud, ‘time_since_last_trans’
may lead to data leakage.
2. Ethical considerations: Under GDPR Article 6(1)(f) and FCRA §1681b,
personal data may be processed for fraud detection under legitimate interest or
permissible use. However, features like “first name”, “last name”, “street”, “job”
may raise privacy and discrimination concerns and should be excluded.
3. Distances combined: Customer and merchant locations were combined via the
Haversine distance formula; abnormally large distances indicate potential fraud.
4. Purchase frequency: Sudden spikes or drops in spending on specific
categories, within a short time, and at unfamiliar merchants may signal fraud.
Frequency features help capture these abnormal patterns
5. Missing values: There are a total 995 missing values in the column
time_since_last_trans, which correspond to cardholders’ first time transaction.
The fraud rate is 8.9%, which is much higher compared to the fraud rate of the
whole dataset(.05%). The rationale behind is that there is no prior history to assess
user behavior, making it harder to detect anomalies. Fraudsters may make use of
this and implement fraud transactions. A boolean feature is_not_first_transaction
was added to indicate first-time transactions and improve model predictability.

B. REPORT
1. Baseline Practice: The current baseline practice involves assigning each credit
account a status: active, suspended, closed, or in collection. For modeling
purposes, we treat accounts with a status of suspended, closed, or in collection as
indicating predicted fraudulent transactions, while active accounts are considered
non-fraudulent. Hence, we convert the account_status into a binary variable: 0 for
non-fraud (active) and 1 for fraud (suspended, closed, or in collection), then
calculate profit based on the confusion matrix.
We then implement ML model, with the best model in the leaderboard: i.e. ‘Light
Gradient Boosted Trees Classifier with Early Stopping and threshold of 0.1, we
compare the result with the baseline practice and recommend the bank to
implement ML model.
●​ Compare with the baseline practice (on the test set)
Baseline’s Profit: 334850.178
ML Model’s Profit: 570623.119 (outperformed the baseline practice)
●​ The estimated annual profit of ML model
Because our profit is calculated based on the test set which only accounts for 20%
of the total data (2 years), we need to scale it up in order to calculate the estimated
annual profit.
𝑃𝑟𝑜𝑓𝑖𝑡 𝑜𝑛 𝑇𝑒𝑠𝑡 𝑠𝑒𝑡 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑎𝑡𝑎 𝑜𝑛 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑠𝑒𝑡
𝐴𝑛𝑛𝑢𝑎𝑙 𝑝𝑟𝑜𝑓𝑖𝑡 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑎𝑡𝑎 𝑜𝑛 𝑇𝑒𝑠𝑡 𝑠𝑒𝑡
× 2

570623.119 463099
= 92620
× 2

= $ 1, 426, 693. 17

2. Action/Intervention: We build a Machine Learning model, predict fraud and


calculate the total annual profit. Potentially, we can suggest some actions after
predicting a transaction is fraud, such as suspending or freezing spending
privileges for customers.

3. Data Management Procedure


a. Data review: The original dataset contains 463,099 credit card transactions
across 23 features, with < 0.1% missing values in one column. The target variable
“is_fraud” is highly imbalanced, with only 0.5% labelled as fraud. Features
include numeric, categorical, text, datetime, geolocation, and customer/merchant
identifiers. The “amt” field is heavily right-skewed due to infrequent high-value
transactions. The “account_status” and “is_fraud” are highly correlated. The
dataset is clean and requires minimal preprocessing to handle missing
data and formatting.

b. Preprocessing and Feature engineering:

Preprocessing Explanations
steps

Feature Distance_km: Calculated using the Haversine distance


2
creation formula from merchant and customer latitude/longitude
(merch_lat, merch_long, lat, long)
is_not_first_transaction: Binary variable created
from time_since_last_trans, = 1 if time_since_last_trans don’t
have missing value, else = 0.
Purchase_frequency_time: 1/(mean of time_since_last_trans
of 1 customer)
Purchase_frequency_merchant: purchases at 1 merchant /
total purchases by the customer
Purchase_frequency_category: purchases in 1 category /
total purchases by the customer

Feature account_status: cause target leakage


exclusion first, last, street, job: ethical considerations
trans_num: uninformative
time_since_last_trans: Removed after frequency feature was
created; had missing values highly correlated with fraud
city, state: Not needed since ZIP code is already used

Encoding Gender_binary: Processed using one-hot encoding

Feature Hour: Created by extract the hour from “trans_time”


extraction Age: Created by subtracting the year (from “dob”) from 2025

Numerical Log_amt: Log transformation applied to handle skewed


transformation transaction amounts.

c. Data partitioning: We split the dataset into 3 subsets: training (60%),


validation (20%), test (20%). We used the training set to build model, used the
validation set to test the model and find the threshold maximizing profit, then used
the test set to calculate the profit generated by ML model.
A noticeable point here is that when creating the 3 frequency features, we create
them after splitting data to avoid, because frequency formula is calculated based
on the average of the dataset.
d. Profit components

Predicted

0 (No Fraud) 1 (Fraud)

Actual 0 (No + Processing fees - Saved Amount


Fraud) (2% of transaction amount) (10%*7.5)

1 + Processing fees + Preventable fraud amount

3
(Fraud) (2% of transaction amount) (100% transaction amount)

← 10% of the FP randomly sampled.

4. Modeling procedures
i. Model selection & Hyperparameters: To select the most suitable model with
strong performance and minimal overfitting, we experimented with AdaBoost
Classifier, DT and hyperparameters using DataRobot and TPOT. The model from
DataRobot outperformed others. To determine the optimal threshold, predictions
on the valuation set were exported to Excel to build a profit curve, revealing that a
threshold of 0.1 maximizes profit.

Hyperparameter tuning and model selection were automated using DataRobot.


The platform evaluated a wide range of algorithms, including decision trees,
random forests, gradient boosting machines (e.g., XGBoost), logistic regression,
and neural networks. To optimize each model's performance, DataRobot applied
techniques such as grid search and 5-fold cross-validation, which we explicitly
configured. The platform generated a leaderboard ranking all models based on
performance metrics like accuracy and AUC, allowing us to select the
best-performing model from the validation results.

ii. Sampling Routines


To address the severe class imbalance in the training data, we applied SMOTE
(Synthetic Minority Over-sampling Technique) using the imblearn library.
SMOTE generates synthetic samples of the minority class (fraud cases) by
interpolating between existing minority observations and their nearest neighbors.
During this process, only the minority class was oversampled; the majority class
was left unchanged, preserving the original distribution of legitimate transactions.
The result was a balanced training set that helped the model learn to detect fraud
more effectively without being biased toward the dominant class.

iii. Potential prediction decision thresholds

4
To determine an effective decision threshold for fraud prediction, we evaluated
model performance across thresholds ranging from 0 to 1 in small increments. For
each threshold, we calculated the total profit on the holdout set using predefined
cost-reward structures for TP, FP, FN, and TN outcomes. Using DataRobot, we
tested our holdout set with the top-performing model, Light Gradient Boosted
Trees Classifier with Early Stopping. The profit curve increased initially and then
plateaued, showing local stability around a threshold of 0.1. We selected this
threshold as it offered a stable trade-off between fraud detection and false
positives while maximizing profit.

5. Evaluation process
After running on DataRobot, there is no leakage or overfitting detected. The top
features with the strongest signal are zip, log_amt, Purchase_frequency_merchant,
age, and hour of transaction. Zip indicates that some certain areas are more likely
to have fraud. The other features capture behavioral patterns and anomalies - such
as unusual transaction amounts, buying from new/irregular merchants, age -
related spending behavior, and off-hour transactions - which are often linked to
fraud.

You might also like