0% found this document useful (0 votes)
14 views3 pages

Problem Statement

Uploaded by

lepesa8247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Problem Statement

Uploaded by

lepesa8247
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

In this project we are going to build a simple Neural Network to realize Credit Card Fraud Detection.

The data contains customer features and a binary label where “1” indicates positive (fraud), and “0” means negative
(legitimate).

Before modeling, you will need to perform a train-test split to reserve part of your data for validation. One problem of
fraud detection data is that usually you will have only a few positive data points, a.k.a imbalanced dataset problem. You
need to think of strategies of splitting and sampling your data to avoid the issues caused by the imbalance.

Imbalance data could cause the following issues. Suppose we have 100 data points and only two of them are positive:

1. You will likely have 0 positive cases in your test data when you split randomly.

2. The model, if predicting everyone to be negative, then the accuracy would be 98%, but does it mean we have a
good model?

Therefore, accuracy is not the sole metric that we need to evaluate. There are precisions and recalls, AUC and F1
scores that can be considered as better metrics.

A confusion matrix better illustrated this problem:


We will use Keras, a high-level open source deep learning library that enables users to build and train NN models
efficiently without explicitly programming NN details.

Keras official documentation: https://keras.io/

To summarize, you will proceed the project by the following steps:

1. Load the data and briefly explore what are the suitable columns to use as features and what is the distribution of
the label.
2. Preprocess your feature set. If your data contains significant imbalance problems, think of strategies to draw
representative validation sets. Besides, think of how to re-balance your TRAINING dataset .
3. Once you have cleaned X_train, X_test, Y_train and Y_test, start building your NN, by using keras, this is as
simple as a few lines of code. Varying your network structure is like playingLEGO!
4. Train your model with your training set. You may play with hyperparameters such as number of layers, size of
hidden neurons, different activation functions, learning rate, number of epochs to improve your model
performance.
5. Evaluate your model with your reserved validation set and report your performance in accuracy, precision and
recall. Discuss briefly what you observe from these metrics. 6. [Bonus] Try if you can implement some other
classic ML algorithm and compare in terms of performance and computation time.
7. [Bonus 2] If you can learn more about and apply dropout, batch normalization and weight initialization to
see if you can further improve.
8. [Bonus 3] Implement cross validation.
9. [Bonus 4] Evaluate using AUC: https://towardsdatascience.com/understanding-auc-roc-curve- 68b2303cc9c5

A final hint: NN can be expensive to train. If you do not possess powerful hardware, start with a mini NNand test how
your machine can handle the computation.

You might also like