HPC Mini Project Report
HPC Mini Project Report
SPSS MODELER”
A Mini Project
Submitted by
Hope Foundation's
International Institute of Information Technology
AY 2018-2019
Semester-1
Classification algorithms using SPSS Modeler
TABLE OF CONTENTS
1. PROBLEM STATEMENT 3
2. ABSTRACT 3
3. INTRODUCTION 3
4. OBJECTIVE 6
5. METHODOLOGY 6
6. MATHEMATICAL MODEL 7
7. ALGORITHM 8
8. FLOWCHART 10
9. RESULT 11
10. CONCLUSION 12
11. REFERENCES 12
1. PROBLEM STATEMENT
Perform Logistic Regression Classifier and Random Forest Classifier of CBC data using
SPSS Modeler tool
2. ABSTRACT
The Laser Interferometer Gravitational-Wave Observatory (LIGO) the Virgo detector are
large-scale physics experiments designed to directly detect gravitational waves. The LIGO
Scientific Collaboration (LSC) and the Virgo Collaboration pursue gravitational wave
science with these detectors, along with partner collaborations around the world. These
gravitational strain waves are represented in the form of events.
To perform supervised machine learning algorithm to predict an event based on the strain
type and strain value, we are to train the model by feeding 70% data as input. The testing is
done on the remaining dataset in which strain value and strain type will be taken as input and
the model will predict the event.
3. INTRODUCTION
Data Mining is a technique used in various domains to give meaning to the available data
Classification is a data mining (machine learning) technique used to predict group
membership for data instances.
Classification is a technique where we categorize data into a given number of classes. The
main goal of a classification problem is to identify the category/class to which a new data
will fall under.
Classification is used to find out in which group each data instance is related within a
given dataset. It is used for classifying data into different classes according to some
constrains. Several major kinds of classification algorithms including C4.5, ID3, k-nearest
neighbor classifier, Naive Bayes, SVM, and ANN are used for classification. Generally, a
classification technique follows three approaches Statistical, Machine Learning and Neural
Network for classification.
Classification is a two step process. During first step the model is created by applying
classification algorithm on training data set then in second step the extracted model is tested
against a predefined test data set to measure the model trained performance and accuracy.
Therefore, classification is the process to assign class label from data set whose class label is
unknown.
SPSS Modeller
IBM SPSS Modeler is a data mining and text analytics software application from IBM.
It is used to build predictive models and conduct other analytic tasks. It has a visual
interface which allows users to leverage statistical and data mining algorithms without
programming.
One of its main aims from the outset was to get rid of unnecessary complexity in data
transformations, and to make complex predictive models very easy to use. The first
version incorporated decision trees (ID3), and neural networks (backprop), which could
both be trained without underlying knowledge of how those techniques worked.
IBM SPSS Modeler was originally named Clementine by its creators, Integral
Solutions Limited. This name continued for a while after SPSS's acquisition of the
product. SPSS later changed the name to SPSS Clementine, and then later to PASW
Modeler.[1] Following IBM's 2009 acquisition of SPSS, the product was renamed IBM
SPSS Modeler.
Applications:
Classification algorithms :
• Logistic Regression
Random forest, as its name implies, consists of a large number of individual decision
trees that operate as an ensemble. Each individual tree in the random forest spits out a
class prediction and the class with the most votes becomes our model’s prediction (see
figure below).
The fundamental concept behind random forest is a simple but powerful one — the
wisdom of crowds. In data science speak, the reason that the random forest model works so
well is:
The low correlation between models is the key. Just like how investments with low
correlations (like stocks and bonds) come together to form a portfolio that is greater than the
sum of its parts, uncorrelated models can produce ensemble predictions that are more
accurate than any of the individual predictions. The reason for this wonderful effect is that
the trees protect each other from their individual errors (as long as they do not
constantly all err in the same direction). While some trees may be wrong, many other trees
will be right, so as a group the trees are able to move in the correct direction. Therefore, the
prerequisites for random forest to perform well are:
1. There needs to be some actual signal in our features so that models built using those
features do better than random guessing.
2. The predictions (and therefore the errors) made by the individual trees need to have low
correlations with each other.
4. OBJECTIVE
• To perform supervised machine learning on gravitational wave strain dataset.
• To use multiple classification algorithms and find the efficiency of them.\
• To find out which classification algorithm has the highest accuracy and correctly
predicts the event.
5. METHODOLOGY
• The gravitational wave strain data for H1 and L1 has 3 attributes – strain value,
strain type and event.
• The dataset is split into training dataset and testing dataset in 70% and 30%
respectively.
• The training dataset is fed to the classification algorithm to train the model to
correctly predict the event.
• The model is tested on the testing dataset where the event is predicted as the final
output.
• Accuracy of every testing model is compared and the model with the best
accuracy is found.
6. MATHEMATICAL MODEL
• Logistic Regression:
Logistic regression can handle any number of numerical and/or categorical variables.
b0 = Regression constant.
p = probability of a class.
• Random Forest:
In Decision Tree the major challenge is to identification of the attribute for the root node in
each level. This process is known as attribute selection. We have two popular attribute
selection measures:
1. Information Gain
2. Gini Index
3. Gain Ratio
Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets
the entropy changes. Information gain is a measure of this change in entropy.
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of
an arbitrary collection of examples. The higher the entropy more the information content.
7. ALGORITHM
1) Split dataset into training dataset( 70% ) and testing dataset (30%).
2) Train the model using the training dataset and apply one of the classification
algorithms.
3) Compare the accuracy of every classification algorithm.
a. Takes the test features and use the rules of each randomly created decision tree to
predict the outcome and stores the predicted outcome (target)
b. Calculate the votes for each predicted target.
c. Consider the high voted predicted target as the final prediction from the random
forest algorithm.
d. To perform the prediction using the trained random forest algorithm we need to pass
the test features through the rules of each randomly created trees. Suppose let’s say
we formed 100 random decision trees to from the random forest.
e. Each random forest will predict different target (outcome) for the same test feature.
Then by considering each predicted target votes will be calculated. Suppose the 100
random decision trees are prediction some 3 unique targets x, y, z then the votes of x
is nothing but out of 100 random decision tree how many trees prediction is x.
Likewise for other 2 targets (y, z). If x is getting high votes. Let’s say out of 100
random decision tree 60 trees are predicting the target will be x. Then the final random
forest returns the x as the predicted target.
8. FLOWCHART
End
4 RESULT
Logistic Regression
Logistic Regression:
6 CONCLUSION
Thus we applied two different classification algorithms (Logistic Regression and Random
Forest Classifier) on the gravitational wave strain dataset. The efficiency of Random Forest
Classifier is substantially more than that of Logistic Regression.
7 REFERENCES
• https://stackabuse.com/decision-trees-in-python-with-scikit-learn/
• https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
• https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/