0% found this document useful (0 votes)
14 views18 pages

DA Report Format

Uploaded by

komalgautham1208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

DA Report Format

Uploaded by

komalgautham1208
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Student Specialization using Naïve Bayes

and Support Vector Machine


MINI PROJECT REPORT
SUBMITTED TO

RAMAIAH INSTITUTE OF TECHNOLOGY


(Autonomous Institute, Affiliated to VTU)
Bangalore – 560054

SUBMITTED BY
Keerthana D 1MS14CS053
Manoj J Shet 1MS14CS064
Prasad Hegde 1MS14CS086
Ajeya S H 1MS14CS146

As part of the Course Data Analytics Laboratory– CSL717

SUPERVISED BY
Faculty
Parkavi.A

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


RAMAIAH INSTITUTE OF TECHNOLOGY
Aug-Dec 2017

1
Department of Computer Science and Engineering
Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
Bangalore – 54

CERTIFICATE

This is to certify that Keerthana D (1MS14CS053), Manoj J Shet (1MS14CS064), Prasad


Hegde (1MS14CS086) and Ajeya S H (1MS14CS146) have completed the “Student
Specialization using Naïve Bayes and Support Vector Mavhine” as part of Mini Project.

We declare that the entire content embodied in this B.E. 7th Semester report contents are not
copied.

Submitted by: Guided by:

Keerthana D 1MS14CS053 Prof. Parkavi


Manoj J Shet 1MS14CS064 (Assistant Professor, Dept. of CSE, RIT)
Prasad Hegde 1MS14CS086
Ajeya S H 1MS14CS146

(Dept of CSE, RIT)

2
Department of Computer Science and Engineering
Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
Bangalore – 54

Evaluation Sheet

Sl. No USN Name Content Speaking Teamwork Neatness Effectiveness Total Marks
and Skills (2) and care &
Demonstration (2) (2) Productivity (25)
(15) (4)
1 1MS14CS053 Keerthana D

2 1MS14CS064 Manoj J Shet

3 1MS14CS086 Prasad Hegde

4 1MS14CS146 Ajeya S H

Evaluated By

Name: Parkavi.A
Designation: Assistant Professor
Department: Computer Science & Engineering, RIT
Signature:

HOD, CSE

3
Table of Contents

Page No

1. Abstract 5

2. Introduction 6

3. Literature Survey 8

4. Algorithm 10

5. Implementation 11

6. Results and Discussions 16

7. Conclusion 17

8. References 18

4
1. Abstract

Student marks in various subjects are usually reflections of their interests and
specializations. One tends to do well in the field they are good at and are specialized. One domain
can be covered in various subjects and an average of those can be considered as a valid measure of
the extent of expertise in that particular domain. The best of those averages is taken as the student’s
specialization. We can achieve this by applying various classification algorithms. We have
considered Naïve Bayes and SVM (Support Vector Machines) to evaluate specializations based on
a set of trained data.

5
2. Introduction

Data analytics, also known as analysis of data or data analysis, is a process of inspecting,
cleansing, transforming, and modeling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, in different business,
science, and social science domains. It is done with the aid of specialized systems and software.

One of the most popular and widely used data analytics software is R. R is an open
source programming language and software environment for statistical computing and graphics
that is supported by the R Foundation for Statistical Computing. The R language is widely used
among statisticians and data miners for developing statistical software and data analysis. R
provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, etc) and graphical techniques, and is highly
extensible.

Classification is one of the important statistical techniques and is used in many


applications. In general, it can be defined as systematic arrangement in groups or categories
according to established criteria. There are various classification algorithms, for example,
Naïve Bayes, Support Vector Machine, k-nearest neighbors, logistic regressions, decision trees, et
cetera. The best suited algorithm is chosen based on various factors like the size of data set,
required results, accuracy.

A student takes up various courses during the graduation. These courses can be classified
into broad categories. For example, C++, Java and Python can be classified under the category of
Programming, and Data Communication and Computer Networks comes under Networking. One
always tends to do well and score more in the interested course than in the one without. So, the
average of grade points of each category can be considered as a valid measure of the level of
expertise in that particular domain. Each student will have their own interests and specializations,
which is the one with the best average among the categories.

6
In this project we consider marks of students in various courses and try to find the
individual’s specialization. This is initially done by mathematically calculating averages and
finding the greatest of them. This data is then fed as training data into the chosen classification
algorithms – Naïve Bayes and SVM - and the rest of the computations is done by the classification
algorithms.

In machine learning, naive Bayes classifiers are a family of simple probabilistic


classifiers based on applying Bayes' theorem with strong (naive) independence assumptions
between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters
linear in the number of variables (features/predictors) in a learning problem.

In machine learning, support vector machines are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training examples, each marked as belonging to one or the other of two categories,
an SVM training algorithm builds a model that assigns new examples to one category or the other,
making it a non-probabilistic binary linear classifier.

7
3. Literature Survey

Zhongheng Zhang [1] talks about the working of Naive Bayes classifier and few implementations
of Naive Bayes classifier in R. Naïve Bayes classifier uses Bayes Theorem for classification which
uses prior knowledge, current evidence to predict posterior probability. Using training data-set a
model is build and using the model probability for test data can be predicted.
The paper talks about two libraries which implement Naive Bayes algorithm. First is e1071
package which has naiveBayes() function which returns object that can be used for further
prediction. Next is the caret package which used train() function for training and then probability is
predicted.

Anand Shanker Tewari, Tasif Sultan, Ansari Asim, Gopal Barman [2] propose a book
recommendation technique based on opinion mining and Naïve Bayes classifier to recommend top
ranking books to buyers. This paper also considered the important factor like price of the book
while recommendation and presented a novel tabular efficient method for recommending books to
the buyer, especially when the buyer is coming first time to the website. The system has run the
text crawler on the reviews of books and find out all the popular negative and positive adjective
keywords. This recommendation system is considering the price of the book along with the
reviews during recommendation process.

Shih-Chung Hsu , I-Chieh Chen and Chung-Lin Huang [3] present an image classification method
which consists of salient region (SR) detection, local feature extraction, and pairwise local
observations based Naive Bayes classifier (NBPLO). Based on the discriminative pairwise local
observations, the structure object model based Naive Bayes classifier for image classification is
developed here. The paper discusses about Feature extraction, Bag-of-Feature and many other
techniques to support Naïve Bayes classification. Different from the pyramid matching pursuit, this
method also outperforms the conventional BoF method. However, there are still more room for
improvement and some problems to be solved.

Durgesh K Srivastava, Lekha Bhambhu [4] deal with the working of Support Vector
Machine(SVM) for data classification, its performance and steps which make SVM to increase
classification accuracy. SVM is a supervised learning method and its special property is to
simultaneously minimize the empirical classification error and maximize the geographical margin.
The separating hyperplane with the largest margin determines the efficiency of classification. This
paper also deals with kernel function selection and model selection.
Model selection refers to tuning of parameters which affect generalization error. A comparative
study of results using different data with different kernel functions like linear, polynomial, sigmoid
etc. is also done. This paper also talks about rough set which is a new tool to deal with uncertain
and unintergrality knowledge.

8
Alexandros Karatzoglou, David Meyer, Kurt Hornik [5] first briefly explain what SVM is along
with concepts like regression, classification, novelty detection classification. It also brings into the
light about more than ten kernel functions that can be used for classification.
Various other currently existing software that implement SVM are mentioned like libsvm,
SVMlight, SVMtorch, MATLAB SVM Toolbox etc. Then details about various data set like Iris,
Spam, Vowel, DNA etc are given. The paper briefly describes various libraries like ksvm in
kernlab package which provides basic kernel functionality, svmlight of klaR package
which includes utility functions for classification and visualization along with svm and svmpath
functions. Then a comparative study of different SVM implementations on different datasets is
done.

Nikhil Bajaj, Niko J. Murrell, Julie G. Whitney, Jan P. Allebach, George T.-C. Chiu [6] in their
paper, a method for integration of expert defined allowable confusions into SVM systems is
introduced, with an example implementation in a least squares support vector machine (LS-SVM)
tested on industrial data, and shown to improve overall performance of a multi-class classification
system when an appropriate performance measurement method is formulated. The proposed
approach was tested with an industrial dataset collected from a multi-sensor sorting application,
where expert knowledge of allowable and acceptable confusion is available. A confusion matrix
augmented performance metric was shown to have the potential to improve the combined
performance of a LS-SVM based multi-class classifier when expert knowledge on acceptable miss-
classification or confusion is available.

9
4. Algorithm

i. Naïve Bayes Classifier

trainNB = subset(marks_train, select = -c(Domain,V2))

testNB = subset(marks_test, select = -c(Domain,V2))

naive_model <- naiveBayes(trainNB$Domain1 ~.,data=trainNB)

naive_pred <- predict(naive_model,testNB)

ii. Support Vector Machine

trainSvm = subset(marks_train, select = -c(V2))

testSvm = subset(marks_test, select = -c(Domain1,V2) )

svm_model <- svm(trainSvm$Domain, trainSvm$Domain1,gamma = 0.2, cost = 1)

svm_pred <- predict(svm_model, testSvm$Domain)

aa <- svm_pred

10
5. Implementation

#non core 1, program 2, network 3, circuits 4, bot 5,core 6,compiler 7


library(data.table)
library(class)
library(caret)
library(plyr)
library(e1071)
library(rminer)
library(ROCR)
library(ggplot2)
l<-c(0,0)
l1<-c("","")

marks<-read.table("Book1.csv", header=FALSE, sep=",")


marks<-marks[,1:27]
for(i in 3:nrow(marks))
{
count<-c(0,0,0,0,0,0,0)
average<-c(0,0,0,0,0,0,0)
summ<-c(0,0,0,0,0,0,0)
subject<-c("N","P","T","R","B","C","M")
for(j in 3:(ncol(marks)))
{
if(marks[1,j]=="Non core")
{
count[1]<-count[1]+1
summ[1]<-summ[1]+strtoi(marks[i,j])
average[1]<-summ[1]/count[1]
}
else if(marks[1,j]=="Programming")
{
count[2]<-count[2]+1
summ[2]<-summ[2]+strtoi(marks[i,j])
average[2]<-summ[2]/count[2]
}
else if(marks[1,j]=="Networking")
{
count[3]<-count[3]+1
summ[3]<-summ[3]+strtoi(marks[i,j])
average[3]<-summ[3]/count[3]
}
else if(marks[1,j]=="Circuits")
{
count[4]<-count[4]+1
summ[4]<-summ[4]+strtoi(marks[i,j])
average[4]<-summ[4]/count[4]
}
else if(marks[1,j]=="Bot")

11
{
count[5]<-count[5]+1
summ[5]<-summ[5]+strtoi(marks[i,j])
average[5]<-summ[5]/count[5]
}
else if(marks[1,j]=="Core")
{
count[6]<-count[6]+1
summ[6]<-summ[6]+strtoi(marks[i,j])
average[6]<-summ[6]/count[6]
}
else if(marks[1,j]=="Compiler")
{
count[7]<-count[7]+1
summ[7]<-summ[7]+strtoi(marks[i,j])
average[7]<-summ[7]/count[7]
}
}
max<-0
index<-99
s<-""
for(k in 1:7)
{
if(average[k]>=max)
{
max<-average[k]
index<-k
s<-subject[k]
}
}

l<-c(l,as.numeric(index))
l1<-c(l1,s)
}
marks <- transform(marks, Domain=as.numeric(l))
marks <- transform(marks,Domain1=as.factor(l1))
marks <- marks[-1,]
marks <- marks[-1,]
marks <- marks[,-1]
data <- sample(2,nrow(marks),replace=TRUE,prob=c(0.80,0.20))
marks_train <- marks[data==1,]
marks_test <- marks[data==2,]

#~~~~~~~~~~~~~~~~~~~~~~~~~~SVM START~~~~~~~~~~~~~~~~~~~~~~~~~~~~

trainSvm = subset(marks_train, select = -c(V2))


testSvm = subset(marks_test, select = -c(Domain1,V2) )

svm_model <- svm(trainSvm$Domain, trainSvm$Domain1,gamma = 0.2, cost = 1)

12
svm_pred <- predict(svm_model, testSvm$Domain)
aa <- svm_pred
l1=c()
for(n in 1:length(svm_pred))
{
if(svm_pred[n]=="N")
{
t<-1
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=="P")
{
t<-2
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=="T")
{
t<-3
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='R')
{
t<-4
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='B')
{
t<-5
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='C')
{
t<-6
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='M')
{
t<-7
l1<-c(l1,as.numeric(t))
}
#print(n)
}
svm_pred <- l1

#~~~~~~~~~~~~~~~~~~~~~~~~~~SVM END~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

13
#~~~~~~~~~~~~~~~~~~~NAIVE BAYES START~~~~~~~~~~~~~~~~~~~~~~~~~~

trainNB = subset(marks_train, select = -c(Domain,V2))


testNB = subset(marks_test, select = -c(Domain,V2))
naive_model <- naiveBayes(trainNB$Domain1 ~.,data=trainNB)
naive_pred <- predict(naive_model,testNB)

l1=c()
for(m in 1:length(naive_pred))
{
if(naive_pred[m]=="N")
{
t<-1
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=="P")
{
t<-2
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=="T")
{
t<-3
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='R')
{
t<-4
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='B')
{
t<-5
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='C')
{
t<-6
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='M')
{
t<-7
l1<-c(l1,as.numeric(t))
}
}
naive_pred <- l1
# ~~~~~~~~~~~~~~~~~~~~~~NAIVE BAYES END~~~~~~~~~~~~~~~~~~~~~~~

14
marks_test <- transform(marks_test,SVM=as.factor(svm_pred))
marks_test <- transform(marks_test,NB=as.factor(naive_pred))
result<-subset(marks_test,select = -c(Domain1))
result <- rename(result,c("V2" = "USN"))

plot.new()
p<-ggplot(result)+geom_point(data=result, aes(USN,SVM),col=2)
p <- p + geom_point(data=result, aes(USN,NB),col=3)
p <- p + geom_point(data=result, aes(USN,Domain),col=4)

print(p)
legend(0.6, 0.6, c('SVM', 'NB', 'Domain'),2:4 )

15
6. Results

The result is a plot of the specialization identified for the test data. It varies each time as the

test data is chosen randomly.

16
7. Conclusion

We have successfully obtained the objective, i.e., to identify the student’s specialization by

applying Naïve Bayes and SVM classification algorithms. We can see from the result that the

output of these 2 have deviations from the expected output. The accuracy of SVM is found to be

better than that of Naïve Bayes. The accuracy is expected to improve with increase in the size of

data set and better training set.

17
8. References

Literature Survey:

[1] Naive Bayes classification in R - Zhongheng Zhang


[2] Opinion Based Book Recommendation Using Naive Bayes Classifier - Anand Shanker
Tewari, Tasif Sultan, Ansari Asim, Gopal Barman
[3] Image Classification Using Pairwise Local Observations Based Naive Bayes Classifier -
Shih-Chung Hsu , I-Chieh Chen and Chung-Lin Huang
[4] Data Classification Using Support Vector Machine - Durgesh K Srivastava, Lekha
Bhambhu
[5] Support Vector Machines in R - Alexandros Karatzoglou, David Meyer, Kurt Hornik
[6] Expert-Prescribed Weighting for Support Vector Machine Classification - Nikhil Bajaj,
Niko J. Murrell, Julie G. Whitney, Jan P. Allebach, George T.-C. Chiu

Web References:

• https://www.r-project.org/about.html

• http://rischanlab.github.io/SVM.html

• https://www.kaggle.com/chinki/naive-bayes-classification-for-iris-dataset

• https://www.tutorialspoint.com/r/

• https://www.w3schools.in/r/

• https://www.rstudio.com/

• https://en.wikipedia.org/wiki/R_(programming_language)

• http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/

• https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-

example-code/

18

You might also like