DA Report Format
DA Report Format
SUBMITTED BY
Keerthana D 1MS14CS053
Manoj J Shet 1MS14CS064
Prasad Hegde 1MS14CS086
Ajeya S H 1MS14CS146
SUPERVISED BY
Faculty
Parkavi.A
1
Department of Computer Science and Engineering
Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
Bangalore – 54
CERTIFICATE
We declare that the entire content embodied in this B.E. 7th Semester report contents are not
copied.
2
Department of Computer Science and Engineering
Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
Bangalore – 54
Evaluation Sheet
Sl. No USN Name Content Speaking Teamwork Neatness Effectiveness Total Marks
and Skills (2) and care &
Demonstration (2) (2) Productivity (25)
(15) (4)
1 1MS14CS053 Keerthana D
4 1MS14CS146 Ajeya S H
Evaluated By
Name: Parkavi.A
Designation: Assistant Professor
Department: Computer Science & Engineering, RIT
Signature:
HOD, CSE
3
Table of Contents
Page No
1. Abstract 5
2. Introduction 6
3. Literature Survey 8
4. Algorithm 10
5. Implementation 11
7. Conclusion 17
8. References 18
4
1. Abstract
Student marks in various subjects are usually reflections of their interests and
specializations. One tends to do well in the field they are good at and are specialized. One domain
can be covered in various subjects and an average of those can be considered as a valid measure of
the extent of expertise in that particular domain. The best of those averages is taken as the student’s
specialization. We can achieve this by applying various classification algorithms. We have
considered Naïve Bayes and SVM (Support Vector Machines) to evaluate specializations based on
a set of trained data.
5
2. Introduction
Data analytics, also known as analysis of data or data analysis, is a process of inspecting,
cleansing, transforming, and modeling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, in different business,
science, and social science domains. It is done with the aid of specialized systems and software.
One of the most popular and widely used data analytics software is R. R is an open
source programming language and software environment for statistical computing and graphics
that is supported by the R Foundation for Statistical Computing. The R language is widely used
among statisticians and data miners for developing statistical software and data analysis. R
provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests,
time-series analysis, classification, clustering, etc) and graphical techniques, and is highly
extensible.
A student takes up various courses during the graduation. These courses can be classified
into broad categories. For example, C++, Java and Python can be classified under the category of
Programming, and Data Communication and Computer Networks comes under Networking. One
always tends to do well and score more in the interested course than in the one without. So, the
average of grade points of each category can be considered as a valid measure of the level of
expertise in that particular domain. Each student will have their own interests and specializations,
which is the one with the best average among the categories.
6
In this project we consider marks of students in various courses and try to find the
individual’s specialization. This is initially done by mathematically calculating averages and
finding the greatest of them. This data is then fed as training data into the chosen classification
algorithms – Naïve Bayes and SVM - and the rest of the computations is done by the classification
algorithms.
In machine learning, support vector machines are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training examples, each marked as belonging to one or the other of two categories,
an SVM training algorithm builds a model that assigns new examples to one category or the other,
making it a non-probabilistic binary linear classifier.
7
3. Literature Survey
Zhongheng Zhang [1] talks about the working of Naive Bayes classifier and few implementations
of Naive Bayes classifier in R. Naïve Bayes classifier uses Bayes Theorem for classification which
uses prior knowledge, current evidence to predict posterior probability. Using training data-set a
model is build and using the model probability for test data can be predicted.
The paper talks about two libraries which implement Naive Bayes algorithm. First is e1071
package which has naiveBayes() function which returns object that can be used for further
prediction. Next is the caret package which used train() function for training and then probability is
predicted.
Anand Shanker Tewari, Tasif Sultan, Ansari Asim, Gopal Barman [2] propose a book
recommendation technique based on opinion mining and Naïve Bayes classifier to recommend top
ranking books to buyers. This paper also considered the important factor like price of the book
while recommendation and presented a novel tabular efficient method for recommending books to
the buyer, especially when the buyer is coming first time to the website. The system has run the
text crawler on the reviews of books and find out all the popular negative and positive adjective
keywords. This recommendation system is considering the price of the book along with the
reviews during recommendation process.
Shih-Chung Hsu , I-Chieh Chen and Chung-Lin Huang [3] present an image classification method
which consists of salient region (SR) detection, local feature extraction, and pairwise local
observations based Naive Bayes classifier (NBPLO). Based on the discriminative pairwise local
observations, the structure object model based Naive Bayes classifier for image classification is
developed here. The paper discusses about Feature extraction, Bag-of-Feature and many other
techniques to support Naïve Bayes classification. Different from the pyramid matching pursuit, this
method also outperforms the conventional BoF method. However, there are still more room for
improvement and some problems to be solved.
Durgesh K Srivastava, Lekha Bhambhu [4] deal with the working of Support Vector
Machine(SVM) for data classification, its performance and steps which make SVM to increase
classification accuracy. SVM is a supervised learning method and its special property is to
simultaneously minimize the empirical classification error and maximize the geographical margin.
The separating hyperplane with the largest margin determines the efficiency of classification. This
paper also deals with kernel function selection and model selection.
Model selection refers to tuning of parameters which affect generalization error. A comparative
study of results using different data with different kernel functions like linear, polynomial, sigmoid
etc. is also done. This paper also talks about rough set which is a new tool to deal with uncertain
and unintergrality knowledge.
8
Alexandros Karatzoglou, David Meyer, Kurt Hornik [5] first briefly explain what SVM is along
with concepts like regression, classification, novelty detection classification. It also brings into the
light about more than ten kernel functions that can be used for classification.
Various other currently existing software that implement SVM are mentioned like libsvm,
SVMlight, SVMtorch, MATLAB SVM Toolbox etc. Then details about various data set like Iris,
Spam, Vowel, DNA etc are given. The paper briefly describes various libraries like ksvm in
kernlab package which provides basic kernel functionality, svmlight of klaR package
which includes utility functions for classification and visualization along with svm and svmpath
functions. Then a comparative study of different SVM implementations on different datasets is
done.
Nikhil Bajaj, Niko J. Murrell, Julie G. Whitney, Jan P. Allebach, George T.-C. Chiu [6] in their
paper, a method for integration of expert defined allowable confusions into SVM systems is
introduced, with an example implementation in a least squares support vector machine (LS-SVM)
tested on industrial data, and shown to improve overall performance of a multi-class classification
system when an appropriate performance measurement method is formulated. The proposed
approach was tested with an industrial dataset collected from a multi-sensor sorting application,
where expert knowledge of allowable and acceptable confusion is available. A confusion matrix
augmented performance metric was shown to have the potential to improve the combined
performance of a LS-SVM based multi-class classifier when expert knowledge on acceptable miss-
classification or confusion is available.
9
4. Algorithm
aa <- svm_pred
10
5. Implementation
11
{
count[5]<-count[5]+1
summ[5]<-summ[5]+strtoi(marks[i,j])
average[5]<-summ[5]/count[5]
}
else if(marks[1,j]=="Core")
{
count[6]<-count[6]+1
summ[6]<-summ[6]+strtoi(marks[i,j])
average[6]<-summ[6]/count[6]
}
else if(marks[1,j]=="Compiler")
{
count[7]<-count[7]+1
summ[7]<-summ[7]+strtoi(marks[i,j])
average[7]<-summ[7]/count[7]
}
}
max<-0
index<-99
s<-""
for(k in 1:7)
{
if(average[k]>=max)
{
max<-average[k]
index<-k
s<-subject[k]
}
}
l<-c(l,as.numeric(index))
l1<-c(l1,s)
}
marks <- transform(marks, Domain=as.numeric(l))
marks <- transform(marks,Domain1=as.factor(l1))
marks <- marks[-1,]
marks <- marks[-1,]
marks <- marks[,-1]
data <- sample(2,nrow(marks),replace=TRUE,prob=c(0.80,0.20))
marks_train <- marks[data==1,]
marks_test <- marks[data==2,]
#~~~~~~~~~~~~~~~~~~~~~~~~~~SVM START~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12
svm_pred <- predict(svm_model, testSvm$Domain)
aa <- svm_pred
l1=c()
for(n in 1:length(svm_pred))
{
if(svm_pred[n]=="N")
{
t<-1
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=="P")
{
t<-2
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=="T")
{
t<-3
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='R')
{
t<-4
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='B')
{
t<-5
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='C')
{
t<-6
l1<-c(l1,as.numeric(t))
}
else if(svm_pred[n]=='M')
{
t<-7
l1<-c(l1,as.numeric(t))
}
#print(n)
}
svm_pred <- l1
#~~~~~~~~~~~~~~~~~~~~~~~~~~SVM END~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
13
#~~~~~~~~~~~~~~~~~~~NAIVE BAYES START~~~~~~~~~~~~~~~~~~~~~~~~~~
l1=c()
for(m in 1:length(naive_pred))
{
if(naive_pred[m]=="N")
{
t<-1
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=="P")
{
t<-2
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=="T")
{
t<-3
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='R')
{
t<-4
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='B')
{
t<-5
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='C')
{
t<-6
l1<-c(l1,as.numeric(t))
}
else if(naive_pred[m]=='M')
{
t<-7
l1<-c(l1,as.numeric(t))
}
}
naive_pred <- l1
# ~~~~~~~~~~~~~~~~~~~~~~NAIVE BAYES END~~~~~~~~~~~~~~~~~~~~~~~
14
marks_test <- transform(marks_test,SVM=as.factor(svm_pred))
marks_test <- transform(marks_test,NB=as.factor(naive_pred))
result<-subset(marks_test,select = -c(Domain1))
result <- rename(result,c("V2" = "USN"))
plot.new()
p<-ggplot(result)+geom_point(data=result, aes(USN,SVM),col=2)
p <- p + geom_point(data=result, aes(USN,NB),col=3)
p <- p + geom_point(data=result, aes(USN,Domain),col=4)
print(p)
legend(0.6, 0.6, c('SVM', 'NB', 'Domain'),2:4 )
15
6. Results
The result is a plot of the specialization identified for the test data. It varies each time as the
16
7. Conclusion
We have successfully obtained the objective, i.e., to identify the student’s specialization by
applying Naïve Bayes and SVM classification algorithms. We can see from the result that the
output of these 2 have deviations from the expected output. The accuracy of SVM is found to be
better than that of Naïve Bayes. The accuracy is expected to improve with increase in the size of
17
8. References
Literature Survey:
Web References:
• https://www.r-project.org/about.html
• http://rischanlab.github.io/SVM.html
• https://www.kaggle.com/chinki/naive-bayes-classification-for-iris-dataset
• https://www.tutorialspoint.com/r/
• https://www.w3schools.in/r/
• https://www.rstudio.com/
• https://en.wikipedia.org/wiki/R_(programming_language)
• http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/
• https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-
example-code/
18