Chapter 01
Chapter 01
Chapter 1
Introduction
1.1 Overview
Data mining (DM) is the process where data is analysed and summarized into useful information.
In short, data mining is the process of finding correlations or patterns in large databases [1]. DM
consists of analysing large quantities of data to extract previously unknown or hidden interesting
patterns such as similar groups of data records i.e. cluster analysis, detecting anomaly and
dependencies i.e. association rule mining etc. This involves use of database techniques like
spatial indices. These patterns can then be seen as a kind of summary of input data, and may be
used in further analysis through machine learning and predictive analysis. As an example, data
mining step might identify multiple groups in the data, which can then be used to obtain accurate
prediction results by a decision support system.
DM works to analyze data stored in data warehouses or particular data that may come from all
parts of business and from the production to the management [2]. Weiss and Indurkhya have
[3].
According to Technology Forecast [4], and Frawley et al. [5], it is the process of nontrivial
extraction of implicit, previously unknown and potentially useful information such as knowledge
rules, constraints, and regularities from data stored in repositories using pattern recognition
technologies as well as statisti Weiss and Kulikowski have
, is learning a function that maps (classifies) a data item into one of
several predefine [6]. Apte and Hong suggested that classification methods of data
mining are used as part of knowledge discovery applications; which may includes classifying
trends in financial markets, education and identifying objects of interest from large dataset of
images [7]. Regression is a predictive technique that maps data item to a prediction variable.
Clustering is a descriptive task where a finite set of categories or clusters are identified to
describe the data e.g. to identify those students who are short of attendance and have shown poor
performance in sessionals [8], [9], [10]. Cheeseman and Stutz in [11] proposed that examples of
clustering applications in a knowledge discovery context include discovering similar groups and
for DM to be effective, the systems should allow the users to discover information as well as
1|Page
INTRODUCTION
knowledge from their own perspectives. Query languages or graphical user interfaces are
required to express the DM requests and the discovered information or knowledge so that results
from the DM Engine can be understandable and usable for end users.
Smart learning is a learning system that provides learners to learn in the real world environment.
In smart learning environment, the use of intelligent technologies such as cloud computing,
learning analytics or big data focuses on how learning data can be captured, analyzed and
directed towards improved learning, teaching and supporting the development of personalized
and adaptive learning [15],[16].
The essence of smart education is to create intelligent environments by using smart technologies,
so that smart pedagogies can be facilitated as to provide personalized learning services and
empower learners [17]. Zhu et al. [18] have proposed research framework of smart education as
shown in Figure 1.2. The framework describes three essential elements in smart education: (a)
Smart environments (b) Smart pedagogy (c) Smart learner.
2|Page
INTRODUCTION
Traditional Approach
(a)
Cloud Computing, Mobile
Computing, Intelligent Learning
Agents
Smarter
Education
(Ideology)
Smart Learners
The smart pedagogies consists of (a) Mass-based generation learning, (b) Individual based
personalized learning, (c) Group-based collaborative learning, (d) Class-based differentiated
instruction. A 4-tier architecture of smart pedagogies [18] has been shown in Figure 1.3.
3|Page
INTRODUCTION
Mass-Based Generative
Collective
Learning
Intelligence
Individual Based
Personalized
Expertise Personalized Learning
Basic
Knowledge Class-Based
& Core Skills Differentiated Instruction
Another smart learning environment framework [19] has been illustrated in Figure 1.4.
Quality teaching involves the use of pedagogical techniques to produce learning outcomes for
students [20]. It involves dimensions like (a) effective design of curriculum (b) course content
4|Page
INTRODUCTION
According to Sahlberg, Finland National Education Policies are intended to raise student
achievements built upon ideas of (a) sustainable leadership that place strong emphasis on
teaching and learning (b) intelligent accountability (c) encouraging schools to craft optimal
learning environments (d) implement educational content that best helps their students reach the
general goals of schooling [21].
A study conducted by Slater et al. has found a relationship between observable teacher
characteristics and student performance [22]. The authors investigated whether observable
characteristics of teachers correlate with measures of teacher effectiveness.
Alignment of teaching and learning process as well as student assessment to the teaching
and learning framework.
5|Page
INTRODUCTION
Regular improvement of existing and development of latest support systems using new
tools of ICT and e-learning.
Awards for teaching excellence and for producing students with good academic record.
Use of multimedia techniques and access to library, updated books, journals, research
papers, electronic, digital documents to students.
Educational data mining [26] is the field that uses data mining techniques in educational
environments to strengthen the learning systems. It is playing an important role in educational
systems where education is primary factor for society [27]. Educational data mining is receiving
great attention due to the many reasons such as (a) to increase the quality of education, (b) to
find solution to problems arising from complex educational dataset, (c) competitive environment
among the academic institutions. The main challenge of institutions is to deeply analyze their
performance in terms of student performance, teaching skills and academic activities.
There are some important factors related to students like performance in sessionals, attendance,
lab work etc for analyzing and predicting student class result. Some of the widely used data
mining techniques i.e. decision trees, neural networks, nearest neighbor, naive bayes etc are
being used in educational data mining. Using these techniques, vital useful knowledge can be
discovered through classification, clustering and association rules which becomes helpful in
increasing quality of education [28].
6|Page
INTRODUCTION
Data mining techniques applied on educational data are significant to educational organizations
as well as for students for effective decision support system. These techniques help us in
enhancing our understanding of learning by finding educational trends to improve student
performance, course selection, in-house trainings and faculty development. There are some
factors correlated to academic performance of using linear regression analysis
[29]. According to Liebowitz, adaptive/personalized learning, educational data mining, data
visualization, visual analytics, knowledge management and blended/e-learning play growing
roles to better inform higher education officials and teachers [30].
Data mining with the support from machine learning, statistical and visualization techniques can
help in finding and extracting knowledge. In order to collect data, questionnaires and feedback
forms are got filled up from students. These forms
approach towards educational patterns or trends, interest towards technologies, teaching
methodologies to be adopted. The data collected is to be analyzed using techniques like decision
tree, neural networks, naive bayes, support vector machines, k-means etc in order to help in
of students, interest in course, prediction of
student retention, prediction of course suitability, and personalized intervention strategy [32].
With competitive environment prevailing among the educational institutions, the main objective
of higher education institutes is to disseminate quality education to its students and to improve
the quality of managerial decisions. Quality of education can be improved by gaining knowledge
from educational data which facilitate academic planners in higher education institutes to
7|Page
INTRODUCTION
Teaching skills
Soft skills
Course content
Infrastructure requirement
Faculty development programmes
Students preference for industrial trainings
Academic trends
Social and emotional learning
Data mining technology can generate new business opportunities by providing capabilities like
automated prediction of trends and behaviors. It automates the process of finding predictive
information in large databases. The technologies like Big Data and IOT are emerging fields of
Data Mining. In Big Data, it consists of data which can be structured, semi-structured or
unstructured and the optimized techniques like real time queries are used to respond to queries in
less than second using databases. The data is collected, stored, organized, analyzed to discover
new insights and improve business decisions. The data mining analytics using Big Data involves
different phases for handling the data which involves data collection, storage, data organization,
data analysis, visualization, and action or result utilization. Big data opportunities are capable of
taking up big data jobs include specialized programmers, statistical modelers, advanced
mathematics etc.
8|Page
INTRODUCTION
The data mining future scope involves a career prospect which involves marketing, analysis,
statistics, applied mathematics, data visualization and movement, cloud computing, relational
databases, product placement and management etc.
Academics
Commercial organizations
Education sector
Data Mining Techniques are used to extract meaningful information from large volume of data
but the educational environment is least explored as far as these techniques are concerned. The
different classification and clustering techniques are being implemented on other datasets
belonging to other fields for enhancing various parameters of quality but implementation of these
techniques on educational environment for improving institutional effectiveness and enhancing
student/teacher learning process have been used the least so far. Also not much effort is visible in
literature either to gather meaningfully the huge amounts of data being produced by different
categories of educational systems or to use the already gathered data for meaningful mining on a
large scale so as to improve the teaching/learning processes.
Therefore, authors choose to work on Educational data mining which is the emerging field of
Data mining. The techniques like a) Statistics, b) Decision trees, c) K-Means, d) Naïve Bayes, e)
Support vector machines, f) K-Nearest Neighbors, g) Neural Networks etc have not yet been
implemented on the educational datasets. Moreover, no suitable datasets are available in the
public domain for such experimentations. These techniques when implemented on educational
9|Page
INTRODUCTION
datasets for exploring and analyzing the unique types of data that come from educational domain
can be purposefully implored to improve the learning analytics.
These learning analytics in turn can help to learn the students learning dynamics, and may prove
to be helpful for educators and policy makers. Also for carrying out such kinds of explorations,
not much support is available for discovering data patterns through customized or standard
generic tools. Some of the potential application areas where these Data mining techniques have
not applied so far are:
1. Some of the big challenges our educational system is facing are related to quality and skill
driven education, better placements of students, lack of support in adopting new educational
patterns as per market requirement.
2. Decision-making process becomes more complex with the increase in horizontal and vertical
educational entities. An educational institution requires more efficient approaches to manage
and support decision-making procedures.
3. There are data mining techniques to extract meaningful knowledge from large datasets but
their applications in the area of education sector have largely remained unexplored till now.
These unexplored areas are:
10 | P a g e
INTRODUCTION
There is a need to introduce latest knowledge related to the educational trends into
the systems.
In educational sector, there is enormous growth in Big data and the data of educational field is
different from other fields in following terms (a) there are increasing learning resources, (b)
dataset vary in formal/non-formal sector, (c) updation in the curricula, (d) students behavioral
attributes vary, (e) criteria for assessing students vary from institution to institution, (f)
demographic factors, (g) data fetched from web enabled sources like MOOCs, Moodle, e-
learning are different from formal education sector, (h) dataset varies in regular and distance
learning modes, (i) academic and non-academic skills of students, (j) data varies in multiple
streams of engineering like computers, IT, Electronics & communication, electrical, mechanical,
textile, civil etc. The major challenges that exist in the educational data classification are:
11 | P a g e
INTRODUCTION
Data redundancy
Outlier detection
Input data formats vary in algorithmic tools i.e. data may be accepted in numeric or may
be in string format.
With the growth in educational technology, there is exponential increase in digital traces in
education sector along matching with computing power but due to the absence of knowledge
discovery process approaches on educational dataset, it has become really difficult to analyze
and consolidate the data. Therefore, there is an urgent need of analyzing and evaluating this huge
amount of unstructured data using educational data mining techniques for effective decision
making and predicting academic trends.
In present scenario, with the emergence of new technologies, inclusion of various modes of e-
learning and other online educational resources into the teacher-taught paradigm, in the formal as
well as informal education sectors has resulted into a collection of huge volumes of data. For this
structured, semi-structured or unstructured data to make reasonable sense to the stakeholders of
the systems, the emerging trends of data mining need to be explored for processing this data on
distributed systems with parallel computations. With more stress on skilled manpower, quality
education is most important and for that overall performance of students is of great concern to
the field of higher education.
12 | P a g e
INTRODUCTION
Educational data mining techniques are conducive for inspecting educational data. With the
mandatory adoption of mining techniques, education sector is the beneficiary due to faster
decision-making with analyses of data fetched from students.
The data from students and other stakeholders may include a) preferences for the courses, course
outcomes, trainings especially vocational trainings, industry oriented courses as optional
subjects, job profiles, etc. b) choices of the appropriate existing subjects, c) available options at
the national and international levels, d) in-house training needs for the employees and
management and so on. Subsequently, these techniques help in acquiring useful information
pertaining to students such as a) prediction of skilled students, b) finding new educational trends
which are as per industry standards, c) fulfillment of the demand of skilled manpower, d)
targeting learners showing unsatisfactory performance in class, d) updations in syllabi, e) guiding
learners to choose right course to undergo trainings.
1.9 Objectives
1. To propose the prediction of student(s) preferences for selection of courses using existing
significant clustering techniques.
2. To predict the learning behaviors of students and the probability of their placements in
educational system currently being followed.
3. To analyze datasets, classification and pattern recognition for finding academic trends to
support academic and job oriented decisions making by the students.
Using classification and clustering techniques of data mining, we intended to predict the class
result of students based on the attributes taken. These techniques can help in identifying those
students who show poor performance in sessionals in an educational institution. The main
13 | P a g e
INTRODUCTION
purpose of using these techniques would be to produce actionable outcomes from academic
performances of the students.
The clustering of the students based on some attributes like their class performance, sessionals
and attendance in class etc. are essential for this purpose. It is intended to enhance the decision
making approach to monitor and enhance the performance of students. We have shown that by
increasing the value of clusters, the accuracy becomes better and we can find the better grouping
of the data. It would also help us to cluster those students who need special attention.
Using data mining classification techniques, we can predict the learning behaviors of students
and the probability of their placements in educational system currently being followed.
The classifiers have been implemented on the educational datasets and comparative analysis of
classifiers accuracy has been performed using Python Machine Learning Techniques. This
research has lead to better selection of classifiers for data analysis in future. Apart from it,
emphasis has been laid on the implementation of Big Data Analytics for getting meaningful
information from unstructured data so as to help students in selecting the choices for their
industrial trainings.
In the scope of this work, a tool has been developed using Dot Net Technologies for Association
Rule Mining and for feedback analysis; another tool in PHP to help in in-house training
programmes.
Involvement of classification and pattern recognition for finding academic trends have been
shown to support academic and job oriented decisions making by the students.
1.11 Contributions
The dataset is generated from the feedback obtained from students of NIELIT (National Institute
of Electronics & IT) Shimla centre (erstwhile DOEACC Society), an Autonomous Scientific
Society under the administrative control of Ministry of Electronics & Information Technology
(MoE&IT), Government of India. It has been set up to carry out Human Resource Development
and related activities in the area of Information, Electronics & Communications Technology
(IECT). NIELIT is engaged both in formal & non-formal sectors of education in the area of
IECT besides development of industry oriented quality education and training programmes in the
14 | P a g e
INTRODUCTION
state-of-the-art areas. The Feedback Performa used has been attached as Annexure A-1. The
parameters of feedback had been categorized as, a) teaching skills, b) course content, c)
infrastructure facilities. The statistical techniques such as regression analysis of data mining have
been implemented on this dataset. The results obtained from SPSS package in the form of mean
and standard deviation have contributed to the thesis by improving a) institutional effectiveness,
b) student learning, c) teaching skills of instructors, d) infrastructure quality. The contribution
related to results of these statistical techniques of data mining has been published in the
following paper below:
After this literature survey, data mining techniques have been applied to the educational dataset
synthesized for the purpose of evaluating the entropy of attributes of data. Therefore, a decision
tree classifier has been implemented on this dataset in order to obtain the following results: (a)
prediction of performance of students in a particular class, (b) identification of students whose
attendance is short and who have performed poorly in sessionals.
It has been shown that the results of WEKA tool and decision tree algorithm produces exactly
same information gain for the root attribute. The following paper based upon this idea has been
published:
P. Guleria, N. Thakur and M. Sood, "Predicting Student Performance using Decision Tree
Classifiers and Information Gain," International Conference on Parallel, Distributed and
Grid Computing, Solan, 2014, pp. 126-129, IEEE. doi: 10.1109/PDGC.2014.7030728.
15 | P a g e
INTRODUCTION
Further in this direction, authors have performed K-means, K-Nearest Neighbor classification
and clustering technique on another dataset synthesized for this purpose and related to
educational environment. The synthesized dataset is based on performance of students in a)
internal exams, b) attendance, c) lab work, d) assignments, and e) overall performance. Using K-
mean clustering technique on this dataset, the following two clusters have been generated (a)
students who are short of attendance and (b) students who have performed poorly in sessionals.
This work also concludes that, on increasing the value of K-clusters, the accuracy becomes better
and it can find the better clustering of the data in the dataset. In K-Nearest Neighbor technique,
using K values, the nearest class for upcoming group of fresh students has been determined
which can help in: a) identifying group of those students who are having good practical as well
as good overall performance in the class, b) strengthens the decision making approach of
instructors to monitor the capabilities of the group. This work on K-means and K nearest
neighbors has been published in the research papers given below:
-
International Journal of Innovations & Advancement in Computer Science, 2347 8616, Vol.
3, Issue 8, 2014.
-Nearest Neighbor: A
International Journal of Computer Science and
Information Security (IJCSIS), Vol. 12, 2016.
-Means
and K-
ISSN: 0974-5572, Vol. 10, No.40, 2017.
The academic trends related to preferable choices for students to undergo industrial trainings
have been predicted using Association rule mining and its importance in academic counseling
has been highlighted. The knowledge has been extracted from a semi-synthesized dataset
especially created for this purpose for the students of engineering background. Using association
rule mining, preferred courses have been extracted from the dataset for students to undergo
industrial trainings. In this work, rules have been discovered using Apriori algorithm which help
instructors a) to find interest of students towards specialized/industry oriented courses and (b) to
16 | P a g e
INTRODUCTION
enhance the effectiveness of academic planning/decision-making. The results related to this work
have been published in research paper given below:
P.
International Journal of Advance Research in Science and
Engineering (IJARSE), ISSN-2319-8354(E), Vol. No.4, special issue (01), 2015.
Further contribution of this thesis includes neural network based clustering and classification
approaches. In this work, experimentation has been performed on the same dataset which is
synthesized for K-means technique. The first approach proposed is based on Self-Organizing
Map (SOM) which is a type of ANN (Artificial Neural Network). In this work, students are
clustered based on certain attributes into natural classes so that similar classes are grouped
together. In second approach, pattern recognition has been carried out through two-layer feed
forward network to classify inputs into a set of target categories. The SOM map learns itself by
recognizing major features in the input data to which they are introduced. The SOM neural
network based clustering and pattern recognition performs data mining by training the network
to identify the classified and misclassified data. The findings have shown that a) the network is
trained properly and (b) for the input data of students, desired target classes are classified
accurately. The confusion matrix results represent the percentage of accurately classified classes
of students which help in a) clustering of students and b) strengthens decision making to
supervise the appraisal of students. This work has been published as per the following details.
P. Guleria
International Journal of Control Theory and Applications, ISSN: 0974-5572, Vol. 10, No.40,
2017.
The next contribution is on Bayesian theorem. Using Naïve Bayes theorem, the probability of
placements of students has been predicted from the dataset synthesized for experimentation
purpose. The results shown a) help instructors/educators to update the syllabi/curricula as per
industry needs and b) guide students to focus on skills like quantitative aptitude, reasoning,
17 | P a g e
INTRODUCTION
communication, technical etc. apart from regular studies to improve chances of placement. The
results have been published as given below:
A Support Vector Machine (SVM) is a supervised data mining technique which is effective in
accuracy of results. The SVM technique has been used on the similar dataset being used in Bayes
theorem to predict the placement results of students. Here, dataset is classified into labels with
cross validation technique applied on it to find the best possible values. The placement results
obtained using SVM technique classify attributes and provide a better insight to students to
update themselves with current academic scenario to get placed. This work has been published as
per the details given below:
After this, the authors have developed a web based educational classification tools in Asp.Net
and Php. The predictive rules in the form of preferred courses for students to undergo industrial
trainings have been applied using Apriori algorithm of Association rule mining technique in
Asp.Net as front end and Sql server 2008 as back-end. Another tool has been developed in PHP
as front end and MySQL as backend for feedback analytics categorizing following: (a) teaching
skills, (b) course content, (c) infrastructure of institute, and (d) other deliverables.
Another contribution is in the emerging field of big data. For this, authors have firstly
synthesized an experimental dataset that consists of courses related to the field of ICT and their
attributes. The dataset is processed through proposed methodology of MapReduce algorithm.
The framework using mapper and reducer functions runs the job in parallel on a single node
18 | P a g e
INTRODUCTION
cluster using Hadoop distributed file system. The file system converts the data into individual
tuples and meaningful data obtained from reducer function. The results and their analysis show
that MapReduce can provide students with the career counseling support which strengthens their
decision-making to opt for the right course(s) for training activities as per industry requirements.
The proposed methodology is going to be the pivotal point in designing and implementing such
support system that will facilitate intelligent decision-making by parents, teachers and mentors
related to the careers of their children/ wards/ students and strengthening of in-house training
programmes.
P. Guleria and M. Sood, "Big Data Analytics: Predicting Academic Course Preference
using Hadoop Inspired MapReduce", Fourth International Conference on Image
Information Processing (ICIIP), 2017, pp. 1-4, IEEE.doi: 10.1109/ICIIP.2017.8313734.
Finally, the authors have performed educational data classification and comparative analysis of
classifiers through python programming on dataset synthesized for three different subjects with
similar attributes. In this work, the following has been done: a) classification of educational
dataset has been performed using different classifiers, b) comparison of precision values of
classifier models, c) validation dataset has been created, and d maximum interest
among the three different courses have been predicted. The work related to this has been
published in research paper given below:
Chapter 2 is all about the literature survey of KDD, Data Mining step of KDD Process, methods
and techniques.
Chapter 3 covers different mining techniques implemented for classification and clustering of
data. The working principles and algorithms of mining techniques have been discussed in this
chapter. The mining techniques discussed in this chapter are as follows a) Statistical techniques,
19 | P a g e
INTRODUCTION
In Chapter 4, Research methodology used in this work has been presented. The research
methodology followed has been divided into 5 phases which are as follows: a) Study of related
literature, b) study of functional requirements of education and training followed in an academic
institute, c) datasets are synthesized and predictive analytics of educational trends, decision
making using data mining techniques have been performed, d) classification and clustering
techniques have been implemented over the dataset and the results evaluated, and e) tools
developed for using APRIORI algorithm for Association rule mining in Asp.Net and feedback
analytics in php language.
In Chapter 5, the tools used for finding results using data mining techniques have been discussed.
The tools discussed are: a) WEKA, b) MATLAB, c) SPSS, d) RapidMiner, e) Hadoop
MapReduce framework, f) Python.
In Chapter 6, experiments, results and discussions have been discussed in detail. In this chapter,
analysis and classification of educational data is done for effective decision making. The results
have been obtained using various data mining techniques discussed in Chapter 3.The techniques
applied on the educational dataset are as follows: a) Statistical techniques, b) classification, c)
clustering, d) pattern recognition, e) supervised learning, and f) probabilistic approach.
In Chapter 7, Web based data mining tools have been proposed for Association Rule Mining and
for performing feedback analysis. The tools have been developed in Asp.Net and Php language.
In Chapter 8, Educational data classification using Hadoop Inspired MapReduce framework has
been presented. Here, the data is distributed using Map and Reduce phases for parallel
computation of data. Using MapReduce framework, preferred courses have been predicted for
students to undergo industrial trainings.
In Chapter 9, educational data classification has been performed using Python language. Using
python, multiple classifiers have been implemented over the dataset and are compared in terms
of their precision values. The classifiers implemented over the dataset are as follows a) KNN, b)
SVM, c) CART, d) Linear regression, e) Naïve Bayes, and f) Linear discriminant analysis.
The conclusion and future scope are highlighted in the Chapter 10 followed by references.
20 | P a g e