ml-lab
ml-lab
MACHINE LEARNING
LABORATORY MANUAL
Machine learning
Machine learning is a subset of artificial intelligence in the field of computer science that often
uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve
performance on a specific task) with data, without being explicitly programmed. In the past
decade, machine learning has given us self-driving cars, practical speech recognition, effective
web search, and a vastly improved understanding of the human genome.
Supervised learning: The computer is presented with example inputs and their desired outputs,
given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. As
special cases, the input signal can be only partially available, or restricted to special feedback:
Semi-supervised learning: the computer is given only an incomplete training signal: a training set
with some (often many) of the target outputs missing.
Active learning: the computer can only obtain training labels for a limited set of instances (based
on a budget), and also has to optimize its choice of objects to acquire labels for. When used
interactively, these can be presented to the user for labeling.
Reinforcement learning: training data (in form of rewards and punishments) is given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or playing
a game against an opponent.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in
data) or a means towards an end (feature learning).
In clustering, a set of inputs is to be divided into groups. Unlike in classification, the groups are
not known beforehand, making this typically an unsupervised task. Density estimation finds the
distribution of inputs in some space. Dimensionality reduction simplifies inputs by mapping them
into a lower- dimensional space. Topic modeling is a related problem, where a program is given
a list of human language documents and is tasked with finding out which documents cover similar
topics.
Decision tree learning: Decision tree learning uses a decision tree as a predictive model, which maps
observations about an item to conclusions about the item's target value. Association rule learning
Association rule learning is a method for discovering interesting relations between variables in large
databases.
An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is
a learning algorithm that is vaguely inspired by biological neural networks. Computations are
structured in terms of an interconnected group of artificial neurons, processing information using
a connectionist approach to computation. Modern neural networks are non-linear statistical data
modeling tools. They are usually used to model complex relationships between inputs and outputs,
to find patterns in data, or to capture the statistical structure in an unknown joint probability
distribution between observed variables.
Deep learning
Falling hardware prices and the development of GPUs for personal use in the last few years have
contributed to the development of the concept of deep learning which consists of multiple hidden
layers in an artificial neural network. This approach tries to model the way the human brain
processes light and sound into vision and hearing. Some successful applications of deep learning
are computer vision and speech recognition.
Support vector machines (SVMs) are a set of related supervised learning methods used for
classification and regression. Given a set of training examples, each marked as belonging to one
of two categories, an SVM training algorithm builds a model that predicts whether a new example
falls into one category or the other.
Clustering
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to some pre designated criterion or
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated for example by internal compactness (similarity between members
of the same cluster) and separation between different clusters. Other methods are based on
estimated density and graph connectivity. Clustering is a method of unsupervised learning, and a
common technique for statistical data analysis.
Bayesian networks
A Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical
model that represents a set of random variables and their conditional independencies via a directed
acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms. Given symptoms, the network can be used to
compute the probabilities of the presence of various diseases. Efficient algorithms exist that
perform inference and learning.
Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an environment
so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt
to find a policy that maps states of the world to the actions the agent ought to take in those states.
Reinforcement learning differs from the supervised learning problem in that correct input/output
pairs are never presented, nor sub-optimal actions explicitly corrected.
Genetic algorithms
A genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and
uses methods such as mutation and crossover to generate new genotype in the hope of finding
good solutions to a given problem. In machine learning, genetic algorithms found some uses in
the 1980s and 1990s. Conversely, machine learning techniques have been used to improve the
performance of genetic and evolutionary algorithms.
Rule-based machine learning is a general term for any machine learning method that identifies,
learns, or evolves "rules" to store, manipulate or apply, knowledge. The defining characteristic of
a rule-based machine learner is the identification and utilization of a set of relational rules that
collectively represent the knowledge captured by the system. This is in contrast to other machine
learners that commonly identify a singular model that can be universally applied to any instance
in order to make a prediction. Rule-based machine learning approaches include learning classifier
systems, association rule learning, and artificial immune systems.
Feature selection is the process of selecting an optimal subset of relevant features for use in model
construction. It is assumed the data contains some features that are either redundant or irrelevant,
and can thus be removed to reduce calculation cost without incurring much loss of information.
Common optimality criteria include accuracy, similarity and information measures.
SEMESTER – III
LAB REQUIREMENTS:
Python or any ML tools like R
LIST OF EXPERIMENTS :
1. Demonstrate how do you structure data in Machine Learning
2. Implement data preprocessing techniques on real time dataset
3. Implement Feature subset selection techniques
4. Demonstrate how will you measure the performance of a machine learning model
5. Write a program to implement the naïve Bayesian classifier for a sample training dataset.
Compute the accuracy of the classifier, considering few test data sets.
6. Write a program to construct a Bayesian network considering medical data. Use this modelto
demonstrate the diagnosis of heart patients using the standard Heart Disease Data Set.
7. Apply EM algorithm to cluster a set of data stored in a .CSV file.
8. Write a program to implement k-Nearest Neighbor algorithm to classify the data set.
9. Apply the technique of pruning for a noisy data monk2 data, and derive the decision tree from this
data. Analyze the results by comparing the structure of pruned and unpruned tree.
10. Build an Artificial Neural Network by implementing the Backpropagation algorithm and test
the same using appropriate data sets
TOTAL: 60 PERIODS
Ex.No:1
Demonstrate how do you structure data in Machine Learning
Date:
Machine Learning is one of the hottest technologies used by data scientists or ML experts to
deploy a real-time project. However, only skills of machine learning are not sufficient for solving real-
world problems and designing a better product, but also you have to gain good exposure to the data
structure.
The data structure used for machine learning is quite similar to other software development
fields where it is often used. Machine Learning is a subset of artificial intelligence that includes
various complex algorithms to solve mathematical problems to a great extent. Data structure helps
to build and understand these complex problems. Understanding the data structure also helps you to
build ML models and algorithms in a much more efficient way than other ML professionals. In this
topic, "Data Structure for Machine Learning", we will discuss various concepts of data structure
used in Machine Learning, along with the relationship between data structure and ML. So, let's start
with a quick overview of Data structure and Machine Learning.
The data structure is defined as the basic building block of computer programming that helps us
to organize, manage and store data for efficient search and retrieval.
In other words, the data structure is the collection of data type 'values' which are stored and organized
in such a way that it allows for efficient access and modification.
The data structure is the ordered sequence of data, and it tells the compiler how a programmer is using
the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.
Now let's discuss popular data structures used for Machine Learning:
The linear data structure is a special type of data structure that helps to organize and manage data in a
specific order where the elements are attached adjacently.
Array:
An array is one of the most basic and common data structures used in Machine Learning. It is also used
in linear algebra to solve complex mathematical problems. You will use arrays constantly in machine
learning, whether it's:
o To convert the column of a data frame into a list format in pre-processing analysis
o To order the frequency of words present in datasets.
o Using a list of tokenized words to begin clustering topics.
o In word embedding, by creating multi-dimensional matrices.
An array contains index numbers to represent an element starting from 0. The lowest index is arr[0] and
corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the Python array is quite
different from than array in other programming languages, the Python list is more popular as it includes
the flexibility of data types and their length. If anyone is using Python in ML algorithms, then it's better
to kick your journey from array initially.
Stacks:
Method Description
Count() It returns the count or total available element with an integer value.
Extend() It is used to add the element of a list to the end of the current list.
Index() It returns the index of the first element with the specified value.
Pop() It is used to remove an element from a specified position using an index number.
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for
binary classification in deep learning. Although stacks are easy to learn and implement in ML
models but having a good grasp can help in many computer science aspects such as parsing grammar,
etc.
Stacks enable the undo and redo buttons on your computer as they function similar to a stack of blog
content. There is no sense in adding a blog at the bottom of the stack. However, we can only check the
most recent one that has been added. Addition and removal occur at the top of the stack.
Linked List:
A linked list is the type of collection having several separately allocated nodes. Or in other words, a
list is the type of collection of data elements that consist of a value and pointer that point to the
next node in the list.
In a linked list, insertion and deletion are constant time operations and are very efficient, but accessing
a value is slow and often requires scanning. So, a linked list is very significant for a dynamic array
where the shifting of elements is required. Although insertion of an element can be done at the head,
middle or tail position, it is relatively cost consuming. However, linked lists are easy to splice together
and split apart. Also, the list can be converted to a fixed-length array for fast access.
Queue:
A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario in real-
time programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue is
significant in a program where multiple lists of codes need to be processed.
The queue data structure can be used to record the split time of a car in F1 racing.
As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All the
elements are arranged and linked with each other in a hierarchal manner, where one element can be
linked with one or more elements.
1) Trees
Binary Tree:
The concept of a binary tree is very much similar to a linked list, but the only difference of nodes and
their pointers. In a linked list, each node contains a data value with a pointer that points to the next
node in the list, whereas; in a binary tree, each node has two pointers to subsequent nodes instead
of just one.
Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time
complexity. Similar to the linked list, a binary tree can also be converted to an array on the basis of tree
sorting.
In a binary tree, there are some child and parent nodes shown in the above image. Where the value of
the left child node is always less than the value of the parent node while the value of the right-side child
nodes is always more than the parent node. Hence, in a binary tree structure, data sorting is done
automatically, which makes insertion and deletion efficient.
2) Graphs
A graph data structure is also very much useful in machine learning for link prediction. Graphs
are directed or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have
good exposure to the graph data structure for machine learning and deep learning.
3) Maps
Maps are the popular data structure in the programming world, which are mostly useful for minimizing
the run-time algorithms and fast searching the data. It stores data in the form of (key, value) pair, where
the key must be unique; however, the value can be duplicated. Each key corresponds to or maps a
value; hence it is named a Map.
In different programming languages, core libraries have built-in maps or, rather, HashMaps with
different names for each implementation.
o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.
Python Dictionaries are very useful in machine learning and data science as various functions and
algorithms return the dictionary as an output. Dictionaries are also much used for implementing sparse
matrices, which is very common in Machine Learning.
Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a tree,
but it consists of vertical ordering instead of horizontal ordering.
Ordering in a heap DS is applied along the hierarchy but not across it, where the value of the parent
node is always more than that of child nodes either on the left or right side.
Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly,
the element is inserted at the highest available position. After that, it gets compared with its parent and
promoted until it reaches the correct ranking position. Most of the heaps data structures can be stored in
an array along with the relationships between the elements.
This is one of the most important types of data structure used in linear algebra to solve 1-D, 2-D, 3-D
as well as 4-D arrays for matrix arithmetic. Further, it requires good exposure to Python libraries such
as Python NumPy for programming in deep learning.
For a Machine learning professional, apart from knowledge of machine learning skills, it is required to
have mastery of data structure and algorithms.
When we use machine learning for solving a problem, we need to evaluate the model performance, i.e.,
which model is fastest and requires the smallest amount of space and resources with accuracy.
Moreover, if a model is built using algorithms, comparing and contrasting two algorithms to determine
the best for the job is crucial to the machine learning professional. For such cases, skills in data
structures become important for ML professionals.
With the knowledge of data structure and algorithms with ML, we can answer the following questions
easily:
Conclusion
In this article, we have discussed how Data structure is helpful in building Machine Learning
algorithms. A data structure is a key player in the programming world to solve most of the computing
problems, and gaining the knowledge of data structure and implementing the best algorithm gives you
the best and optimum solution for an ML problem. Further, having a strong knowledge of data structure
will help you to build a strong foundation and use the skills to create a better Project in Machine
Learning.
Ex.No:2
Implement data preprocessing techniques on real time dataset
Date:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as
the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a machine
learning model for business purpose, then dataset will be different with the dataset required for a liver
patient. So each dataset is different from another dataset. To use the dataset in our code, we usually put
it into a CSV file. However, sometimes, we may also need to use an HTML or xlsx file.
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the
tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from
here, "https://www.superdatascience.com/pages/machine-learning. For real-world problems,
We can also create our dataset by gathering data using various API with Python and put that data into a
.csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined Python
libraries. These libraries are used to perform some specific jobs. There are three specific libraries that
we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It
is the fundamental package for scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices. So, in Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this
library, we need to import a sub-library pyplot. This library is used to plot any type of charts in Python
for the code. It will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and
used for importing and managing the datasets. It is an open-source data manipulation and analysis
library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Now we need to import the datasets which we have collected for our machine learning project. But
before importing a dataset, we need to set the current directory as a working directory. To set a working
directory in Spyder IDE, we need to follow the below steps:
Note: We can set any directory as a working directory, but it must contain the required dataset.
Here, in the below image, we can see the Python file along with required dataset. Now, the current
folder is set as a working directory.
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally as
well as through an URL.
data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed
the name of our dataset. Once we execute the above line of code, it will successfully import the dataset
in our code. We can also check the imported dataset by clicking on the section variable explorer, and
then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also
change the format of our dataset by clicking on the format option.
In machine learning, it is important to distinguish the matrix of features (independent variables) and
dependent variables from dataset. In our dataset, there are three independent variables that
are Country, Age, and Salary, and one is a dependent variable which is Purchased.
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the
required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the
columns. Here we have used :-1, because we don't want to take the last column as it contains the
dependent variable. So by doing this, we will get the matrix of features.
As we can see in the above output, there are only three variables.
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but for R
language it is not required.
4) Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains
some missing data, then it may create a huge problem for our machine learning model. Hence it is
necessary to handle missing values present in the dataset.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this way,
we just delete the specific row or column which consists of null values. But this way is not so efficient
and removing data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is useful for the
features which have numeric data such as age, salary, year, etc. Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries
for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced with the means of rest
column values.
Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our dataset would
have a categorical variable, then it may create trouble while building the model. So it is necessary to
encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has successfully
encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above output, these variables
are encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is
some correlation between these variables which will produce the wrong output. So to remove this issue,
we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that
variable in a particular column, and rest variables become 0. With dummy encoding, we will have a
number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For
Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
Output:
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into
three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here
we are not using OneHotEncoder class because the purchased variable has only two categories yes or
no, and which are automatically encoded into 0 and 1.
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084
Output:
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one
of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the
correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a new dataset
to it, then it will decrease the performance. So we always try to make a machine learning model which
performs well with the training set and also with the test dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random train and
test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for arrays
of data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2,
which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you
always get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the variable
explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put our
variables in the same range and in the same scale so that no any variable dominate the other variable.
As we can see, the age and salary column values are not on the same scale. A machine learning model
is based on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our
machine learning model.
If we compute any two values from age and salary, then salary values will dominate the age values, and
it will produce an incorrect result. So to remove this issue, we need to perform feature scaling for
machine learning.
Standardization
Normalization
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
Now, we will create the object of StandardScaler class for independent variables or features. And then
we will fit and transform the training dataset.
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform() because it is
already done in training set.
x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1. But
if these variables will have more range of values, then we will also need to scale those variables.
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Dataset.csv')
In the above code, we have included all the data preprocessing steps together. But there are some steps
or lines of code which are not necessary for all machine learning models. So we can exclude them from our code
to make it reusable for all models.
Ex.No:3
Implement Feature subset selection techniques
Date:
Feature Selection is the most critical pre-processing activity in any machine learning
process. It intends to select a subset of attributes or features that makes the most meaningful
contribution to a machine learning activity. In order to understand it, let us consider a small example
i.e. Predict the weight of students based on the past information about similar students, which is
captured inside a ‘Student Weight’ data set. The data set has 04 features like Roll Number, Age,
Height & Weight. Roll Number has no effect on the weight of the students, so we eliminate this
feature. So now the new data set will be having only 03 features. This subset of the data s et is
expected to give better results than the full set.
12 1.1 23
11 1.05 21.6
13 1.2 24.7
11 1.07 21.3
14 1.24 25.2
12 1.12 23.4
The above data set is a reduced dataset. Before proceeding further, we should look at the
fact why we have reduced the dimensionality of the above dataset OR what are the issues in High
Dimensional Data?
High Dimensional refers to the high number of variables or attributes or features present in certain
data sets, more so in the domains like DNA analysis, geographic information system (GIS), etc. It
may have sometimes hundreds or thousands of dimensions which is not good from the machine
learning aspect because it may be a big challenge for any ML algorithm to handle that. On the other
hand, a high quantity of computational and a high amount of time will be required. Also, a model
built on an extremely high number of features may be very difficult to understand. For these
reasons, it is necessary to take a subset of the features instead of the full set. So we can deduce
that the objectives of feature selection are:
1. Having a faster and more cost-effective (less need for computational resources) learning model
2. Having a better understanding of the underlying model that generates the data.
3. Improving the efficacy of the learning model.
prediction is very little, the variable is said to be weakly relevant. The remaining variables, which
make a significant contribution to the prediction task are said to be strongly relevant variables.
In the case of unsupervised learning, there is no training data set or labelled data. Grouping of similar
data instances are done and the similarity of data instances are evaluated based on the value of
different variables. Certain variables do not contribute any useful information for deciding the
similarity of dissimilar data instances. Hence, those variable makes no significant contribution to the
grouping process. These variables are marked as irrelevant variables in the context of the
unsupervised machine learning task.
We can understand the concept by taking a real-world example: At the start of the article, we took a
random dataset of the student. In that, Roll Number doesn’t contribute any significant information in
predicting what the Weight of a student would be. Similarly, if we are trying to group together
students with similar academic capabilities, Roll No can really not contribute any information. So, in
the context of grouping students with similar academic merit, the variable Roll No is quite irrelevant.
Any feature which is irrelevant in the context of a machine learning task is a candidate for rejection
when we are selecting a subset of features.
b. Feature Redundancy: A feature may contribute to information that is similar to the information
contributed by one or more features. For example, in the Student Data-set, both the features Age &
Height contribute similar information. This is because, with an increase in age, weight is expected to
increase. Similarly, with the increase in Height also weight is expected to increase. So, in context to
that problem, Age and Height contribute similar information. In other words, irrespective of whether
the feature Height is present or not, the learning model will give the same results. In this kind of
situation where one feature is similar to another feature, the feature is said to be potentially
redundant in the context of a machine learning problem.
All features having potential redundancy are candidates for rejection in the final feature subset. Only
a few representative features out of a set of potentially redundant features are considered for being a
part of the final feature subset. So in short, the main objective of feature selection is to remove all
features which are irrelevant and take a representative subset of the features which are potentially
redundant. This leads to a meaningful feature subset in the context of a specific learning task.
The measure of feature relevance and redundancy
a. Measures of Feature Relevance: In the case of supervised learning, mutual information is
considered as a good measure of information contribution of a feature to decide the value of the class
label. That is why it is a good indicator of the relevance of a feature with respect to the class variable.
The higher the value of mutual information of a feature, the more relevant is that feature. Mutual
information can be calculated as follows:
Correlation value ranges between +1 and -1. A correlation of 1 (+/-) indicates perfect correlation. In
case the correlation is zero, then the features seem to have no linear relationship. Generally for all
feature selection problems, a threshold value is adopted to decide whether two features have adequate
similarity or not.
2. Distance-Based Similarity Measure
The most common distance measure is the Euclidean distance, which, between two features F 1 and
F2 are calculated as
Where the features represent an n-dimensional dataset. Let us consider that the dataset has
two features, Subjects (F1) and marks (F2) under consideration. The Euclidean distance between the
two features will be calculated like this:
2 6 -4 16
6 4 2 4
8 3 5 25
6 7 -1 1
7 6 1 1
8 6 2 4
9 7 2 4
A more generalized form of the Euclidean distance is the Minkowski Distance, measured
as Minkowski distance takes the form of Euclidean distance (also called L2 norm) where r = 2. At
r=1, it takes the form of Manhattan distance (also called L1 norm)
Jaccard distance:
Let us take an example to understand it better. Consider two features, F 1 and F2 having values (0, 1,
1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).
As shown in the above picture, the cases where both the values are 0 have been left out without
border- as an indication of the fact that they will be excluded in the calculation of the Jaccard
coefficient.
Therefore, Jaccard Distance between those two features is d j = (1 – 0.4) = 0.6
Note: One more measure of similarity using similarity coefficient calculation is Cosine Similarity.
For the sake of understanding, let u stake an example of the text classification problem. The text
needs to be first transformed into features with a word token being a feature and the number of times
the word occurs in a document comes as a value in each row. There are thousands of features in such
a text dataset. However, the data set is sparse in nature as only a few words do appear in a document
and hence in a row of the data set. So each row has very few non-zero values. However, the non-zero
values can be anything integer value as the same word may occur any number of times. Also,
considering the sparsity of the dataset, the 0-0 matches need to be ignored. Cosine similarity which
is one of the most popular measures in text classification is calculated as:
Cosine Similarity measures the angle between x and y vectors. Hence, if cosine similarity has a
value of 1, the angles between x and y is 0 degrees which means x and y are the same except for the
magnitude. If the cosine similarity is 0, the angle between x and y is 90 0. Hence, they do not share
any similarity. In the case of the above example, the angle comes out to be 43.2 0.
Even after all these steps, there are some few more steps. You can understand it by the following
flowchart:
After the successful completion of this cycle, we get the desired features, and we have finally tested
them also.
Ex.No:4
Demonstrate how will you measure the performance of a machine learning model
Date:
we will discuss the various ways to check the performance of our machine learning or deep learning
model and why to use one in place of the other. We will discuss terms like:
1. Confusion matrix
2. Accuracy
3. Precision
4. Recall
5. Specificity
6. F1 score
7. Precision-Recall or PR curve
9. PR vs ROC curve.
For simplicity, we will mostly discuss things in terms of a binary classification problem where let’s
say we’ll have to find if an image is of a cat or a dog. Or a patient is having cancer (positive) or is found
Confusion matrix
It’s just a representation of the above parameters in a matrix format. Better visualization is always good
:)
Accuracy
The most commonly used metric to judge a model and is actually not a clear indicator of the
Take for example a cancer detection model. The chances of actually having cancer are very low. Let’s
say out of 100, 90 of the patients don’t have cancer and the remaining 10 actually have it. We don’t want
to miss on a patient who is having cancer but goes undetected (false negative). Detecting everyone as
not having cancer gives an accuracy of 90% straight. The model did nothing here but just gave cancer
Precision
Percentage of positive instances out of the total predicted positive instances. Here denominator is the
model prediction done as positive from the whole given dataset. Take it as to find out ‘how much the
Percentage of positive instances out of the total actual positive instances. Therefore denominator (TP +
FN) here is the actual number of positive instances present in the dataset. Take it as to find out ‘how
much extra right ones, the model missed when it showed the right ones’.
Specificity
Percentage of negative instances out of the total actual negative instances. Therefore denominator (TN
+ FP) here is the actual number of negative instances present in the dataset. It is similar to recall but the
shift is on the negative instances. Like finding out how many healthy patients were not having cancer
and were told they don’t have cancer. Kind of a measure to see how separate the classes are.
F1 score
It is the harmonic mean of precision and recall. This takes the contribution of both, so higher the F1
score, the better. See that due to the product in the numerator if one goes low, the final F1 score goes
down significantly. So a model does well in F1 score if the positive predicted are actually positives
(precision) and doesn't miss out on positives and predicts them negative (recall).
One drawback is that both precision and recall are given equal importance due to which according to our
application we may need one higher than the other and F1 score may not be the exact metric for it.
Therefore either weighted-F1 score or seeing the PR or ROC curve can help.
PR curve
It is the curve between precision and recall for various threshold values. In the figure below we have 6
predictors showing their respective precision-recall curve for various threshold values. The top right part
of the graph is the ideal space where we get high precision and recall. Based on our application we can
choose the predictor and the threshold value. PR AUC is just the area under the curve. The higher its
ROC curve
ROC stands for receiver operating characteristic and the graph is plotted against TPR and FPR for
various threshold values. As TPR increases FPR also increases. As you can see in the first figure, we
have four categories and we want the threshold value that leads us closer to the top left corner.
Comparing different predictors (here 3) on a given dataset also becomes easy as you can see in figure 2,
one can choose the threshold according to the application at hand. ROC AUC is just the area under the
Due to the absence of TN in the precision-recall equation, they are useful in imbalanced classes. In
the case of class imbalance when there is a majority of the negative class. The metric doesn’t take much
into consideration the high number of TRUE NEGATIVES of the negative class which is in majority,
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084
giving better resistance to the imbalance. This is important when the detection of the positive class is
very important.
Like to detect cancer patients, which has a high class imbalance because very few have it out of all the
diagnosed. We certainly don’t want to miss on a person having cancer and going undetected (recall) and
Due to the consideration of TN or the negative class in the ROC equation, it is useful when both
the classes are important to us. Like the detection of cats and dog. The importance of true negatives
makes sure that both the classes are given importance, like the output of a CNN model in determining
Conclusion
The evaluation metric to use depends heavily on the task at hand. For a long time, accuracy was the only
measure I used, which is really a vague option. I hope this blog would have been useful for you. That's
Ex.No:5
Write a program to implement the naïve Bayesian classifier for a sample
training dataset stored as a .CSV file. Compute the accuracy of the classifier,
Date: considering few test datasets.
import csv
import random
import math
def loadCsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)];
del summaries[-1]
return summaries
def summarizeByClass(dataset):
separated = separateByClass(dataset);
summaries = {}
for classValue, instances in separated.items():
#summaries is a dic of tuples(mean,std) for each class value
summaries[classValue] = summarize(instances)
return summaries
def main():
filename = '5data.csv'
splitRatio = 0.67
dataset = loadCsv(filename);
main()
Output
confusion matrix is as
follows [[17 0 0]
[ 0 17 0]
[ 0 0 11]]
Accuracy metrics
precision recall f1-score support
Bronchitis = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.92],
[„True‟, „False‟,0.08].
[ „False‟, „True‟,0.03],
[ „False‟, „False‟, 0.98]], [ smoking])
Tuberculosis_or_cancer = ConditionalProbabilityTable(
[[ „True‟, „True‟, „True‟, 1.0],
[„True‟, „True‟, „False‟, 0.0],
[„True‟, „False‟, „True‟, 1.0],
[„True‟, „False‟, „False‟, 0.0],
[„False‟, „True‟, „True‟, 1.0],
[„False‟, „True‟, „False‟, 0.0],
[„False‟, „False‟ „True‟, 1.0],
[„False‟, „False‟, „False‟, 0.0]], [tuberculosis, lung])
Xray = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.885],
[„True‟, „False‟, 0.115],
[ „False‟, „True‟, 0.04],
network = BayesianNetwork(“asia”)
network.add_nodes(s0,s1,s2)
network.add_edge(s0,s1)
network.add_edge(s1.s2)
network.bake()
print(network.predict_probal({„tuberculosis‟: „True‟}))
Ex.No:7
Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the
same dataset for clustering using k-Means algorithm. Compare the results
of these two algorithms and comment on the quality of clustering. You can
Date: add Java/Python MLlibrary classes/API in the program.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=100, centers =
4,Cluster_std=0.60,random_state=0)
X = X[:, ::-1]
U, s, Vt = np.linalg.svd(covariance)
Angle = np.degrees(np.arctan2(U[1, 0], U[0,0]))
Width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)
Output
[[1 ,0, 0, 0]
[0 ,0, 1, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]]
Ex.No:8
Write a program to implement k-Nearest Neighbour algorithm to classify the iris
data set. Print both correct and wrong predictions. Java/Python ML library classes
Date:
can be used for this problem.
import csv
import random
import math
import operator
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes =
sorted(classVotes.iteritems(),
reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet,
predictions): correct = 0
for x in
range(len(testSet)):
key=operator.itemgetter(1
),
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
def main():
# prepare
data
trainingSet=
[] testSet=[]
split = 0.67
loadDataset('knndat.data', split, trainingSet,
testSet) print('Train set: ' + repr(len(trainingSet)))
print('Test set: ' + repr(len(testSet)))
# generate
predictions
predictions=[]
k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x],
k) result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-
1])) accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) +
'%') main()
OUTPUT
Confusion matrix is as follows
[[11 0 0]
[0 9 1]
[0 1 8]]
Accuracy metrics0
Ex.No:9 Apply the technique of pruning for a noisy data monk2 data, and derive the
decision tree from this data. Analyze the results by comparing the structure of
pruned and unpruned tree.
Date:
Machine learning is a problem of trade-offs. The classic issue is over-fitting versus under-fitting. Over-
fitting happens when a model fits on training data so well and it fails to generalize well.ie, it also learns
noises on top of the signal. Under-fitting is an opposite event: the model is too simple to find the
patterns in the data.
Decision trees are extremly popular and useful model in machine learning. But it can easily get overfit.
Pruning is one of the mainly used technique to avoid/overcome overfitting. In this kernal we will
discuss about 2 commonly used pruning types.
1. Prepruning
2. Postpruning
In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import tree
from sklearn.metrics import accuracy_score,confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
data = '/kaggle/input/heart-disease-uci/heart.csv'
df = pd.read_csv(data)
df.head()
Out[2]:
We are not getting into the nitty-gritty details of this dataset. The main aim of this kernel is to show you
how to pre prune and post prune the decision tree.s
In [3]:
X = df.drop(columns=['target'])
y = df['target']
print(X.shape)
print(y.shape)
(303, 13)
(303,)
Splitting dataset to train and test
In [4]:
x_train,x_test,y_train,y_test = train_test_split(X,y,stratify=y)
print(x_train.shape)
print(x_test.shape)
(227, 13)
(76, 13)
First we will fit a normal decision tree without any fine tuning and check the results
In [5]:
clf = tree.DecisionTreeClassifier(random_state=0)
clf.fit(x_train,y_train)
y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)
Visualizing decision tree
In [6]:
plt.figure(figsize=(20,20))
features = df.columns
classes = ['Not heart disease','heart disease']
tree.plot_tree(clf,feature_names=features,class_names=classes,filled=True)
plt.show()
In [7]:
# helper function
def plot_confusionmatrix(y_train_pred,y_train,dom):
print(f'{dom} Confusion matrix')
cf = confusion_matrix(y_train_pred,y_train)
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084
sns.heatmap(cf,annot=True,yticklabels=classes
,xticklabels=classes,cmap='Blues', fmt='g')
plt.tight_layout()
plt.show()
In [8]:
print(f'Train score {accuracy_score(y_train_pred,y_train)}')
print(f'Test score {accuracy_score(y_test_pred,y_test)}')
plot_confusionmatrix(y_train_pred,y_train,dom='Train')
plot_confusionmatrix(y_test_pred,y_test,dom='Test')
Train score 1.0
Test score 0.7631578947368421
Train Confusion matrix
We can see that in our train data we have 100% accuracy (100 % precison and recall). But in test data
model is not well generalizing. We have just 75% accuracy. Over model is clearly overfitting. We will
avoid overfitting through pruning. We will do cost complexity prunning
1. Pre pruning techniques
Pre pruning is nothing but stoping the growth of decision tree on an early stage. For that we can limit
the growth of trees by setting constrains. We can limit parameters like max_depth , min_samples etc.
An effective way to do is that we can grid search those parameters and choose the optimum values that
gives better performace on test data.
clf = tree.DecisionTreeClassifier()
gcv = GridSearchCV(estimator=clf,param_grid=params)
gcv.fit(x_train,y_train)
Out[9]:
GridSearchCV(estimator=DecisionTreeClassifier(),
param_grid={'max_depth': [2, 4, 6, 8, 10, 12],
'min_samples_leaf': [1, 2],
'min_samples_split': [2, 3, 4]})
In [10]:
model = gcv.best_estimator_
model.fit(x_train,y_train)
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)
In [11]:
plt.figure(figsize=(20,20))
features = df.columns
classes = ['Not heart disease','heart disease']
tree.plot_tree(model,feature_names=features,class_names=classes,filled=True)
plt.show()
We can see that tree is pruned and there is improvement in test accuracy.But still there is still scope of
improvement.
2. Post pruning techniques
There are several post pruning techniques. Cost complexity pruning is one of the important among
them.
Cost complexity pruning is all about finding the right parameter for alpha.We will get the alpha values
for this tree and will check the accuracy with the pruned trees.
To know more about cost complexity pruning watch this vedio from Josh Starmer.
In [12]:
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas)
[0. 0.00469897 0.00565617 0.00630757 0.00660793 0.00660793
0.00704846 0.00739486 0.0076652 0.0077917 0.00783162 0.00792164
0.00802391 0.00926791 0.01082349 0.01151248 0.01566324 0.02484071
0.04195511 0.04299238 0.13943465]
In [13]:
# For each alpha we will append our model to a list
clfs = []
for ccp_alpha in ccp_alphas:
clf = tree.DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
clf.fit(x_train, y_train)
clfs.append(clf)
We will remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node.
In [14]:
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
plt.scatter(ccp_alphas,node_counts)
plt.scatter(ccp_alphas,depth)
plt.plot(ccp_alphas,node_counts,label='no of nodes',drawstyle="steps-post")
plt.plot(ccp_alphas,depth,label='depth',drawstyle="steps-post")
plt.legend()
plt.show()
In [15]:
train_acc = []
test_acc = []
for c in clfs:
y_train_pred = c.predict(x_train)
y_test_pred = c.predict(x_test)
train_acc.append(accuracy_score(y_train_pred,y_train))
test_acc.append(accuracy_score(y_test_pred,y_test))
plt.scatter(ccp_alphas,train_acc)
plt.scatter(ccp_alphas,test_acc)
plt.plot(ccp_alphas,train_acc,label='train_accuracy',drawstyle="steps-post")
plt.plot(ccp_alphas,test_acc,label='test_accuracy',drawstyle="steps-post")
plt.legend()
plt.title('Accuracy vs alpha')
plt.show()
In [16]:
clf_ = tree.DecisionTreeClassifier(random_state=0,ccp_alpha=0.020)
clf_.fit(x_train,y_train)
y_train_pred = clf_.predict(x_train)
y_test_pred = clf_.predict(x_test)
We can see that now our model is not overfiting and performance on test data have improved
In [17]:
plt.figure(figsize=(20,20))
features = df.columns
linkcode
We can see that the size of decision tree significantly got reduced. Also postpruning is much efficient
than prepruning.
Ex.No:10
Build an Artificial Neural Network by implementing the Backpropagation
algorithm and test the same using appropriate data sets.
Date:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally
y = y/100
#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))
#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
#draws a random range of numbers uniformly of dim x*y
for i in range(epoch):
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)#how much hidden layer wts
contributed to error
d_hiddenlayer = EH * hiddengrad
wout += hlayer_act.T.dot(d_output) *lr# dotproduct of nextlayererror
andcurrentlayerop
# bout += np.sum(d_output, axis=0,keepdims=True)
*lrwh += X.T.dot(d_hiddenlayer) *lr
#bh += np.sum(d_hiddenlayer, axis=0,keepdims=True)
*lrprint("Input: \n" + str(X))
print("Actual Output: \n" +
str(y)) print("Predicted Output:
\n" ,output)
output
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Actual Output:[[ 0.92]
[ 0.86]
[ 0.89]]
Predicted Output:[[
0.89559591]
[ 0.88142069]
[ 0.8928407 ]]
Ex.No:11
Implement Support Vector Classification for linear kernels.
Date:
Linear Kernel is used when the data is Linearly separable, that is, it can be separated using
a single Line. It is one of the most common kernels to be used. It is mostly used when there are a
Large number of Features in a particular Data Set. One of the examples where there are a lot of
features, is Text Classification, as each alphabet is a new feature. So we mostly use Linear Kernel
in Text Classification.
Note: Internet Connection must be stable while running the below code because it involves
downloading data.
In the above image, there are two set of features “Blue” features and the “Yellow”
Features. Since these can be easily separated or in other words, they are linearly separable, so the
Linear Kernel can be used here.
import numpy as np
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C = 1.0
h = (x_max / x_min)/100
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.show()
Output:
Here all the features are separated using simple lines, thus representing the Linear Kernel.
Ex.No:12
Implement Logistic Regression to classify problems such as spam detection.
Diabetes predictions and so on.
Date:
The first example which was provided to explain, how machine learning works, was “Spam
Detection”. I think in most of the machine learning courses tutors provide the same example, but, in
how many courses you actually get to implement the model? We talk how machine learning involved
Introduction
The idea of this post is to understand step by step working of the spam filter and how it helps in
making everyone life easier. Also, next time when you see a “You have won a lottery” email rather
The above image gives an overview of spam filtering , plenty of emails arrive everyday, some goes to
spam and rest stays in our primary inbox(unless you have further categories defined). The blue box in
the middle — Machine Learning Model, how does it decide which mail is spam and which one is not.
Before we start talking about the algorithm and the code, take a step back and try relating that simple
explanation of spam detection with monthly active Gmail account(which is approximately 1 billion).
The picture seems pretty complicated, isn’t it? Let’s get an overview on how does gmail use the
We all know the data Google has, is not obviously in paper files. They have data centers which
maintain the customers data. Before Google/Gmail decides to segregate the emails into spam or not
spam category, before it arrives to your mailbox, hundreds of rules apply to those email in the data
centers. These rules describe the properties of a spam email. There are common types of spam filters
Blatant Blocking- Deletes the emails even before it reaches to the inbox.
Bulk Email Filter- This filter helps in filtering the emails that are passed through other categories but
are spam.
Category Filters- User can define their own rules which will enable the filtering of the messages
Null Sender Disposition- Dispose of all messages without an SMTP envelope sender address.
Remember when you get an email saying, “Not delivered to xyz address”.
Null Sender Header Tag Validation- Validate the messages by checking security digital signature.
There are ways to avoid spam filtering and send your emails straight to the inbox. To learn more
about Gmail spam filter please watch this informational video from Google.
Moving on to our aim of creating our very own spam detector. Let’s talk about about that blue box in
the middle of above image. The model is like a small kid unless you tell the kid, the difference
between salt and sugar, he/she won’t be able to recognize it. The similar idea we apply on machine
learning model, we tell the model beforehand what kind of email can be spam or not spam. In order to
do that we need to collect the data from users and ask them to filter few emails as spam or not spam.
The above image is a snapshot of tagged email that have been collected for Spam research. It contains
one set of messages in English of 5,574 emails, tagged according being legitimate(ham) or spam.
Now that we have data with tagged emails — Spam or Not Spam, what should we do next? We
would need to train the machine to make it smart enough to categorize the emails on its own. But, the
machine can’t read the full statement and start categorizing the emails. Here we will need to use our
NLP basics (check out my last blog).
We will first do some pre-processing on message text, like removing - punctuation and stop words.
def text_preprocess(text):
text = text.translate(str.maketrans('', '', string.punctuation))
text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
return " ".join(text)
Once the pre-processing is done, we would need to vectorize the data — i.e collecting each word and
This vector matrix can be used create train/test split. This will help us to train the model/machine to
Choosing a model
Now that we have train test split, we would need to choose a model. There is a huge collection of
models but for this particular exercise we will be using logistic regression.Why?
Generally when someone asks, what is logistic regression? what do you tell them — Oh! it is an
algorithm which is used for categorizing things into two classes (most of the time) i.e. the result is
measured using a dichotomous variable. But, how does logistic regression classify thing into classes
like -binomial(2 possible values), multinomial(3 or more possible values) and ordinal(deals with
ordered categories). For this post we will only be focusing on binomial logistic regression i.e. the
Logistic Regression
Logistic Regression measures the relationship between the categorical dependent variable and one or
From the definition it seems, the logistic function plays an important role in classification here but we
need to understand what is logistic function and how does it help in estimating the probability of
being in a class.
The formula mentioned in the above image is known as Logistic function or Sigmoid function and the
curve called Sigmoid curve. The Sigmoid function gives an S shaped curve. The output of Sigmoid
function tends towards 1 as z → ∞ and tends towards 0 as z → −∞. Hence Sigmoid/logistic function
produces the value of dependent variable which will always lie between [0,1] i.e the probability of
being in a class.
Modelling
For the Spam detection problem, we have tagged messages but we are not certain about new
incoming messages. We will need a model which can tell us the probability of a message being Spam
or Not Spam. Assuming in this example , 0 indicates — negative class (absence of spam) and 1
indicates — positive class (presence of spam), we will use logistic regression model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
So, first we define the model then fit the train data — this phase is called training your model. Once
the training phase is finished we can use the test split and predict the results. In order to check the
accuracy of our model we can use accuracy score metric. This metric compares the predicted results
with the obtained true results. After running above code we got 93% accuracy.
In some cases 93% might seems a good score. There are a lot other things we can do with the
collected data in order to achieve more accurate results, like stemming the words and normalizing the
length.
Concusion:
As we saw, we used previously collected data in order to train the model and predicted the
category for new incoming emails. This indicate the importance of tagging the data in right way. One
mistake can make your machine dumb, e.g In your gmail or any other email account when you get the
emails and you think it is a spam but you choose to ignore, may be next time when you see that email,
you should report that as a spam. This process can help a lot of other people who are receiving the
same kind of email but not aware of what spam is. Sometimes wrong spam tag can move a genuine
email to spam folder too. So, you have to be careful before you tag an email as a spam or not spam.
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving
its accuracy. Machine learning is an important component of the growing field of data science.
2. Define supervised learning
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the output.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model.
3. Define unsupervised learning
Unsupervised learning is a type of machine learning in which models are trained
using unlabeled dataset and are allowed to act on that data without any supervision.
7. What is classification
Machine learning is a field of study and is concerned with algorithms that learn from examples.
Classification is a task that requires the use of machine learning algorithms that learn how to assign a
class label to examples from the problem domain. An easy to understand example is classifying
emails as “spam” or “not spam.” There are many different types of classification tasks that you may
encounter in machine learning and specialized approaches to modeling that may be used for each.
8.What is clustering?
Cluster analysis, or clustering, is an unsupervised machine learning task. It involves automatically
discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering
algorithms only interpret the input data and find natural groups or clusters in feature space.
9.Define precision, accuracy and recall.
In pattern recognition, information retrieval and classification (machine learning), precision
(also called positive predictive value) is the fraction of relevant instances among the retrieved
instances, while recall (also known as sensitivity) is the fraction of relevant instances that were
retrieved. Machine learning model accuracy is the measurement used to determine which model is
best at identifying relationships and patterns between variables in a dataset based on the input, or
training, data.
10.Define entropy
Machine learning is the branch of computer science that deals with pattern recognition,
computation, and predicting outcomes from collected data. Its purpose is to accurately draw
conclusions that were not previously known about the data in question. Normally, the data set is very
large. This makes it difficult for a person to handle, and ideal for a machine. This is a hot topic in the
news these days, particularly in areas like consumer spending and supply chain management.
11.Define regression.
Regression analysis consists of a set of machine learning methods that allow us to predict a
continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function
of the x variables.
The "more general than" relation forms a partial ordering over the hypothesis space.
The version-space algorithm that follows exploits this partial ordering to search for
hypotheses that are consistent with the training examples.
Given hypothesis space H and examples E, the version space is the subset of H that
is consistent with the examples.
The general boundary of a version space, G, is the set of maximally general
members of the version space (i.e., those members of the version space such that no
other element of the version space is more general). The specific boundary of a
version space, S, is the set of maximally specific members of the version space.
A target function, in machine learning, is a method for solving a problem that an AI algorithm
parses its training data to find. Once an algorithm finds its target function, that function can be used
to predict results (predictive analysis). The function can then be used to find output data related to
inputs for real problems where, unlike training sets, outputs are not included.
The target function is essentially the formula that an algorithm feeds data to in order to calculate
predictions. As in algebra, it is common when training AI to find the variable from the solution,
working in reverse. The function as defined by f is applied to the input (I) to produce the output (I),
Therefore O= f(I).
➢ "A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
➢ It is also called a Bayes network, belief network, decision network, or Bayesian model.
➢ Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
➢ Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making
➢ In hard clustering, each data point either belongs to a cluster completely or not. For example,
in the above example each customer is put into one group out of the 10 groups.
➢ In soft clustering, instead of putting each data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is assigned.
➢ Logistic Regression.
➢ Naïve Bayes.
➢ Stochastic Gradient Descent.
➢ K-Nearest Neighbours.
➢ Decision Tree.
➢ Random Forest.
➢ Support Vector Machine.
Both Classification and Clustering is used for the categorisation of objects into one or more
classes based on the features. They appear to be a similar process as the basic difference is minute. In
the case of Classification, there are predefined labels assigned to each input instances according to
their properties whereas in clustering those labels are missing.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.
The term bias was first introduced by Tom Mitchell in 1980 in his paper titled, “The need for
biases in learning generalizations”. The idea of having bias was about model giving importance to
some of the features in order to generalize better for the larger dataset with various other attributes.
Bias in ML does help us generalize better and make our model less sensitive to some single data
point.