0% found this document useful (0 votes)
10 views

ml-lab

lab manual

Uploaded by

muthu viknesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ml-lab

lab manual

Uploaded by

muthu viknesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

lOMoARcPSD|24426084

Mchine learning lab

Master of Computer Applications (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

MACHINE LEARNING
LABORATORY MANUAL

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

GNANAMANI COLLEGE OF TECHNOLOGY


DEPARTMENT OF MASTER OF COMPUTER APPLICATIONS
MC4311 - MACHINE LEARNING LABORATORY
LABORATORY MANUAL
SEM/YEAR: III/II

Machine learning
Machine learning is a subset of artificial intelligence in the field of computer science that often
uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve
performance on a specific task) with data, without being explicitly programmed. In the past
decade, machine learning has given us self-driving cars, practical speech recognition, effective
web search, and a vastly improved understanding of the human genome.

Machine learning tasks


Machine learning tasks are typically classified into two broad categories, depending on whether
there is a learning "signal" or "feedback" available to a learning system:

Supervised learning: The computer is presented with example inputs and their desired outputs,
given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. As
special cases, the input signal can be only partially available, or restricted to special feedback:

Semi-supervised learning: the computer is given only an incomplete training signal: a training set
with some (often many) of the target outputs missing.

Active learning: the computer can only obtain training labels for a limited set of instances (based
on a budget), and also has to optimize its choice of objects to acquire labels for. When used
interactively, these can be presented to the user for labeling.

Reinforcement learning: training data (in form of rewards and punishments) is given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or playing
a game against an opponent.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in
data) or a means towards an end (feature learning).

Supervised learning Un Supervised learning Instance based


learning
Find-s algorithm EM algorithm
Candidate elimination algorithm
Decision tree algorithm
Back propagation Algorithm Locally weighted
Naïve Bayes Algorithm K means algorithm Regression algorithm
K nearest neighbour
algorithm(lazy learning
algorithm)

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Machine learning applications


In classification, inputs are divided into two or more classes, and the learner must produce a model
that assigns unseen inputs to one or more (multi-label classification) of these classes. This is
typically tackled in a supervised manner. Spam filtering is an example of classification, where the
inputs are email (or other) messages and the classes are "spam" and "not spam". In regression, also
a supervised problem, the outputs are continuous rather than discrete.

In clustering, a set of inputs is to be divided into groups. Unlike in classification, the groups are
not known beforehand, making this typically an unsupervised task. Density estimation finds the
distribution of inputs in some space. Dimensionality reduction simplifies inputs by mapping them
into a lower- dimensional space. Topic modeling is a related problem, where a program is given
a list of human language documents and is tasked with finding out which documents cover similar
topics.

Machine learning Approaches

Decision tree learning: Decision tree learning uses a decision tree as a predictive model, which maps
observations about an item to conclusions about the item's target value. Association rule learning
Association rule learning is a method for discovering interesting relations between variables in large
databases.

Artificial neural networks

An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is
a learning algorithm that is vaguely inspired by biological neural networks. Computations are
structured in terms of an interconnected group of artificial neurons, processing information using
a connectionist approach to computation. Modern neural networks are non-linear statistical data
modeling tools. They are usually used to model complex relationships between inputs and outputs,
to find patterns in data, or to capture the statistical structure in an unknown joint probability
distribution between observed variables.

Deep learning

Falling hardware prices and the development of GPUs for personal use in the last few years have
contributed to the development of the concept of deep learning which consists of multiple hidden
layers in an artificial neural network. This approach tries to model the way the human brain
processes light and sound into vision and hearing. Some successful applications of deep learning
are computer vision and speech recognition.

Inductive logic programming


Inductive logic programming (ILP) is an approach to rule learning using logic programming as a
uniform representation for input examples, background knowledge, and hypotheses. Given an
encoding of the known background knowledge and a set of examples represented as a logical
database of facts, an ILP system will derive a hypothesized logic program that entails all positive
and no negative examples.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Support vector machines

Support vector machines (SVMs) are a set of related supervised learning methods used for
classification and regression. Given a set of training examples, each marked as belonging to one
of two categories, an SVM training algorithm builds a model that predicts whether a new example
falls into one category or the other.

Clustering

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to some pre designated criterion or
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated for example by internal compactness (similarity between members
of the same cluster) and separation between different clusters. Other methods are based on
estimated density and graph connectivity. Clustering is a method of unsupervised learning, and a
common technique for statistical data analysis.

Bayesian networks

A Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical
model that represents a set of random variables and their conditional independencies via a directed
acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms. Given symptoms, the network can be used to
compute the probabilities of the presence of various diseases. Efficient algorithms exist that
perform inference and learning.

Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an environment
so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt
to find a policy that maps states of the world to the actions the agent ought to take in those states.
Reinforcement learning differs from the supervised learning problem in that correct input/output
pairs are never presented, nor sub-optimal actions explicitly corrected.

Similarity and metric learning


In this problem, the learning machine is given pairs of examples that are considered similar and
pairs of less similar objects. It then needs to learn a similarity function (or a distance metric
function) that can predict if new objects are similar. It is sometimes used in Recommendation
systems.

Genetic algorithms
A genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and
uses methods such as mutation and crossover to generate new genotype in the hope of finding
good solutions to a given problem. In machine learning, genetic algorithms found some uses in
the 1980s and 1990s. Conversely, machine learning techniques have been used to improve the
performance of genetic and evolutionary algorithms.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Rule-based machine learning

Rule-based machine learning is a general term for any machine learning method that identifies,
learns, or evolves "rules" to store, manipulate or apply, knowledge. The defining characteristic of
a rule-based machine learner is the identification and utilization of a set of relational rules that
collectively represent the knowledge captured by the system. This is in contrast to other machine
learners that commonly identify a singular model that can be universally applied to any instance
in order to make a prediction. Rule-based machine learning approaches include learning classifier
systems, association rule learning, and artificial immune systems.

Feature selection approach

Feature selection is the process of selecting an optimal subset of relevant features for use in model
construction. It is assumed the data contains some features that are either redundant or irrelevant,
and can thus be removed to reduce calculation cost without incurring much loss of information.
Common optimality criteria include accuracy, similarity and information measures.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

MACHINE LEARNING LABORATORY


(Effective from the academic year 2022 -2023)

SEMESTER – III

Subject Code MC4311 IA Marks 20


Number of Lecture Hours/Week 01I + 03P Exam Marks 80
Total Number of Lecture Hours 60 Exam Hours 03

Course objectives: This course will enable students to


• To understand about data cleaning and data preprocessing
• To familiarize with the Supervised Learning algorithms and implement them in
practicalsituations.
• To familiarize with unsupervised Learning algorithms and carry on the implementation part.
• To involve the students to practice ML algorithms and techniques.
• Learn to use algorithms for real time data sets.

LAB REQUIREMENTS:
Python or any ML tools like R

LIST OF EXPERIMENTS :
1. Demonstrate how do you structure data in Machine Learning
2. Implement data preprocessing techniques on real time dataset
3. Implement Feature subset selection techniques
4. Demonstrate how will you measure the performance of a machine learning model
5. Write a program to implement the naïve Bayesian classifier for a sample training dataset.
Compute the accuracy of the classifier, considering few test data sets.
6. Write a program to construct a Bayesian network considering medical data. Use this modelto
demonstrate the diagnosis of heart patients using the standard Heart Disease Data Set.
7. Apply EM algorithm to cluster a set of data stored in a .CSV file.
8. Write a program to implement k-Nearest Neighbor algorithm to classify the data set.
9. Apply the technique of pruning for a noisy data monk2 data, and derive the decision tree from this
data. Analyze the results by comparing the structure of pruned and unpruned tree.
10. Build an Artificial Neural Network by implementing the Backpropagation algorithm and test
the same using appropriate data sets

11. Implement Support Vector Classification for linear kernels.


12. Implement Logistic Regression to classify problems such as spam detection. Diabetes
predictions and so on.

TOTAL: 60 PERIODS

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:1
Demonstrate how do you structure data in Machine Learning
Date:

Machine Learning is one of the hottest technologies used by data scientists or ML experts to
deploy a real-time project. However, only skills of machine learning are not sufficient for solving real-
world problems and designing a better product, but also you have to gain good exposure to the data
structure.

The data structure used for machine learning is quite similar to other software development
fields where it is often used. Machine Learning is a subset of artificial intelligence that includes
various complex algorithms to solve mathematical problems to a great extent. Data structure helps
to build and understand these complex problems. Understanding the data structure also helps you to
build ML models and algorithms in a much more efficient way than other ML professionals. In this
topic, "Data Structure for Machine Learning", we will discuss various concepts of data structure
used in Machine Learning, along with the relationship between data structure and ML. So, let's start
with a quick overview of Data structure and Machine Learning.

What is Data Structure?

The data structure is defined as the basic building block of computer programming that helps us
to organize, manage and store data for efficient search and retrieval.

In other words, the data structure is the collection of data type 'values' which are stored and organized
in such a way that it allows for efficient access and modification.

Types of Data Structure

The data structure is the ordered sequence of data, and it tells the compiler how a programmer is using
the data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data structures.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Now let's discuss popular data structures used for Machine Learning:

1. Linear Data structure:

The linear data structure is a special type of data structure that helps to organize and manage data in a
specific order where the elements are attached adjacently.

There are mainly 4 types of linear data structure as follows:

Array:

An array is one of the most basic and common data structures used in Machine Learning. It is also used
in linear algebra to solve complex mathematical problems. You will use arrays constantly in machine
learning, whether it's:

o To convert the column of a data frame into a list format in pre-processing analysis
o To order the frequency of words present in datasets.
o Using a list of tokenized words to begin clustering topics.
o In word embedding, by creating multi-dimensional matrices.

An array contains index numbers to represent an element starting from 0. The lowest index is arr[0] and
corresponds to the first element.

Let's take an example of a Python array used in machine learning. Although the Python array is quite
different from than array in other programming languages, the Python list is more popular as it includes
the flexibility of data types and their length. If anyone is using Python in ML algorithms, then it's better
to kick your journey from array initially.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Python Array method:

Stacks:

Method Description

Append() It is used to add an element at the end of the list.

Clear() It is used to remove/clear all elements in the list.

Copy() It returns a copy of the list.

Count() It returns the count or total available element with an integer value.

Extend() It is used to add the element of a list to the end of the current list.

Index() It returns the index of the first element with the specified value.

Insert() It is used to add an element at a specific position using an index number.

Pop() It is used to remove an element from a specified position using an index number.

Remove() Used to remove the elements with specified values.

Reverse() Used to show list in reverse order

Sort() Used to sort the list in an array.

Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for
binary classification in deep learning. Although stacks are easy to learn and implement in ML
models but having a good grasp can help in many computer science aspects such as parsing grammar,
etc.

Stacks enable the undo and redo buttons on your computer as they function similar to a stack of blog
content. There is no sense in adding a blog at the bottom of the stack. However, we can only check the
most recent one that has been added. Addition and removal occur at the top of the stack.

Linked List:

A linked list is the type of collection having several separately allocated nodes. Or in other words, a
list is the type of collection of data elements that consist of a value and pointer that point to the
next node in the list.

In a linked list, insertion and deletion are constant time operations and are very efficient, but accessing
a value is slow and often requires scanning. So, a linked list is very significant for a dynamic array
where the shifting of elements is required. Although insertion of an element can be done at the head,
middle or tail position, it is relatively cost consuming. However, linked lists are easy to splice together
and split apart. Also, the list can be converted to a fixed-length array for fast access.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Queue:

A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario in real-
time programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue is
significant in a program where multiple lists of codes need to be processed.

The queue data structure can be used to record the split time of a car in F1 racing.

2. Non-linear Data Structures

As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All the
elements are arranged and linked with each other in a hierarchal manner, where one element can be
linked with one or more elements.

1) Trees

Binary Tree:

The concept of a binary tree is very much similar to a linked list, but the only difference of nodes and
their pointers. In a linked list, each node contains a data value with a pointer that points to the next
node in the list, whereas; in a binary tree, each node has two pointers to subsequent nodes instead
of just one.

Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time
complexity. Similar to the linked list, a binary tree can also be converted to an array on the basis of tree
sorting.

In a binary tree, there are some child and parent nodes shown in the above image. Where the value of
the left child node is always less than the value of the parent node while the value of the right-side child
nodes is always more than the parent node. Hence, in a binary tree structure, data sorting is done
automatically, which makes insertion and deletion efficient.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

2) Graphs

A graph data structure is also very much useful in machine learning for link prediction. Graphs
are directed or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have
good exposure to the graph data structure for machine learning and deep learning.

3) Maps

Maps are the popular data structure in the programming world, which are mostly useful for minimizing
the run-time algorithms and fast searching the data. It stores data in the form of (key, value) pair, where
the key must be unique; however, the value can be duplicated. Each key corresponds to or maps a
value; hence it is named a Map.

In different programming languages, core libraries have built-in maps or, rather, HashMaps with
different names for each implementation.

o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.

Python Dictionaries are very useful in machine learning and data science as various functions and
algorithms return the dictionary as an output. Dictionaries are also much used for implementing sparse
matrices, which is very common in Machine Learning.

4) Heap data structure:

Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a tree,
but it consists of vertical ordering instead of horizontal ordering.

Ordering in a heap DS is applied along the hierarchy but not across it, where the value of the parent
node is always more than that of child nodes either on the left or right side.

Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly,
the element is inserted at the highest available position. After that, it gets compared with its parent and
promoted until it reaches the correct ranking position. Most of the heaps data structures can be stored in
an array along with the relationships between the elements.

Dynamic array data structure:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

This is one of the most important types of data structure used in linear algebra to solve 1-D, 2-D, 3-D
as well as 4-D arrays for matrix arithmetic. Further, it requires good exposure to Python libraries such
as Python NumPy for programming in deep learning.

How is Data Structure used in Machine Learning?

For a Machine learning professional, apart from knowledge of machine learning skills, it is required to
have mastery of data structure and algorithms.

When we use machine learning for solving a problem, we need to evaluate the model performance, i.e.,
which model is fastest and requires the smallest amount of space and resources with accuracy.
Moreover, if a model is built using algorithms, comparing and contrasting two algorithms to determine
the best for the job is crucial to the machine learning professional. For such cases, skills in data
structures become important for ML professionals.

With the knowledge of data structure and algorithms with ML, we can answer the following questions
easily:

o How much memory is required to execute?


o How long will it take to run?
o With the business case on hand, which algorithm will offer the best performance?

Conclusion

In this article, we have discussed how Data structure is helpful in building Machine Learning
algorithms. A data structure is a key player in the programming world to solve most of the computing
problems, and gaining the knowledge of data structure and implementing the best algorithm gives you
the best and optimum solution for an ML problem. Further, having a strong knowledge of data structure
will help you to build a strong foundation and use the skills to create a better Project in Machine
Learning.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:2
Implement data preprocessing techniques on real time dataset
Date:

Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as
the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a machine
learning model for business purpose, then dataset will be different with the dataset required for a liver
patient. So each dataset is different from another dataset. To use the dataset in our code, we usually put
it into a CSV file. However, sometimes, we may also need to use an HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the
tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from
here, "https://www.superdatascience.com/pages/machine-learning. For real-world problems,

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

we can download datasets online from various sources such


as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put that data into a
.csv file.

2) Importing Libraries

In order to perform data preprocessing using Python, we need to import some predefined Python
libraries. These libraries are used to perform some specific jobs. There are three specific libraries that
we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It
is the fundamental package for scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices. So, in Python, we can import it as:

import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this
library, we need to import a sub-library pyplot. This library is used to plot any type of charts in Python
for the code. It will be imported as below:

import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and
used for importing and managing the datasets. It is an open-source data manipulation and analysis
library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning project. But
before importing a dataset, we need to set the current directory as a working directory. To set a working
directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required dataset.

Here, in the below image, we can see the Python file along with required dataset. Now, the current
folder is set as a working directory.
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally as
well as through an URL.

We can use read_csv function as below:

data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed
the name of our dataset. Once we execute the above line of code, it will successfully import the dataset
in our code. We can also check the imported dataset by clicking on the section variable explorer, and
then double click on data_set. Consider the below image:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

As in the above image, indexing is started from 0, which is the default indexing in Python. We can also
change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent variables) and
dependent variables from dataset. In our dataset, there are three independent variables that
are Country, Age, and Salary, and one is a dependent variable which is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the
required rows and columns from the dataset.

x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the
columns. Here we have used :-1, because we don't want to take the last column as it contains the
dependent variable. So by doing this, we will get the matrix of features.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but for R
language it is not required.
4) Handling Missing data:

The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains
some missing data, then it may create a huge problem for our machine learning model. Hence it is
necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In this way,
we just delete the specific row or column which consists of null values. But this way is not so efficient
and removing data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which
contains any missing value and will put it on the place of missing value. This strategy is useful for the
features which have numeric data such as age, salary, year, etc. Here, we will use this approach.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

To handle missing values, we will use Scikit-learn library in our code, which contains various libraries
for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of rest
column values.

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our dataset would
have a categorical variable, then it may create trouble while building the model. So it is necessary to
encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has successfully
encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these variables
are encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is
some correlation between these variables which will produce the wrong output. So to remove this issue,
we will use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that
variable in a particular column, and rest variables become 0. With dummy encoding, we will have a
number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For
Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,


5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into
three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:

For Purchased Variable:

labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here
we are not using OneHotEncoder class because the purchased variable has only two categories yes or
no, and which are automatically encoded into 0 and 1.
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:

6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one
of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the
correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a new dataset
to it, then it will decrease the performance. So we always try to make a machine learning model which
performs well with the training set and also with the test dataset. Here, we can define these datasets as:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random train and
test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are for arrays
of data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2,
which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that you
always get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under the variable
explorer section.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put our
variables in the same range and in the same scale so that no any variable dominate the other variable.

Consider the below dataset:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

As we can see, the age and salary column values are not on the same scale. A machine learning model
is based on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our
machine learning model.

Euclidean distance is given as:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

If we compute any two values from age and salary, then salary values will dominate the age values, and
it will produce an incorrect result. So to remove this issue, we need to perform feature scaling for
machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features. And then
we will fit and transform the training dataset.

st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because it is
already done in training set.

x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:

x_test:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1. But
if these variables will have more range of values, then we will also need to scale those variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more
understandable.

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('Dataset.csv')

#Extracting Independent Variable


x= data_set.iloc[:, :-1].values

#Extracting Dependent variable


y= data_set.iloc[:, 3].values

#handling missing data(Replacing missing data with the mean value)


from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)

#Fitting imputer object to the independent varibles x.


Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

imputerimputer= imputer.fit(x[:, 1:3])

#Replacing missing data with the calculated mean value


x[:, 1:3]= imputer.transform(x[:, 1:3])

#for Country Variable


from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

#Encoding for dummy variables


onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()

#encoding for purchased variable


labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

#Feature Scaling of datasets


from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But there are some steps
or lines of code which are not necessary for all machine learning models. So we can exclude them from our code
to make it reusable for all models.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:3
Implement Feature subset selection techniques
Date:

Feature Selection is the most critical pre-processing activity in any machine learning
process. It intends to select a subset of attributes or features that makes the most meaningful
contribution to a machine learning activity. In order to understand it, let us consider a small example
i.e. Predict the weight of students based on the past information about similar students, which is
captured inside a ‘Student Weight’ data set. The data set has 04 features like Roll Number, Age,
Height & Weight. Roll Number has no effect on the weight of the students, so we eliminate this
feature. So now the new data set will be having only 03 features. This subset of the data s et is
expected to give better results than the full set.

Age Height Weight

12 1.1 23

11 1.05 21.6

13 1.2 24.7

11 1.07 21.3

14 1.24 25.2

12 1.12 23.4

The above data set is a reduced dataset. Before proceeding further, we should look at the
fact why we have reduced the dimensionality of the above dataset OR what are the issues in High
Dimensional Data?

High Dimensional refers to the high number of variables or attributes or features present in certain
data sets, more so in the domains like DNA analysis, geographic information system (GIS), etc. It
may have sometimes hundreds or thousands of dimensions which is not good from the machine
learning aspect because it may be a big challenge for any ML algorithm to handle that. On the other
hand, a high quantity of computational and a high amount of time will be required. Also, a model
built on an extremely high number of features may be very difficult to understand. For these
reasons, it is necessary to take a subset of the features instead of the full set. So we can deduce
that the objectives of feature selection are:
1. Having a faster and more cost-effective (less need for computational resources) learning model
2. Having a better understanding of the underlying model that generates the data.
3. Improving the efficacy of the learning model.

Main Factors Affecting Feature Selection


a. Feature Relevance: In the case of supervised learning, the input data set (which is the training
data set), has a class label attached. A model is inducted based on the training data set — so that the
inducted model can assign class labels to new, unlabeled data. Each of the predictor variables, ie
expected to contribute information to decide the value of the class label. In case of a variable is not
contributing any information, it is said to be irrelevant. In case the information contribution for
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

prediction is very little, the variable is said to be weakly relevant. The remaining variables, which
make a significant contribution to the prediction task are said to be strongly relevant variables.
In the case of unsupervised learning, there is no training data set or labelled data. Grouping of similar
data instances are done and the similarity of data instances are evaluated based on the value of
different variables. Certain variables do not contribute any useful information for deciding the
similarity of dissimilar data instances. Hence, those variable makes no significant contribution to the
grouping process. These variables are marked as irrelevant variables in the context of the
unsupervised machine learning task.
We can understand the concept by taking a real-world example: At the start of the article, we took a
random dataset of the student. In that, Roll Number doesn’t contribute any significant information in
predicting what the Weight of a student would be. Similarly, if we are trying to group together
students with similar academic capabilities, Roll No can really not contribute any information. So, in
the context of grouping students with similar academic merit, the variable Roll No is quite irrelevant.
Any feature which is irrelevant in the context of a machine learning task is a candidate for rejection
when we are selecting a subset of features.
b. Feature Redundancy: A feature may contribute to information that is similar to the information
contributed by one or more features. For example, in the Student Data-set, both the features Age &
Height contribute similar information. This is because, with an increase in age, weight is expected to
increase. Similarly, with the increase in Height also weight is expected to increase. So, in context to
that problem, Age and Height contribute similar information. In other words, irrespective of whether
the feature Height is present or not, the learning model will give the same results. In this kind of
situation where one feature is similar to another feature, the feature is said to be potentially
redundant in the context of a machine learning problem.
All features having potential redundancy are candidates for rejection in the final feature subset. Only
a few representative features out of a set of potentially redundant features are considered for being a
part of the final feature subset. So in short, the main objective of feature selection is to remove all
features which are irrelevant and take a representative subset of the features which are potentially
redundant. This leads to a meaningful feature subset in the context of a specific learning task.
The measure of feature relevance and redundancy
a. Measures of Feature Relevance: In the case of supervised learning, mutual information is
considered as a good measure of information contribution of a feature to decide the value of the class
label. That is why it is a good indicator of the relevance of a feature with respect to the class variable.
The higher the value of mutual information of a feature, the more relevant is that feature. Mutual
information can be calculated as follows:

Marginal entropy of the feature ‘x’,


And K = number of classes, C = class variable, f = feature set that take discrete values. In the case of
unsupervised learning, there is no class variable. Hence, feature-to-class mutual information cannot
be used to measure the information contribution of the features. In the case of unsupervised learn ing,
the entropy of the set of features without one feature at a time is calculated for all features. Then the
features are ranked in descending order of information gain from a feature and the top percentage
(value of beta is a design parameter of the algorithm) of features are selected as relevant features.

b. Measures of Feature Redundancy: There are multiple measures of similarity of information


contribution, the main ones are:
• Correlation-based Measures
• Distance-based Measures
• Other coefficient-based Measure

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

1. Correlation Based Similarity Measure


Correlation is a measure of linear dependency between two random variables. Pearson’s product
correlation coefficient is one of the most popular and accepted measures correlation between two
random variables. For two random feature variables F 1 and F2 .

Correlation value ranges between +1 and -1. A correlation of 1 (+/-) indicates perfect correlation. In
case the correlation is zero, then the features seem to have no linear relationship. Generally for all
feature selection problems, a threshold value is adopted to decide whether two features have adequate
similarity or not.
2. Distance-Based Similarity Measure
The most common distance measure is the Euclidean distance, which, between two features F 1 and
F2 are calculated as

Where the features represent an n-dimensional dataset. Let us consider that the dataset has
two features, Subjects (F1) and marks (F2) under consideration. The Euclidean distance between the
two features will be calculated like this:

Subjects (F1) Marks (F2) (F1 -F2) (F1 -F2)2

2 6 -4 16

3 5.5 -2.5 6.25

6 4 2 4

7 2.5 4.5 20.25

8 3 5 25

6 5.5 0.5 0.25

6 7 -1 1

7 6 1 1

8 6 2 4

9 7 2 4

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

A more generalized form of the Euclidean distance is the Minkowski Distance, measured
as Minkowski distance takes the form of Euclidean distance (also called L2 norm) where r = 2. At
r=1, it takes the form of Manhattan distance (also called L1 norm)

3. Other Similarity Measures


Jaccard index/coefficient is used as a measure of dissimilarity between two features is
complementary of Jaccard Index. For two features having binary values, Jaccard Index is measured
as:

number of cases when both the feature have value 1,


number of cases where the feature 1 has value 0 and feature 2 has value 1,
the number of cases where feature 1 has value 1 and feature 2 has value 0.

Jaccard distance:
Let us take an example to understand it better. Consider two features, F 1 and F2 having values (0, 1,
1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).

As shown in the above picture, the cases where both the values are 0 have been left out without
border- as an indication of the fact that they will be excluded in the calculation of the Jaccard
coefficient.
Therefore, Jaccard Distance between those two features is d j = (1 – 0.4) = 0.6
Note: One more measure of similarity using similarity coefficient calculation is Cosine Similarity.
For the sake of understanding, let u stake an example of the text classification problem. The text
needs to be first transformed into features with a word token being a feature and the number of times
the word occurs in a document comes as a value in each row. There are thousands of features in such
a text dataset. However, the data set is sparse in nature as only a few words do appear in a document
and hence in a row of the data set. So each row has very few non-zero values. However, the non-zero
values can be anything integer value as the same word may occur any number of times. Also,
considering the sparsity of the dataset, the 0-0 matches need to be ignored. Cosine similarity which
is one of the most popular measures in text classification is calculated as:

Cosine Similarity measures the angle between x and y vectors. Hence, if cosine similarity has a
value of 1, the angles between x and y is 0 degrees which means x and y are the same except for the
magnitude. If the cosine similarity is 0, the angle between x and y is 90 0. Hence, they do not share
any similarity. In the case of the above example, the angle comes out to be 43.2 0.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Even after all these steps, there are some few more steps. You can understand it by the following
flowchart:

Feature Selection Process

After the successful completion of this cycle, we get the desired features, and we have finally tested
them also.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:4
Demonstrate how will you measure the performance of a machine learning model
Date:

we will discuss the various ways to check the performance of our machine learning or deep learning

model and why to use one in place of the other. We will discuss terms like:

1. Confusion matrix

2. Accuracy

3. Precision

4. Recall

5. Specificity

6. F1 score

7. Precision-Recall or PR curve

8. ROC (Receiver Operating Characteristics) curve

9. PR vs ROC curve.

For simplicity, we will mostly discuss things in terms of a binary classification problem where let’s

say we’ll have to find if an image is of a cat or a dog. Or a patient is having cancer (positive) or is found

healthy (negative). Some common terms to be clear with are:

True positives (TP): Predicted positive and are actually positive.

False positives (FP): Predicted positive and are actually negative.

True negatives (TN): Predicted negative and are actually negative.

False negatives (FN): Predicted negative and are actually positive.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

So let's get started!

Confusion matrix

It’s just a representation of the above parameters in a matrix format. Better visualization is always good

:)

Accuracy

The most commonly used metric to judge a model and is actually not a clear indicator of the

performance. The worse happens when classes are imbalanced.

Take for example a cancer detection model. The chances of actually having cancer are very low. Let’s

say out of 100, 90 of the patients don’t have cancer and the remaining 10 actually have it. We don’t want
to miss on a patient who is having cancer but goes undetected (false negative). Detecting everyone as

not having cancer gives an accuracy of 90% straight. The model did nothing here but just gave cancer

free for all the 100 predictions.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

We surely need better alternatives.

Precision

Percentage of positive instances out of the total predicted positive instances. Here denominator is the

model prediction done as positive from the whole given dataset. Take it as to find out ‘how much the

model is right when it says it is right’.

Recall/Sensitivity/True Positive Rate

Percentage of positive instances out of the total actual positive instances. Therefore denominator (TP +

FN) here is the actual number of positive instances present in the dataset. Take it as to find out ‘how

much extra right ones, the model missed when it showed the right ones’.

Specificity

Percentage of negative instances out of the total actual negative instances. Therefore denominator (TN

+ FP) here is the actual number of negative instances present in the dataset. It is similar to recall but the

shift is on the negative instances. Like finding out how many healthy patients were not having cancer

and were told they don’t have cancer. Kind of a measure to see how separate the classes are.

F1 score

It is the harmonic mean of precision and recall. This takes the contribution of both, so higher the F1

score, the better. See that due to the product in the numerator if one goes low, the final F1 score goes

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

down significantly. So a model does well in F1 score if the positive predicted are actually positives

(precision) and doesn't miss out on positives and predicts them negative (recall).

One drawback is that both precision and recall are given equal importance due to which according to our

application we may need one higher than the other and F1 score may not be the exact metric for it.

Therefore either weighted-F1 score or seeing the PR or ROC curve can help.

PR curve

It is the curve between precision and recall for various threshold values. In the figure below we have 6

predictors showing their respective precision-recall curve for various threshold values. The top right part

of the graph is the ideal space where we get high precision and recall. Based on our application we can

choose the predictor and the threshold value. PR AUC is just the area under the curve. The higher its

numerical value the better.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

ROC curve

ROC stands for receiver operating characteristic and the graph is plotted against TPR and FPR for

various threshold values. As TPR increases FPR also increases. As you can see in the first figure, we

have four categories and we want the threshold value that leads us closer to the top left corner.

Comparing different predictors (here 3) on a given dataset also becomes easy as you can see in figure 2,

one can choose the threshold according to the application at hand. ROC AUC is just the area under the

curve, the higher its numerical value the better.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Which one to use PR or ROC?

The answer lies in TRUE NEGATIVES.

Due to the absence of TN in the precision-recall equation, they are useful in imbalanced classes. In

the case of class imbalance when there is a majority of the negative class. The metric doesn’t take much

into consideration the high number of TRUE NEGATIVES of the negative class which is in majority,
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

giving better resistance to the imbalance. This is important when the detection of the positive class is

very important.

Like to detect cancer patients, which has a high class imbalance because very few have it out of all the

diagnosed. We certainly don’t want to miss on a person having cancer and going undetected (recall) and

be sure the detected one is having it (precision).

Due to the consideration of TN or the negative class in the ROC equation, it is useful when both

the classes are important to us. Like the detection of cats and dog. The importance of true negatives

makes sure that both the classes are given importance, like the output of a CNN model in determining

the image is of a cat or a dog.

Conclusion

The evaluation metric to use depends heavily on the task at hand. For a long time, accuracy was the only

measure I used, which is really a vague option. I hope this blog would have been useful for you. That's

all from my side. Feel free to suggest corrections and improvements.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:5
Write a program to implement the naïve Bayesian classifier for a sample
training dataset stored as a .CSV file. Compute the accuracy of the classifier,
Date: considering few test datasets.

import csv
import random
import math

def loadCsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]

return dataset

def splitDataset(dataset, splitRatio):


#67% training size
trainSize = int(len(dataset) * splitRatio);
trainSet = []
copy = list(dataset);
while len(trainSet) < trainSize:
#generate indices for the dataset list randomly to pick ele for training data
index = random.randrange(len(copy));
trainSet.append(copy.pop(index))
return [trainSet, copy]
def separateByClass(dataset):
separated = {}
#creates a dictionary of classes 1 and 0 where the values are the instacnes belonging to
each class
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated

def mean(numbers):
return sum(numbers)/float(len(numbers))

def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)];
del summaries[-1]
return summaries

def summarizeByClass(dataset):
separated = separateByClass(dataset);
summaries = {}
for classValue, instances in separated.items():
#summaries is a dic of tuples(mean,std) for each class value
summaries[classValue] = summarize(instances)
return summaries

def calculateProbability(x, mean, stdev):


exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

def calculateClassProbabilities(summaries, inputVector):


probabilities = {}
for classValue, classSummaries in summaries.items():#class and attribute information
as mean and sd
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, stdev = classSummaries[i] #take mean and sd of every attribute
for class 0 and 1 seperaely
x = inputVector[i] #testvector's first attribute
probabilities[classValue] *= calculateProbability(x, mean, stdev);#use
normal dist
return probabilities

def predict(summaries, inputVector):


probabilities = calculateClassProbabilities(summaries, inputVector)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():#assigns that class which has he
highest prob
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel

def getPredictions(summaries, testSet):


predictions = []
for i in range(len(testSet)):
result = predict(summaries, testSet[i])
predictions.append(result)
return predictions

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

def getAccuracy(testSet, predictions):


correct = 0
for i in range(len(testSet)):
if testSet[i][-1] == predictions[i]:
correct += 1
return (correct/float(len(testSet))) * 100.0

def main():
filename = '5data.csv'
splitRatio = 0.67
dataset = loadCsv(filename);

trainingSet, testSet = splitDataset(dataset, splitRatio)


print('Split {0} rows into train={1} and test={2} rows'.format(len(dataset),
len(trainingSet), len(testSet)))
# prepare model
summaries = summarizeByClass(trainingSet);
# test model
predictions = getPredictions(summaries, testSet)
accuracy = getAccuracy(testSet, predictions)
print('Accuracy of the classifier is : {0}%'.format(accuracy))

main()

Output

confusion matrix is as
follows [[17 0 0]
[ 0 17 0]
[ 0 0 11]]
Accuracy metrics
precision recall f1-score support

0 1.00 1.00 1.00 17


1 1.00 1.00 1.00 17
2 1.00 1.00 1.00 11

avg / total 1.00 1.00 1.00 45

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:6 Write a program to construct a Bayesian network considering medical


data. Usethis model to demonstrate the diagnosis of heart patients using
standard Heart Disease Data Set. You can use Java/Python ML library
Date: classes/API.

From pomegranate import*


Asia=DiscreteDistribution({ „True‟:0.5, „False‟:0.5 })
Tuberculosis=ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.2],
[„True‟, „False‟, 0.8],
[ „False‟, „True‟, 0.01],
[ „False‟, „False‟, 0.98]], [asia])

Smoking = DiscreteDistribution({ „True‟:0.5, „False‟:0.5 })


Lung = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.75],
[„True‟, „False‟,0.25].
[ „False‟, „True‟, 0.02],
[ „False‟, „False‟, 0.98]], [ smoking])

Bronchitis = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.92],
[„True‟, „False‟,0.08].
[ „False‟, „True‟,0.03],
[ „False‟, „False‟, 0.98]], [ smoking])

Tuberculosis_or_cancer = ConditionalProbabilityTable(
[[ „True‟, „True‟, „True‟, 1.0],
[„True‟, „True‟, „False‟, 0.0],
[„True‟, „False‟, „True‟, 1.0],
[„True‟, „False‟, „False‟, 0.0],
[„False‟, „True‟, „True‟, 1.0],
[„False‟, „True‟, „False‟, 0.0],
[„False‟, „False‟ „True‟, 1.0],
[„False‟, „False‟, „False‟, 0.0]], [tuberculosis, lung])

Xray = ConditionalProbabilityTable(
[[ „True‟, „True‟, 0.885],
[„True‟, „False‟, 0.115],
[ „False‟, „True‟, 0.04],

[ „False‟, „False‟, 0.96]], [tuberculosis_or_cancer])


dyspnea = ConditionalProbabilityTable(
[[ „True‟, „True‟, „True‟, 0.96],
[„True‟, „True‟, „False‟, 0.04],
[„True‟, „False‟, „True‟, 0.89],
[„True‟, „False‟, „False‟, 0.11],
[„False‟, „True‟, „True‟, 0.96],
[„False‟, „True‟, „False‟, 0.04],
[„False‟, „False‟ „True‟, 0.89],
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

[„False‟, „False‟, „False‟, 0.11 ]], [tuberculosis_or_cancer, bronchitis])


s0 = State(asia, name=”asia”)

s1 = State(tuberculosis, name=” tuberculosis”)


s2 = State(smoking, name=” smoker”)

network = BayesianNetwork(“asia”)
network.add_nodes(s0,s1,s2)
network.add_edge(s0,s1)
network.add_edge(s1.s2)
network.bake()
print(network.predict_probal({„tuberculosis‟: „True‟}))

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:7
Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the
same dataset for clustering using k-Means algorithm. Compare the results
of these two algorithms and comment on the quality of clustering. You can
Date: add Java/Python MLlibrary classes/API in the program.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=100, centers =
4,Cluster_std=0.60,random_state=0)
X = X[:, ::-1]

#flip axes for better plotting


from sklearn.mixture import GaussianMixture
gmm = GaussianMixture (n_components = 4).fit(X)
lables = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap=‟viridis‟);
probs = gmm.predict_proba(X)
print(probs[:5].round(3))
size = 50 * probs.max(1) ** 2 # square emphasizes differences
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=‟viridis‟, s=size);

from matplotlib.patches import Ellipse


def draw_ellipse(position, covariance, ax=None, **kwargs);
“””Draw an ellipse with a given position and covariance”””
Ax = ax or plt.gca()
# Convert covariance to principal axes
if covariance.shape ==(2,2):

U, s, Vt = np.linalg.svd(covariance)
Angle = np.degrees(np.arctan2(U[1, 0], U[0,0]))
Width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)

#Draw the Ellipse


for nsig in range(1,4):
ax.add_patch(Ellipse(position, nsig * width, nsig *height,
angle, **kwargs))

def plot_gmm(gmm, X, label=True, ax=None):


ax = ax or plt.gca()
labels = gmm.fit(X).predict(X)
if label:

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

ax.scatter(X[:, 0], x[:, 1], c=labels, s=40, cmap=‟viridis‟, zorder=2)


else:
ax.scatter(X[:, 0], x[:, 1], s=40, zorder=2)
ax.axis(„equal‟)

w_factor = 0.2 / gmm.weights_.max()


for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
draw_ellipse(pos, covar, alpha=w * w_factor)

gmm = GaussianMixture(n_components=4, random_state=42)


plot_gmm(gmm, X)
gmm = GaussianMixture(n_components=4, covariance_type=‟full‟,
random_state=42)
plot_gmm(gmm, X)

Output

[[1 ,0, 0, 0]
[0 ,0, 1, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]]

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:8
Write a program to implement k-Nearest Neighbour algorithm to classify the iris
data set. Print both correct and wrong predictions. Java/Python ML library classes
Date:
can be used for this problem.

import csv
import random
import math
import operator

def loadDataset(filename, split, trainingSet=[] , testSet=[]):


with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])

def euclideanDistance(instance1, instance2, length):


distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, k):


distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors

def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

sortedVotes =
sorted(classVotes.iteritems(),
reverse=True)
return sortedVotes[0][0]

def getAccuracy(testSet,
predictions): correct = 0
for x in
range(len(testSet)):
key=operator.itemgetter(1
),
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0

def main():
# prepare
data
trainingSet=
[] testSet=[]
split = 0.67
loadDataset('knndat.data', split, trainingSet,
testSet) print('Train set: ' + repr(len(trainingSet)))
print('Test set: ' + repr(len(testSet)))
# generate
predictions
predictions=[]
k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x],
k) result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-
1])) accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) +

'%') main()

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

OUTPUT
Confusion matrix is as follows

[[11 0 0]

[0 9 1]

[0 1 8]]

Accuracy metrics0

1.00 1.00 1.00 11

1 0.90 0.90 0.90 10

2 0.89 0.89 0,89 9

Avg/Total 0.93 0.93 0.93 30

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:9 Apply the technique of pruning for a noisy data monk2 data, and derive the
decision tree from this data. Analyze the results by comparing the structure of
pruned and unpruned tree.
Date:

Machine learning is a problem of trade-offs. The classic issue is over-fitting versus under-fitting. Over-
fitting happens when a model fits on training data so well and it fails to generalize well.ie, it also learns
noises on top of the signal. Under-fitting is an opposite event: the model is too simple to find the
patterns in the data.
Decision trees are extremly popular and useful model in machine learning. But it can easily get overfit.
Pruning is one of the mainly used technique to avoid/overcome overfitting. In this kernal we will
discuss about 2 commonly used pruning types.

1. Prepruning
2. Postpruning

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn import tree
from sklearn.metrics import accuracy_score,confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
data = '/kaggle/input/heart-disease-uci/heart.csv'
df = pd.read_csv(data)
df.head()
Out[2]:

Ag se c Trestbp cho fb restec thalac exan oldpea slop c tha targe


e x p s l s g h g k e a l t

0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1

1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ag se c Trestbp cho fb restec thalac exan oldpea slop c tha targe


e x p s l s g h g k e a l t

2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1

3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1

4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

We are not getting into the nitty-gritty details of this dataset. The main aim of this kernel is to show you
how to pre prune and post prune the decision tree.s

In [3]:
X = df.drop(columns=['target'])
y = df['target']
print(X.shape)
print(y.shape)
(303, 13)
(303,)
Splitting dataset to train and test

In [4]:
x_train,x_test,y_train,y_test = train_test_split(X,y,stratify=y)
print(x_train.shape)
print(x_test.shape)
(227, 13)
(76, 13)
First we will fit a normal decision tree without any fine tuning and check the results

In [5]:
clf = tree.DecisionTreeClassifier(random_state=0)
clf.fit(x_train,y_train)
y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)
Visualizing decision tree
In [6]:
plt.figure(figsize=(20,20))
features = df.columns
classes = ['Not heart disease','heart disease']
tree.plot_tree(clf,feature_names=features,class_names=classes,filled=True)
plt.show()

In [7]:
# helper function
def plot_confusionmatrix(y_train_pred,y_train,dom):
print(f'{dom} Confusion matrix')
cf = confusion_matrix(y_train_pred,y_train)
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

sns.heatmap(cf,annot=True,yticklabels=classes
,xticklabels=classes,cmap='Blues', fmt='g')
plt.tight_layout()
plt.show()

In [8]:
print(f'Train score {accuracy_score(y_train_pred,y_train)}')
print(f'Test score {accuracy_score(y_test_pred,y_test)}')
plot_confusionmatrix(y_train_pred,y_train,dom='Train')
plot_confusionmatrix(y_test_pred,y_test,dom='Test')
Train score 1.0
Test score 0.7631578947368421
Train Confusion matrix

Test Confusion matrix

We can see that in our train data we have 100% accuracy (100 % precison and recall). But in test data
model is not well generalizing. We have just 75% accuracy. Over model is clearly overfitting. We will
avoid overfitting through pruning. We will do cost complexity prunning
1. Pre pruning techniques
Pre pruning is nothing but stoping the growth of decision tree on an early stage. For that we can limit
the growth of trees by setting constrains. We can limit parameters like max_depth , min_samples etc.

An effective way to do is that we can grid search those parameters and choose the optimum values that
gives better performace on test data.

As of now we will control these parameters

• max_depth: maximum depth of decision tree


• min_sample_split: The minimum number of samples required to split an internal node:
• min_samples_leaf: The minimum number of samples required to be at a leaf node.
In [9]:
params = {'max_depth': [2,4,6,8,10,12],
'min_samples_split': [2,3,4],
'min_samples_leaf': [1,2]}

clf = tree.DecisionTreeClassifier()
gcv = GridSearchCV(estimator=clf,param_grid=params)
gcv.fit(x_train,y_train)
Out[9]:
GridSearchCV(estimator=DecisionTreeClassifier(),
param_grid={'max_depth': [2, 4, 6, 8, 10, 12],
'min_samples_leaf': [1, 2],
'min_samples_split': [2, 3, 4]})
In [10]:
model = gcv.best_estimator_
model.fit(x_train,y_train)
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

print(f'Train score {accuracy_score(y_train_pred,y_train)}')


print(f'Test score {accuracy_score(y_test_pred,y_test)}')
plot_confusionmatrix(y_train_pred,y_train,dom='Train')
plot_confusionmatrix(y_test_pred,y_test,dom='Test')
Train score 0.9647577092511013
Test score 0.7894736842105263
Train Confusion matrix

Test Confusion matrix

In [11]:
plt.figure(figsize=(20,20))
features = df.columns
classes = ['Not heart disease','heart disease']
tree.plot_tree(model,feature_names=features,class_names=classes,filled=True)
plt.show()

We can see that tree is pruned and there is improvement in test accuracy.But still there is still scope of
improvement.
2. Post pruning techniques
There are several post pruning techniques. Cost complexity pruning is one of the important among
them.

Cost Complexity Pruning


Decision trees can easily overfit. One way to avoid it is to limit the growth of trees by setting
constrains. We can limit parameters like max_depth , min_samples etc. But a most effective way is to
use post pruning methods like cost complexity pruning. This helps to improve test accuracy and get a
better model.

Cost complexity pruning is all about finding the right parameter for alpha.We will get the alpha values
for this tree and will check the accuracy with the pruned trees.

To know more about cost complexity pruning watch this vedio from Josh Starmer.

In [12]:
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas)
[0. 0.00469897 0.00565617 0.00630757 0.00660793 0.00660793
0.00704846 0.00739486 0.0076652 0.0077917 0.00783162 0.00792164
0.00802391 0.00926791 0.01082349 0.01151248 0.01566324 0.02484071
0.04195511 0.04299238 0.13943465]
In [13]:
# For each alpha we will append our model to a list
clfs = []
for ccp_alpha in ccp_alphas:
clf = tree.DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
clf.fit(x_train, y_train)
clfs.append(clf)
We will remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node.

In [14]:
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
plt.scatter(ccp_alphas,node_counts)
plt.scatter(ccp_alphas,depth)
plt.plot(ccp_alphas,node_counts,label='no of nodes',drawstyle="steps-post")
plt.plot(ccp_alphas,depth,label='depth',drawstyle="steps-post")
plt.legend()
plt.show()

Observation: As alpha increases no of nodes and depth decreases

In [15]:
train_acc = []
test_acc = []
for c in clfs:
y_train_pred = c.predict(x_train)
y_test_pred = c.predict(x_test)
train_acc.append(accuracy_score(y_train_pred,y_train))
test_acc.append(accuracy_score(y_test_pred,y_test))

plt.scatter(ccp_alphas,train_acc)
plt.scatter(ccp_alphas,test_acc)
plt.plot(ccp_alphas,train_acc,label='train_accuracy',drawstyle="steps-post")
plt.plot(ccp_alphas,test_acc,label='test_accuracy',drawstyle="steps-post")
plt.legend()
plt.title('Accuracy vs alpha')
plt.show()

We can choose alpha = 0.020

In [16]:
clf_ = tree.DecisionTreeClassifier(random_state=0,ccp_alpha=0.020)
clf_.fit(x_train,y_train)
y_train_pred = clf_.predict(x_train)
y_test_pred = clf_.predict(x_test)

print(f'Train score {accuracy_score(y_train_pred,y_train)}')


print(f'Test score {accuracy_score(y_test_pred,y_test)}')
plot_confusionmatrix(y_train_pred,y_train,dom='Train')
plot_confusionmatrix(y_test_pred,y_test,dom='Test')
Train score 0.8502202643171806
Test score 0.8026315789473685
Train Confusion matrix

Test Confusion matrix

We can see that now our model is not overfiting and performance on test data have improved

In [17]:
plt.figure(figsize=(20,20))
features = df.columns

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

classes = ['Not heart disease','heart disease']


tree.plot_tree(clf_,feature_names=features,class_names=classes,filled=True)
plt.show()

linkcode
We can see that the size of decision tree significantly got reduced. Also postpruning is much efficient
than prepruning.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:10
Build an Artificial Neural Network by implementing the Backpropagation
algorithm and test the same using appropriate data sets.
Date:

import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally
y = y/100

#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))

#Derivative of Sigmoid Function


def derivatives_sigmoid(x):
return x * (1 - x)

#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
#draws a random range of numbers uniformly of dim x*y
for i in range(epoch):

#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)

#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)#how much hidden layer wts
contributed to error

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

d_hiddenlayer = EH * hiddengrad
wout += hlayer_act.T.dot(d_output) *lr# dotproduct of nextlayererror
andcurrentlayerop
# bout += np.sum(d_output, axis=0,keepdims=True)
*lrwh += X.T.dot(d_hiddenlayer) *lr
#bh += np.sum(d_hiddenlayer, axis=0,keepdims=True)
*lrprint("Input: \n" + str(X))
print("Actual Output: \n" +
str(y)) print("Predicted Output:
\n" ,output)

output

Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Actual Output:[[ 0.92]
[ 0.86]
[ 0.89]]
Predicted Output:[[
0.89559591]
[ 0.88142069]
[ 0.8928407 ]]

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:11
Implement Support Vector Classification for linear kernels.
Date:

Linear Kernel is used when the data is Linearly separable, that is, it can be separated using
a single Line. It is one of the most common kernels to be used. It is mostly used when there are a
Large number of Features in a particular Data Set. One of the examples where there are a lot of
features, is Text Classification, as each alphabet is a new feature. So we mostly use Linear Kernel
in Text Classification.
Note: Internet Connection must be stable while running the below code because it involves
downloading data.

In the above image, there are two set of features “Blue” features and the “Yellow”
Features. Since these can be easily separated or in other words, they are linearly separable, so the
Linear Kernel can be used here.

Advantages of using Linear Kernel:


1. Training a SVM with a Linear Kernel is Faster than with any other Kernel.
2. When training a SVM with a Linear Kernel, only the optimisation of the C
Regularisation parameter is required. On the other hand, when training with other kernels, there is
a need to optimise the γ parameter which means that performing a grid search will usually take
more time.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

# Import the Libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn import svm, datasets

# Import some Data from the iris Data Set

iris = datasets.load_iris()

# Take only the first two features of Data.

# To avoid the slicing, Two-Dim Dataset can be used

X = iris.data[:, :2]

y = iris.target

# C is the SVM regularization parameter

C = 1.0

# Create an Instance of SVM and Fit out the data.

# Data is not scaled so as to be able to plot the support vectors

svc = svm.SVC(kernel ='linear', C = 1).fit(X, y)

# create a mesh to plot

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

h = (x_max / x_min)/100

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))

# Plot the data for Proper Visual Representation

plt.subplot(1, 1, 1)

# Predict the result by giving Data to the model

Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap = plt.cm.Paired, alpha = 0.8)

plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Paired)

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.title('SVC with linear kernel')

# Output the Plot

plt.show()

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Output:

Here all the features are separated using simple lines, thus representing the Linear Kernel.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Ex.No:12
Implement Logistic Regression to classify problems such as spam detection.
Diabetes predictions and so on.
Date:

The first example which was provided to explain, how machine learning works, was “Spam

Detection”. I think in most of the machine learning courses tutors provide the same example, but, in

how many courses you actually get to implement the model? We talk how machine learning involved

in Spam Detection and then just move on to other things.

Introduction

The idea of this post is to understand step by step working of the spam filter and how it helps in

making everyone life easier. Also, next time when you see a “You have won a lottery” email rather

than ignoring it, you might prefer to report it as a spam.

Courtesy : Google images(Medium post)

The above image gives an overview of spam filtering , plenty of emails arrive everyday, some goes to

spam and rest stays in our primary inbox(unless you have further categories defined). The blue box in

the middle — Machine Learning Model, how does it decide which mail is spam and which one is not.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Before we start talking about the algorithm and the code, take a step back and try relating that simple

explanation of spam detection with monthly active Gmail account(which is approximately 1 billion).

The picture seems pretty complicated, isn’t it? Let’s get an overview on how does gmail use the

filtering for a huge number of accounts.

Gmail Spam Detection

We all know the data Google has, is not obviously in paper files. They have data centers which

maintain the customers data. Before Google/Gmail decides to segregate the emails into spam or not

spam category, before it arrives to your mailbox, hundreds of rules apply to those email in the data

centers. These rules describe the properties of a spam email. There are common types of spam filters

which are used by Gmail/Google —

Blatant Blocking- Deletes the emails even before it reaches to the inbox.

Bulk Email Filter- This filter helps in filtering the emails that are passed through other categories but

are spam.

Category Filters- User can define their own rules which will enable the filtering of the messages

according to the specific content or the email addresses etc.

Null Sender Disposition- Dispose of all messages without an SMTP envelope sender address.

Remember when you get an email saying, “Not delivered to xyz address”.

Null Sender Header Tag Validation- Validate the messages by checking security digital signature.

There are ways to avoid spam filtering and send your emails straight to the inbox. To learn more
about Gmail spam filter please watch this informational video from Google.

Create a Spam Detector : Pre-processing

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Moving on to our aim of creating our very own spam detector. Let’s talk about about that blue box in

the middle of above image. The model is like a small kid unless you tell the kid, the difference

between salt and sugar, he/she won’t be able to recognize it. The similar idea we apply on machine

learning model, we tell the model beforehand what kind of email can be spam or not spam. In order to

do that we need to collect the data from users and ask them to filter few emails as spam or not spam.

Kaggle Spam Detection Dataset

The above image is a snapshot of tagged email that have been collected for Spam research. It contains

one set of messages in English of 5,574 emails, tagged according being legitimate(ham) or spam.

Now that we have data with tagged emails — Spam or Not Spam, what should we do next? We

would need to train the machine to make it smart enough to categorize the emails on its own. But, the

machine can’t read the full statement and start categorizing the emails. Here we will need to use our
NLP basics (check out my last blog).

We will first do some pre-processing on message text, like removing - punctuation and stop words.
def text_preprocess(text):
text = text.translate(str.maketrans('', '', string.punctuation))
text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
return " ".join(text)

Once the pre-processing is done, we would need to vectorize the data — i.e collecting each word and

its frequency in each email. The vectorization will produce a matrix.


vectorizer = TfidfVectorizer("english")
message_mat = vectorizer.fit_transform(message_data_copy)
message_mat

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

This vector matrix can be used create train/test split. This will help us to train the model/machine to

be smart and test the accuracy of its results.


message_train, message_test, spam_nospam_train, spam_nospam_test = train_test_split(message_mat,
message_data['Spam/Not_Spam'], test_size=0.3, random_state=20)

Choosing a model

Now that we have train test split, we would need to choose a model. There is a huge collection of

models but for this particular exercise we will be using logistic regression.Why?

Generally when someone asks, what is logistic regression? what do you tell them — Oh! it is an

algorithm which is used for categorizing things into two classes (most of the time) i.e. the result is

measured using a dichotomous variable. But, how does logistic regression classify thing into classes

like -binomial(2 possible values), multinomial(3 or more possible values) and ordinal(deals with

ordered categories). For this post we will only be focusing on binomial logistic regression i.e. the

outcome of the model will be categorized into two classes.

Logistic Regression

According to Wikipedia definition,

Logistic Regression measures the relationship between the categorical dependent variable and one or

more independent variables by estimating probabilities using a logistic function.

From the definition it seems, the logistic function plays an important role in classification here but we

need to understand what is logistic function and how does it help in estimating the probability of

being in a class.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Courtesy — Google image(Quora post)

The formula mentioned in the above image is known as Logistic function or Sigmoid function and the

curve called Sigmoid curve. The Sigmoid function gives an S shaped curve. The output of Sigmoid

function tends towards 1 as z → ∞ and tends towards 0 as z → −∞. Hence Sigmoid/logistic function

produces the value of dependent variable which will always lie between [0,1] i.e the probability of

being in a class.

Modelling

For the Spam detection problem, we have tagged messages but we are not certain about new

incoming messages. We will need a model which can tell us the probability of a message being Spam

or Not Spam. Assuming in this example , 0 indicates — negative class (absence of spam) and 1

indicates — positive class (presence of spam), we will use logistic regression model.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Spam_model = LogisticRegression(solver='liblinear', penalty='l1')


Spam_model.fit(message_train, spam_nospam_train)
pred = Spam_model.predict(message_test)
accuracy_score(spam_nospam_test,pred)

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

So, first we define the model then fit the train data — this phase is called training your model. Once

the training phase is finished we can use the test split and predict the results. In order to check the

accuracy of our model we can use accuracy score metric. This metric compares the predicted results

with the obtained true results. After running above code we got 93% accuracy.

In some cases 93% might seems a good score. There are a lot other things we can do with the

collected data in order to achieve more accurate results, like stemming the words and normalizing the

length.

Concusion:

As we saw, we used previously collected data in order to train the model and predicted the

category for new incoming emails. This indicate the importance of tagging the data in right way. One

mistake can make your machine dumb, e.g In your gmail or any other email account when you get the

emails and you think it is a spam but you choose to ignore, may be next time when you see that email,

you should report that as a spam. This process can help a lot of other people who are receiving the

same kind of email but not aware of what spam is. Sometimes wrong spam tag can move a genuine

email to spam folder too. So, you have to be careful before you tag an email as a spam or not spam.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

Viva Questions with Answers:-


1. What is machine learning?

Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving
its accuracy. Machine learning is an important component of the growing field of data science.
2. Define supervised learning
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the output.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model.
3. Define unsupervised learning
Unsupervised learning is a type of machine learning in which models are trained
using unlabeled dataset and are allowed to act on that data without any supervision.

4. Define semi supervised learning


Semi-supervised learning is an approach to machine learning that combines a small
amount of labeled data with a large amount of unlabeled data during training. Semi-
supervised learning falls between unsupervised learning (with no labeled training data) and
supervised learning (with only labeled training data).
5. Define reinforcement learning
Reinforcement learning (RL) is an area of machine learning concerned with how
intelligent agents ought to take actions in an environment in order to maximize the notion of
cumulative reward. Reinforcement learning is one of three basic machine learning
paradigms, alongside supervised learning and unsupervised learning.

6. What do you mean by hypotheses

 Supervised machine learning is often described as the problem of approximating a


target function that maps inputs to outputs.
 This description is characterized as searching through and evaluating candidate
hypothesis from hypothesis spaces.
 The discussion of hypotheses in machine learning can be confusing for a beginner,
especially when “hypothesis” has a distinct, but related meaning in statistics (e.g.
statistical hypothesis testing) and more broadly in science (e.g. scientific hypothesis).

7. What is classification

Machine learning is a field of study and is concerned with algorithms that learn from examples.
Classification is a task that requires the use of machine learning algorithms that learn how to assign a
class label to examples from the problem domain. An easy to understand example is classifying
emails as “spam” or “not spam.” There are many different types of classification tasks that you may
encounter in machine learning and specialized approaches to modeling that may be used for each.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

8.What is clustering?
Cluster analysis, or clustering, is an unsupervised machine learning task. It involves automatically
discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering
algorithms only interpret the input data and find natural groups or clusters in feature space.
9.Define precision, accuracy and recall.
In pattern recognition, information retrieval and classification (machine learning), precision
(also called positive predictive value) is the fraction of relevant instances among the retrieved
instances, while recall (also known as sensitivity) is the fraction of relevant instances that were
retrieved. Machine learning model accuracy is the measurement used to determine which model is
best at identifying relationships and patterns between variables in a dataset based on the input, or
training, data.

10.Define entropy

Machine learning is the branch of computer science that deals with pattern recognition,
computation, and predicting outcomes from collected data. Its purpose is to accurately draw
conclusions that were not previously known about the data in question. Normally, the data set is very
large. This makes it difficult for a person to handle, and ideal for a machine. This is a hot topic in the
news these days, particularly in areas like consumer spending and supply chain management.

11.Define regression.

Regression analysis consists of a set of machine learning methods that allow us to predict a
continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function
of the x variables.

12.How KNN is different from k-means clustering

The k-means algorithm is an unsupervised clustering algorithm. It takes a bunch of unlabeled


points and tries to group them into “k” number of clusters. It is unsupervised because the points have
no external classification. The “k” in k-means denotes the number of clusters you want to have in the
end. If k = 5, you will have 5 clusters on the data set. The 'K' in K-Means Clustering has nothing to
do with the 'K' in KNN algorithm. k-Means Clustering is an unsupervised learning algorithm that
is used for clustering whereas KNN is a supervised learning algorithm used for classification.

13.What is concept learning

 Rote learning (memorization): Memorizing things without knowing the concept/logic


behind them.
 Passive learning (instructions): Learning from a teacher/expert.
 Analogy (experience): Learning new things from our past experience.
 Inductive learning (experience): On the basis of past experience, formulating a generalized
concept.
 Deductive learning: Deriving new facts from past facts.

14. Define specific boundary and general boundary

Hypothesis h1 is a more general hypothesis than hypothesis h2 if h2 implies h1. In


this case, h2 is a more specific hypothesis than h1. Any hypothesis is both more
general than itself and more specific than itself.
Downloaded by Muthu Viknesh ([email protected])
lOMoARcPSD|24426084

The "more general than" relation forms a partial ordering over the hypothesis space.
The version-space algorithm that follows exploits this partial ordering to search for
hypotheses that are consistent with the training examples.
Given hypothesis space H and examples E, the version space is the subset of H that
is consistent with the examples.
The general boundary of a version space, G, is the set of maximally general
members of the version space (i.e., those members of the version space such that no
other element of the version space is more general). The specific boundary of a
version space, S, is the set of maximally specific members of the version space.

15.Define target function

A target function, in machine learning, is a method for solving a problem that an AI algorithm
parses its training data to find. Once an algorithm finds its target function, that function can be used
to predict results (predictive analysis). The function can then be used to find output data related to
inputs for real problems where, unlike training sets, outputs are not included.

The target function is essentially the formula that an algorithm feeds data to in order to calculate
predictions. As in algebra, it is common when training AI to find the variable from the solution,
working in reverse. The function as defined by f is applied to the input (I) to produce the output (I),
Therefore O= f(I).

16. Define decision tree


❖ Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
❖ In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
❖ The decisions or the test are performed on the basis of features of the given dataset.
❖ It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.

17. What is ANN?

➢ Artificial Neural networks (ANN) or neural networks are computational algorithms.


➢ It intended to simulate the behavior of biological systems composed of “neurons”. ANNs are
computational models inspired by an animal’s central nervous systems. It is capable
of machine learning as well as pattern recognition. These presented as systems of
interconnected “neurons” which can compute values from inputs.
➢ A neural network is an oriented graph. It consists of nodes which in the biological analogy
represent neurons, connected by arcs. It corresponds to dendrites and synapses. Each arc
associated with a weight while at each node. Apply the values received as input by the node
and define Activation function along the incoming arcs, adjusted by the weights of the arcs.

18. Explain gradient descent approximation


Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable
function. Gradient descent is simply used to find the values of a function's parameters (coefficients)
that minimize a cost function as far as possible.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

19. State Bayes theorem.


Bayes Theorem is a method to determine conditional probabilities – that is, the probability of one
event occurring given that another event has already occurred. Thus, conditional probabilities are a must in
determining accurate predictions and probabilities in Machine Learning.

20. Define Bayesian belief networks.

➢ "A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
➢ It is also called a Bayes network, belief network, decision network, or Bayesian model.
➢ Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
➢ Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making

21. Differentiate hard and soft clustering

➢ In hard clustering, each data point either belongs to a cluster completely or not. For example,
in the above example each customer is put into one group out of the 10 groups.
➢ In soft clustering, instead of putting each data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is assigned.

22. Define variance.


variance is an error from sensitivity to small fluctuations in the training set. High variance can
cause an algorithm to model the random noise in the training data, rather than the intended outputs
(overfitting) .” Variance is the difference between many model's predictions.
23. What is inductive machine learning

➢ Inductive Learning Algorithm (ILA) is an iterative and inductive machine learning


algorithm which is used for generating a set of a classification rule, which produces
rules of the form “IF-THEN”, for a set of examples, producing rules at each iteration
and appending to the set of rules.
➢ Basic Idea: There are basically two methods for knowledge extraction firstly from
domain experts and then with machine learning.
➢ For a very large amount of data, the domain experts are not very useful and reliable.
So we move towards the machine learning approach for this work.

24. Why K nearest neighbour algorithm is lazy learning algorithm.


KNN algorithm is the Classification algorithm. It is also called as K Nearest Neighbor
Classifier. K-NN is a lazy learner because it doesn’t learn a discriminative function from the training
data but memorizes the training dataset instead. There is no training time in K-NN. The prediction
step in K-NN is expensive. Each time we want to make a prediction, K-NN is searching for the
nearest neighbors in the entire training set. An eager learner has a model fitting or training step. A
lazy learner does not have a training phase.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

25. Why naïve Bayes is naïve.

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’


Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each other.

26. Mention classification algorithms

➢ Logistic Regression.
➢ Naïve Bayes.
➢ Stochastic Gradient Descent.
➢ K-Nearest Neighbours.
➢ Decision Tree.
➢ Random Forest.
➢ Support Vector Machine.

27. Define pruning.


Pruning is a data compression technique in machine learning and search algorithms that
reduces the size of decision trees by removing sections of the tree that are non-critical and redundant
to classify instances.

28. Differentiate Clustering and classification.

Both Classification and Clustering is used for the categorisation of objects into one or more
classes based on the features. They appear to be a similar process as the basic difference is minute. In
the case of Classification, there are predefined labels assigned to each input instances according to
their properties whereas in clustering those labels are missing.

29.Mention clustering algorithms

Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.

Downloaded by Muthu Viknesh ([email protected])


lOMoARcPSD|24426084

30. Define Bias.

The term bias was first introduced by Tom Mitchell in 1980 in his paper titled, “The need for
biases in learning generalizations”. The idea of having bias was about model giving importance to
some of the features in order to generalize better for the larger dataset with various other attributes.
Bias in ML does help us generalize better and make our model less sensitive to some single data
point.

Downloaded by Muthu Viknesh ([email protected])

You might also like