0% found this document useful (0 votes)
2 views112 pages

Unit-I

The document provides an overview of Machine Learning, defining it as a process where a computer program improves its performance on tasks through experience. It outlines course objectives, outcomes, and various types of learning including supervised, unsupervised, reinforcement, and evolutionary learning. Additionally, it discusses the design of a learning system using a checkers game as an example, detailing the steps involved in choosing training experiences, target functions, and function approximation algorithms.

Uploaded by

mekalaanusha0218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views112 pages

Unit-I

The document provides an overview of Machine Learning, defining it as a process where a computer program improves its performance on tasks through experience. It outlines course objectives, outcomes, and various types of learning including supervised, unsupervised, reinforcement, and evolutionary learning. Additionally, it discusses the design of a learning system using a checkers game as an example, detailing the steps involved in choosing training experiences, target functions, and function approximation algorithms.

Uploaded by

mekalaanusha0218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 112

Machine Learning

Definition of Machine Learning


Definition of Machine Learning:
• A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.
Course Objectives
• To introduce students to the basic concepts and techniques of
Machine Learning.
• To have a thorough understanding of the Supervised and
Unsupervised learning techniques
• To study the various probability-based learning techniques
Course Outcomes
• Distinguish between, supervised, unsupervised and semi-supervised
learning
• Understand algorithms for building classifiers applied on datasets of
non-linearly separable classes
• Understand the principles of evolutionary computing algorithms
• Design an ensembler to increase the classification accuracy
Books
Text Book
• Stephen Marsland, ―Machine Learning – An Algorithmic Perspective, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014.
Reference Books
• Tom M Mitchell, ―Machine Learning, First Edition, McGraw Hill Education, 2013.
• Peter Flach, ―Machine Learning: The Art and Science of Algorithms that Make Sense
of Data, First Edition, Cambridge University Press, 2012.
• Jason Bell, ―Machine learning – Hands on for Developers and Technical Professionals,
First Edition, Wiley, 2014
• Ethem Alpaydin, ―Introduction to Machine Learning 3e (Adaptive Computation and
Machine Learning Series), Third Edition, MIT Press, 2014
Unit-I
• Learning – Types of Machine Learning – Supervised Learning
• The Brain and the Neuron
• Design a Learning System – Perspectives and Issues in Machine
Learning
• Concept Learning Task – Concept Learning as Search – Finding a
Maximally Specific Hypothesis – Version Spaces and the Candidate
Elimination Algorithm
• Linear Discriminants: – Perceptron – Linear Separability – Linear
Regression.
Learning
• The key concept that we will need to think about for our machines is
learning from data.
• The important parts of human learning for this course are
remembering, adapting, and generalizing:
• Recognizing that last time we were in this situation (saw this data) we tried
out some particular action (gave this output) and it worked (was correct), so
we’ll try it again, or it didn’t work, so we’ll try something different.
• The last word, generalizing, is about recognizing similarity between different
situations, so that things that applied in one place can be used in another.
• This is what makes learning useful, because we can use our
knowledge in lots of different places.
Machine Learning
• Machine learning, is about making computers modify or adapt their
actions (making predictions, or controlling a robot) so that these
actions get more accurate.
• Accuracy here is measured by how well the chosen actions reflect the
correct ones.
• The computational complexity of the machine learning methods will
also be of interest to us since what we are producing is algorithms.
• The complexity is often broken into two parts: the complexity of
training, and the complexity of applying the trained algorithm.
Types of Machine Learning
• Supervised learning:
• A training set of examples with the correct responses (targets) is provided.
• Based on this training set, the algorithm generalizes to respond correctly to all
possible inputs.
• This is also called learning from exemplars.
• Unsupervised learning:
• Correct responses are not provided.
• Instead the algorithm tries to identify similarities between the inputs.
• Inputs that have something in common are categorized together.
• The statistical approach to unsupervised learning is known as density
estimation.
Types of Machine Learning
• Reinforcement learning:
• This is somewhere between supervised and unsupervised learning.
• The algorithm gets told when the answer is wrong, but does not know how to correct
it.
• It has to explore and try out different possibilities until it works out how to get the
answer right.
• Reinforcement learning is sometime called learning with a critic because of this
monitor that scores the answer, but does not suggest improvements.
• Evolutionary learning:
• Biological organisms adapt to improve their survival rates and chance of having
offspring in their environment.
• We’ll look at how we can model this in a computer, using an idea of fitness, which
corresponds to a score for how good the current solution is.
Supervised Learning
• Supervised Learning has a set of data usually written as a set of data
(xi, ti), where the inputs are xi, the targets are ti.
• The thing that makes machine learning better is generalization: the
algorithm should produce sensible outputs for inputs that weren’t
encountered during learning.
• Algorithm must be able to deal with noise, which is small
inaccuracies in the data that are inherent in measuring any real world
process.
Supervised Learning
• Regression
• Classification
Regression

I want to know the output at x=0.44


Classification
• The classification
problem consists of
taking input vectors
and deciding which of
N classes they belong
to, based on training
from examples of each
class.
The Brain and the Neuron
• If we can understand how the brain works, then there might be things
in there for us to copy and use for our machine learning systems.
• Each neuron can be viewed as a separate processor, performing a very
simple computation: deciding whether or not to fire.
• The brain is a massively parallel computer made up of 1011 processing
elements.
• We should be able to model the same inside a computer and end up
with animal or human intelligence inside a computer.
The Brain and the Neuron
• Hebb’s Rule
• McCulloch and Pitts Neurons
• Limitations of the McCulloch and Pitts Neuronal Model
Hebb’s Rule
• Hebb’s rule says that the changes in the strength of synaptic
connections are proportional to the correlation in the firing of the two
connecting neurons.
• If two neurons consistently fire simultaneously, then any connection
between them will change in strength, becoming stronger.
• If two neurons never fire simultaneously, the connection between
them will die away.
• The idea is that if two neurons both respond to something, then they
should be connected.
McCulloch and Pitts Neurons
Limitations of MuCulloch and
Pitts Neurons
• Real neurons are much more complicated.
• The inputs to a real neuron are not necessarily summed linearly:
there may be non-linear summations.
• It is possible to improve the model to include many of these features.
• The picture is complicated enough already, and McCulloch and Pitts
neurons already provide a great deal of interesting behavior.
• Networks of McCulloch and Pitts neurons can memorise pictures and
learn to represent functions and classify data.
Designing A Learning System
Here we design a program to learn to play checkers so that we
understand some of the basic design issues and approaches to machine
learning. We take the following steps:
• Choosing the Training Experience
• Choosing the Target function
• Choosing a representation for the target function
• Choosing a function approximation algorithm
• Estimating Training values
• Adjusting the weights
• The Final Design
Choosing the Training
Experience
The type of training experience available can have a significant impact
on success or failure of the learner. There are three important
attributes of training experience. They are:
• Whether the training experience provides direct or indirect feedback
• The degree to which the learner controls the sequence of training
examples
• Representation of the distribution of examples
Direct or Indirect Feedback
• Direct Feedback: Learning from direct training examples consisting of
individual checkers board states and the correct moves for each.
• Indirect Feedback: Learning from indirect information consisting of
the move sequences and final outcomes of various games played.
• Indirect feedback has a problem of credit assignment.
Sequencing of Training
Examples (settings for learning)
• The learner might rely on the teacher to select informative board
states and to provide correct moves for each of them.
• The learner might propose board states that are confusing and ask
the teacher for the appropriate move.
• The learner might take complete control over both the board states
and the correct moves while playing against itself.
Representation of the
distribution of examples
• Learning is most reliable when the training examples follow a
distribution that is similar to the future test examples.
Choosing the Training
Experience
• In our design we decide that our system will train by playing games
against itself. Hence no external trainer need to be present and the
system is allowed to as much training data as time permits. Now we
have a fully specified learning task:
A checkers learning problem:
• Task T: playing checkers
• Performance measure P: percent of games won in the world
tournament
• Training experience E: playing practice games against itself
Choosing the Training
Experience
• In order to complete the design of the learning system, we must now
choose:
• The exact type of knowledge to be learned (choosing the target function).
• A representation for this target knowledge (choosing a representation for the
target function).
• A learning mechanism (choosing a function approximation algorithm).
Choosing the Target Function
• We now need to determine What type of knowledge will be learned
and how this will be used by the performance program.
• We begin with a checkers-playing program that can generate legal
moves from any board state.
• The program needs to learn how to choose the best move from
among the legal moves.
• Many optimization problems fall into this class.
Choosing the Target Function
• To solve this problem we need a function that chooses the best move
for any given board state. The target function is defined as follows:
ChooseMove: BM
• This function accepts as input any board from the set of legal board
states B and produces as output an appropriate move M from the set
of legal moves.
• Now the problem of improving performance P at task T is reduced to
the problem of learning some target function ChooseMove.
• Given the kind of indirect training experience the target function
ChooseMove is very difficult to learn.
Choosing the Target Function
• An alternative target function that will be easy to learn is an
evaluation function that assigns a numerical score to any given board
state. Let us call the function V and it is defined as follows:
V: BR
• This function maps a legal board state to some real value.
Choosing the Target Function
• Let us now define the target value V(b) for an arbitrary board state b in B.
• If b is final board state that is won, then V(b)=100
• If b is final board state that is lost, then V(b)=-100
• If b is a final board state that is drawn, then V(b)=0
• If b is not a final state in the game, then V(b)=V(b’), where b’ is the best final
board state that can be achieved starting from b and playing optimally until the
end of the game.
• This definition of the function is a nonoperational definition because
• The last case requires searching ahead for the optimal line of play until the end of
the game, and
• This definition is not efficiently computable
Choosing the Target Function
• The goal here is to discover an operational description of V; that is a
description that is useful in evaluating states and selecting moves for
the checkers game.
• Now the learning task is reduced to the problem of discovering an
operational description of the ideal target function V (which is very
difficult to learn).
• We actually expect learning algorithms to acquire some approximation
to the target function, and therefore learning a target function is often
called as function approximation.
• We will use the symbol to refer to the function that is actually learned,
to distinguish it from the ideal target function V.
Choosing a Representation for
the Target Function
• The program can be allowed to represent using:
• A large table with a distinct entry specifying the value for each distinct board state.
• We could allow it to represent using a collection of rules that match against
features of the board state.
• A quadratic polynomial function of predefined board features.
• An artificial neural network.
• The choice of representation of the function approximation is a tradeoff
between:
• A very expensive representation to allow representing as close an approximation as
possible to the ideal target function.
• The more expensive the representation the more training data the program will
require.
Choosing a Representation for
the Target Function
• Here in this example we choose a simple representation in which for any
given board state, the function will be calculated as a linear combination
of the following board features:
• X1: the number of black pieces on the board
• X2: the number of red pieces on the board
• X3: the number of black kings on the board
• X4: the number of red kings on the board
• X5: the number of black pieces threatened by red
• X6: the number of red pieces threatened by black.
• Our learning program will represent (b) as a linear function of the form:
• (b) = w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
Choosing a Representation for
the Target Function
• The partial design of the checkers learning program is as follows:
• Task T: playing checkers
• Performance measure P: percent of games won in the world
tournament
• Training Experience E: games played against itself
• Target function: V: BR
• Target function representation:
(b) = w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
Choosing a function
approximation algorithm
• A set of training examples describing a given board state b and the
training value Vtrain(b)are needed to learn the target function .
• Each training example is represented as an ordered pair <b,Vtrain(b)>
• An example training pair is as follows:
<<x1=0,x2=4,x3=0,x4=1,x5=0,x6=0>,+100>
• There are two steps to be taken as part of the function approximation
algorithm
• Estimating training values
• Adjusting the weights
Estimating Training Values
• The only information available to the learner as far as the example
checkers is concerned is whether the game is eventually won or lost.
• Here we require training examples which assign scores to board
states.
• The most difficult task is assigning scores to intermediate board states
that occur before the game’s end.
• The rule for estimating training values can be summarized as follows:
Vtrain(b) (Successor(b))
Adjusting the Weights
• The remaining task is to specify the algorithm for choosing the
weights wi to best fit to the set of training examples <b,Vtrain(b)> just
estimated.
• One very common approach is the estimate the weights so that the
squared error E between the training values and the predicted values
is minimized.

• Here we need to minimize E.


Adjusting the Weights
• Here we use an algorithm called as least mean square (LMS) to
update the weights.

• Where η is a small constant that moderates the size of the weight


update.
The Final Design
• The Final design of our checkers learning system is characterized by
four distinct modules namely:
• The Performance System
• The Critic
• Generalizer
• Experiment Generator
The Final Design
Diagrammatic representation of the relationship between the four modules
of our checkers learning system
The Performance System
• This is the module that solves the given performance task like the
checkers game using the learned target function.
• It takes a new problem as input and produces a trace of the solution
as the output.
• In the case of our checkers game the performance system chooses
the next move which is decided by the evaluation function.
• The performance improve as the evaluation function becomes
increasingly accurate.
The Critic
• This module takes as input the solution trace of the game and
produces training examples of the target function as the output.
• In the case of our checkers game the critic corresponds to the rule for
producing the training examples.

Vtrain(b) (Successor(b))
The Generalizer
• This module takes the training examples as input and produces the
hypothesis which is the estimate of the target function as the output.
• It generalizes from the given specific examples.
• In the case of our checkers problem this module uses the LMS algorithm
and generalizes described by the learned weights w0,w1,w2,w3,w4,w5,w6.
Experiment Generator
• This module takes as input the current hypothesis and produces a
new problem as output for the performance system to explore.
• In the current checkers problem the current hypothesis is and the
new problem is the checkers game with the board’s initial state.
• Our experiment generator is a very simple module which outputs the
same board with the initial state each time.
• A better experiment generator module would create board positions
to explore particular regions of the state space.
Summary of choices in
designing the checkers learning
problem
Perspectives and Issues in
Machine Learning
• In machine learning there is a very large space of possible hypothesis
to be searched.
• The job of machine learning algorithm is to search for the one that
best fits the large observed space and any prior knowledge held by
the learner.
• In the checkers game example the LMS algorithm fits the weights
each time the hypothesized evaluation function predicts a value that
differs from the training value.
Perspectives and Issues in
Machine Learning
Issues in Machine Learning
• What algorithms exist for learning general target functions from specific
training examples? In what settings will particular algorithms converge to
the desired function, given sufficient training data? Which algorithms
perform best for which types of problems and representations?
• How much training data is sufficient? What general bounds can be found
to relate the confidence in learned hypotheses to the amount of training
experience and the character of the learner's hypothesis space?
• When and how does the prior knowledge held by the learner guide the
process of generalizing from examples? Can prior knowledge be helpful
even when it is only approximately correct?
Perspectives and Issues in
Machine Learning
• What is the best strategy for choosing a useful next training
experience?
• How does the choice of this strategy alter the complexity of the
learning problem?
• What specific functions should the system attempt to learn? Can this
process be automated?
• How can the learner automatically alter its representation to improve
its ability to represent and learn the target function?
Concept Learning and the
General to Specific Ordering
• Introduction
• A Concept Learning Task
• Concept Learning as Search
• Find-S: Finding a Maximally Specific Ordering of Hypothesis
• Version Spaces and the Candidate Elimination Algorithm
• Remarks on Version Spaces and Candidate Elimination
• Inductive Bias
Introduction
• Concept learning can be thought of as a problem of searching through
a predefined space of potential hypotheses for the hypothesis that
best fits the training examples.
• The activity of learning mostly involves acquiring general concepts
from specific training examples which is called as concept learning.
• Each concept can be thought of as a Boolean-valued function defined
over a large set. (Eg: a function defined over all animals, whose value
is defined as true for birds and false for other animals).
• Concept learning is all about inferring a Boolean valued function from
training examples of its input and output.
A Concept Learning Task
• Description of Concept Learning Task
• Notation
• The Inductive Learning Hypothesis
Description of A Concept
Learning Task

• Let us consider the example task of learning the target concept “the days on which I
enjoy my favorite water sport”. The above table describes a few examples.
• Let us look at the hypothesis representation for the learner.
• Here we have a very simple representation of the hypothesis in which each hypothesis
consists of a conjugation on the instance attributes.
• For each attribute the hypothesis will be indicated by:
• a “?” indicating any value is accepted for this attribute
• a “” indicating no value is accepted
• or specify a single required value for the attribute
• For example: (rainy,?,high,?,?,?) ,(?,?,?,?,?,?) and (, , , , )
Description of A Concept
Learning Task
• Any concept learning task can be described by:
• The set of instances over which the target function is defined (X)
• The target function (c)
• The set of candidate hypothesis considered by the learner, and (H)
• The set of available training examples (D)
Notation
The Inductive Learning
Hypothesis
• The learning task in the previous example is to determine a hypothesis h
identical to the target concept c over the entire set of instances X.
• The information available about c is only its value over the training
examples.
• The inductive learning algorithms can at best guarantee that the output
hypothesis fits the target concept over the training data.
• The inductive learning hypothesis is any hypothesis found to approximate
the target function well over a sufficiently large set of training examples. It
will also approximate the target function well over other unobserved
examples.
Concept Learning as Search
• Concept Learning as Search (Description)
• General to Specific Ordering of Hypothesis
Description of Concept Learning
as Search
• Concept learning is a task of searching through a large space of hypothesis
implicitly defined by the hypothesis representation.
• The goal is to find the hypothesis that best fits the training examples.
• It is the selection of the hypothesis representation that defines the space
of all hypothesis the program can ever represent and therefore can ever
learn.
• Let us consider the set of instances X and hypothesis H in the previous
example of EnjoySport.
• The instance space contains a total of 3*2*2*2*2*2 =96 distinct instances
and the hypothesis space contains a total of 5*4*4*4*4*4=5120
syntactically distinct hypothesis.
Description of Concept Learning
as Search
• We can observe that every hypothesis containing one or more “”
represent the empty set of instances, which classify every instance as
negative.
• The number of semantically distinct hypothesis is 1+(4*3*3*3*3*3)=973.
• If learning is viewed as a search problem the study of learning algorithms
will examine different strategies for searching the hypothesis space.
• The learning algorithm will be particularly interested in algorithms
capable of efficiently searching infinite space of hypothesis to find the
hypothesis that best fits the training data.
General to Specific Ordering of
Hypothesis
• There is a naturally occurring structure that exists for any concept
learning problem: a general-to-specific ordering of hypothesis.
• This structure helps us in designing learning algorithms that exhaustively
search even infinite hypothesis space without explicitly enumerating
every hypothesis.
• Let us consider two hypothesis:
• h1=(sunny,?,?,strong,?,?)
• h2=(sunny,?,?,?,?,?)
• h2 imposes fewer constraints on the instances and therefore classifies
more instances as positive. We can say that h2 is more general than h1.
General to Specific Ordering of
Hypothesis
• The definition of “more general than” relationship between hypothesis
can be defined as follows:
• For any instance x in X and hypothesis h in H, x satisfies h iff h(x)=1
• Definition of “more general than or equal to”

• Now we say that hj is more general than hk written (hj>ghk) iff (hj>=ghk)
and (hk!>=ghj).
• We can say that hk is more specific than hj and hj is more general than hk.
General to Specific Ordering of
Hypothesis

The relation >=g defines a partial order over the hypothesis space (the
relation is reflexive, antisymmetric and transitive).
The >=g relation is important because it provides a useful structure
over the hypothesis space H for any concept learning problem.
Find-S: Finding A Maximally
Specific Hypothesis
• The best way to use “the more general than” partial ordering to
organize the search for the best hypothesis is to begin with the most
specific possible hypothesis in H, then generalize it each time it fails to
cover an observed positive training example.
• Find-S is an algorithm which uses this partial ordering very effectively.
Find-S: Finding A Maximally
Specific Hypothesis
• The first step of Find-S is to initialize h to the most specific hypothesis in H (for
the example EnjoySport).
• h(, , , , , )
• After observing the first training example the algorithm finds a more general
constraint that fits the example.
• h(sunny,warm,normal,strong,warm,same)
• After observing the second training example the algorithm finds a more more
general constraint.
• h(sunny,warm,?,strong,warm,same)
• The third training example is ignored because the Find-S algorithm ignores
every negative example.
• The fourth example leads to a further generalization of h.
• h(sunny,warm,?,strong,?,?)
Find-S: Finding A Maximally
Specific Hypothesis
• The Find-S algorithm is an algorithm which illustrates a way in which
the more general than partial ordering can be used to find an
acceptable hypothesis.
Find-S: Finding A Maximally
Specific Hypothesis
• There are a few questions left unanswered by the Find-S algorithm
such as:
• Has the learner converged to the correct target concept? (can it determine if
it has found the only hypothesis consistent with the data)
• Why prefer the most specific hypothesis?
• Are the training examples consistent? (training examples can mislead FIND-S
it does not consider negative examples)
• What if there are several maximally specific consistent hypothesis? (it has no
facility to backtrack)
Version Spaces and the
CANDIDATE-ELIMINATION
Algorithm
• Introduction
• Representation
• The List-Then-Eliminate Algorithm
• A more compact representation for version spaces
• CANDIDATE-ELIMINATION Learning Algorithm
• An Illustrative Example
Version Spaces and the
CANDIDATE-ELIMINATION
Algorithm (Introduction)
• A new approach to concept learning is CANDIDTE-ELIMINATION
algorithm which addresses several limitations of the Find-S algorithm
• The idea in CANDIDATE-ELIMINATION algorithm is to output a
description of the set of all hypotheses consistent with the training
examples without explicitly enumerating all of its members.
• Both FIND-S and CANDIDATE-ELIMINATION algorithms perform poorly
when given noisy training data.
• CANDIDATE-ELIMINATION provides a useful conceptual framework for
introducing several fundamental issues in machine learning.
Representation
• The CANDIDATE-ELIMINATION algorithm finds all describable
hypotheses that are consistent with the observed training examples.
• Let us first try to understand what is being consistent with the
observed training examples.

• The CANDIDATE-ELIMINATION algorithm represents all the


hypotheses consistent with the training examples.
• This subset of hypotheses is called version space with respect to
hypotheses space H and training examples D.
Representation
The LIST-THEN-ELIMINATE
Algorithm
A more Compact Representation
of Version Spaces
• The CANDIDATE-ELIMINATION algorithm employs a better representation
of the version space.
• The version space is represented by its most general and least general
members.
• These members form the general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.
A more Compact Representation
of Version Spaces
A more Compact Representation
of Version Spaces
• It Is intuitively probable that we can represent the version space in
terms of its most specific and most general members.
• Here are the definitions for the boundary sets G and S.
CANDIDATE-ELIMINATION
Learning Algorithm
An Illustrative Example
An Illustrative Example
An Illustrative Example
An Illustrative Example
Linear Discriminants
• Perceptron
• Linear Separability
• Linear Regression
Perceptron
The Perceptron is nothing more than a collection of McCulloch and Pitts neurons together
with a set of inputs and some weights to fasten the inputs to the neurons.
Perceptron
• In the McCulloch and Pitts neuron the weights were labelled as wi but
here the count of the neuron to which the input goes is also to be
noted so the label will be wij.
• We use the following equations for each neuron:

h= and
Perceptron
• The output pattern is compared to the target values.
• For a neuron that is correct, we are happy, but any neuron that fired
when it shouldn’t have done, or failed to fire when it should, needs to
have its weights changed.
• The weights will be changed as follows:

and
Perceptron
• The learning rate
• The bias input
• The perceptron learning algorithm
Learning Rate
• The learning rate tells us how to change the weights, with the parameter
controlling how much to change the weights by.
• We could miss it out, which would be the same as setting it to 1.
• If we miss it out weights change a lot whenever there is a wrong answer,
which tends to make the network unstable, so that it never settles down.
• The cost of having a small learning rate is that the weights need to see
the inputs more often before they change significantly, so that the
network takes longer to learn.
• It will be more stable and resistant to noise (errors) and inaccuracies in
the data.
• We use a moderate learning rate, typically 0.1 < < 0.4, depending upon
how much error we expect in the inputs.
The Bias Input
• Each neuron is given a firing threshold that determined what value it
needed before it should fire.
• If we wanted one neuron to fire when all the inputs to the network
were zero, and another not to fire, then we would have a problem.
• The input of the threshold is made -1.
• The value of the weight will change to make the neuron fire or not
fire, whichever is correct when an input of all zeros is given, since the
input on the threshold is always -1, even when all the other inputs are
zero.
• This input is called a bias node.
The Perceptron Learning
Algorithm
An Example of Perceptron
Learning: Logic Functions
• The example we are going to use is something very simple , the logical
OR. = 0.25
• We pick w0 = −0.05,w1 = −0.02,w2 = 0.02
• -1x-0.05+ −0.02 × 0 + 0.02 × 0 = 0.05 > 0
An Example of Perceptron
Learning: Logic Functions
Implementation
• To implement a perceptron we need to design some data structures to hold
the variables, then write and test the program.
• Data structures are usually very basic for machine learning algorithms; here
we need an array to hold the inputs, another to hold the weights, and then
two more for the outputs and the targets.
• We need to present the data to the neural network in the form of input
vectors.
• The vector is a list of values that are presented to the Perceptron, with one
value for each of the nodes in the network.
• It is normal to arrange the data into a two-dimensional array, with each row of
the array being a datapoint.
• Let us look at the recall implementation which is used after training.
Implementation
• Python’s numerical library NumPy provides an alternative method, because
it can easily multiply arrays and matrices together.
• This means that we can write the code with fewer loops, making it rather
easier to read, and also means that we write less code.
• In computer matrices are represented as two-dimensional arrays.
• We can write the set of weights for the network in a matrix by making an
np.array that has m + 1 rows (the number of input nodes + 1 for the bias)
and n columns (the number of neurons).
• The element of the matrix at location (i,j) contains the weight connecting
input i to neuron j, which is what we had in the code above.
Implementation
• If we have matrices A and B where A is size m × n, then the size of B
needs to be n×p, where p can be any number to compute AB.

• NumPy can do this multiplication for us, using the np.dot() function.
• The np.array() function makes the NumPy array, which is actually a
matrix here, made up of an array of arrays: each row is a separate array.
Implementation
• We can put the input vectors into a two-dimensional array of size N×m ,
where N is the number of input vectors we have and m is the number of
inputs.
• The weights array is of size m×n , so we can multiply them together.
• The output will be an N×n matrix that holds the values of the sum that each
neuron computes for each of the N input vectors.
• Now we just need to compute the activations based on these sums.
• NumPy has another useful function for us here, which is
np.where(condition,x,y), (condition is a logical condition and x and y are
values) that returns a matrix that has value x where condition is true and
value y everywhere else.
Implementation
• Now the code for recall section of the implementation becomes:
# Compute activations
activations = np.dot(inputs,self.weights)
# Threshold the activations
return np.where(activations>0,1,0)
Implementation
• The first part of the training algorithm is the same as the recall
computation, so we can put them into a function.(pcnfwd)
• We now just need to compute the weight updates.
• The weights are in an m×n matrix, the activations are in an N×n matrix
(as are the targets) and the inputs are in an N×m matrix.
• To do the multiplication np.dot(inputs,targets-activations) we need to
turn the inputs matrix around so that it is m × N.
• This is done using the np.transpose() function, which swaps the rows
and columns over.
• np.transpose(a)
Implementation
• The weight update for the entire network can be done in one line.
• weights-= eta*np.dot(np.transpose(inputs),activations-targets)
• Now we need to add those extra −1’s onto the input vectors for the bias
node, and to decide what values we should put into the weights to start
with.
• The first of these can be done using the np.concatenate() function,
making a one-dimensional array that contains-1 as all of its elements,
and adding it on to the inputs array.
• inputs_bias = np.concatenate((inputs,-np.ones((np.shape(inputs)[0],1))),
axis=1)
Implementation
• The last thing we need to do is to give initial values to the weights.
• It is possible to set them all to be zero, and the algorithm will get to the
right answer.
• Instead we will assign small random numbers to the weights.
• NumPy has a nice way to do this, using the built-in random number
generator.
• weights=np.random.rand(nIn+1,nOut)*0.1-0.05
Linear Separability
• The Perceptron Convergence Theorem
• The Exclusive OR function
• A Useful Insight
• Another Example: The Pima Indian Dataset
• Preprocessing: Data Preprocessing
Linear Separability
• The perceptron actually tries to find a straight line in 2d, a plane in 3d
and a hyperplane in higher dimensions where the neuron fires on one
side and doesn’t on the other.
• This line is called the decision boundary or discrimination function.
Linear Separability
• Let us consider just one input vector x.
• The neuron fires if x.wT>=0 (where w is the row of W that connects
the inputs to one particular neuron.
• The a.b or the inner product is computed as ||a||||b||cosӨ.
• It can be computed using np.inner().
• The boundary case is the case where we find an input vector x1 that
has x1. wT=0.
• Let us assume there is one more vector x2 that satisfies x2. wT=0.
Linear Separability
• X1. wT= x2. wT
• (x1-x2).wT=0
• In order for the inner product to be zero cosӨ=0 which means Ө=∏/2
or - ∏/2.
• X1-x2 is a straight line between two points that lie on the decision
boundary and the weight vector must be perpendicular to that.
Linear Separability
• Given some data, and the associated target outputs, the Perceptron
simply tries to find a straight line that divides the examples where
each neuron fires from those where it does not.
• The cases where there is a straight line are called linearly separable
cases.
• Now let us see what happens when we have more than one output
neuron.
• The weights for each neuron separately describe a straight line, so by
putting together several neurons we get several straight lines that
each try to separate different parts of the space.
Linear Separability
The Perceptron Convergence
Theorem
• Given a linearly separable dataset, the Perceptron will converge to a
solution that separates the classes, and that it will do it after a finite
number of iterations.
• The number of iterations is bounded by 1/γ2, where γ is the distance
between the separating hyperplane and the closest datapoint to it.
The Exclusive OR Function
• The XOR has the same four input points as the OR function but it is very
clear that we cannot draw a straight line on the graph that separates true
from false.
• The XOR function is not linearly separable.
• The perceptron should fail to get the correct answer using the perceptron
code written.
A Useful Insight
• To solve the problem of XOR we need to rewrite the problem in three
dimensions.
• If the inputs are given as in the figure the XOR problem can be solved.
Another Example: The Pima
Indian Dataset
• The UCI Machine Learning Repository holds lots of datasets that are
used to demonstrate and test machine learning algorithms.
• It provides eight measurements of a group of American Pima Indians
(Pima) living in Arizona in the USA, and the classification is whether or
not each person had diabetes.
• import pylab as pl
• import numpy as np
• import pcn
• pima = np.getfromtxt(’pima-indians-diabetes.data’,delimiter=’,’)
• np.shape(pima)
Another Example: The Pima
Indian Dataset
• There are eight dimensions of data, with the class being the ninth
element of each line.
• In order to see the two different classes in the data in your plot, we
need to work out how to use the np.where command.
• indices0 = np.where(pima[:,8]==0)
• indices1 = np.where(pima[:,8]==1)
• We can plot any two-dimensional subset of the data.
• pl.plot(pima[indices0,0],pima[indices0,1],’go’)
• pl.plot(pima[indices1,0],pima[indices1,1],’rx’)
• pl.show()
Another Example: The Pima
Indian Dataset
• p =pcn.pcn(pima[:,:8],pima[:,8:9])
• p.pcntrain(pima[:,:8],pima[:,8:9],0.25,100)
• trainin = pima[::2,:8]
• testin = pima[1::2,:8]
• traintgt = pima[::2,8:9]
• testtgt = pima[1::2,8:9]
Preprocessing: Data
Preparation
• Machine learning algorithms tend to learn much more effectively if the
inputs and targets are prepared for analysis before the network is
trained.
• The neurons that we are using give outputs of 0 and 1, and so if the
target values are not 0 and 1, then they should be transformed so that
they are 0 and 1.
• The most common approach to scaling the input data is to treat each
data dimension independently.
• We either make each dimension have zero mean and unit variance in
each dimension, or simply to scale them so that maximum value is 1 and
the minimum-1.
Preprocessing: Data
Preparation
• These scalings are commonly referred to as data normalisation, or
sometimes standardisation.
• In NumPy it is very easy to perform the normalisation by using the built-
in np.mean() and np.var() functions.
• As far as the axis for these functions are concerned axis=0 sums down
the columns and axis=1 sums across the rows.
• data = (data- data.mean(axis=0))/data.var(axis=0)
• targets = (targets- targets.mean(axis=0))/targets.var(axis=0)
• It is a good idea to normalise the dataset before splitting it into training
and testing.
Preprocessing: Data
Preparation
• There is useful preprocessing that can be done by looking at the data.
• Taking the pregnancy variable first, there are relatively few subjects
that were pregnant 8 or more times, so they should be replaced by an
8 for any of these values.
• The age would be better quantised into a set of ranges such as 21–30,
31–40, etc.
• pima[np.where(pima[:,0]>8),0] = 8
• pima[np.where(pima[:,7]<=30),7] = 1
• pima[np.where((pima[:,7]>30) & (pima[:,7]<=40)),7] = 2
Preprocessing: Data
Preparation
• The last thing that we can discuss now in preprocessing is feature
selection which is one of the methods of dimensionality reduction.
• If missing out one feature does improve the results, then leave it out
completely and try missing out others as well.
• This is a simplistic way of testing for correlation between the output
and each of the features.
• We can also consider other methods of dimensionality reduction,
which produce lower dimensionsal representations of the data that
still include the relevant information

You might also like