Unit-I
Unit-I
Vtrain(b) (Successor(b))
The Generalizer
• This module takes the training examples as input and produces the
hypothesis which is the estimate of the target function as the output.
• It generalizes from the given specific examples.
• In the case of our checkers problem this module uses the LMS algorithm
and generalizes described by the learned weights w0,w1,w2,w3,w4,w5,w6.
Experiment Generator
• This module takes as input the current hypothesis and produces a
new problem as output for the performance system to explore.
• In the current checkers problem the current hypothesis is and the
new problem is the checkers game with the board’s initial state.
• Our experiment generator is a very simple module which outputs the
same board with the initial state each time.
• A better experiment generator module would create board positions
to explore particular regions of the state space.
Summary of choices in
designing the checkers learning
problem
Perspectives and Issues in
Machine Learning
• In machine learning there is a very large space of possible hypothesis
to be searched.
• The job of machine learning algorithm is to search for the one that
best fits the large observed space and any prior knowledge held by
the learner.
• In the checkers game example the LMS algorithm fits the weights
each time the hypothesized evaluation function predicts a value that
differs from the training value.
Perspectives and Issues in
Machine Learning
Issues in Machine Learning
• What algorithms exist for learning general target functions from specific
training examples? In what settings will particular algorithms converge to
the desired function, given sufficient training data? Which algorithms
perform best for which types of problems and representations?
• How much training data is sufficient? What general bounds can be found
to relate the confidence in learned hypotheses to the amount of training
experience and the character of the learner's hypothesis space?
• When and how does the prior knowledge held by the learner guide the
process of generalizing from examples? Can prior knowledge be helpful
even when it is only approximately correct?
Perspectives and Issues in
Machine Learning
• What is the best strategy for choosing a useful next training
experience?
• How does the choice of this strategy alter the complexity of the
learning problem?
• What specific functions should the system attempt to learn? Can this
process be automated?
• How can the learner automatically alter its representation to improve
its ability to represent and learn the target function?
Concept Learning and the
General to Specific Ordering
• Introduction
• A Concept Learning Task
• Concept Learning as Search
• Find-S: Finding a Maximally Specific Ordering of Hypothesis
• Version Spaces and the Candidate Elimination Algorithm
• Remarks on Version Spaces and Candidate Elimination
• Inductive Bias
Introduction
• Concept learning can be thought of as a problem of searching through
a predefined space of potential hypotheses for the hypothesis that
best fits the training examples.
• The activity of learning mostly involves acquiring general concepts
from specific training examples which is called as concept learning.
• Each concept can be thought of as a Boolean-valued function defined
over a large set. (Eg: a function defined over all animals, whose value
is defined as true for birds and false for other animals).
• Concept learning is all about inferring a Boolean valued function from
training examples of its input and output.
A Concept Learning Task
• Description of Concept Learning Task
• Notation
• The Inductive Learning Hypothesis
Description of A Concept
Learning Task
• Let us consider the example task of learning the target concept “the days on which I
enjoy my favorite water sport”. The above table describes a few examples.
• Let us look at the hypothesis representation for the learner.
• Here we have a very simple representation of the hypothesis in which each hypothesis
consists of a conjugation on the instance attributes.
• For each attribute the hypothesis will be indicated by:
• a “?” indicating any value is accepted for this attribute
• a “” indicating no value is accepted
• or specify a single required value for the attribute
• For example: (rainy,?,high,?,?,?) ,(?,?,?,?,?,?) and (, , , , )
Description of A Concept
Learning Task
• Any concept learning task can be described by:
• The set of instances over which the target function is defined (X)
• The target function (c)
• The set of candidate hypothesis considered by the learner, and (H)
• The set of available training examples (D)
Notation
The Inductive Learning
Hypothesis
• The learning task in the previous example is to determine a hypothesis h
identical to the target concept c over the entire set of instances X.
• The information available about c is only its value over the training
examples.
• The inductive learning algorithms can at best guarantee that the output
hypothesis fits the target concept over the training data.
• The inductive learning hypothesis is any hypothesis found to approximate
the target function well over a sufficiently large set of training examples. It
will also approximate the target function well over other unobserved
examples.
Concept Learning as Search
• Concept Learning as Search (Description)
• General to Specific Ordering of Hypothesis
Description of Concept Learning
as Search
• Concept learning is a task of searching through a large space of hypothesis
implicitly defined by the hypothesis representation.
• The goal is to find the hypothesis that best fits the training examples.
• It is the selection of the hypothesis representation that defines the space
of all hypothesis the program can ever represent and therefore can ever
learn.
• Let us consider the set of instances X and hypothesis H in the previous
example of EnjoySport.
• The instance space contains a total of 3*2*2*2*2*2 =96 distinct instances
and the hypothesis space contains a total of 5*4*4*4*4*4=5120
syntactically distinct hypothesis.
Description of Concept Learning
as Search
• We can observe that every hypothesis containing one or more “”
represent the empty set of instances, which classify every instance as
negative.
• The number of semantically distinct hypothesis is 1+(4*3*3*3*3*3)=973.
• If learning is viewed as a search problem the study of learning algorithms
will examine different strategies for searching the hypothesis space.
• The learning algorithm will be particularly interested in algorithms
capable of efficiently searching infinite space of hypothesis to find the
hypothesis that best fits the training data.
General to Specific Ordering of
Hypothesis
• There is a naturally occurring structure that exists for any concept
learning problem: a general-to-specific ordering of hypothesis.
• This structure helps us in designing learning algorithms that exhaustively
search even infinite hypothesis space without explicitly enumerating
every hypothesis.
• Let us consider two hypothesis:
• h1=(sunny,?,?,strong,?,?)
• h2=(sunny,?,?,?,?,?)
• h2 imposes fewer constraints on the instances and therefore classifies
more instances as positive. We can say that h2 is more general than h1.
General to Specific Ordering of
Hypothesis
• The definition of “more general than” relationship between hypothesis
can be defined as follows:
• For any instance x in X and hypothesis h in H, x satisfies h iff h(x)=1
• Definition of “more general than or equal to”
• Now we say that hj is more general than hk written (hj>ghk) iff (hj>=ghk)
and (hk!>=ghj).
• We can say that hk is more specific than hj and hj is more general than hk.
General to Specific Ordering of
Hypothesis
The relation >=g defines a partial order over the hypothesis space (the
relation is reflexive, antisymmetric and transitive).
The >=g relation is important because it provides a useful structure
over the hypothesis space H for any concept learning problem.
Find-S: Finding A Maximally
Specific Hypothesis
• The best way to use “the more general than” partial ordering to
organize the search for the best hypothesis is to begin with the most
specific possible hypothesis in H, then generalize it each time it fails to
cover an observed positive training example.
• Find-S is an algorithm which uses this partial ordering very effectively.
Find-S: Finding A Maximally
Specific Hypothesis
• The first step of Find-S is to initialize h to the most specific hypothesis in H (for
the example EnjoySport).
• h(, , , , , )
• After observing the first training example the algorithm finds a more general
constraint that fits the example.
• h(sunny,warm,normal,strong,warm,same)
• After observing the second training example the algorithm finds a more more
general constraint.
• h(sunny,warm,?,strong,warm,same)
• The third training example is ignored because the Find-S algorithm ignores
every negative example.
• The fourth example leads to a further generalization of h.
• h(sunny,warm,?,strong,?,?)
Find-S: Finding A Maximally
Specific Hypothesis
• The Find-S algorithm is an algorithm which illustrates a way in which
the more general than partial ordering can be used to find an
acceptable hypothesis.
Find-S: Finding A Maximally
Specific Hypothesis
• There are a few questions left unanswered by the Find-S algorithm
such as:
• Has the learner converged to the correct target concept? (can it determine if
it has found the only hypothesis consistent with the data)
• Why prefer the most specific hypothesis?
• Are the training examples consistent? (training examples can mislead FIND-S
it does not consider negative examples)
• What if there are several maximally specific consistent hypothesis? (it has no
facility to backtrack)
Version Spaces and the
CANDIDATE-ELIMINATION
Algorithm
• Introduction
• Representation
• The List-Then-Eliminate Algorithm
• A more compact representation for version spaces
• CANDIDATE-ELIMINATION Learning Algorithm
• An Illustrative Example
Version Spaces and the
CANDIDATE-ELIMINATION
Algorithm (Introduction)
• A new approach to concept learning is CANDIDTE-ELIMINATION
algorithm which addresses several limitations of the Find-S algorithm
• The idea in CANDIDATE-ELIMINATION algorithm is to output a
description of the set of all hypotheses consistent with the training
examples without explicitly enumerating all of its members.
• Both FIND-S and CANDIDATE-ELIMINATION algorithms perform poorly
when given noisy training data.
• CANDIDATE-ELIMINATION provides a useful conceptual framework for
introducing several fundamental issues in machine learning.
Representation
• The CANDIDATE-ELIMINATION algorithm finds all describable
hypotheses that are consistent with the observed training examples.
• Let us first try to understand what is being consistent with the
observed training examples.
h= and
Perceptron
• The output pattern is compared to the target values.
• For a neuron that is correct, we are happy, but any neuron that fired
when it shouldn’t have done, or failed to fire when it should, needs to
have its weights changed.
• The weights will be changed as follows:
and
Perceptron
• The learning rate
• The bias input
• The perceptron learning algorithm
Learning Rate
• The learning rate tells us how to change the weights, with the parameter
controlling how much to change the weights by.
• We could miss it out, which would be the same as setting it to 1.
• If we miss it out weights change a lot whenever there is a wrong answer,
which tends to make the network unstable, so that it never settles down.
• The cost of having a small learning rate is that the weights need to see
the inputs more often before they change significantly, so that the
network takes longer to learn.
• It will be more stable and resistant to noise (errors) and inaccuracies in
the data.
• We use a moderate learning rate, typically 0.1 < < 0.4, depending upon
how much error we expect in the inputs.
The Bias Input
• Each neuron is given a firing threshold that determined what value it
needed before it should fire.
• If we wanted one neuron to fire when all the inputs to the network
were zero, and another not to fire, then we would have a problem.
• The input of the threshold is made -1.
• The value of the weight will change to make the neuron fire or not
fire, whichever is correct when an input of all zeros is given, since the
input on the threshold is always -1, even when all the other inputs are
zero.
• This input is called a bias node.
The Perceptron Learning
Algorithm
An Example of Perceptron
Learning: Logic Functions
• The example we are going to use is something very simple , the logical
OR. = 0.25
• We pick w0 = −0.05,w1 = −0.02,w2 = 0.02
• -1x-0.05+ −0.02 × 0 + 0.02 × 0 = 0.05 > 0
An Example of Perceptron
Learning: Logic Functions
Implementation
• To implement a perceptron we need to design some data structures to hold
the variables, then write and test the program.
• Data structures are usually very basic for machine learning algorithms; here
we need an array to hold the inputs, another to hold the weights, and then
two more for the outputs and the targets.
• We need to present the data to the neural network in the form of input
vectors.
• The vector is a list of values that are presented to the Perceptron, with one
value for each of the nodes in the network.
• It is normal to arrange the data into a two-dimensional array, with each row of
the array being a datapoint.
• Let us look at the recall implementation which is used after training.
Implementation
• Python’s numerical library NumPy provides an alternative method, because
it can easily multiply arrays and matrices together.
• This means that we can write the code with fewer loops, making it rather
easier to read, and also means that we write less code.
• In computer matrices are represented as two-dimensional arrays.
• We can write the set of weights for the network in a matrix by making an
np.array that has m + 1 rows (the number of input nodes + 1 for the bias)
and n columns (the number of neurons).
• The element of the matrix at location (i,j) contains the weight connecting
input i to neuron j, which is what we had in the code above.
Implementation
• If we have matrices A and B where A is size m × n, then the size of B
needs to be n×p, where p can be any number to compute AB.
• NumPy can do this multiplication for us, using the np.dot() function.
• The np.array() function makes the NumPy array, which is actually a
matrix here, made up of an array of arrays: each row is a separate array.
Implementation
• We can put the input vectors into a two-dimensional array of size N×m ,
where N is the number of input vectors we have and m is the number of
inputs.
• The weights array is of size m×n , so we can multiply them together.
• The output will be an N×n matrix that holds the values of the sum that each
neuron computes for each of the N input vectors.
• Now we just need to compute the activations based on these sums.
• NumPy has another useful function for us here, which is
np.where(condition,x,y), (condition is a logical condition and x and y are
values) that returns a matrix that has value x where condition is true and
value y everywhere else.
Implementation
• Now the code for recall section of the implementation becomes:
# Compute activations
activations = np.dot(inputs,self.weights)
# Threshold the activations
return np.where(activations>0,1,0)
Implementation
• The first part of the training algorithm is the same as the recall
computation, so we can put them into a function.(pcnfwd)
• We now just need to compute the weight updates.
• The weights are in an m×n matrix, the activations are in an N×n matrix
(as are the targets) and the inputs are in an N×m matrix.
• To do the multiplication np.dot(inputs,targets-activations) we need to
turn the inputs matrix around so that it is m × N.
• This is done using the np.transpose() function, which swaps the rows
and columns over.
• np.transpose(a)
Implementation
• The weight update for the entire network can be done in one line.
• weights-= eta*np.dot(np.transpose(inputs),activations-targets)
• Now we need to add those extra −1’s onto the input vectors for the bias
node, and to decide what values we should put into the weights to start
with.
• The first of these can be done using the np.concatenate() function,
making a one-dimensional array that contains-1 as all of its elements,
and adding it on to the inputs array.
• inputs_bias = np.concatenate((inputs,-np.ones((np.shape(inputs)[0],1))),
axis=1)
Implementation
• The last thing we need to do is to give initial values to the weights.
• It is possible to set them all to be zero, and the algorithm will get to the
right answer.
• Instead we will assign small random numbers to the weights.
• NumPy has a nice way to do this, using the built-in random number
generator.
• weights=np.random.rand(nIn+1,nOut)*0.1-0.05
Linear Separability
• The Perceptron Convergence Theorem
• The Exclusive OR function
• A Useful Insight
• Another Example: The Pima Indian Dataset
• Preprocessing: Data Preprocessing
Linear Separability
• The perceptron actually tries to find a straight line in 2d, a plane in 3d
and a hyperplane in higher dimensions where the neuron fires on one
side and doesn’t on the other.
• This line is called the decision boundary or discrimination function.
Linear Separability
• Let us consider just one input vector x.
• The neuron fires if x.wT>=0 (where w is the row of W that connects
the inputs to one particular neuron.
• The a.b or the inner product is computed as ||a||||b||cosӨ.
• It can be computed using np.inner().
• The boundary case is the case where we find an input vector x1 that
has x1. wT=0.
• Let us assume there is one more vector x2 that satisfies x2. wT=0.
Linear Separability
• X1. wT= x2. wT
• (x1-x2).wT=0
• In order for the inner product to be zero cosӨ=0 which means Ө=∏/2
or - ∏/2.
• X1-x2 is a straight line between two points that lie on the decision
boundary and the weight vector must be perpendicular to that.
Linear Separability
• Given some data, and the associated target outputs, the Perceptron
simply tries to find a straight line that divides the examples where
each neuron fires from those where it does not.
• The cases where there is a straight line are called linearly separable
cases.
• Now let us see what happens when we have more than one output
neuron.
• The weights for each neuron separately describe a straight line, so by
putting together several neurons we get several straight lines that
each try to separate different parts of the space.
Linear Separability
The Perceptron Convergence
Theorem
• Given a linearly separable dataset, the Perceptron will converge to a
solution that separates the classes, and that it will do it after a finite
number of iterations.
• The number of iterations is bounded by 1/γ2, where γ is the distance
between the separating hyperplane and the closest datapoint to it.
The Exclusive OR Function
• The XOR has the same four input points as the OR function but it is very
clear that we cannot draw a straight line on the graph that separates true
from false.
• The XOR function is not linearly separable.
• The perceptron should fail to get the correct answer using the perceptron
code written.
A Useful Insight
• To solve the problem of XOR we need to rewrite the problem in three
dimensions.
• If the inputs are given as in the figure the XOR problem can be solved.
Another Example: The Pima
Indian Dataset
• The UCI Machine Learning Repository holds lots of datasets that are
used to demonstrate and test machine learning algorithms.
• It provides eight measurements of a group of American Pima Indians
(Pima) living in Arizona in the USA, and the classification is whether or
not each person had diabetes.
• import pylab as pl
• import numpy as np
• import pcn
• pima = np.getfromtxt(’pima-indians-diabetes.data’,delimiter=’,’)
• np.shape(pima)
Another Example: The Pima
Indian Dataset
• There are eight dimensions of data, with the class being the ninth
element of each line.
• In order to see the two different classes in the data in your plot, we
need to work out how to use the np.where command.
• indices0 = np.where(pima[:,8]==0)
• indices1 = np.where(pima[:,8]==1)
• We can plot any two-dimensional subset of the data.
• pl.plot(pima[indices0,0],pima[indices0,1],’go’)
• pl.plot(pima[indices1,0],pima[indices1,1],’rx’)
• pl.show()
Another Example: The Pima
Indian Dataset
• p =pcn.pcn(pima[:,:8],pima[:,8:9])
• p.pcntrain(pima[:,:8],pima[:,8:9],0.25,100)
• trainin = pima[::2,:8]
• testin = pima[1::2,:8]
• traintgt = pima[::2,8:9]
• testtgt = pima[1::2,8:9]
Preprocessing: Data
Preparation
• Machine learning algorithms tend to learn much more effectively if the
inputs and targets are prepared for analysis before the network is
trained.
• The neurons that we are using give outputs of 0 and 1, and so if the
target values are not 0 and 1, then they should be transformed so that
they are 0 and 1.
• The most common approach to scaling the input data is to treat each
data dimension independently.
• We either make each dimension have zero mean and unit variance in
each dimension, or simply to scale them so that maximum value is 1 and
the minimum-1.
Preprocessing: Data
Preparation
• These scalings are commonly referred to as data normalisation, or
sometimes standardisation.
• In NumPy it is very easy to perform the normalisation by using the built-
in np.mean() and np.var() functions.
• As far as the axis for these functions are concerned axis=0 sums down
the columns and axis=1 sums across the rows.
• data = (data- data.mean(axis=0))/data.var(axis=0)
• targets = (targets- targets.mean(axis=0))/targets.var(axis=0)
• It is a good idea to normalise the dataset before splitting it into training
and testing.
Preprocessing: Data
Preparation
• There is useful preprocessing that can be done by looking at the data.
• Taking the pregnancy variable first, there are relatively few subjects
that were pregnant 8 or more times, so they should be replaced by an
8 for any of these values.
• The age would be better quantised into a set of ranges such as 21–30,
31–40, etc.
• pima[np.where(pima[:,0]>8),0] = 8
• pima[np.where(pima[:,7]<=30),7] = 1
• pima[np.where((pima[:,7]>30) & (pima[:,7]<=40)),7] = 2
Preprocessing: Data
Preparation
• The last thing that we can discuss now in preprocessing is feature
selection which is one of the methods of dimensionality reduction.
• If missing out one feature does improve the results, then leave it out
completely and try missing out others as well.
• This is a simplistic way of testing for correlation between the output
and each of the features.
• We can also consider other methods of dimensionality reduction,
which produce lower dimensionsal representations of the data that
still include the relevant information