machine learning UNIT-I notes
machine learning UNIT-I notes
1
3. A robot driving learning problem:
• Task T: driving on public four-lane highways using vision sensors
• Performance measure P: average distance travelled before an error (as judged
by human overseer)
• Training experience E: a sequence of images and steering commands recorded
while observing a human driver
2
There are three attributes which impact on success or failure of the learner
1. Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
For example, in checkers game:
In learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each Indirect
training examples consisting of the move sequences and final outcomes of various
games played. The information about the correctness of specific moves early in the game
must be inferred indirectly from the fact that the game was eventually won or lost.
Here the learner faces an additional problem of credit assignment, or determining the degree
to which each move in the sequence deserves credit or blame for the final outcome.
Credit assignment can be a particularly difficult problem because the game can be lost even
when early moves are optimal, if these are followed later by poor moves. Hence, learning from
direct training feedback is typically easier than learning from indirect feedback.
2. The degree to which the learner controls the sequence of training examples
For example, in checkers game:
The learner might depends on the teacher to select informative board states and to
provide the correct move for each.
Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect) training
classifications, as it does when it learns by playing against itself with no teacher present
3. How well it represents the distribution of examples over which the final system
performance P must be measured
For example, in checkers game:
In checkers learning scenario, the performance metric P is the percent of games the
system wins in the world tournament.
If its training experience E consists only of games played against itself, there is a danger
that this training experience might not be fully representative of the distribution of
situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from those on
which the final system will be evaluated.
3
2. Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and
how this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board
state.
The program needs only to learn how to choose the best move from among these legal
moves. We must learn to choose among the legal moves, the most obvious choice for
the type of information to be learned is a program, or function, that chooses the best move
for any given board state.
1. Let ChooseMove be the target function and the notation is
ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M.
ChooseMove is a choice for the target function in checkers example, but this
function will turn out to be very difficult to learn given the kind of indirect training
experience available to our system .
2. An alternative target function is an evaluation function that assigns a numerical score
to any given board state
Let the target function V and the notation
V:B →R
which denote that V maps any legal board state from the set B to some real value. Intend for
this target function V to assign higher scores to better board states. If the system can
successfully learn such a target function V, then it can easily use it to select the best move
from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
• If b is a final board state that is won, then V(b) = 100
• If b is a final board state that is lost, then V(b) = -100
• If b is a final board state that is drawn, then V(b) = 0
• If b is a not a final state in the game, then V(b) = V(b' ),
Where b' is the best final board state that can be achieved starting from b and playing
optimally until the end of the game
4
3. Choosing a Representation for the Target Function
Let’s choose a simple representation - for any given board state, the function c
will be calculated as a linear combination of the following board features:
Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by
the learning algorithm.
• Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the
board
• The weight w0 will provide an additive constant to the board value
In order to learn the target function f we require a set of training examples, each
describing a specific board state b and the training value Vtrain(b) for b.
For instance, the following training example describes a board state b in which
black has won the game (note x2 = 0 indicates that red has no remaining pieces)
and for which the target function value Vtrain(b) is therefore +100.
5
Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by
the learning algorithm.
• Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the
board
• The weight w0 will provide an additive constant to the board value
In order to learn the target function f we require a set of training examples, each
describing a specific board state b and the training value Vtrain(b) for b.
For instance, the following training example describes a board state b in which
black has won the game (note x2 = 0 indicates that red has no remaining pieces)
and for which the target function value Vtrain(b) is therefore +100.
Where ,
• V̂ is the learner's current approximation to V
• Successor(b) denotes the next board state following b for which it
is again the program's turn to move
Vtrain(b) ← V̂ (Successor(b))
6
2. Adjusting the weights
Specify the learning algorithm for choosing the weights wi to best fit the set
of training examples {(b, Vtrain(b))}
A first step is to define what we mean by the bestfit to the training data.
Several algorithms are known for finding weights of a linear function that
minimize E. One such algorithm is called the least mean squares, or LMS
training rule. For each observed training example it adjusts the weights a
small amount in the direction that reduces the error on this training example
Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.
7
1. The Performance System is the module that must solve the given
performance task by using the learned target function(s). It takes an
instance of a new problem (new game) as input and produces a trace of its
solution (game history) as output.
2. The Critic takes as input the history or trace of the game and produces as
output a set of training examples of the target function
The sequence of design choices made for the checkers program is summarized in below figure
8
9
1.3 PERSPECTIVES AND ISSUES IN MACHINE LEARNING:
• The LMS algorithm for fitting weights achieves this goal by iteratively tuning
the weights, adding a correction to each weight each time the hypothesized
evaluation function predicts a value that differs from the training value.
• This algorithm works well when the hypothesis representation considered by
the learner defines a continuously parameterized space of potential
hypotheses.
10
CONCEPT LEARNING:
2.1 INTRODUCTION:
Table 2.1 describes a set of example days, each represented by a set of attributes.
The attribute EnjoySport indicates whether or not Aldo enjoys his favorite
water sport on this day.
The task is to learn to predict the value of EnjoySport for an arbitrary day,
based on the values of its other attributes.
Consider the example task of learning the target concept "Days on which Aldo
enjoyshis
favorite water sport=
Table 2.1 : Positive and negative training examples for the target concept EnjoySport.
11
What hypothesis representation is provided to the learner?
The hypothesis that PERSON enjoys his favorite sport only on cold days with high
humidityis represented by the expression
(?, Cold, High, ?, ?, ?)
Notation
• The set of items over which the concept is defined is called the set of
instances, which is denoted by X.
Example: X is the set of all possible days, each represented by the attributes: Sky,
AirTemp, Humidity, Wind, Water, and Forecast
c: X→ {O, 1}
Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
12
• Instances for which c(x) = 1 are called positive examples, or members of the target
concept.
• Instances for which c(x) = 0 are called negative examples, or non-members
of the target concept.
• The ordered pair (x, c(x)) to describe the training example consisting of the
instance x and its target concept value c(x).
• D to denote the set of available training examples
• The symbol H to denote the set of all possible hypotheses that the learner may
consider regarding the identity of the target concept. Each hypothesis h in H
represents a Boolean- valued function defined over X
h: X→{O, 1}
The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.
• Given:
• Instances X: Possible days, each described by the attributes
• Sky (with possible values Sunny, Cloudy, and Rainy),
• AirTemp (with values Warm and Cold),
• Humidity (with values Normal and High),
• Wind (with values Strong and Weak),
• Water (with values Warm and Cool),
• Forecast (with values Same and Change).
• Determine:
• A hypothesis h in H such that h(x) = c(x) for all x in X.
13
The inductive learning hypothesis:
Notice that although the learning task is to determine a hypothesis h identical to the
target concept c over the entire set of instances X, the only information available about c
is its value over the training examples.
Therefore, inductive learning algorithms can at best guarantee that the output
hypothesis fits the target concept over the training data. Lacking any further
information, our assumption is that the best hypothesis regarding unseen instances is
the hypothesis that best fits the observed training data.
The inductive learning hypothesis. : Any hypothesis found to approximate the target
function well over a sufficiently large set of training examples will also approximate
the target function well over other unobserved examples.
Example:
Consider the instances X and hypotheses H in the Enjoy Sport learning task. The
attribute Sky has three possible values, and AirTemp, Humidity, Wind, Water,
Forecast each have two possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.
Every hypothesis containing one or more "Φ" symbols represents the empty set of
instances;
that is, it classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses
14
General-to-Specific Ordering of Hypotheses
• Consider the sets of instances that are classified positive by hl and by h2.
• h2 imposes fewer constraints on the instance, it classifies more instances as
positive. So, any instance classified positive by hl will also be classified
positive by h2. Therefore, h2is more general than hl.
Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any
instance that satisfies hk also satisfies hi
(Ɐx€X)[hk(x)=1->hj(x)=1)]
• In the figure, the box on the left represents the set X of all instances, the
box on the right the set H of all hypotheses.
• Each hypothesis corresponds to some subset of X-the subset of instances
that it classifies positive.
• The arrows connecting hypotheses represent the more - general -than
relation, with the arrow pointing toward the less general hypothesis.
• Note the subset of instances characterized by h2 subsumes the subset
characterized by hl , hence h2 is more - general– than h1
15
2.4 FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS :
How can we use the more-general-than partial ordering to organize the search for
a hypothesis consistent with the observed training examples?
One way is to begin with the most specific possible hypothesis in H, then
generalize this hypothesis each time it fails to cover an observed positive
training example.
(We say that a hypothesis "covers" a positive example if it correctly classifies the
example as positive.)
FIND-S Algorithm
16
• The first step of FIND-S is to initialize h to the most specific hypothesis in H
h - (Ø, Ø, Ø, Ø, Ø, Ø)
17
The key property of the FIND-S algorithm
• FIND-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
• FIND-S algorithm’s final hypothesis will also be consistent with the negative
examples provided the correct target concept is contained in H, and provided
the training examples are correct.
Unanswered by FIND-S
18
2.5 VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM :
19
• One limitation of the FIND-S algorithm is that it outputs just one
hypothesis consistent with the training data – there might be many.
• To overcome this, introduce notion of version space and algorithms to compute
it.
Representation:
One obvious way to represent the version space is simply to list all of its members.
This leads to a simple learning algorithm, which we might call the LIST-THEN -
ELIMINATE algorithm, defined in Table 2.4.
The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all hypotheses in H
and then eliminates any hypothesis found inconsistent with any training example.
20
• The version space of candidate hypotheses thus shrinks as more examples
are observed, until ideally just one hypothesis remains that are consistent
with all the observed examples.
• In fact, this is just one of six different hypotheses from H that are consistent with
these training examples. All six hypotheses are shown in Figure 2.3.
• They constitute the version space relative to this set of data and this hypothesis
representation.
• The arrows among these six hypotheses in Figure 2.3 indicate instances of the
more- general~than relation.
• The CANDIDATE-ELIMINATE algorithm represents the version space by storing
only its most general members (labeled G in Figure 2.3) and its most specific
(labeled S in the figure).
21
• Given only these two sets S and G, it is possible to enumerate all members of
the version space as needed by generating the hypotheses that lie between
these two sets in the general-to-specific partial ordering over hypotheses. It is
intuitively plausible that we can represent the version space in terms of its
most specific and most general members.
• Below we define the boundary sets G and S precisely and prove that these sets do
in fact represent the version space.
22
showing that this leads to an inconsistency.
23
CANDIDATE-ELIMINATION Learning Algorithm
These two boundary sets delimit the entire hypothesis space, because every other
hypothesis in H is both more general than So and more specific than Go. As each
training example is considered, the S and G boundary sets are generalized and
specialized, respectively, to eliminate from the version space any hypotheses found
inconsistent with the new training example. After all examples have been
processed, the computed version space contains all the hypotheses
consistent with these examples and only these hypotheses. This algorithm is
summarize in Table 2.5.
The CANDIDATE-ELIMINTION algorithm computes the version space containing all hypotheses from
H that are consistent with an observed sequence of training examples.
24
25
The detailed implementation of these operations will depend, of course, on the
specific representations for instances and hypotheses. However, the algorithm itself
can be applied to any concept learning task and hypothesis space for which these
operations are well-defined.
In the following example trace of this algorithm, we see how such operations can
be implemented for the representations used in the EnjoySport example problem.
An Illustrative Example
26
• As illustrated by these first two steps, positive training examples may force
the S boundary of the version space to become increasingly general.
Negative training examples play the complimentary role of forcing the G
boundary to become increasingly specific.
• Consider the third training example, shown in Figure 2.5. This negative
example reveals that the G boundary of the version space is overly
general; that is, the hypothesis in G incorrectly predicts that this new
example is a positive example.
• The hypothesis in the G boundary must therefore be specialized until it
correctly classifies this new negative example.
27
28
• When the second training example is observed, it has a similar effect
of generalizing S further to S2, leaving G again unchanged i.e., G2 =
G1 = G0
• Consider the third training example. This negative example reveals that
the G boundaryof the version space is overly general, that is, the
hypothesis in G incorrectly predicts that this new example is a positive
example.
• The hypothesis in the G boundary must therefore be specialized until it
correctly classifies this new negative example
Given that there are six attributes that could be specified to specialize G2, why are
there only three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of
G2 that correctly labels the new example as a negative example, but it is not
included in G3.The reason this hypothesis is excluded is that it is inconsistent with
the previously encountered positive examples.
• This positive example further generalizes the S boundary of the version space. It
also results in removing one member of the G boundary, because this member
fails to cover the new positive example
29
After processing these four examples, the boundary sets S4 and G4 delimit the
version spaceof all hypotheses consistent with the set of incrementally observed
training examples.
30
2.6 REMARKS ON VERSION SPACES AND CANDIDATE-ELIMINATION
The target concept is exactly learned when the S and G boundary sets
converge to a single identical, hypothesis.
31
What will happen if the training data contains errors?
• Suppose, for example, that the second training example above is incorrectly
presented as anegative example instead of a positive example.
• Unfortunately, in this case the algorithm is certain to remove the correct
target concept from the version space.
• Because it will remove every hypothesis that is inconsistent with each training
example, it will eliminate the true target concept from the version space as
soon as this false negative example is encountered.
• Of course, given sufficient additional training data, the learner will
eventually detect an inconsistency by noticing that the S and G boundary
sets eventually converge to an empty version space.
• Such an empty version space indicates that there is no hypothesis in H
consistent with all observed training examples.
• A similar symptom will appear when the training examples are correct, but the
target concept cannot be described in the hypothesis representation.
• For now, we consider only the case in which the training examples are
correct and the true target concept is present in the hypothesis space.
INDUCTIVE BIAS
32
33
34
An Unbiased Learner
• The solution to the problem of assuring that the target concept is in the
hypothesis space H is to provide a hypothesis space capable of representing
every teachable concept that is representing every possible subset of the
instances X.
• The set of all subsets of a set X is called the power set of X
(Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?)
Our concept learning algorithm is now completely unable to generalize beyond
the observed examples!
To see why, suppose we present three positive examples (xl,x 2, x3) and two
negative examples
(x4, x5) to the learner.
At this point, the S boundary of the version space will contain the hypothesis
which is just the disjunction of the positive examples
35
To see the reason, note that when H is the power set of X and x is some previously
unobserved instance,
then for any hypothesis h in the version space that covers x, there will be another
hypothesis h' in the power set that is identical to h except for its classification of x.
And of course if h is in the version space, then h' will be as well, because it agrees
with h on all the observed training examples.
The Futility of Bias-Free Learning
36
What, then, is the inductive bias of the CANDIDATE-ELIMINAION Algorithm ?
To answer this, let us specify L(xi, D,) exactly for this algorithm: given a set of
data D,, the CANDIDATE-ELIMINATION Algorithm will first compute the
version space VSH,D, ,then classify the new instance xi by a vote among
hypotheses in this version space. Here let us assume that it will output a
classification for xi only if
this vote among version space hypotheses is unanimously positive or negative and
that it will not output a classification otherwise. Given this definition of L(xi, D,) for
the CANDIDATE- ELIMINATION Algorithm, what is its inductive bias? It is simply
the assumption c E H. Given this assumption, each inductive inference performed by
the CANDIDATE-ELIMINATION Algorithm can bejustified deductively.
37
Inductive bias of CANDIDATE-ELIMINATION Algorithm The target concept c is
contained in the given hypothesis space H. Figure 2.8 summarizes the situation
schematically. The inductive CANDIDATE-ELIMINATION Algorithm at the top of the
figure takes two inputs: the training examples and a new instance to be classified. At the
bottom of the figure, a deductive theorem prover is given these same two inputs plus the
assertion "H contains the target concept." These two systems will in principle produce
identical outputs for every possible input set of training examples and every possible
new instance in X.
One advantage of viewing inductive inference systems in terms of their inductive
bias is that:
• It provides a nonprocedural means of characterizing their policy for generalizing
beyond the observed data.
• A second advantage is that it allows comparison of different learners according to
the strength of the inductive bias they employ. Consider, for example, the
following three learning algorithms, which are listed from weakest to strongest
bias.
3. FIND-S: This algorithm, described earlier, finds the most specific hypothesis
consistent with the training examples. It then uses this hypothesis to classify
all subsequent instances.
38