0% found this document useful (0 votes)
14 views38 pages

machine learning UNIT-I notes

The document discusses well-posed learning problems, defining them as situations where a computer program improves its performance on a task through experience. It outlines the essential components of a learning problem, such as the class of tasks, performance measures, and sources of experience, and provides examples including checkers, handwriting recognition, and robot driving. Additionally, it covers the design of a learning system, emphasizing choices regarding training experience, target functions, and function approximation algorithms, while also addressing perspectives and issues in machine learning and concept learning.

Uploaded by

s76710396
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views38 pages

machine learning UNIT-I notes

The document discusses well-posed learning problems, defining them as situations where a computer program improves its performance on a task through experience. It outlines the essential components of a learning problem, such as the class of tasks, performance measures, and sources of experience, and provides examples including checkers, handwriting recognition, and robot driving. Additionally, it covers the design of a learning system, emphasizing choices regarding training experience, target functions, and function approximation algorithms, while also addressing perspectives and issues in machine learning and concept learning.

Uploaded by

s76710396
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1.

1 WELL-POSED LEARNING PROBLEMS:


Definition: A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.
To have a well-defined learning problem, three features needs to be identified:
1. The class of tasks
2. The measure of performance to be improved
3. The source of experience
Examples
1.Checkers game: A computer program that learns to play checkers might improve
its performance as measured by its ability to win at the class of tasks involving
playing checkers games, through experience obtained by playing games against itself.

A checkers learning problem:


• Task T: playing checkers
• Performance measure P: percent of games won against opponents

• Training experience E: playing practice games against itself


2. A handwriting recognition learning problem:
• Task T: recognizing and classifying handwritten words within images
• Performance measure P: percent of words correctly classified

• Training experience E: a database of handwritten words with given


classifications

1
3. A robot driving learning problem:
• Task T: driving on public four-lane highways using vision sensors
• Performance measure P: average distance travelled before an error (as judged
by human overseer)
• Training experience E: a sequence of images and steering commands recorded
while observing a human driver

1.2 DESIGNING A LEARNING SYSTEM:


The basic design issues and approaches to machine learning are illustrated by designing a
program to learn to play checkers, with the goal of entering it in the world checkers
tournament
1. Choosing the Training Experience
2. Choosing the Target Function
3. Choosing a Representation for the Target Function
4. Choosing a Function Approximation Algorithm
1. Estimating training values
2. Adjusting the weights
5. The Final Design
1. Choosing the Training Experience
• The first design choice is to choose the type of training experience from which the
system will learn.
• The type of training experience available can have a significant impact on success or
failure of the learner.

2
There are three attributes which impact on success or failure of the learner
1. Whether the training experience provides direct or indirect feedback regarding the
choices made by the performance system.
For example, in checkers game:
In learning to play checkers, the system might learn from direct training examples
consisting of individual checkers board states and the correct move for each Indirect
training examples consisting of the move sequences and final outcomes of various
games played. The information about the correctness of specific moves early in the game
must be inferred indirectly from the fact that the game was eventually won or lost.
Here the learner faces an additional problem of credit assignment, or determining the degree
to which each move in the sequence deserves credit or blame for the final outcome.
Credit assignment can be a particularly difficult problem because the game can be lost even
when early moves are optimal, if these are followed later by poor moves. Hence, learning from
direct training feedback is typically easier than learning from indirect feedback.
2. The degree to which the learner controls the sequence of training examples
For example, in checkers game:
The learner might depends on the teacher to select informative board states and to
provide the correct move for each.
Alternatively, the learner might itself propose board states that it finds particularly
confusing and ask the teacher for the correct move.
The learner may have complete control over both the board states and (indirect) training
classifications, as it does when it learns by playing against itself with no teacher present
3. How well it represents the distribution of examples over which the final system
performance P must be measured
For example, in checkers game:
In checkers learning scenario, the performance metric P is the percent of games the
system wins in the world tournament.
If its training experience E consists only of games played against itself, there is a danger
that this training experience might not be fully representative of the distribution of
situations over which it will later be tested.
It is necessary to learn from a distribution of examples that is different from those on
which the final system will be evaluated.

3
2. Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and
how this will be used by the performance program.
Let’s consider a checkers-playing program that can generate the legal moves from any board
state.
The program needs only to learn how to choose the best move from among these legal
moves. We must learn to choose among the legal moves, the most obvious choice for
the type of information to be learned is a program, or function, that chooses the best move
for any given board state.
1. Let ChooseMove be the target function and the notation is

ChooseMove : B→ M
which indicate that this function accepts as input any board from the set of legal board
states B and produces as output some move from the set of legal moves M.
ChooseMove is a choice for the target function in checkers example, but this
function will turn out to be very difficult to learn given the kind of indirect training
experience available to our system .
2. An alternative target function is an evaluation function that assigns a numerical score
to any given board state
Let the target function V and the notation
V:B →R
which denote that V maps any legal board state from the set B to some real value. Intend for
this target function V to assign higher scores to better board states. If the system can
successfully learn such a target function V, then it can easily use it to select the best move
from any current board position.
Let us define the target value V(b) for an arbitrary board state b in B, as follows:
• If b is a final board state that is won, then V(b) = 100
• If b is a final board state that is lost, then V(b) = -100
• If b is a final board state that is drawn, then V(b) = 0
• If b is a not a final state in the game, then V(b) = V(b' ),
Where b' is the best final board state that can be achieved starting from b and playing
optimally until the end of the game

4
3. Choosing a Representation for the Target Function

Let’s choose a simple representation - for any given board state, the function c
will be calculated as a linear combination of the following board features:

• xl: the number of black pieces on the board


• x2: the number of red pieces on the board
• x3: the number of black kings on the board
• x4: the number of red kings on the board
• x5: the number of black pieces threatened by red (i.e., which can be
captured on red's next turn)
• x6: the number of red pieces threatened by black

Thus, learning program will represent as a linear function of the form

Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by
the learning algorithm.
• Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the
board
• The weight w0 will provide an additive constant to the board value

4. Choosing a Function Approximation Algorithm

In order to learn the target function f we require a set of training examples, each
describing a specific board state b and the training value Vtrain(b) for b.

Each training example is an ordered pair of the form (b, Vtrain(b)).

For instance, the following training example describes a board state b in which
black has won the game (note x2 = 0 indicates that red has no remaining pieces)
and for which the target function value Vtrain(b) is therefore +100.

((x1=3, x2=0, x3=1, x4=0, x5=0, x6=0), +100)

5
Where,
• w0 through w6 are numerical coefficients, or weights, to be chosen by
the learning algorithm.
• Learned values for the weights w1 through w6 will determine the relative
importance of the various board features in determining the value of the
board
• The weight w0 will provide an additive constant to the board value

4. Choosing a Function Approximation Algorithm

In order to learn the target function f we require a set of training examples, each
describing a specific board state b and the training value Vtrain(b) for b.

Each training example is an ordered pair of the form (b, Vtrain(b)).

For instance, the following training example describes a board state b in which
black has won the game (note x2 = 0 indicates that red has no remaining pieces)
and for which the target function value Vtrain(b) is therefore +100.

((x1=3, x2=0, x3=1, x4=0, x5=0, x6=0), +100)

Function Approximation Procedure

1. Derive training examples from the indirect training experience available to


the learner
2. Adjusts the weights wi to best fit these training examples

1. Estimating training values

A simple approach for estimating training values for intermediate board


states is to assign the training value of Vtrain(b) for any intermediate
board state b to be
V̂ (Successor(b))

Where ,
• V̂ is the learner's current approximation to V
• Successor(b) denotes the next board state following b for which it
is again the program's turn to move

Rule for estimating training values

Vtrain(b) ← V̂ (Successor(b))

6
2. Adjusting the weights
Specify the learning algorithm for choosing the weights wi to best fit the set
of training examples {(b, Vtrain(b))}

A first step is to define what we mean by the bestfit to the training data.

One common approach is to define the best hypothesis, or set of weights, as


that which minimizes the squared error E between the training values and
the values predicted by the hypothesis.

Several algorithms are known for finding weights of a linear function that
minimize E. One such algorithm is called the least mean squares, or LMS
training rule. For each observed training example it adjusts the weights a
small amount in the direction that reduces the error on this training example

LMS weight update rule :- For each training example (b,


Vtrain(b)) Use the current weights to calculate V̂
(b)
For each weight wi, update it as

wi ← wi + ƞ (Vtrain (b) - V̂ (b)) xi

Here ƞ is a small constant (e.g., 0.1) that moderates the size of the weight update.

Working of weight update rule

• When the error (Vtrain(b)- V̂ (b)) is zero, no weights are changed.


• When (Vtrain(b) - V̂ (b)) is positive (i.e., when V̂ (b) is too low), then
each weightis increased in proportion to the value of its
corresponding feature. This will raisethe value of V̂ (b), reducing the
error.
If the value of some feature xi is zero, then its weight is not altered regardless of the error,
so that the only weights updated are those whose features actually occur on the training
example board.
5. The Final Design
The final design of checkers learning system can be described by four distinct
program modules that represent the central components in many learning
systems

7
1. The Performance System is the module that must solve the given
performance task by using the learned target function(s). It takes an
instance of a new problem (new game) as input and produces a trace of its
solution (game history) as output.

2. The Critic takes as input the history or trace of the game and produces as
output a set of training examples of the target function

3. The Generalizer takes as input the training examples and produces an


output hypothesis that is its estimate of the target function. It generalizes
from the specific training examples, hypothesizing a general function that
covers these examples and other cases beyond the training examples.

4. The Experiment Generator takes as input the current hypothesis and


outputs a new problem (i.e., initial board state) for the Performance
System to explore. Its role is to pick new practice problems that will
maximize the learning rate of the overall system.

The sequence of design choices made for the checkers program is summarized in below figure

8
9
1.3 PERSPECTIVES AND ISSUES IN MACHINE LEARNING:

• One useful perspective on machine learning is that it involves searching a very


large space of possible hypotheses to determine one that best fits the observed
data and any prior knowledge held by the learner.
• For example, consider the space of hypotheses that could in principle be output
by the above checkers learner. This hypothesis space consists of all evaluation
functions that can be represented by some choice of values for the weights wo
through w6.
• The learner's task is thus to search through this vast space to locate the hypothesis
that is most consistent with the available training examples.

• The LMS algorithm for fitting weights achieves this goal by iteratively tuning
the weights, adding a correction to each weight each time the hypothesized
evaluation function predicts a value that differs from the training value.
• This algorithm works well when the hypothesis representation considered by
the learner defines a continuously parameterized space of potential
hypotheses.

Issues in Machine Learning


The field of machine learning, and much of this book, is concerned with answering
questions such as the following
• What algorithms exist for learning general target functions from specific
training examples? In what settings will particular algorithms converge to
the desired function, given sufficient training data? Which algorithms
perform best for which types of problems and representations?
• How much training data is sufficient? What general bounds can be found to
relate the confidence in learned hypotheses to the amount of training
experience and the character of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it
is only approximately correct?
• What is the best strategy for choosing a useful next training experience, and
how does the choice of this strategy alter the complexity of the learning
problem?
• What is the best way to reduce the learning task to one or more function
approximation problems? Put another way, what specific functions should
the system attempt to learn? Can this process itself be automated?
• How can the learner automatically alter its representation to improve its ability
to represent and learn the target function?

10
CONCEPT LEARNING:

2.1 INTRODUCTION:

• Learning involves acquiring general concepts from specific training examples.


Example: People continually learn general concepts or categories such as
"bird," "car," "situations in which I should study more in order to pass the
exam," etc.
• Each such concept can be viewed as describing some subset of objects or
events defined over a larger set
• Alternatively, each concept can be thought of as a Boolean-valued function
defined over this larger set. (Example: A function defined over all animals,
whose value is true for birds and false for other animals).

Definition: Concept learning - Inferring a Boolean-valued function from training


examples ofits input and output

2.2 A CONCEPT LEARNING TASK:

To ground our discussion of concept learning, consider the example task of


learning the target concept "days on which my friend Aldo enjoys his favorite
water sport."

Table 2.1 describes a set of example days, each represented by a set of attributes.

The attribute EnjoySport indicates whether or not Aldo enjoys his favorite
water sport on this day.

The task is to learn to predict the value of EnjoySport for an arbitrary day,
based on the values of its other attributes.

Consider the example task of learning the target concept "Days on which Aldo
enjoyshis
favorite water sport=

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes

Table 2.1 : Positive and negative training examples for the target concept EnjoySport.

11
What hypothesis representation is provided to the learner?

• Let’s consider a simple representation in which each hypothesis


consists of a conjunction of constraints on the instance attributes.
• Let each hypothesis be a vector of six constraints, specifying the
values of the six attributes Sky, AirTemp, Humidity, Wind, Water, and
Forecast.

For each attribute, the hypothesis will either


• Indicate by a "?' that any value is acceptable for this attribute,
• Specify a single required value (e.g., Warm) for the attribute, or
• Indicate by a "Φ" that no value is acceptable

If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a


positive example (h(x) = 1).

The hypothesis that PERSON enjoys his favorite sport only on cold days with high
humidityis represented by the expression
(?, Cold, High, ?, ?, ?)

The most general hypothesis-that every day is a positive example-is represented by


(?, ?, ?, ?, ?, ?)

The most specific possible hypothesis-that no day is a positive example-is


represented by
(Φ, Φ, Φ, Φ, Φ, Φ)

Notation

• The set of items over which the concept is defined is called the set of
instances, which is denoted by X.

Example: X is the set of all possible days, each represented by the attributes: Sky,
AirTemp, Humidity, Wind, Water, and Forecast

• The concept or function to be learned is called the target concept, which is


denoted by c.c can be any Boolean valued function defined over the instances X

c: X→ {O, 1}

Example: The target concept corresponds to the value of the attribute EnjoySport
(i.e., c(x) = 1 if EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).

12
• Instances for which c(x) = 1 are called positive examples, or members of the target
concept.
• Instances for which c(x) = 0 are called negative examples, or non-members
of the target concept.
• The ordered pair (x, c(x)) to describe the training example consisting of the
instance x and its target concept value c(x).
• D to denote the set of available training examples

• The symbol H to denote the set of all possible hypotheses that the learner may
consider regarding the identity of the target concept. Each hypothesis h in H
represents a Boolean- valued function defined over X
h: X→{O, 1}

The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.

• Given:
• Instances X: Possible days, each described by the attributes
• Sky (with possible values Sunny, Cloudy, and Rainy),
• AirTemp (with values Warm and Cold),
• Humidity (with values Normal and High),
• Wind (with values Strong and Weak),
• Water (with values Warm and Cool),
• Forecast (with values Same and Change).

• Hypotheses H: Each hypothesis is described by a conjunction of constraints


on the attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. The
constraints may be"?" (any value is acceptable), <Φ= (no value is
acceptable), or a specific value.

• Target concept c: EnjoySport : X → {0, l}


• Training examples D: Positive and negative examples of the target function

• Determine:
• A hypothesis h in H such that h(x) = c(x) for all x in X.

Table 2.2 : The Enjoy Sport concept learning task.

13
The inductive learning hypothesis:

Notice that although the learning task is to determine a hypothesis h identical to the
target concept c over the entire set of instances X, the only information available about c
is its value over the training examples.
Therefore, inductive learning algorithms can at best guarantee that the output
hypothesis fits the target concept over the training data. Lacking any further
information, our assumption is that the best hypothesis regarding unseen instances is
the hypothesis that best fits the observed training data.

The inductive learning hypothesis. : Any hypothesis found to approximate the target
function well over a sufficiently large set of training examples will also approximate
the target function well over other unobserved examples.

2.3 CONCEPT LEARNING AS SEARCH:

• Concept learning can be viewed as the task of searching through a


large space of hypotheses implicitly defined by the hypothesis
representation.
• The goal of this search is to find the hypothesis that best fits the training
examples.

Example:
Consider the instances X and hypotheses H in the Enjoy Sport learning task. The
attribute Sky has three possible values, and AirTemp, Humidity, Wind, Water,
Forecast each have two possible values, the instance space X contains exactly
3.2.2.2.2.2 = 96 distinct instances
5.4.4.4.4.4 = 5120 syntactically distinct hypotheses within H.

Every hypothesis containing one or more "Φ" symbols represents the empty set of
instances;
that is, it classifies every instance as negative.
1 + (4.3.3.3.3.3) = 973. Semantically distinct hypotheses

14
General-to-Specific Ordering of Hypotheses

Consider the two hypotheses


h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)

• Consider the sets of instances that are classified positive by hl and by h2.
• h2 imposes fewer constraints on the instance, it classifies more instances as
positive. So, any instance classified positive by hl will also be classified
positive by h2. Therefore, h2is more general than hl.

Given hypotheses hj and hk, hj is more-general-than or- equal do hk if and only if any
instance that satisfies hk also satisfies hi

Definition: Let hj and hk be Boolean-valued functions defined over X. Then hj is


more general- than-or-equal-to hk (written hj ≥ hk) if and only if

(Ɐx€X)[hk(x)=1->hj(x)=1)]

• In the figure, the box on the left represents the set X of all instances, the
box on the right the set H of all hypotheses.
• Each hypothesis corresponds to some subset of X-the subset of instances
that it classifies positive.
• The arrows connecting hypotheses represent the more - general -than
relation, with the arrow pointing toward the less general hypothesis.
• Note the subset of instances characterized by h2 subsumes the subset
characterized by hl , hence h2 is more - general– than h1

15
2.4 FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS :

How can we use the more-general-than partial ordering to organize the search for
a hypothesis consistent with the observed training examples?
One way is to begin with the most specific possible hypothesis in H, then
generalize this hypothesis each time it fails to cover an observed positive
training example.
(We say that a hypothesis "covers" a positive example if it correctly classifies the
example as positive.)
FIND-S Algorithm

1. Initialize h to the most specific hypothesis in H


2. For each positive training instance x
For each attribute
constraint a in h
i
If the constraint a is
i
satisfied by x
Then do nothing
Else replace a in h by the next more general constraint
i
that is satisfied by x
3. Output hypothesis h

FIND-S algorithm Table 2.3

To illustrate this algorithm, assume the learner is given the sequence of


training examples from the EnjoySport task

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport


1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

16
• The first step of FIND-S is to initialize h to the most specific hypothesis in H
h - (Ø, Ø, Ø, Ø, Ø, Ø)

• Consider the first training example


x1 = <Sunny Warm Normal Strong Warm Same>, +

Observing the first training example, it is clear that hypothesis h is


too specific. None of the "Ø" constraints in h are satisfied by this
example, so each is replaced by the nextmore general constraint that
fits the example
h1 = <Sunny Warm Normal Strong Warm Same>

• Consider the second training example


x2 = <Sunny, Warm, High, Strong, Warm, Same>, +

The second training example forces the algorithm to further


generalize h, this time substituting a "?" in place of any attribute
value in h that is not satisfied by the new example
h2 = <Sunny Warm ? Strong Warm Same>
• Consider the third training example
x3 = <Rainy, Cold, High, Strong, Warm, Change>, -

Upon encountering the third training the algorithm makes no change to h.


The FIND-S algorithm simply ignores every negative example.
h3 = < Sunny Warm ? Strong Warm Same>

• Consider the fourth training example


x4 = <Sunny Warm High Strong Cool Change>, +
The fourth example leads to a further generalization of h
h4 = < Sunny Warm ? Strong ? ? >

17
The key property of the FIND-S algorithm
• FIND-S is guaranteed to output the most specific hypothesis within H that is
consistent with the positive training examples
• FIND-S algorithm’s final hypothesis will also be consistent with the negative
examples provided the correct target concept is contained in H, and provided
the training examples are correct.

Unanswered by FIND-S

1. Has the learner converged to the correct target concept?


2. Why prefer the most specific hypothesis?
3. Are the training examples consistent?
4. What if there are several maximally specific consistent hypotheses?

18
2.5 VERSION SPACES AND THE CANDIDATE-ELIMINATION ALGORITHM :

• This section describes a second approach to concept learning, the


CANDIDATEELIMINATION algorithm that addresses several of the
limitations of FIND-S.

• Notice that although FIND-S outputs a hypothesis from H that is consistent


with the training examples, this is just one of many hypotheses from H that
might fit the training data equally well.

• The key idea in the CANDIDATE-ELIMINATION algorithm is to output a


description of the set of all hypotheses consistent with the training
examples.

• Surprisingly, the CANDIDATE-ELIMINATION algorithm computes the


description of this set without explicitly enumerating all of its members.

• This is accomplished by again using the more-general-than partial ordering,


this time to maintain a compact representation of the set of consistent
hypotheses and to incrementally refine this representation as each new
training example is encountered.

• The CANDIDATE-ELIMINATION algorithm has been applied to problems such


as learning regularities in chemical mass spectroscopy (Mitchell 1979) and
learning control rules for heuristic search (Mitchell et al. 1983).

• Nevertheless, practical applications of the CANDIDATE-ELIMINATION, FNIN


D-S algorithms are limited by the fact that they both perform poorly when
given noisy training data.

• More importantly for our purposes here, the CANDIDATE-ELIMINATION


algorithm provides a useful conceptual framework for introducing several
fundamental issues in machine learning. In the remainder of this chapter we
present the algorithm and discuss these issues. Beginning with the next
chapter,
• we will examine learning algorithms that are used more frequently with
noisy training data.

• The key idea in the CANDIDATE-ELIMINATION algorithm is to output a


description of the set of all hypotheses consistent with the training
examples

19
• One limitation of the FIND-S algorithm is that it outputs just one
hypothesis consistent with the training data – there might be many.
• To overcome this, introduce notion of version space and algorithms to compute
it.

Representation:

One obvious way to represent the version space is simply to list all of its members.
This leads to a simple learning algorithm, which we might call the LIST-THEN -
ELIMINATE algorithm, defined in Table 2.4.

The LIST-THEN-ELIMINATE algorithm first initializes the version space to contain all hypotheses in H
and then eliminates any hypothesis found inconsistent with any training example.

• The LIST-THEN-ELIMINATION algorithm first initializes the version space to


contain all hypotheses in H, then eliminates any hypothesis found
inconsistent with any training example.

20
• The version space of candidate hypotheses thus shrinks as more examples
are observed, until ideally just one hypothesis remains that are consistent
with all the observed examples.

• This, presumably, is the desired target concept. If insufficient data is available


to narrow the version space to a single hypothesis, then the algorithm can
output the entire set of hypotheses consistent with the observed data.

• In principle, the LIST-THEN-ELIMINATE algorithm can be applied whenever


the hypothesis space H is finite. It has many advantages, including the fact
that it is guaranteed to output all hypotheses consistent with the training
data. Unfortunately, it requires exhaustively enumerating all hypotheses in
H-an unrealistic requirement for all but the most trivial hypothesis spaces.
A More Compact Representation for Version Spaces
• List-Then-Eliminate works in principle, so long as version space is finite.
• However, since it requires exhaustive enumeration of all hypotheses in
practice it is not feasible.
A More Compact Representation for Version Spaces:
The CANDIDATE-ELIMINATE algorithm works on the same principle as the above LIST-
THEN- ELIMINATE algorithm. However, it employs a much more compact
representation of the version space.
The version space is represented by its most general and least general members. These
members form general and specific boundary sets that delimit the version space within
the partially ordered hypothesis space.

• In fact, this is just one of six different hypotheses from H that are consistent with
these training examples. All six hypotheses are shown in Figure 2.3.
• They constitute the version space relative to this set of data and this hypothesis
representation.
• The arrows among these six hypotheses in Figure 2.3 indicate instances of the
more- general~than relation.
• The CANDIDATE-ELIMINATE algorithm represents the version space by storing
only its most general members (labeled G in Figure 2.3) and its most specific
(labeled S in the figure).

21
• Given only these two sets S and G, it is possible to enumerate all members of
the version space as needed by generating the hypotheses that lie between
these two sets in the general-to-specific partial ordering over hypotheses. It is
intuitively plausible that we can represent the version space in terms of its
most specific and most general members.

• Below we define the boundary sets G and S precisely and prove that these sets do
in fact represent the version space.

22
showing that this leads to an inconsistency.

23
CANDIDATE-ELIMINATION Learning Algorithm

The CANDIDATE-ELIMINATION algorithm computes the version space containing


all hypotheses from H that are consistent with an observed sequence of training
examples. It begins by initializing the version space to the set of all hypotheses in
H; that is, by initializing the G boundary set to contain the most general hypothesis
in H

These two boundary sets delimit the entire hypothesis space, because every other
hypothesis in H is both more general than So and more specific than Go. As each
training example is considered, the S and G boundary sets are generalized and
specialized, respectively, to eliminate from the version space any hypotheses found
inconsistent with the new training example. After all examples have been
processed, the computed version space contains all the hypotheses
consistent with these examples and only these hypotheses. This algorithm is
summarize in Table 2.5.

The CANDIDATE-ELIMINTION algorithm computes the version space containing all hypotheses from
H that are consistent with an observed sequence of training examples.

24
25
The detailed implementation of these operations will depend, of course, on the
specific representations for instances and hypotheses. However, the algorithm itself
can be applied to any concept learning task and hypothesis space for which these
operations are well-defined.
In the following example trace of this algorithm, we see how such operations can
be implemented for the representations used in the EnjoySport example problem.

An Illustrative Example

• Figure 2.4 traces the CANDIDATE-ELIMINATE algorithm applied to the first


two training examples from Table 2.1. As described above, the boundary
sets are first initialized to Go and So, the most general and most specific
hypotheses in H, respectively.
• When the first training example is presented (a positive example in this
case), the CANDIDATE ELIMINATE algorithm checks the S boundary and
finds that it is overly specific-it fails to cover the positive example.
• The boundary is therefore revised by moving it to the least more general
hypothesis that covers this new example.
• This revised boundary is shown as S1 in Figure 2.4. No update of the
G boundary is needed in response to this training example because Go
correctly covers this example. When the second training example (also
positive) is observed, it has a similar effect of generalizing S further to
S2, leaving G again unchanged (i.e., G2 = GI = GO).
Notice the processing of these first two positive examples is very similar to
the processing performed by the FIND-S algorithm.

26
• As illustrated by these first two steps, positive training examples may force
the S boundary of the version space to become increasingly general.
Negative training examples play the complimentary role of forcing the G
boundary to become increasingly specific.

• Consider the third training example, shown in Figure 2.5. This negative
example reveals that the G boundary of the version space is overly
general; that is, the hypothesis in G incorrectly predicts that this new
example is a positive example.
• The hypothesis in the G boundary must therefore be specialized until it
correctly classifies this new negative example.

• As shown in Figure 2.5, there are several alternative minimally more


specific hypotheses. All of these become members of the new G3
boundary set.

27
28
• When the second training example is observed, it has a similar effect
of generalizing S further to S2, leaving G again unchanged i.e., G2 =
G1 = G0

• Consider the third training example. This negative example reveals that
the G boundaryof the version space is overly general, that is, the
hypothesis in G incorrectly predicts that this new example is a positive
example.
• The hypothesis in the G boundary must therefore be specialized until it
correctly classifies this new negative example

Given that there are six attributes that could be specified to specialize G2, why are
there only three new hypotheses in G3?
For example, the hypothesis h = (?, ?, Normal, ?, ?, ?) is a minimal specialization of
G2 that correctly labels the new example as a negative example, but it is not
included in G3.The reason this hypothesis is excluded is that it is inconsistent with
the previously encountered positive examples.

Consider the fourth training example.

• This positive example further generalizes the S boundary of the version space. It
also results in removing one member of the G boundary, because this member
fails to cover the new positive example

29
After processing these four examples, the boundary sets S4 and G4 delimit the
version spaceof all hypotheses consistent with the set of incrementally observed
training examples.

The final version space

30
2.6 REMARKS ON VERSION SPACES AND CANDIDATE-ELIMINATION

Will the CANDIDATE-ELIMINATION Algorithm Converge to the Correct Hypothesis?

The version space learned by the CANDIDATE-ELIMINATION algorithm will


converge toward the hypothesis that correctly describes the target concept,
provided

1. There are no errors in the training examples


2. There is some hypothesis in H that correctly describes the target concept. In
fact, as new training examples are observed, the version space can be monitored
to determine the remaining ambiguity regarding the true concept and to
determine when sufficient training examples have been observed to
unambiguously identify the target concept.

The target concept is exactly learned when the S and G boundary sets
converge to a single identical, hypothesis.

31
What will happen if the training data contains errors?

• Suppose, for example, that the second training example above is incorrectly
presented as anegative example instead of a positive example.
• Unfortunately, in this case the algorithm is certain to remove the correct
target concept from the version space.
• Because it will remove every hypothesis that is inconsistent with each training
example, it will eliminate the true target concept from the version space as
soon as this false negative example is encountered.
• Of course, given sufficient additional training data, the learner will
eventually detect an inconsistency by noticing that the S and G boundary
sets eventually converge to an empty version space.
• Such an empty version space indicates that there is no hypothesis in H
consistent with all observed training examples.
• A similar symptom will appear when the training examples are correct, but the
target concept cannot be described in the hypothesis representation.
• For now, we consider only the case in which the training examples are
correct and the true target concept is present in the hypothesis space.
INDUCTIVE BIAS

The fundamental questions for inductive inference

1. What if the target concept is not contained in the hypothesis space?


2. Can we avoid this difficulty by using a hypothesis space that
includes every possible hypothesis?
3. How does the size of this hypothesis space influence the ability of
the algorithm to generalize to unobserved instances?
4. How does the size of the hypothesis space influence the number of
training examples that must be observed?

32
33
34
An Unbiased Learner

• The solution to the problem of assuring that the target concept is in the
hypothesis space H is to provide a hypothesis space capable of representing
every teachable concept that is representing every possible subset of the
instances X.
• The set of all subsets of a set X is called the power set of X

• In the EnjoySport learning task the size of the instance space X of


days described by the six attributes is 96 instances.
• Thus, there are 296 distinct target concepts that could be defined over
this instance space and learner might be called upon to learn.
• The conjunctive hypothesis space is able to represent only 973 of
these - a biased hypothesis space indeed

• Let us reformulate the EnjoySport learning task in an unbiased way by


defining a new hypothesis space H' that can represent every subset
of instances
• The target concept "Sky = Sunny or Sky = Cloudy" could then be described as

(Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?)
Our concept learning algorithm is now completely unable to generalize beyond
the observed examples!
To see why, suppose we present three positive examples (xl,x 2, x3) and two
negative examples
(x4, x5) to the learner.
At this point, the S boundary of the version space will contain the hypothesis
which is just the disjunction of the positive examples

35
To see the reason, note that when H is the power set of X and x is some previously
unobserved instance,
then for any hypothesis h in the version space that covers x, there will be another
hypothesis h' in the power set that is identical to h except for its classification of x.
And of course if h is in the version space, then h' will be as well, because it agrees
with h on all the observed training examples.
The Futility of Bias-Free Learning

36
What, then, is the inductive bias of the CANDIDATE-ELIMINAION Algorithm ?

To answer this, let us specify L(xi, D,) exactly for this algorithm: given a set of
data D,, the CANDIDATE-ELIMINATION Algorithm will first compute the
version space VSH,D, ,then classify the new instance xi by a vote among
hypotheses in this version space. Here let us assume that it will output a
classification for xi only if

this vote among version space hypotheses is unanimously positive or negative and
that it will not output a classification otherwise. Given this definition of L(xi, D,) for
the CANDIDATE- ELIMINATION Algorithm, what is its inductive bias? It is simply
the assumption c E H. Given this assumption, each inductive inference performed by
the CANDIDATE-ELIMINATION Algorithm can bejustified deductively.

37
Inductive bias of CANDIDATE-ELIMINATION Algorithm The target concept c is
contained in the given hypothesis space H. Figure 2.8 summarizes the situation
schematically. The inductive CANDIDATE-ELIMINATION Algorithm at the top of the
figure takes two inputs: the training examples and a new instance to be classified. At the
bottom of the figure, a deductive theorem prover is given these same two inputs plus the
assertion "H contains the target concept." These two systems will in principle produce
identical outputs for every possible input set of training examples and every possible
new instance in X.
One advantage of viewing inductive inference systems in terms of their inductive
bias is that:
• It provides a nonprocedural means of characterizing their policy for generalizing
beyond the observed data.
• A second advantage is that it allows comparison of different learners according to
the strength of the inductive bias they employ. Consider, for example, the
following three learning algorithms, which are listed from weakest to strongest
bias.

1. ROTE-LEARNERL: Learning corresponds simply to storing each observed


training example in memory. Subsequent instances are classified by looking
them up in memory. If the instance is found in memory, the stored
classification is returned. Otherwise, the system refuses to classify the new
instance.

2. CANDIDATE-ELIMINATION Algorithm : New instances are classified


only in the case where all members of the current version space agree on
the classification. Otherwise, the system refuses to classify the new
instance.

3. FIND-S: This algorithm, described earlier, finds the most specific hypothesis
consistent with the training examples. It then uses this hypothesis to classify
all subsequent instances.

38

You might also like