0% found this document useful (0 votes)
20 views220 pages

unit4

Uploaded by

Alan Wesley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views220 pages

unit4

Uploaded by

Alan Wesley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 220

Planning and Learning

1
Planning
• Task to come up with sequence of actions that will achieve goal is
planning
• Classic planning environment
• Fully observable
• Deterministic
• Finite
• Static
• Discrete

2
Languages for planning poblems

• STRIPS(Standard Research Institute Problem Solver)


• ADS(Action Description Language)

3
STRIPS
• Conjunction of positive, ground, function-free literals
• At(Home) AND IsAt(Umbrella, Home) AND CanBeCarried(Umbrella) AND
IsUmbrella(Umbrella) AND HandEmpty AND Dry
• Any literal not mentioned is assumed false
– Other languages make different assumptions, e.g., negative literals part of state,
unmentioned literals unknown
• Represented with
– Actions : (Fly(P, from, to))
– Preconditions: At(P, from) AND Plane (P) AND Airport(from) AND Airport (to)
– Effect: NOT At(P, from) AND At(P, to)

4
An action: TakeObject
• TakeObject(location, x)
• Preconditions:
• HandEmpty
• CanBeCarried(x)
• At(location)
• IsAt(x, location)
• Effects (“NOT something” means that that
something should be removed from state):
• Holding(x)
• NOT(HandEmpty)
• NOT(IsAt(x, location))
5
Another action
• WalkWithUmbrella(location1, location2,
umbr)
• Preconditions:
• At(location1)
• Holding(umbr)
• IsUmbrella(umbr)
• Effects:
• At(location2)
• NOT(At(location1)) 6
Yet another action
• WalkWithoutUmbrella(location1, location2)
• Preconditions:
• At(location1)
• Effects:
• At(location2)
• NOT(At(location1))
• NOT(Dry)

7
Comparison Between STRIPS and ADL

8
ADL
• Actions : (Fly(P:Plane, from: Airport, to: Airport))
• Preconditions: At(P, from) AND Plane (P) AND Airport(from) AND Airport (to)
• Effect: NOT At(P, from) AND At(P, to)

9
Planning Methods: State-Space Search & Goal
Stack
Method # 1. Planning with State-Space Search:
• The most straight forward approach is to use state-space search. Because the
descriptions of actions in a planning problem specify both preconditions and
effects,
• it is possible to search in either direction: forward from the initial state or
backward from the goal, as shown in Fig.. We can also use the explicit action
and goal representations to derive effective heuristics automatically.

10
Forward State-Space Search:
• Planning with forward state-space search is similar to the problem-solving approach. It is sometimes called progression
planning, because it moves in the forward direction.
• We start with the problem’s initial state, considering sequences of actions until we reach a goal state.
The formulation of planning problem as state-space search problems is as follows:
i. The initial state of the search is the initial state from the planning problem. In general each state will be set of positive
ground literals; literals not appearing are false.

ii. The actions which are applicable to a state are all those whose preconditions are satisfied. The successor state resulting
from an action is generated by adding the positive effect literals and deleting the negative effect literals.

iii. The goal test checks whether the state satisfies the goal of the planning problem.

iv. The step cost of each action is typically 1. Although it would be easy to allow different costs for different actions, this was
seldom done by STRIPS planners.

Since function symbols are not present, the state space of a planning problem is finite and therefore, any graph search
algorithm such as A * will be a complete planning algorithm.
11
• From the early days of planning research it is known that forward state-space
search is too inefficient to be practical.
• Mainly, this is because of a big branching factor since forward search does not
address only relevant actions, (all applicable actions are considered).
• Consider for example, an air cargo problem with 10 airports, where each
airport has 5 planes and 20 pieces of cargo.
• The goal is to move all the cargo at airport A to airport B.
• There is a simple solution to the problem: load the 20 pieces of cargo into one of the
planes at A, fly the plane to B, and unload the cargo. But finding the solution can be
difficult because the average branching factor is huge: each of the 50 planes can fly to 9
other airports, and each of the 200 packages can be either unloaded (if it is loaded), or
loaded into any plane at its airport (if it is unloaded).
• On average, let’s say there are about 1000 possible actions, so the search tree
up to the depth of the obvious solution has about 1000 nodes. It is thus clear
that a very accurate heuristic will be needed to make this kind of search
efficient. 12
ii. Backward State-Space
Search:
• Backward search can be difficult to implement when the goal states are described by a set of
constraints which are not listed explicitly.
• In particular, it is not always obvious how to generate a description of the possible
predecessors of the set of goal states. The STRIPS representation makes this quite easy
because sets of states can be described by the literals which must be true in those states.
• The main advantage of backward search is that it allows us to consider only relevant actions.
• An action is relevant to a conjunctive goal if it achieves one of the conjuncts of the goal. For
example, the goal in our 10-airport air cargo problem is to have 20 pieces of cargo at airport B,
or more precisely.
• At (C1 B) ∧ At (C2 B)……………….. At (C20, B)
• Now consider the conjunct At (C1, B). Working backwards, we can seek those actions which
have this as an effect,

13
There is only one:
• Unload (C1p, B),where plane p is unspecified.
• We may note that there are many irrelevant actions which can also lead to a goal
state.
• For example, we can fly an empty plane from Mumbai to Chennai; this action reaches
a goal state from a predecessor state in which the plane is at Mumbai and all the goal
conjuncts are satisfied.
• A backward search which allows irrelevant actions will still be complete, but it will be
much less efficient. If a solution exists, it should be found by a backward search which
allows only relevant action.
• This restriction to relevant actions only means that backward search often has a much
lower branching factor than forward search. For example, our air cargo problem has
about 1000 actions leading forward from the initial state, but only 20 actions working
backward from the goal. Hence backward search is more efficient than forward
searching.
• Searching backwards is also called regression planning. The principal question in
regression planning is: what are the states from which applying a given action leads to
the goal? Computing the description of these states is called regressing the goal
through the action. To see how does it work, once again consider the air cargo 14
• At (C1 B) ˄ At (C2 B) ˄…. ˄ At (C20, B)
• The relevant action UNLOAD (C1 ‘ p, B) achieves the first conjunct. The action
will work only if its preconditions are satisfied. Therefore, any predecessor
state must include these preconditions:
• In (C1, p) ˄ At (p, B) as sub-goals. Moreover, the sub-goal At (C1, B) should
not be true in the predecessor state which will no doubt be a goal but not
relevant one.
Thus, the predecessor description is:
• In(C1, p) ˄ At(p, B) ˄ At(C2, B) ˄……………………˄ At (C20,B)
• In addition to insisting that actions achieve some desired literal, we must
insist that the actions do not undo any desired literals.
• An action which satisfies this restriction is called consistent. For example,
the action load (C2, p) would not be consistent with the current goal,
because it would negate the literal At (C2, B)
15
Given a goal description G, let A be an action which is relevant and consistent.
The corresponding predecessor is constructed as follows:
I. Any positive effects of A which appear in G are deleted.
II. Each precondition literal of A is added, unless it already appears.
• For example, the predecessor description in the preceding paragraph is satisfied by the
initial state.
• In (C1, P12) ˄ At (P12, B) ˄ At (C2, B) ˄ ……………………….˄ At (C20, B)
with substitution (P/P12).
• The substitution must be applied to the action leading from the state to the goal,
producing the solution [Unload (C1,P12, B)]

16
Method # 2. Goal Stack
Planning:
• This was perhaps the first method used to solve the problems in
which goal interacted and was the approach used by STRIPS.
• The planner used a single stack which contains both goals and
operators which are proposed to satisfy those goals.
• It also depends on a data base which describes the current situation
and a set of operators described as PRECONDITION, ADD, and DELETE
lists.
• Let us illustrate the working of this method with the help of an
example of blocks world, shown in Fig. 8.6.
• At the start of solution, the goal stack is simply.

17
ON(C, A)
ON(B, D)
ONTABLE(A)
ONTABLE(D)

ONTABLE(D)

• But we want to separate this problem into four sub-problems, one for each
component of the original goal.
• Two of the sub-problems ON TABLE (A) and ON TABLE (D) are already true in the
initial state. So we need to work on the remaining two.
• There are two goal stacks depending on the order of tackling the sub problems.

18
• where OTDA is the abbreviation for ONTABLE (A) ˄ ONTABLE (D). Single
line below the operator represents the goal.
• Let us recapitulate the process of finding the goal in STRIPS.
• In each succeeding step of planner, the top goal on the stack is pursued.
• When a sequence of operators which satisfied the goal is found that
sequence is applied to the state description, yielding a new description.
• Next the goal which is then at the top of the stack is explored and an
attempt is made to satisfy it, starting from the situation which was
produced as a result of satisfying the first goal.

19
• Hence ON(C, A) is replaced by STACK(C, A); yielding:
STACK(C, A)
ON (B, D)
ON (C, A) ON (C, A) ˄ ON (B, D) ˄ OTAD
• But in order to apply STACK, (C, A) its preconditions must
hold, which now become sub-goals.
• The new compound sub-goal CLEAR (A) ˄ HOLDING(C) must
be broken into components and decide on the order
• HOLDING(x) is very easy to achieve: put down some thing
else and then pickup the desired object.
• In order to do anything else, the robot will need to use the
arm. So if we achieve HOLDING first and then try to do
something else the robot will have to use arm.
• So if we achieve HOLDING first and then try to do
something else, will imply that HOLDING is no longer true
towards the end.

20
So the heuristic used is:
• If HOLDING is one of several goals to be achieved at once, it should be tackled last, the other sub goal, CLEAR (A) should be tackled
first.
So the new goal stack becomes:
CLEAR (A)
HOLDING(C)
CLEAR (A) ∧ HOLDING(C)
STACK(C, A)
ON (B, D) ˄ ON (C, A) ˄ OTAD.
• This kind of heuristic information could be contained in the precondition list itself by stating the predicates in the order in which they
should be achieved.
• Next, is CLEAR (A) true or not, is checked. It is not. The only operator UNSTACK (B, A) makes it true.
• Its preconditions form, the sub-goals, so the new goal stack becomes:
ON (B, A)
CLEAR (B)
ARMEMPTY
ON (B, A) ˄ CLEAR (B) ˄ ARMEMPTY
UNSTACK (B, A)
HOLDING(C)
CLEAR (A) ˄ HOLDING(C)
STACK (C, A)
ON (B, D)
ON (C, A) ˄ ON (B, D) ˄ OTAD 21
• Now on comparing the top element of the goal stack ON (B, A) to the block world problem
initial state it is found is be satisfied.
• So it is popped off and the next goal CLEAR (B) considered. It is also already true. So this goal
can also be popped from the stack.
• The third pre-condition for UNSTACK (B, A) – ARMEMPTY also holds good; hence can be
popped off the stack.
• The next element of the stack is the combined goal representing all of the preconditions for the
UNSTACK (B, A).
• It is also satisfied, so it can also be popped of the stack. Now the top element of the stack is the
operator UNSTACK (B, A).
• Since its preconditions are satisfied so it can be applied to produce a new world model from
which the rest of problem solving process can continue. This is done by using ADD and DELETE
lists specified for UNSTACK. We also note that UNSTACK (B, A) is the first operator of the
proposed solution sequence.

22
Now the data base corresponding to blocks world model is:
ONTABLE (A) ˄ ONTABLE(C) ˄ ONTABLE (D) ˄ HOLDING (B) ˄ CLEAR (A).

The goal stack now is:


HOLDING(C)
CLEAR (A) ˄ HOLDING(C)
STACK(C, A)
ON (B, D)
ON(C, A) ˄ ON (B, D) ˄ OTAD

The next goal to be satisfied is HOLDING(C) and this is made true by two operators PICK UP(C) and
UNSTACK(c, x), where x could be any block from which the block c could be un-stacked.
Using those two operators the two branches of the search tree are:

23
Now the question is which of the two alternatives be selected?
• Alternative 1 is better than the alternative 2, because block C is not on anything so
• The goal stack would then become:
CLEAR (D)
HOLDING (B)
CLEAR (D) ˄ HOLDING (B)
STACK (B, D)
ONTABLE(C) ˄ CLEAR(C) ˄ ARMEMPTY PICKUP(C)
CLEAR (A) ˄ HOLDING(C)
STACK (C, A)
ON (B, D)
ON(C, A) ˄ ON (B, D) ˄ OTAD.
CLEAR (D) and HOLDING (B) are both true; the operation STACK (B, D) can be performed,
producing the model.
ONTABLE (A) ˄ ONTABLE(C) ˄ ONTABLE (D) ˄ ON (B, D) ˄ ARMEMPTY.

24
• All of the pre-conditions for PICKUP(C) are now satisfied, so it too can
be executed. Since all the pre-conditions for STACK(C, A) are true, so it
also can be executed.
• Now, consider the second part of the original goal, ON (B, D). But this
has already been satisfied by the operations which were used to
satisfy the first sub-goal. The reader should ascertain for himself; that
ON (B, D) can be popped off the goal stack.
• We then check the combined goal, the last step towards finding the
solution:
• ON(C, A) ˄ ON(B, D) ˄ ONTABLE(A) ˄ ONTABLE(D)
25
• The answer by the planner can be (the order of the operators will
be):
i. UNSTACK (B, A)
ii. STACK (B, D)
iii. PICKUP(C)
iv. STACK(C, A)

26
Steps in goal stack planning
• Problem solver uses a single stack that contains both the goal and
operators to satisfy the goal.
• Problem solver relies on database that describes current situation and
set of operators described as pre-condition , add and delete lists.
• It attacks problems involving conjoined goals by solving the goals one
at a time, in order.
• A plan generated by this method contains sequence of operators for
attaining the 1st goal, followed by 2nd goal
• This process continues till goal stack is empty.

27
Partial Order Planning (POP)
• State-space search
• Yields totally ordered plans (linear plans)
• POP
• Works on subproblems independently, then combines subplans
• Example
• Goal(RightShoeOn  LeftShoeOn)
• Init()
• Action(RightShoe, PRECOND: RightSockOn, EFFECT: RightShoeOn)
• Action(RightSock, EFFECT: RightSockOn)
• Action(LeftShoe, PRECOND: LeftSockOn, EFFECT: LeftShoeOn)
• Action(LeftSock, EFFECT: LeftSockOn)

28
POP Example & its linearization

29
Components of a Plan
1. A set of actions
2. A set of ordering constraints
• A p B reads “A before B” but not necessarily immediately before B
• Alert: caution to cycles A p B and B p A
3. A set of causal links (protection intervals) between actions
p
• A B reads “A achieves p for B” and p must remain true from the time
A is applied to the time B is applied
RightSockOn
• Example “RightSock RightShoe
4. A set of open preconditions
• Planners work to reduce the set of open preconditions to the empty set
without introducing contradictions

30
Consistent Plan (POP)
• Consistent plan is a plan that has
• No cycle in the ordering constraints
• No conflicts with the causal links
• Solution
• Is a consistent plan with no open preconditions
p
• To solve a conflict between a causal link A B and an action C (that clobbers,
threatens the causal link), we force C to occur outside the “protection interval”
by adding
• the constraint C p A (demoting C) or
• the constraint B p C (promoting C)

31
Setting up the PoP Start
Literala, Literalb, …

• Add dummy states


Literal1, Literal2, …
• Start
• Has no preconditions Finish
• Its effects are the literals of the initial state
• Finish
• Its preconditions are the literals of the goal state
• Has no effects Start
• Initial Plan:
LeftShoeOn, RightShoeOn
• Actions: {Start, Finish}
Finish
• Ordering constraints: {Start p Finish}
• Causal links: {}
• Open Preconditions: {LeftShoeOn,RightShoeOn}

32
POP as a Search Problem
• The successor function arbitrarily picks one open precondition p on an action B
• For every possible consistent action A that achieves p
p
• It generates a successor plan adding the causal link A B and the ordering constraint
Ap B
• If A was not in the plan, it adds Start p A and A p Finish
• It resolves all conflicts between
• the new causal link and all existing actions between A and all existing causal links
• Then it adds the successor states for combination of resolved conflicts
• It repeats until no open precondition exists

33
Some assumptions with planning
• The world is accessible, static, deterministic.
• Action descriptions are correct & complete with
exact stated consequences.

• However, the real world is not that perfect. So,


how can we handle partially accessible, dynamic,
non-deterministic world with incomplete
information?
• What do we usually do?
34
Planning under non deterministic
domain
• Sensorless Planning
• Conditional Planning
• Execution Monitoring and Replanning
• Continuous Planning

35
Sensorless Planning

• Also called as Conformant planning


• Constructs sequential plans to be executed without perception
• Achieves goals at all possible circumstances.
• Example:
• Initial State – Chairs and tables
• Goal : Paint both with same colour.
• Plan: Just open can and paint both without checking their colours.

36
Conditional planning
• It is also called as Contingency planning
• Plan differently for different contingencies.
• Agent finds which part of plan to be executed by sensing the actions.
• Example: If Airport A1 is operational then fly there
else fly to A2
- Sense color of table and chair if same colour
then goal reached – stop.
else If can with color of chair paint table
else If can with color of table paint chair
else paint both
37
Execution Monitoring and
Replanning
• Use preceding planning techniques to construct a plan
• Use execution monitoring to judge whether plan has a provision for
current situation or need to be revised
• For example if some parts not painted repaint them

38
Continuous planning
• Design to persist over life time
• Can handle unexpected circumstances in the environment
• Example plan to go out for dinner then postpone the painting

39
40
41
42
43
44
45
46
47
48
49
50
51
Continuous Planning
• Planner and execution monitor => single process
• Monitors the world continuously and updates world model from
percepts
• Example: Blocks world problem
Action:- move(x,y)
Pre-condition: clear(x) ˄ clear(y) ˄ on(x,z)
Effect: on(x,y) ˄ clear(z) ˄ ˥on(x,z) ˄ ˥clear(y)

52
B C D
A E F G

onTable(A) D
B C
on(B,E)
A E F G
on(C,F)
START
on(D,G) Finish
clear(A)
Move(D,B)
clear(B)
clear(C)
clear(D) clear(G)

on(D,G) and clear(B) are –ve literals to be dropped and on(D,B) and clear(G) are
53
+ve literals.
Multi Agent Planning
• C0-operative environment eg. Team planning – Cricket
• Requires co-ordination by communication
• Co-operation :- Joint goals and plans
• - Agent[A,B] – 2 agents
• - Actions mentioned for each agent
• - Solution is a joint plan => Action of each agent
• - Plan1 :- A[ Go(A, (Right baseline))
Hit(A, Ball)]
B: [No op(B)]
• Plan2 :- A[ Go(A, [Left Net] , No op(A))
B: [Go(B, (Right baseline))
Hit(B, Ball)]
If A choose plan 2 and B chose plan 1 then failure
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
Utility learning

81
82
83
84
Eg. Robo navigation 85
• Eg. Travelling salesman problem
86
87
88
Eg. Use of alpha beta in game playing 89
90
91
92
Eg. Conversion from Pascal to C and C++ 93
94
95
96
97
Example: Learning When an Object is a Cup
Target Concept: cup(C) :- premise(C).
Domain Theory:
cup(X) :- liftable(X),
holds_liquid(X).
holds_liquid(Z) :- part(Z,W), concave(W), points_up(W).
liftable(X) :- light(X), part(Y,handle).
light(X) :- small(X).
light(X) :- made_of(X,feathers).
Note that the domain theory includes the knowledge needed to determine when something is a cup.
We want an explicit rule that specifies when something is a cup.
Training Example: An Instance of a Cup
cup(obj1) small(obj1) part(obj1,handle)
owns(bob,obj1) part(obj1,bottom)
part(obj1,bowl)
points_up(bowl)
concave(bowl)
color(obj1,red)

98
• First: Prove that obj1 is a cup.

Next: Generalize the proof to: X is a cup. To generalize, we generalize all constants that depend solely on the training
example. So bowl and obj1 are constants found in the training example but not in the domain thoery.

99
Next: Add the new chunk of knowledge to the domain theory.
cup(X) :- small(X),
part(X,handle),
part(X,W),
concave(W),
points_up(W).
Note that none of the irrelevant information in the training example has made it into the proof or
into the new knowledge.

100
Learning from Examples: Inductive Learning

101
• It is used in inquiry-based and project-based learning where the goal is
to learn through observation rather than being ‘told’ the answers by the
teacher.
• It is consistent with a constructivist approach to learning as it holds that
knowledge should be constructed in the mind rather
than transferred from the teacher to student.
• It is argued that learning with the inductive approach results in deep
cognitive processing of information, creative independent thinking, and
a rich understanding of the concepts involved.
• It can also lead to long memory retention and strong transferability of
knowledge to other situations.

102
Inductive Learning Vs Deductive
Learning
• Generally, inductive learning is a bottom-up approach meaning the
observations precede the conclusions. It involves making
observations, recognizing patterns, and forming generalizations.
• On the other hand, deductive learning is a top-down approach
meaning that it involves a teacher presenting general principles which
are then examined using scientific research.
• Both are legitimate methods, and in fact, despite its limitations, many
students get a lot of pleasure out of doing deductive research in a
physics or chemistry class.

103
Inductive Learning Deductive Learning
Bottom-up approach starting with Top-down approach starting with
Learning Approach
examples and experiences general principles and theories
Students move from general
Students go from specific examples principles or rules (e.g. theories,
Reasoning Process and observations to concluding with hypotheses, and presuppositions) to
general principles or rules. specific examples in order to test the
theories.
The teacher facilitates discovery and
The teacher presents an idea then
exploration of new concepts and
Teacher’s Role guides students through exploring
ideas in an inquiry-based classroom
and testing concepts and ideas.
environment.
The student starts as a
The student is an active participant passive receiver of information, but
Learner’s Role in the learning process, discovering the act of testing theories is active
new information on their own. and still involves critique and
analysis.
Inductive reasoning,
Deductive reasoning, analyzing,
Thinking Skills creative thinking, critical thinking,
debunking, critical thinking
hypothesizing
More suitable for abstract and
More suitable for real-life situations
theoretical concepts where students
Real-life Applications where students must use trial-and-
must apply principles and rules to
error to find solutions.
specific examples.

104
Strengths of Inductive Learning Limitations of Inductive Learning
Requires Extensive Time and Effort: Teachers
Encourages Active Learning: Students must learn have minimal time to present concepts in a crowded
through experimentation, observation, and trial-and- curriculum. Often, it makes more sense to use
error. deductive learning, especially if it leads to the same
learning outcomes.

May Not Provide Clear Guidelines for


Learning: One of the biggest challenges I’ve faced
Helps Develop Critical Thinking Skills: Students as both a learner and teacher is ensuring students
are encouraged to think critically and actively understand the direction and point of each lesson.
analyze what they observe in their experiments. The teacher wants students to discover information
for themselves, but the students also need guidance
and scaffolding to stay on track.

May not End with Correct Answers: When


students construct information themselves, they
Can Lead to New Insights: Because students
may use faulty logic or methodologies. To address
aren’t given the information at the outset, students
this, the teacher needs to set in place strong
often come to conclusions that are surprising and
guidelines on how to observe and experiment while
innovative.
still leaving open possibilities for surprising
conclusions.
105
106
A Blocks World Learning Example -- Winston (1975)
• The goal is to construct representation of the definitions of concepts in this domain.
• Concepts such a house - brick (rectangular block) with a wedge (triangular block) suitably
placed on top of it, tent - 2 wedges touching side by side, or an arch - two non-touching
bricks supporting a third wedge or brick, were learned.
• The idea of near miss objects -- similar to actual instances was introduced.
• Input was a line drawing of a blocks world structure.
• Input processed to produce a semantic net representation of the structural description of
the object
• Links in network include left-of, right-of, does-not-marry, supported-by, has-part, and isa.
• The marry relation is important -- two objects with a common touching edge are said to
marry. Marrying is assumed unless does-not-marry stated.

107
There are three basic steps to the problem of concept formulation:
1. Select one know instance of the concept. Call this the concept
definition.
2. Examine definitions of other known instance of the concept.
Generalise the definition to include them.
3. Examine descriptions of near misses. Restrict the definition
to exclude these.
• Both steps 2 and 3 rely on comparison and both similarities and
differences need to be identified.

108
Version Space
• A version space is a hierarchical representation of knowledge that enables you to keep track of all the useful information
supplied by a sequence of learning examples without remembering any of the examples.
• The version space method is a concept learning process accomplished by managing multiple models within a version
space.
Version Space Characteristics
• Tentative heuristics are represented using version spaces.
• A version space represents all the alternative plausible descriptions of a heuristic.
A plausible description is one that is applicable to all known positive examples and no known negative example.
• A version space description consists of two complementary trees:
• One that contains nodes connected to overly general models, and
• One that contains nodes connected to overly specific models.
• Node values/attributes are discrete.
Fundamental Assumptions
• The data is correct; there are no erroneous instances.
• A correct description is a conjunction of some of the attributes with values.

109
Diagrammatical Guidelines
• There is a generalization tree and a specialization tree.
• Each node is connected to a model.
• Nodes in the generalization tree are connected to a model that matches
everything in its subtree.
• Nodes in the specialization tree are connected to a model that matches
only one thing in its subtree.
• Links between nodes and their models denote
• generalization relations in a generalization tree, and
• specialization relations in a specialization tree.

110
Diagram of a Version Space
In the diagram below, the specialization tree is colored red, and the generalization tree is colored green.

111
Problem 1:
• Learning the concept of "Japanese Economy Car"
• Features: ( Country of Origin, Manufacturer, Color, Decade, Type )
Origin Manufacturer Color Decade Type Example Type
Japan Honda Blue 1980 Economy Positive
Japan Toyota Green 1970 Sports Negative
Japan Toyota Blue 1990 Economy Positive
USA Chrysler Red 1980 Economy Negative
Japan Honda White 1980 Economy Positive

Solution:
1. Positive Example: (Japan, Honda, Blue, 1980, Economy)

Initialize G to a singleton set that includes


everything. G = { (?, ?, ?, ?, ?) }
Initialize S to a singleton set that includes the S = { (Japan, Honda, Blue, 1980, Economy) }
first positive example.

112
These models represent the most general and the most specific heuristics one might learn.
The actual heuristic to be learned, "Japanese Economy Car", probably lies between them somewhere within the version space.
2. Negative Example: (Japan, Toyota, Green, 1970, Sports)
Specialize G to exclude the negative example.
{ (?, Honda, ?, ?, ?),
(?, ?, Blue, ?, ?),
G=
(?, ?, ?, 1980, ?),
(?, ?, ?, ?, Economy) }
S= { (Japan, Honda, Blue, 1980, Economy) }

Refinement occurs by generalizing S or specializing G, until the heuristic hopefully converges to one that works well.113
• 3. Positive Example: (Japan, Toyota, Blue, 1990, Economy)
Prune G to exclude descriptions inconsistent with the positive example.
Generalize S to include the positive example.

G { (?, ?, Blue, ?, ?),


= (?, ?, ?, ?, Economy) }
{ (Japan, ?, Blue, ?,
S=
Economy) }

114
4. Negative Example: (USA, Chrysler, Red, 1980, Economy)
Specialize G to exclude the negative example (but stay consistent with S)

{ (?, ?, Blue, ?, ?),


G=
(Japan, ?, ?, ?, Economy) }
S= { (Japan, ?, Blue, ?, Economy) }

115
• G and S are singleton sets and
S = G.Converged.
No more data, so algorithm
5. Positive Example: (Japan, Honda, White, 1980, Economy) stops.
Prune G to exclude descriptions inconsistent with positive example.
Generalize S to include positive example.
G = { (Japan, ?, ?, ?, Economy) }
S = { (Japan, ?, ?, ?, Economy) }

116
Decision Tree

24/12/2024 117
Tree-Based Methods
• Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other
supervised learning algorithms, the decision tree algorithm can be used for solving regression
and classification problems too.

• Tree-based methods for regression and classification involve stratifying or segmenting the
predictor space into a number of simple regions.

• Since the set of splitting rules used to segment the predictor space can be summarized in a tree,
these types of machine learning approaches are known as decision tree methods.

• The basic idea of these methods is to partition the space and identify some representative
centroids.

24/12/2024 118
Regression Trees
• One way to make predictions in a regression problem is to divide the predictor
space (i.e. all the possible values for X1, X2,…, Xp) into distinct regions, say R1, R2,
…,Rk (terminal nodes or leaves).

• Then, for every X that falls into a particular region (say Rj) we make the same
prediction.

• Decision trees are typically drawn upside down, in the sense that the leaves are
at the bottom of the tree.

• The points along the tree where the predictor space is split are referred to as
internal nodes. The segments of the trees that connect the nodes are branches.
24/12/2024 119
Regression Trees (cont.)

• Suppose we have two regions R1 and R2


with and , respectively.

• For any value of X such that , we would


predict 10. For any value of X such that ,
we would predict 20.

 In this illustrative example, we have two predictors and five distinct


regions. Depending on which region the new X comes from, we
would make one of five possible predictions for Y.
24/12/2024 120
• Generally, we create the partitions by iteratively
splitting one of the X variables into two regions.

• First split on X1 = t1

• If X1 < t1, split on X2 = t2

• If X1 > t1, split on X1 = t3

• If X1 > t3, split on X2 = t4

24/12/2024 121
• When we create partitions this way, we
can always represent them using a tree
structure.

• As a result, this provides a very simple


way to explain the model to a non-
expert.

24/12/2024 122
Classification Trees
• Classification trees are a hierarchical way of partitioning the space.

• We start with the entire space and recursively divide it into smaller regions.

• At the end, every region is assigned with a class label.

• We start with a medical example to get a rough idea about classification trees.

24/12/2024 123
Classification Trees (cont.)
• One big advantage for decision trees is that the classifier generated is highly
interpretable. For physicians, this is an especially desirable feature.

• In this example, patients are classified into one of two classes: high risk versus low
risk.

• It is predicted that the high risk patients would not survive at least 30 days based on
the initial 24-hour data.

• There are 19 measurements taken from each patient during the first 24 hours. These
include blood pressure, age, etc.
24/12/2024 124
Classification Trees (cont.)
• Here, we generate a tree-structured classification rule, which can be interpreted as follows:

• Only three measurements are looked at by this classifier. For some patients, only one measurement
determines the final result.
• Classification
24/12/2024
trees operate similarly to a doctor's examination. 125
Classification Trees (cont.)
• First we look at the minimum systolic blood pressure within the initial 24 hours and determine
whether it is above 91.

• If the answer is no, the patient is classified as high-risk. We don't need to look at the other
measurements for this patient.

• If the answer is yes, then we can't make a decision yet. The classifier will then look at whether the
patient's age is greater than 62.5 years old.

• If the answer is no, the patient is classified as low risk. However, if the patient is over 62.5 years old,
we still cannot make a decision and then look at the third measurement, specifically, whether sinus
tachycardia is present.

• If the answer is no, the patient is classified as low risk. If the answer is yes, the patient is classified as
24/12/2024 126
high risk.
Classification Trees (cont.)
Business marketing: predict whether a person will buy a
computer?

24/12/2024 127
Classification Trees (cont.)
Notation:
• We will denote the feature space by X. Normally X is a
multidimensional Euclidean space.

• However, sometimes some variables may be categorical such as


gender (male or female).

• Classification and Regression Trees (CART) have the advantage of


treating real variables and categorical variables in a unified manner.

• This is not so for many other classification methods, such as LDA.

24/12/2024 128
Classification Trees (cont.)
Notation (cont’d):
• The input vector is indicated by contains p features X1,…,Xp

• Tree structured classifiers are constructed by repeated splits of the space X into
smaller and smaller subsets, beginning with X itself.

• Node: Any of the nodes in the classification tree, where each corresponds to a
region in the original space.

• Terminal Node: The final node resulting from successive splits, where each is
assigned a unique class.
24/12/2024 129
Classification Trees (cont.)
Notation (cont’d):
• Parent Node: those nodes that are split into two child nodes.

• Child Node: these result from the splitting of a parent node. Two
child nodes are two different regions. Together they occupy the same
region of the parent node.

24/12/2024 130
Notation (cont’d):
• One thing that we need to keep in mind is that the tree represents recursive splitting of
the space.

• Therefore, every node of interest corresponds to one region in the original space.

• Two child nodes will occupy two different regions and if we put the two together, we get
the same region as that of the parent node.

• In the end, every leaf node is assigned with a class and a test point is assigned with the
class of the leaf node it lands in.

24/12/2024 131
Notation (cont’d):
• A node is denoted by t. We will also denote the left child node by tL and the right
one by tR.

• We denote the collection of all the nodes in the tree by T and the collection of all
the leaf nodes by .

• A split will be denoted by s, and the set of splits is denoted by S.

• Let’s next take a look at how these splits can take place, where the whole space is
represented by X.
24/12/2024 132
24/12/2024 133
24/12/2024 134
Types of Decision Trees
• Categorical Variable Decision Tree: Decision Tree which has a categorical target
variable then it called a Categorical variable decision tree.

• Continuous Variable Decision Tree: Decision Tree has a continuous target


variable then it is called Continuous Variable Decision Tree.
• Example:-
 Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no).
Here we know that the income of customers is a significant variable but the insurance
company does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to
predict customer income based on occupation, product, and various other variables. In
this case, we are predicting values for the continuous variables.
24/12/2024 135
• The construction of a tree involves the following three elements:
• The selection of the splits, i.e., how do we decide which node (region) to split and how to
split it?
• If we know how to make splits or 'grow' the tree, how do we decide when to declare a
node terminal and stop splitting?
• We have to assign each terminal node to a class. How do we assign these class labels?

24/12/2024 136
Assumptions while creating
Decision Tree
Below are some of the assumptions we make while using Decision tree:
• In the beginning, the whole training set is considered as the root.
• Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
• Records are distributed recursively on the basis of attribute values.
• Order to placing attributes as root or internal node of the tree is done by using
some statistical approach.

24/12/2024 137
Algorithms used in Decision Trees:

ID3 → (extension of D3)


C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits
when computing classification trees)

24/12/2024 138
ID3 algorithm (Iterative
Dichotomiser3)
• The ID3 algorithm builds decision trees using a top-down greedy search approach through the
space of possible branches with no backtracking.
• A greedy algorithm, as the name suggests, always makes the choice that seems to be the best
at that moment.(Iteratively divides)
• Steps in ID3 algorithm:
 It begins with the original set S as the root node.
 On each iteration of the algorithm, it iterates through the very unused attribute of the set S and
calculates Entropy(H) and Information gain(IG) of this attribute.
 It then selects the attribute which has the smallest Entropy or Largest Information gain.
 The set S is then split by the selected attribute to produce a subset of the data.
 The algorithm continues to recur on each subset, considering only attributes never selected before.

24/12/2024 139
Attribute Selection Measures
• If the dataset consists of N attributes then deciding which attribute to place at the root or at
different levels of the tree as internal nodes is a complicated step.
• By just randomly selecting any node to be the root can’t solve the issue.
• If we follow a random approach, it may give us bad results with low accuracy.
• For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested using some criteria like :
 Entropy
 Information gain,
 Gini index,
 Gain Ratio,
 Reduction in Variance
 Chi-Square
 These criteria will calculate values for every attribute. The values are sorted, and attributes are
placed in the tree by following the order i.e, the attribute with a high value(in case of
information gain) is placed at the root.

 While using Information Gain as a criterion, we assume attributes to be categorical, and for the
Gini24/12/2024
index, attributes are assumed to be continuous. 140
Entropy
• Entropy is a measure of the randomness in the information being processed.
• The higher the entropy, the harder it is to draw any conclusions from that information.
• Flipping a coin is an example of an action that provides information that is random.
• we have two outcomes, both have P(X=H)=P(X=T)=1/2;P(X=H)=P(X=T)=1/2, so H(X)=−
[(1/2) log21/2+(1/2)log21/2]=1
• P(X=H)=P(X=T)=1 ;P(X=H)=P(X=T)=0 both terms are zero, so the entropy is 0.

24/12/2024 141
• ID3 follows the rule — A branch with an entropy of zero is a leaf
node and A branch with entropy more than zero needs further
splitting.
• Mathematically Entropy for 1 attribute is represented as:
Where S → Current state, and Pi →
Probability of an event i of state S
or Percentage of class i in a node
of state S.
P(playgolf)=9/14=0.64
Mathematically Entropy for 1 attribute is represented as:
P(~playgolf)=5/14=0.36

24/12/2024 142
• Mathematically Entropy for multiple attributes is represented as

where T→ Current state and X → Selected attrib

Mathematically Entropy for multiple attributes is represented as:

E(3,2)=-[3/5log2(3/5) + 2/5log2(2/5)]
E(4,0)=-[4/4log2(4/4) + 0/4log2(0/4)]
E(2,3)=-[2/5log2(2/5) + 3/5log2(3/5)]

24/12/2024 143
Information Gain
• Information gain or IG is a statistical property that measures how well a
given attribute separates the training examples according to their target
classification.
• Constructing a decision tree is all about finding an attribute that returns
the highest information gain and the smallest entropy.

24/12/2024 144
• Information gain is a decrease in entropy.
• It computes the difference between entropy before split and average
entropy after split of the dataset based on given attribute values.
• ID3 (Iterative Dichotomiser) decision tree algorithm uses information
gain.
• Mathematically, IG is represented as:

24/12/2024 145
In a much simpler way, we can conclude that:

Where “before” is the dataset before the split, K is the number of subsets
generated by the split, and (j, after) is subset j after the split.

24/12/2024 146
Will I play tennis today? – ID3
Algorithm
O T H W Play?
1 S H H W - Outlook: S(unny),
2 S H H S - O(vercast),
3 O H H W + R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S +
L(ow)
12 O M H S +
13 O H N W + Wind: S(trong),
14 R M H S - W(eak)

147
Basic Decision Trees Learning Algorithm
O T H W Play?
1 S H H W - • Data is processed in Batch (i.e. all the
2 S H H S - data available) Algorithm?
3 O H H W +
4 R M H W + • Recursively build a decision tree top
5
6
R
R
C
C
N
N
W
S
+
-
down.
7 O C N S + Outlook
8 S M H W -
9 S C N W +
10 R M N W +
Sunny Overcast Rain
11 S M N S +
12 O M H S + Humidity Yes Wind
13 O H N W +
14 R M H S - High Normal Strong Weak
No Yes No Yes

148
• Gain(Sunny,T) O T H W Play?
1 S H H W -
2 S H H S -
Temperature = hot: 4 R M H W +
2 2 Entropy(T = H) = 0 5 R C N W +
Temperature = Medium: 6 R C N S -
2 2 Entropy(H = N) = 1 7 O C N S +
8 S M H W -
Temperature = Cool:
9 S C N W +
1 1 Entropy(H = N) = 0
10 R M N W +
11 S M N S +
Expected entropy 12 O M H S +
= 13 O H N W +
= (2/5)×0+ (2/5)* 1 + (1/5)×0= 2/5 14 R M H S -

Information gain = 0.971 – 0.4 = 0.57


149
Gain(S sunny , Temp)  0.57
• Gain(Sunny,H) O T H W Play?
1 S H H W -
2 S H H S -
Humidity = high: 4 R M H W +
3 3 Entropy(H = H) = 0 5 R C N W +
Humidity = normal: 6 R C N S -
2 2 Entropy(H = N) = 0 7 O C N S +
8 S M H W -
9 S C N W +
Expected entropy
10 R M N W +
=
11 S M N S +
= (3/5)×0+ (2/5)*0=0 12 O M H S +
13 O H N W +
Information gain(Sunny,Humidity) = 0.971 – 0 = 0.971 14 R M H S -

150
• Gain(Sunny,W) O T H W Play?
1 S H H W -
2 S H H S -
Wind = Strong: 4 R M H W +
2 2 Entropy(W=S) = 1 5 R C N W +
Wind = Weak: 6 R C N S -
3 3 Entropy(W=W) = 0.918 7 O C N S +
8 S M H W -
9 S C N W +
Expected entropy
10 R M N W +
=
11 S M N S +
= (2/5)×1+ (3/5)*.918=0.952 12 O M H S +
13 O H N W +
Information gain(Sunny,Humidity) = 0.971 – 0.952= 0.02 14 R M H S -

151
Gain(S sunny , Humidity)  .97-(3/5) 0-(2/5) 0 = .97
Gain(S sunny , Temp)  .97- 0-(2/5) 1 = .57
Gain(S sunny , Wind)  .97-(2/5) 1 - (3/5) .92= .02

152
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?

153
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes ?

High Normal
No Yes

154
An Illustrative Example (VI)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

155
Statistical Learning

• As intuitive as it sounds from its name, statistical machine learning


involves using statistical techniques to develop models that can learn
from data and make predictions or decisions.
• You might have heard technical terms such as supervised, unsupervised
and semi-supervised learning– they all rely on a solid statistical
foundation
• In essence, statistical machine learning merges the computational
efficiency and adaptability of machine learning algorithms with statistical
inference and modeling capabilities.
• By employing statistical methods, we can extract significant patterns,
relationships, and insights from intricate datasets, thereby promoting the
effectiveness of machine learning algorithms.
156
1. Bayesian Learning

157
158
P(hi/d) =P(lime/d)=∑ P(d/hi).P(hi) = (0.1*0) + (0.2* .2510 )+ (0.4 * 0.510 )+ (0.2* 0.05563)+ (0.1*1 )

159
Bayesian Learning
• Optimal results
• No overfitting
• Hypothesis space is large
• Large summation problem

160
2. Maximum Aposteriori Learning
(MAP)
• Make prediction on most probable hypothesis hMAP
• hMAP= argmaxP(hi/d)
• P(cherry/d)=P(cherry/ hMAP)
• P(h1/d) for cherry =0
• P(h2/d)=0.25*0.2 = 0.05
• P(h3/d)=0.5*0.4=0.2
• P(h4/d)=0.75*0.2=0.15
• P(h5/d)=1*0.1=0.1
• hMAP = h3 =P(cherry/ hMAP) = 0.2
161
• Features (MAP)
• It is less accurate than Bayesian Learning
• It relies on hyphothesis only.
• No overfitting

162
3.Maximum Likelihood Estimation
(MLE)
• Simplified Maximum Aposteriori Learning by assuming uniform
prior probability [p(hi)]
• hML = argmax[p(d/h)]
• P(lime/h5) =1 (max) for h5.
• It is less accurate than Bayesian and MAP
• Ignores prior information
• Relies on one hypothesis alone.

163
4. Naïve Bayesian Classifier
• Suppose, Y is a class variable and X = is a set of attributes,
with instance of Y.
INPUT (X) CLASS(Y)
… … …
… … … …

… … … …

• The classification problem, then can be expressed as the class-conditional


probability

24/12/2024 164
Naïve Bayesian Classifier
• Naïve Bayesian classifier calculate this posterior probability using Bayes’ theorem, which is as follows.

• From Bayes’ theorem on conditional probability, we have

(Y)
where,

Note:
 is called the evidence (also the total probability) and it is a constant.

 The probability P(Y|X) (also called class conditional probability) is therefore


proportional to P(X|Y).

 Thus, P(Y|X) can be taken as a measure of Y given that X.


P(Y|X)
24/12/2024 165
Naïve Bayesian Classifier
• Suppose, for a given instance of X (say x = () and ….. .

• There are any two class conditional probabilities namely P(Y|X=x) and P(YX=x).

• If P(YX=x) > P(YX=x), then we say that is more stronger than for the instance X = x.

• The strongest is the classification for the instance X = x.

24/12/2024 166
Days Season Fog Rain Class
Naïve Bayesian Classifier
Weekday Spring None None On Time

Weekday Winter None Slight On Time


• Example: With reference to the Air Traffic Dataset mentioned earlier,
Weekday Winter None None On Time
let us tabulate all the posterior and prior probabilities as shown below.
Holiday Winter High Slight Late

Saturday Summer Normal None On Time

Weekday Autumn Normal None Very Late

Holiday Summer High Slight On Time


Class
Sunday Summer Normal None On Time

Weekday Winter High Heavy Very Late


Attribute On Time Late Very Late Cancelled
Weekday Summer None Slight On Time Weekday 9/14 = 0.64 ½ = 0.5 3/3 = 1 0/1 = 0
Saturday Spring High Heavy Cancelled Saturday 2/14 = 0.14 ½ = 0.5 0/3 = 0 1/1 = 1
Day
Weekday Summer High Slight On Time Sunday 1/14 = 0.07 0/2 = 0 0/3 = 0 0/1 = 0
Weekday Winter Normal None Late
Weekday Summer High None On Time
Holiday 2/14 = 0.14 0/2 = 0 0/3 = 0 0/1 = 0
Weekday Winter Normal Heavy Very Late Spring 4/14 = 0.29 0/2 = 0 0/3 = 0 0/1 = 0
Saturday Autumn High Slight On Time
Summer 6/14 = 0.43 0/2 = 0 0/3 = 0 0/1 = 0
Season

Weekday Autumn None Heavy On Time


Holiday Spring Normal Slight On Time Autumn 2/14 = 0.14 0/2 = 0 1/3= 0.33 0/1 = 0
Weekday Spring Normal None On Time
24/12/2024 Winter 2/14 = 0.14 2/2 = 1 2/3 = 0.67 0/1
167 = 0
Weekday Spring Normal Heavy On Time
Days Season Fog Rain Class

Weekday Spring None None On Time

Weekday

Weekday
Winter

Winter
None

None
Slight

None
On Time

On Time
Naïve Bayesian Classifier
Holiday Winter High Slight Late

Saturday Summer Normal None On Time

Weekday Autumn Normal None Very Late

Holiday Summer High Slight On Time

Sunday Summer Normal None On Time Class


Weekday Winter High Heavy Very Late Attribute On Time Late Very Late Cancelled
Weekday Summer None Slight On Time
None 5/14 = 0.36 0/2 = 0 0/3 = 0 0/1 = 0
Saturday Spring High Heavy Cancelled

Fog
High 4/14 = 0.29 1/2 = 0.5 1/3 = 0.33 1/1 = 1
Weekday Summer High Slight On Time
Normal 5/14 = 0.36 1/2 = 0.5 2/3 = 0.67 0/1 = 0
Weekday Winter Normal None Late
Weekday Summer High None On Time None 5/14 = 0.36 1/2 = 0.5 1/3 = 0.33 0/1 = 0
Rain

Weekday Winter Normal Heavy Very Late


Slight 8/14 = 0.57 0/2 = 0 0/3 = 0 0/1 = 0
Saturday Autumn High Slight On Time
Weekday Autumn None Heavy On Time Heavy 1/14 = 0.07 1/2 = 0.5 2/3 = 0.67 1/1 = 1
Holiday Spring Normal Slight On Time
Prior Probability 14/20 = 0.70 2/20 = 0.10 3/20 = 0.15 1/20 = 0.05
Weekday Spring Normal None On Time
Weekday Spring
24/12/2024 Normal Heavy On Time 168
Naïve Bayesian Classifier
Instance:

Week Day Winter High Heavy ???

Case1: Class = On Time : 0.70 × 0.64 × 0.14 × 0.29 × 0.07 = 0.0013

Case2: Class = Late : 0.10 × 0.50 × 1.0 × 0.50 × 0.50 = 0.0125

Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222

Case4: Class = Cancelled : 0.05 × 0.0 × 0.0 × 1.0 × 1.0 = 0.0000

Case3 is the strongest; Hence correct classification is Very Late

24/12/2024 169
Naïve Bayesian Classifier
Algorithm: Naïve Bayesian Classification

Input: Given a set of k mutually exclusive and exhaustive classes C = , which


have prior probabilities P(C1), P(C2),….. P(Ck).

There are n-attribute set A = which for a given instance have values = , = ,….., =

Step: For each , calculate the class condition probabilities, i = 1,2,…..,k

Output: is the classification

24/12/2024 Note: , because they are not probabilities rather proportion values (to posterior probabilities) 170
Naïve Bayesian Classifier
Pros and Cons
• The Naïve Bayes’ approach is a very popular one, which often works well.

• However, it has a number of potential problems

 It relies on all attributes being categorical.

 If the data is less, then it estimates poorly.

24/12/2024 171
Naïve Bayesian Classifier
M-estimate of Conditional Probability

• The M-estimation is to deal with the potential problem of Naïve Bayesian Classifier
when training data size is too poor.
• If the posterior probability for one of the attribute is zero, then the overall class-
conditional probability for the class vanishes.

• In other words, if training data do not cover many of the attribute values, then we may not
be able to classify some of the test records.

• This problem can be addressed by using the M-estimate approach.

24/12/2024 172
M-estimate Approach
• M-estimate approach can be stated as follows

where, n = total number of instances from class


= number of training examples from class that take the value
m = it is a parameter known as the equivalent sample size, and
p = is a user specified parameter.

Note:
If n = 0, that is, if there is no training set available, then = p,
so, this is a different value, in absence of sample value.

24/12/2024 173
A Practice Example
age income studentcredit_rating
buys_computer
Example 8.4 <=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data instance <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
24/12/2024 174
age income studentcredit_rating
buys_compu
A Practice Example <=30 high no fair no
 P(C ): P(buys_computer = “yes”) = 9/14 = 0.643
i
<=30 high no excellent no
P(buys_computer = “no”) = 5/14= 0.357
31…40 high no fair yes
 Compute P(X|C ) for each class
i >40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 low yes fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 low yes excellent no
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
31…40 low yes excellent yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
<=30 medium no fair no
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
<=30 low yes fair yes
>40 medium yes fair yes
P(X|C ) : P(X|buys_computer = “yes”) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044
i
P(X|buys_computer = “no”) = 0.6 × 0.4 × 0.2 × 0.4 = 0.019 <=30 medium yes excellent yes
P(X|C )*P(C ) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
i i
31…40 medium no excellent yes
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
31…40 high yes fair yes
Therefore, X belongs to class (“buys_computer = yes”)
24/12/2024 >40 medium no excellent175 no
5.Reinforcement Learning

• Reinforcement Learning is a feedback-based Machine learning technique in


which an agent learns to behave in an environment by performing the actions
and seeing the results of actions.
• For each good action, the agent gets positive feedback, and for each bad action,
the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience
only.
• RL solves a specific type of problem where decision making is sequential, and the
goal is long-term, such as game-playing, robotics, etc.
24/12/2024 176
• The agent interacts with the environment and explores it by itself. The primary goal of an agent in
reinforcement learning is to improve the performance by getting the maximum positive rewards.

• The agent learns with the process of hit and trial, and based on the experience, it learns to perform the
task in a better way. Hence, we can say that "Reinforcement learning is a type of machine learning
method where an intelligent agent (computer program) interacts with the environment and learns to
act within that." How a Robotic dog learns the movement of his arms is an example of Reinforcement
learning.

• It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement learning.
Here we do not need to pre-program the agent, as it learns from its own experience without any human
intervention.

• Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the
diamond. The agent interacts with the environment by performing some actions, and based on those
actions, the state of the agent gets changed, and it also receives a reward or penalty as feedback.

• The agent continues doing these three things (take action, change state/remain in the same state, and
get feedback), and by doing these actions, he learns and explores the environment.

24/12/2024 177
• The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative
• Robotics: Robots can learn to perform tasks the physical world using
this technique.
• Video gameplay: Reinforcement learning has been used to teach bots
to play a number of video games.
• Resource management: Given finite resources and a defined goal,
reinforcement learning can help enterprises plan out how to allocate
resources.

24/12/2024 178
6. Artificial Neural Network
An artificial neural network consists of a pool of simple
processing units which communicate by sending signals to
each other over a large number of weighted connections.
Artificial Neural Network
A set of major aspects of a parallel distributed model include:
 a set of processing units (cells).
 a state of activation for every unit, which equivalent to the output of the unit.
 connections between the units. Generally each connection is defined by a weight.
 a propagation rule, which determines the effective input of a unit from its external inputs.
 an activation function, which determines the new level of activation based on the effective
input and the current activation.
 an external input for each unit.
 a method for information gathering (the learning rule).
 an environment within which the system must operate, providing input signals and _ if
necessary _ error signals.
Computers vs. Neural Networks
“Standard” Computers Neural Networks

 one CPU highly parallel processing

fast processing units slow processing units

reliable units unreliable units

static infrastructure dynamic infrastructure


Why Artificial Neural Networks?
There are two basic reasons why we are interested in
building artificial neural networks (ANNs):

• Technical viewpoint: Some problems such as


character recognition or the prediction of future
states of a system require massively parallel and
adaptive processing.

• Biological viewpoint: ANNs can be used to


replicate and simulate components of the human
(or animal) brain, thereby giving us insight into
natural information processing.
Artificial Neural Networks
• The “building blocks” of neural networks are the
neurons.
• In technical systems, we also refer to them as units or nodes.
• Basically, each neuron
 receives input from many other neurons.
 changes its internal state (activation) based on the current
input.
 sends one output signal to many other neurons, possibly
including its input neurons (recurrent network).
Artificial Neural Networks
• Information is transmitted as a series of electric
impulses, so-called spikes.

• The frequency and phase of these spikes encodes the


information.

• In biological systems, one neuron can be connected to as


many as 10,000 other neurons.

• Usually, a neuron receives its information from other


neurons in a confined area, its so-called receptive field.
How do ANNs work?
 An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to
simulate the information processing capabilities of its biological
exemplar. ANNs are typically composed of a great number of
interconnected artificial neurons. The artificial neurons are
simplified models of their biological counterparts.
 ANN is a technique for solving problems by constructing software
that works like our brains.
How do our brains work?
 The Brain is A massively parallel information processing
system.
 Our brains are a huge network of processing elements. A
typical brain contains a network of 10 billion neurons.
How do our brains work?
 A processing element

Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
How do our brains work?
 A processing element

A neuron is connected to other neurons through about 10,000


synapses
How do our brains work?
 A processing element

A neuron receives input from other neurons. Inputs are combined.


How do our brains work?
 A processing element

Once input exceeds a critical level, the neuron discharges a spike ‐


an electrical pulse that travels from the body, down the axon, to
the next neuron(s)
How do our brains work?
 A processing element

The axon endings almost touch the dendrites or cell body of the
next neuron.
How do our brains work?
 A processing element

Transmission of an electrical signal from one neuron to the next is


effected by neurotransmitters.
How do our brains work?
 A processing element

Neurotransmitters are chemicals which are released


from the first neuron and which bind to the
Second.
How do our brains work?
 A processing element

This link is called a synapse. The strength of the signal that


reaches the next neuron depends on factors such as the amount of
neurotransmitter available.
How do ANNs work?

An artificial neuron is an imitation of a human neuron


How do ANNs work?
• Now, let us have a look at the model of an artificial neuron.
How do ANNs work?
.........
Input xm
...
x2 x1

Processing ∑
∑= X1+X2 + ….+Xm =y

Output y
How do ANNs work?
Not all inputs are equal
xm ......... x2 x1
...
Input
w ...
weights m
..
w2 w1

Processing ∑ ∑= X1w1+X2w2 + ….
+Xmwm =y

Output y
How do ANNs work?
The signal is not passed down to the
next neuron verbatim
xm ......... x2 x1
...
Input
w
w w
...
weights m
..
2
1

Processing ∑
Transfer Function
f(vk)
(Activation Function)

Output y
Why we use Activation functions with Neural Networks?

• It is used to determine the output of neural network like yes or no.


• It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the
function).
• +ve value fires perception
• -ve values inhibits perception

Output depends on input and weights which may be


1.Linear Activation Function: output is proportional to total weight of input

200
2. Threshold function: The output is set at one of two values
depending on whether the total weighted input is greater than
or less than some threshold value

3. Sigmoid function: Output vary continuously and non linearly

201
Several perception can be combined together to compute complex function
Artificial Neural Networks
An ANN can:
1. compute any computable function, by the appropriate
selection of the network topology and weights values.
2. learn from experience!
 Specifically, by trial‐and‐error
Learning by trial‐and‐error
Continuous process of:
Trial:
Processing an input to produce an output (In
terms of ANN: Compute the output function of a
given input)
Evaluate:
Evaluating this output by comparing the
actual output with the expected output.
Adjust:
Adjust the weights.
What a single perceptron learn to do?

Linear Separable problem:

• A line that separates one class from another


• Output to be 1 if input belongs to class o and 0 for .

Let X = {x1,x2,x3….xn}
Input vector weight summation function g(x) =∑wixi
Output function o(x) = 1 if g(x) >0 and 0 else
For 2 inputs
g(x)=w0 + w1x1 + w2x2
If g(x)=0 then w0 + w1x1 + w2x2=0
w2x2= -w0 - w1x1
x2= (-w0 - w1x1 )/w2
Location of line is determined by w0w1w2
Such a line acts as decision surface

Perception with many input decision surface => hyperplane

CS 270 - Perceptron 205


Learning rule
 If perceptron fires when it should not fire make w i smaller by
an amount proportional to xi
 If it fails to fire make each wi larger proportionally.
 Implementation of AND graph
 Let w0 =0 , w1= 1 and w2 =1
g(x)=w0 + w1x1 + w2x2
(0,1)
(0,0) : g(x) =0
(0,1) : g(x) = 1 (1,1)

(1,0) : g(x) =1
(1,1) : g(x) = 2
Threshold set =1.5 (0,0) (1,0)
Learning rule
 Implementation of XOR graph
 Here we cannot take a linear decision
 Construct multi layer perceptron

(0,1)

(1,1)

(0,0) (1,0)
Multi layer perceptron for XOR gate

X0=1
W0= -1.5 X0=1
W0= -0.5


W1=1 ∫ -9 >0: 1
X1 ∑ >=0: 1 ∑ <=0: 0
<0: 0
W1= 1
W2=1
X2 X1 W2= 1

X2

X1 x2 Output1 Output2
0 0 -1.5+0=> 0 -0.5+0=>0
0 1 -1.5+0+1=-0.5 => 0 -0.5-0+1=0.5=> 1
1 0 -1.5+1+0=-0.5 => 0 -0.5=0+1 =0.5 => 1
1 1 -1.5+1+1=0.5=> 1 -0.5-9+1+1=-7.5 => 0
208
Genetic Learning
• A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's
theory of evolution in Nature."
• It is used to solve optimization problems in machine learning.
• It is one of the important algorithms as it helps solve complex problems that would take a long
time to solve.
• Genetic Algorithms are being widely used in different real-world applications, for example, Designing electronic
circuits, code-breaking, image processing, and artificial creativity.
• Before understanding the Genetic algorithm, let's first understand basic terminologies to better understand this
algorithm:
 Population: Population is the subset of all possible or probable solutions, which can solve the given problem.
 Chromosomes: A chromosome is one of the solutions in the population for the given problem, and the collection
of gene generate a chromosome.
 Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
 Fitness Function: The fitness function is used to determine the individual's fitness level in the population. It means
the ability of an individual to compete with other individuals. In every iteration, individuals are evaluated based on
their fitness function.
 Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better than parents.
Here genetic operators play a role in changing the genetic composition of the next generation.
 Selection:After calculating the fitness of every existent in the population, a selection process is used to determine
which of the individualities in the population will get to reproduce and produce the seed that will form the coming
generation.
209
How Genetic Algorithm Work?

• The genetic algorithm works on the evolutionary generational cycle to generate high-
quality solutions.
• These algorithms use different operations that either enhance or replace the population
to give an improved fit solution.
• It basically involves five phases to solve the complex optimization problems, which are
given as below:
• Initialization
• Fitness Assignment
• Selection
• Reproduction
• Termination

210
1. Initialization
• The process of a genetic algorithm starts by generating the set of individuals, which is
called population.
• Here each individual is the solution for the given problem.
• An individual contains or is characterized by a set of parameters called Genes.
• Genes are combined into a string and generate chromosomes, which is the solution to
the problem. One of the most popular techniques for initialization is the use of random
binary strings.

211
2. Fitness Assignment

• Fitness function is used to determine how fit an individual is?


• It means the ability of an individual to compete with other individuals.
• In every iteration, individuals are evaluated based on their fitness function.
• The fitness function provides a fitness score to each individual.
• This score further determines the probability of being selected for reproduction.
• The high the fitness score, the more chances of getting selected for
reproduction.

3. Selection
• The selection phase involves the selection of individuals for the reproduction of offspring.
• All the selected individuals are then arranged in a pair of two to increase reproduction.
• Then these individuals transfer their genes to the next generation.

212
4. Reproduction
• After the selection process, the creation of a child occurs in the reproduction step.
• In this step, the genetic algorithm uses two variation operators that are applied to the
parent population.
• The two operators involved in the reproduction phase are given below:
a. Crossover:
 The crossover plays a most significant role in the reproduction phase of the
genetic algorithm. In this process, a crossover point is selected at random within
the genes. Then the crossover operator swaps genetic information of two parents
from the current generation to produce a new individual representing the
Oneoffspring.
Point Crossover Multi Point Crossover Uniform Crossover

The genes of parents are exchanged among themselves until the crossover point is met.
These newly generated offspring are added to the population. This process is also called or
crossover.
213
b. Mutation

The mutation operator inserts random genes in the offspring (new child) to maintain the
diversity in the population. It can be done by flipping some bits in the chromosomes.

• Mutation helps in solving the issue of premature convergence and enhances diversification.
The below image shows the mutation process:

Swap Mutation

Scramble Mutation

Inversion Mutation

5. Termination
After the reproduction phase, a stopping criterion is applied as a base for termination. The
algorithm terminates after the threshold fitness solution is reached. It will identify the final
solution as the best solution in the population.

214
Example: Job shop scheduling

• There are 3 manufacturing units - M1, M2, M3


• Quantity of products manufactured { 20,35,15}
• Profit earned/product = {0.75,0.5,0.5} = 40
• 1 represents unit is ON and 0 for OFF
• The chromosome {1,0,1} represents M1 and M3 is ON and M2 is OFF
• We need to find optimal solution which yields maximum profit
Applying Genetic Algorithm

• Represent a solution as a chromosome {1, 0, 1}


• Define fitness function
f(solution)=[concat(G1,G2,G3)]10 G represent a gene
f(1 0 1) = (101)10 = 5
• Select a population of the solution and calculate their fitness values
Solution Si Fitness fi=concat(G1,G2,G3)
1 {0 0 1} 1
2 {0 1 0} 2
3 {1 1 0} 6
4 {1 0 1} 5
215
% of profit = f/ Solution Si Fitness % of profit Expected Count
N= n0.of solutions fi=concat(G1,G2,G3) Ec= N * % of profit
1 {0 0 1} 1 1/14 *100=7.14 4*7.14=28.6
2 {0 1 0} 2 2/14*100=14.29 4*14.29=57.1
3 {1 1 0} 6 6/14*100=42.86 4*42.86=171.4
4 {1 0 1} 5 5/14*100=35.71 4*35.71=142.9
=14 =100 =400

Reproduction
• Assume 1000 chrosomes for each set => 4000 chromosomes
• Expected count for S1= 28.6 *10=286
• Expected count for S2= 57.1 *10=571
• Expected count for S3= 171.4 *10=1714 Solution Si profit Expected Count
• Expected count for S4= 142.9 *10=1429 Ec= N * % of profit
• Choose a random no (0-999) => 70
1 {0 0 1} 0-71 286
2 {0 1 0} 71 – (71+142.9)=> 71-213 571
3 {1 1 0} 231-(213+428.6)=> 231-642 1714
4 {1 0 1} 642-(642+357.1)=> 642-1000 1429
=4000

216
Reproduction

• Generate 4 random numbers to select 4 chromosomes for next generation


• Mating pool on this we apply genetic operators
265 {1 1 0}
a. Cross Over 801 {1 0 1}
 Swapping of portion of 2 selected chromosomes
 101 515 {1 1 0}
100
110 111 85 {{ 0 1 0}

 110 0 1 0
010 1 10
b. Mutation

• Flip values randomly on selected gene in chromosome


{ 0 1 0 } mutated to 1 1 0, 0 0 0, 0 1 1

Next generation
1 0 0 => 4
1 1 1 => 7 = > maximum one selected
0 1 0 => 2
1 1 0 => 6
217
General Workflow of a Simple Genetic Algorithm

218
Advantages of Genetic Algorithm

• The parallel capabilities of genetic algorithms are best.


• It helps in optimizing various problems such as discrete functions, multi-objective problems, and
continuous functions.
• It provides a solution for a problem that improves over time.
• A genetic algorithm does not need derivative information.

Limitations of Genetic Algorithms

• Genetic algorithms are not efficient algorithms for solving simple problems.
• It does not guarantee the quality of the final solution to a problem.
• Repetitive calculation of fitness values may generate some computational challenges.

219
Difference between Genetic Algorithms and Traditional Algorithms

• A search space is the set of all possible solutions to the problem. In the traditional algorithm,
only one set of solutions is maintained, whereas, in a genetic algorithm, several sets of solutions
in search space can be used.
• Traditional algorithms need more information in order to perform a search, whereas genetic
algorithms need only one objective function to calculate the fitness of an individual.
• Traditional Algorithms cannot work parallelly, whereas genetic Algorithms can work parallelly
(calculating the fitness of the individualities are independent).
• One big difference in genetic Algorithms is that rather of operating directly on seeker results,
inheritable algorithms operate on their representations (or rendering), frequently appertained to
as chromosomes.
• One of the big differences between traditional algorithm and genetic algorithm is that it does
not directly operate on candidate solutions.
• Traditional Algorithms can only generate one result in the end, whereas Genetic Algorithms can
generate multiple optimal results from different generations.
• The traditional algorithm is not more likely to generate optimal results, whereas Genetic
algorithms do not guarantee to generate optimal global results, but also there is a great
possibility of getting the optimal result for a problem as it uses genetic operators such as
Crossover and Mutation.
• Traditional algorithms are deterministic in nature, whereas Genetic algorithms are probabilistic
and stochastic in nature. 220

You might also like