Main Learning Algorithms: Find-S Algorithm
Main Learning Algorithms: Find-S Algorithm
Find-S
Candidate-Elimination
ID3
Gradient descent and Backpropagation
Genetic Algorithms
Bayesian Learning
Q Learning
EM and K-means
AdaBoost
1
Find-S Algorithm
1. Initialize h to the most specic hypothesis in H
2. For each positive training instance x
For each attribute constraint a
i
in h
If the constraint a
i
in h is satised by x
Then do nothing
Else replace a
i
in h by the next more general constraint that is
satised by x
3. Output hypothesis h
2
Candidate Elimination Algorithm
G maximally general hypotheses in H
S maximally specic hypotheses in H
For each training example d, do
If d is a positive example
Remove from G any hypothesis inconsistent with d
For each hypothesis s in S that is not consistent with d
Remove s from S
Add to S all minimal generalizations h of s such that
- h is consistent with d, and
- some member of G is more general than h
Remove from S any hypothesis that is more general than another
hypothesis in S
3
If d is a negative example
Remove from S any hypothesis inconsistent with d
For each hypothesis g in G that is not consistent with d
Remove g from G
Add to G all minimal specializations h of g such that
- h is consistent with d, and
- some member of S is more specic than h
Remove from G any hypothesis that is less general than another
hypothesis in G
ID3 Algorithm
Input: Examples, Target attribute, Attributes
1. Create a Root node for the tree
2. if all Examples are positive, then return the node Root with label +
3. if all Examples are negative, then return the node Root with label
4. if Attributes is empty, then return the node Root with label = most
common value of Target attribute in Examples
4
5. Otherwise
A the best decision attribute for Examples
Assign A as decision attribute for Root
For each value v
i
of A
- add a new branch from Root corresponding to the test A = v
i
- let Examples
v
i
be the subset of Examples that have value v
i
for A
- if Examples
v
i
is empty then add a leaf node with label =
most common value of Target attribute in Examples
- else add the tree ID3(Examples
v
i
, Target attribute, AttributesA)
Entropy and Information Gain
Entropy(S) p
log
2
p
log
2
p
Gain(S, A) Entropy(S)
vV alues(A)
[S
v
[
[S[
Entropy(S
v
)
5
Denitions of Error
error
T
(h) Pr
xT
[f(x) ,= h(x)]
error
S
(h)
1
n
xS
(f(x) ,= h(x))
bias E[error
S
(h)] error
T
(h)
6
PAC Learning
Denition: C is PAC-learnable by L using H if for all c C, di-
stributions T over X, such that 0 < < 1/2, and such that
0 < < 1/2,
learner L will with probability at least (1 ) output a hypothesis
h H such that error
T
(h) , in time that is polynomial in 1/,
1/, n and size(c).
7
8
Gradient Descent
Gradient-Descent(training examples = x, t), )
x: vector of input values, t: target output, : learning rate (0.05)
Initialize each w
i
to some small random value
Until the termination condition is met, Do
1. Initialize each w
i
to zero.
2. For each x, t) in training examples, Do
Input the instance x to the unit and compute the output o
For each linear unit weight w
i
, Do
w
i
w
i
+(t o)x
i
3. For each linear unit weight w
i
, Do
w
i
w
i
+w
i
9
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satised, Do
For each training example, Do
1. Input the training example to the network and compute the network
outputs
2. For each output unit k, compute
k
= o
k
(1 o
k
)(t
k
o
k
)
3. For each hidden unit h, compute
h
= o
h
(1 o
h
)
koutputs
w
kh
k
4. Update each network weight w
ji
w
ji
w
ji
+w
ji
w
ji
=
j
x
ji
10
Genetic Algorithm
Input:
Fitness: evaluation function of hypotheses,
Fitness threshold: the threshold used as termination criterion,
p: size of the population,
r: fraction of population to be replaced by Crossover,
m: mutation rate
Initialize: P h
1
, ..., h
p
, p random hypotheses
Evaluate: for each h in P, compute Fitness(h)
While max
hP
Fitness(h) < Fitness threshold
Create a new generation P
S
P P
S
Return the hypothesis from P that has the highest tness.
11
Create a new generation P
S
:
1. Select: Probabilistically select (1 r)p members of P to add to P
S
,
using
Pr(h
i
) =
Fitness(h
i
)
p
j=1
Fitness(h
j
)
2. Crossover: Probabilistically select
rp
2
pairs of hypotheses from P. For
each pair, h
1
, h
2
), produce two ospring by applying the Crossover
operator. Add all ospring to P
s
.
3. Mutate: Probabilistically select m p members of P
s
and invert a ran-
domly selected bit.
Basic Formulas for Probabilities
Product Rule:
P(A B) = P(A[B)P(B) = P(B[A)P(A)
Sum Rule:
P(A B) = P(A) +P(B) P(A B)
Theorem of total probability: if events A
1
, . . . , A
n
are mutually exclusive
with
n
i=1
P(A
i
) = 1, then
P(B) =
n
i=1
P(B[A
i
)P(A
i
)
12
Bayes Theorem
P(h[D) =
P(D[h)P(h)
P(D)
Conditional independence: X is c.i. of Y given Z if
P(X[Y, Z) = P(X[Z)
Bayes classiers
Maximum a posteriori (MAP) hypothesis
h
MAP
= argmax
hH
P(h[D)
Bayes Optimal Classier
v
OB
= argmax
v
j
V
h
i
H
P(v
j
[h
i
)P(h
i
[D)
Naive Bayes Classier
v
NB
= argmax
v
j
V
P(v
j
)
i
P(a
i
[v
j
)
13
Naive Bayes Algorithm
Naive Bayes Learn(examples)
For each target value v
j
P(v
j
) estimate P(v
j
)
For each attribute value a
i
of each attribute a
P(a
i
[v
j
) estimate P(a
i
[v
j
)
Classify New Instance(x)
v
NB
= argmax
v
j
V
P(v
j
)
a
i
x
P(a
i
[v
j
)
14
Q Learning for Deterministic Worlds
For each s, a initialize table entry
Q(s, a) 0
Observe current state s
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s
t
Update the table entry for
Q(s, a) as follows:
Q(s, a) r + max
a
t
Q(s
t
, a
t
)
s s
t
15
16
EM Algorithm for mixture of Gaussians
Pick random initial h =
1
, ...,
k
)
Repeat until termination condition:
E step: Calculate the expected value E[z
ij
] of each hidden variable z
ij
,
assuming the current hypothesis h =
1
, ...,
k
) holds.
E[z
ij
] =
p(x = x
i
[ =
j
)
k
l=1
p(x = x
i
[ =
l
)
=
e
1
2
2
(x
i
j
)
2
k
l=1
e
1
2
2
(x
i
l
)
2
M step: Calculate a new maximum likelihood hypothesis h
t
=
t
1
, ...,
t
k
),
assuming the value taken on by each hidden variable z
ij
is its expected
value E[z
ij
] calculated above. Replace h =
1
, ...,
k
) by h
t
=
t
1
, ...,
t
k
).
m
i=1
E[z
ij
] x
i
m
i=1
E[z
ij
]
17
General EM Algorithm
General EM Algorithm:
Estimation (E) step: Calculate Q(h
t
[h) using the current hypothesis h
and the observed data X to estimate the probability distribution over
Y .
Q(h
t
[h) E[lnP(Y [h
t
)[h, X]
Maximization (M) step: Replace hypothesis h by the hypothesis h
t
that
maximizes this Q function.
h argmax
h
t
Q(h
t
[h)
18
K-means
A variant of EM computing only k means of data generated from k Gaussian
distributions.
Step 1. Begin with a decision on the value of k = number of clusters
Step 2. Put any initial partition that classies the data into k clusters. You
may assign the training samples randomly, or systematically as follows
1. Take the rst k training samples as single-element clusters
2. Assign each of the remaining (N-k) training samples to the cluster with
the nearest centroid. After each assignment, recomputed the centroid
of the gaining cluster.
19
Step 3 . Take each sample in sequence and compute its distance from the
centroid of each of the clusters. If a sample is not currently in the cluster
with the closest centroid, switch this sample to that cluster and update the
centroid of the cluster gaining the new sample and the cluster losing the
sample.
Step 4 . Repeat step 3 until convergence is achieved, that is until a pass
through the training sample causes no new assignments.
AdaBoost
Given (x
1
, y
1
), . . . , (x
m
, y
m
), where x
i
X, y
i
Y = 1, +1
Initialize D
1
(i) = 1/m, i = 1, ..., m.
For t = 1, ..., T:
Train base learner using distribution D
t
(i)
Get base classier h
t
: X T
Choose
t
T
Update
D
t+1
(i) =
1
Z
t
D
t
(i) e
t
y
i
h
t
(x
i
)
where Z
t
is a normalization factor
20
Output the nal classier
H(x) = sign
t=1
t
h
t
(x)