Day & Time: Monday (10am-11am & 3pm-4pm)
Tuesday (10am-11am)
Wednesday (10am-11am & 3pm-4pm)
Friday (9am-10am, 11am-12am, 2pm-3pm)
Dr. Srinivasa L. Chakravarthy
&
Smt. Jyotsna Rani Thota
Department of CSE
GITAM Institute of Technology (GIT)
Visakhapatnam – 530045
Email: [email protected] & [email protected]
Department of CSE, GIT 1
20 August 2020
EID 403 and machine learning
Course objectives
● Explore about various disciplines connected with ML.
● Explore about efficiency of learning with inductive bias.
● Explore about identification of Ml algorithms like decision
tree learning.
● Explore about algorithms like Artificial Neural networks,
genetic programming, Bayesian algorithm, Nearest neighbor
algorithm, Hidden Markov chain model.
Department of CSE, GIT EID 403 and machine learning
Learning Outcomes
● Identify the various applications connected with ML.
● Classify efficiency of ML algorithms with Inductive bias
technique.
● Discriminate the purpose of all ML algorithms.
● Analyze any application and Correlate available ML
algorithms.
● Choose an ML algorithm to develop their project.
Department of CSE, GIT EID 403 and machine learning
Syllabus
20 August 2020 4
Department of CSE, GIT EID 403 and machine learning
Reference book 1. Title -Machine Learning
Author- Tom M Mitchell
Department of CSE, GIT EID 403 and machine
learning
Reference book 2. Title –Introduction to Machine Learning
Author- Ethem Alpaydin
Department of CSE, GIT EID 403 and machine learning
Module -3(Chapter-7)
It includes-
Chapter -6 Bayesian Learning
&
Chapter -9 Computational Learning Theory
● Probably Learning
● Sample Complexity
● Finite and Infinite Hypothesis space
● Mistake Bound model of Learning
7
Introduction
This chapter introduces
-Characterization of difficulty of several types of ML problems.
-Capabilities of several types of ML algorithms.
-Two specific frameworks for analysing ML algorithms
1. Probably approximately correct(PAC) framework- In this we identify
classes of hypotheses that can and cannot be learned from no.of
training examples.
2. Mistake Bound framework- In this we examine the no.of training errors
made by a learner before determining the correct hypothesis.
Introduction-
The Goal in this chapter is to answer some questions with computational theory
such as-
1. Sample complexity- How many training examples are needed for a
learner to converge (with high probability)to a successful hypothesis?
2. Computational complexity- How much computational effort is needed for
a learner to converge (with high probability)to a successful hypothesis?
3. Mistake bound- How many training examples will the learner misclassify
before converging to a successful hypothesis?
Probably learning an approximately correct hypothesis-
The Problem Setting-
● Set of instances X described by attributes like age(young/old) height(short/tall)
● Set of hypotheses H
● Set of possible target concepts C (i.e., c in C where, c: X {0,1}.)
● Training instances generated by a fixed, unknown probability distribution D from X
For example, D might be the distribution of instances generated by observing
“people who walk out of the largest sports store.”
Probably learning an approximately correct hypothesis(cont.)
The Problem Setting-(cont.)
The learner L, considers some set H of possible hypotheses when attempting
to learn the target concept.
After observing a sequence of training examples of the target concept c, L must
output some hypothesis h from H, which is its estimate of c.
We may evaluate the success of L, by the performance of h over new instances
drawn randomly from X according to D.
With this setting, we are interested in characterizing the performance of various
learners L using various hypothesis spaces H, when learning individual target concepts
drawn from various classes C.
Probably learning an approximately correct hypothesis(cont.)
Error of a Hypothesis-
In order to identify , how closely the learner’s output hypothesis h approximates the
actual target concept c-
we need to know about true error of a hypothesis h w.r.t target concept c, i.e.,
Probably learning an approximately correct hypothesis(cont.)
Error of a Hypothesis-(cont.)
● Two notations of error
Probably learning an approximately correct hypothesis(cont.)
PAC Learnability-
In order to characterize classes of target concepts that are reliably learned
then, what kind of statements about learnability should we say as TRUE?
Generally there are two difficulties existed-
1. we provide training examples corresponds to possible instance X, where there
may be multiple consistent hypothesis and learner cannot be able to pick one
corresponds to target concept.
2. Given that the training examples are drawn randomly, there is always a
nonzero probability that the training examples encountered by learner will be
misleading.
Probably learning an approximately correct hypothesis(cont.)
PAC Learnability-(cont.)
To overcome those difficulties-
1. We should not look for learner to output a zero error hypothesis instead we
require that its error be bounded by some constant ϵ, that can be made
arbitrarily small.
2. We should not look for learner to succeed for every sequence of randomly
drawn training examples instead we require only probability of failure be
bounded by some constant δ, that can be made arbitrarily small.
In short we require a learner probably learn a hypothesis that is approximately
correct-hence the term probably approximately correct(PAC) learning in short.
Probably learning an approximately correct hypothesis(cont.)
PAC Learnability-(cont.)
Where, n is size of instances in X & size(c) is encoding length of c in C.
For example, if C is conjunctions of k boolean features then size(c) is no.of boolean
features actually used to describe c.
Sample complexity for Finite hypothesis spaces-
PAC learnability is determined by no.of training examples required by learner.
The growth in the required training examples with problem size is called as-
-“Sample Complexity” of learning problem.
In practical,the factor that minimizes the success of a learner is limited availability of
training data.
The general bound on the sample complexity for a very broad class of learners,
called consistent learners. A learner is consistent if its output hypotheses perfectly
fits the training data.
Sample complexity for Finite hypothesis spaces-(cont.)
As we know that, version space VSH,D to be the set f all hypotheses h ∈ H that
correctly classify the training examples D.
So, every consistent learner outputs a hypothesis belonging to the version space.
Therefore, to bound the no.of examples needed by any consistent learner, we
bound the no.of examples needed to assure that the version space contains no
unacceptable hypotheses.
Sample complexity for Finite hypothesis spaces-(cont.)
Exhausting the Version space
Sample complexity for Finite hypothesis spaces-(cont.)
How many examples will ∈-exhaust the VS?
Sample complexity for Finite hypothesis spaces-(cont.)
Conjunctions of Boolean Literals Are PAC-Learnable
Use theorem-
Sample complexity for Finite hypothesis spaces-(cont.)
Agnostic Learning
If H does not contains the target concept c, then a zero-error hypothesis cannot
always be found.
In this case, we might ask our learner to output the hypothesis from H that has the
minimum error over the training examples.
A learner that makes no assumption that the target concept is representable by H
and that simply finds the hypothesis with minimum training error, is often called an
agnostic learner.
Sample complexity for Finite hypothesis spaces-(cont.)
Agnostic Learning
Sample complexity for Infinite hypothesis spaces-
So far, the discussion is about-
Sample complexity for PAC leaning grows as the logarithm of size of the
hypothesis space but, the drawbacks to characterizing sample complexity in
terms of |H| are-
1. It can lead to quite weak bounds.
2. In case of infinite hypothesis spaces we cannot apply the equation-
Here we consider, a measure of the complexity of H, called Vapnik-Chervonenkis
dimension or VC dimension or VC(H).
Instead of |H| to state bounds on sample complexity we use VC(H).
Sample complexity for Infinite hypothesis spaces-(cont.)
Shattering a set of Instances
VC dimension measures the complexity of hypothesis not by no.of distinct
hypotheses |H|, but by no.of distinct instances
from X that can be discriminated using H.
A set of 3 instances shattered by eight
hypotheses.
For every possible dichotomy of the
instances, there exists a corresponding
hypothesis.
Sample complexity for Infinite hypothesis spaces-(cont.)
Vapnik-Chervonenkis Dimension
Note- For any finite H, VC(H) ≤ log2|H|. To see this suppose VC(H)=d, then H will
require 2d distinct hypotheses to shatter d instances.
Hence, 2d ≤ |H| and d =VC(H) ≤ log2|H|.
Sample complexity for Infinite hypothesis spaces-(cont.)
Example-VC dimension for linear decision surfaces
Suppose each instance in X is described by the conjunction of exactly 3
boolean literals and,
Suppose that each hypothesis in H is described by conjunction of up to 3
boolean literals. What is VC (H)?
We can show it as follows-By representing each instance by a 3-bit string and
each literal corresponds to l1,l2,l3.
Instance1: 100
Instance2: 010
Instance3: 001
Sample complexity for Infinite hypothesis spaces-(cont.)
Example-VC dimension for linear decision surfaces(cont.)
The VC dimension for linear decision surface in the x,y plane is 3.
a) A set of 3 points that can be shattered using linear decision surfaces.
b) A set of 3 points that cannot be shattered.
NOTE- The VC dimension for conjunctions of n boolean literals is at least n. In fact
it is exactly n, though showing this is more difficult, because it requires
demonstrating that no set of n+1 instances can be shattered.
Sample complexity for Infinite hypothesis spaces-(cont.)
Sample complexity and VC Dimension
Earlier, we considered the question
The answer is- Here the no.of training examples grows logarithmically in
Now,using VC(H) for the complexity of H, it is possible to derive an alternative i.e.,
Here the no.of training examples grows log times linear in (1/ ∈)
Mistake Bound model
So far the discussion is about-
1. How the training examples are generated.
2. Noise in the data.
3. The definition of success by stating “ target concept must be learned exactly
or probably”.
4. The measures according to which the learner is evaluated.
Now consider, Mistake bound model of learning,in which
The learner is evaluated by total no.of mistakes it makes before converges to
correct hypothesis.
Mistake Bound model(cont.)
Mistake boud learning problem may be studied in various specific settings.
Generally, we might count the number of mistakes made before PAC learning the
target concept.
So, In the examples below, we consider the number of mistakes made before
learning the target concept exactly means converging to a hypothesis such that
h(x)=c(x).
Mistake Bound for FIND-S Algorithm
FIND-S makes no errors, provided C ⊆ H and training data is noise-free.
Mistake Bound for FIND-S Algorithm(cont.)
FIND-S begins with most specific hypothesis and generalizes the hypotheses
space in order to observe positive training examples.
And So, FIND-S can never mistakenly classify a negative example as
positive.
So, to calculate no.of mistakes in FIND-S, we need to count the no.of
mistakes it mis-classifies truly positive examples as negative.
Mistake Bound for Halving Algorithm
● If the majority of version space hypotheses classify a new instance as positive
then this prediction is output by the learner.
Mistake bound of Halving algorithm can be derived, when the majority of
hypothesis in current version space incorrectly classifies new instance .
In this case,if an instance is correctly classified then the version space will be
reduced to at most half its current size
Optimal Mistake Bounds
Therefore, M Find-S(C) = n+1 and M Halving (C) ≤ log2(|C|)
END OF CHAPTER 7 & MODULE 3