UNIT- IV
Computational
Learning
Probability Learning
Probability learning is a machine
learning technique that uses
probability theory to make
predictions and decisions.
It's a statistical approach that
models uncertainty in data by using
probability distributions.
A probability
distribution describes the possible
values and the corresponding
likelihoods that a random variable
can take. For example, the
probabilities of having 0, 1, 2, …,
100 heads respectively.
Frequentist calculates probabilities
from the relative frequencies of
specific events to the total number
of trials. For example, P(Head) =
Bayesian
Bayesian modifies a prior belief with
current experiment data using
Bayesian inference. For example, a
Bayesian can combine a prior belief
that the coin is fair with the current
experiment data (56 heads out of
100) to form a new belief
A Frequentist estimates the most
likely value for P(head) (a point
estimate). But a Bayesian tracks all
possibilities with the corresponding
certainties. This calculation is
complex but contains richer
information for further computation.
Hypothesis
The hypothesis is one of the
commonly used concepts of
statistics in Machine Learning.
It is specifically used in Supervised
Machine learning, where an ML
model learns a function that best
maps the input to corresponding
outputs with the help of an available
dataset.
Hypothesis
Thereare some common methods
given to find out the possible
hypothesis from the Hypothesis
space, where hypothesis space is
represented by uppercase-h
(H) and hypothesis by lowercase-h
(h). These are defined as follows:
The hypothesis (h) can be formulated in
machine learning as follows:
Y= mx + b
Where,
Y: Range
m: Slope of the line which divided test
data or changes in y divided by change
in x.
x: domain
c: intercept (constant)
Hypothesis space (H):
Hypothesis space is defined as a
set of all possible legal
hypotheses; hence it is also
known as a hypothesis set.
Sample complexity
Sample complexity is a concept
in machine learning that determines the
number of data samples required to
achieve a certain level of learning
performance.
Its importance lies in its ability to assess
the efficiency of a learning algorithm.
A more efficient algorithm needs fewer
samples to learn effectively, reducing the
resources required for data acquisition
and storage.
There are two types of sample complexities
that are often referenced: worst-case sample
complexity and average-case sample
complexity.
Worst-case sample complexity refers to the
maximum number of samples required to
reach a specific learning goal, irrespective of
the data distribution.
Average-case sample complexity, on the
other hand, considers the average number of
samples needed, assuming the data follows a
certain distribution.
Mathematical backbone of
sample complexity
Probably Approximately Correct
(PAC) learning theory provides a
framework to relate VC dimension to
sample complexity. PAC learning
seeks to identify the minimum
sample size that will, with high
probability, produce a hypothesis
within a specified error tolerance of
the best possible hypothesis.
The PAC learning bound is given by:
N >= (1/ε) * (ln|H| + ln(1/δ)),
Where,
N is the sample size,
ε is the maximum acceptable error
(the 'approximately correct' part),
|H| is the size of the hypothesis
space (related to VC dimension),
δ is the acceptable failure probability
(the 'probably' part).
Finite hypothesis
A finite hypothesis space consists of
a limited, countable set of
hypotheses (models or functions)
that can be selected to explain or
predict outcomes based on the input
data.
Example: Decision trees with a
limited depth, a fixed number of
linear classifiers, or a specific set of
rules in rule-based learning.
Implications
Easier to manage and evaluate since
the number of hypotheses is small.
Risk of underfitting if the hypothesis
space is too constrained and does
not capture the underlying data
distribution.
It can be easier to ensure
generalization since overfitting can
be less of a concern due to the
limited complexity.
Infinite hypothesis space
An infinite hypothesis space contains
an unbounded or uncountable set of
hypotheses. This means there are
potentially limitless models that could
be considered for fitting the data.
Example: Linear regression with any
real-valued coefficients, neural
networks with varying architectures
and parameters, or kernel methods in
support vector machines (SVMs).
Implications
Greater flexibility and the ability to
capture complex patterns in the
data.
Higher risk of overfitting, as the
model can become too complex and
fit noise in the data rather than the
underlying distribution.
Requires more advanced techniques
for regularization and validation to
ensure that the model generalizes
Mistake bound model
The mistake bound (MB) model in machine
learning is a model that evaluates a learner
based on the total number of mistakes it
makes before reaching the correct
hypothesis.
The model is used in cumulative learning
scenarios, where the learning process is
made up of rounds.
In each round, the learner is asked about an
aspect of the learned phenomenon, makes a
prediction, and is told if it was correct.
Thegoal of the learner is to make
the minimum number of mistakes
possible in the learning process.
Mistake bound model
algorithm
An algorithm A is said to learn C in the
mistake bound model if for any concept
c ∈ C, and for any ordering of
examples consistent with c, the total
number of mistakes ever made by A is
bounded by p(n, size(c)), where p is a
polynomial. We say that A is a
polynomial time learning algorithm if
its running time per stage is also
polynomial in n and size(c).
Conjunctions
Let us assume that we know that the
target concept c will be a
conjunction of a set of (possibly
negated) variables, with an example
space of n-bit strings. Consider the
following algorithm:
Algorithm for MB
1. Initialize hypothesis h to be x1x1
x2x2 . . . xnxn .
2. Predict using h(x).
3. If the prediction is False but the label is
actually True, remove all the literals in h
which are False in x. (So if the first mistake
is on 1001, the new h will be x1x2 x3 x4.)
4. If the prediction is True but the label is
actually False, then output “no consistent
conjunction”. 5. Return to step 2.
Lower Bound: In fact no deterministic
algorithm can achieve a mistake
bound better than n in the worst
case.
This can be seen by considering the
sequence of n examples in which the
ith example has all bits except the
ith bit set to 1
The target concept c will be a monotone
conjunction constructed by including xi only
if the algorithm predicts the ith example to
be True (in which case the ith example’s
label will be False).
(If the algorithm predicts the ith example to
be False, then the target concept will not
include xi , and so the true label will be
True.) The algorithm will have made n
mistakes by the time all of these n examples
are processed.
Learning set of rules
A learning rule set in machine learning
is a collection of rules that describe a
dataset. The process of creating these
rules from data is called rule learning.
Rule types
The most common type of rule learning
is inductive rule learning, also known as
rule induction. Other types of rules
include association rules, which are used
to express relationships in large datasets
Rule form
The
basic form of a rule is "IF
PREMISE THEN CONSEQUENT". This
means that the consequent is true
whenever the premise is true.
Example 1
Example 2:
Learning strategies
Some strategies for learning rule
sets include:
Learn-One-Rule: Searches from
general to specific.
Find-S: Searches from specific to
general.
FOIL: Learns one rule at a time,
removing positive examples covered
by the learned rule before
attempting to learn another rule.
Sequential Covering
Algorithm
Sequential Covering is a popular
algorithm based on Rule-Based
Classification used for learning a
disjunctive set of rules.
The basic idea here is to learn one
rule, remove the data that it covers,
then repeat the same process.
In this way, it covers all the rules
involved with it in a sequential
manner during the training phase.
Sequential Covering
Algorithm
The algorithm involves a set of ‘ordered
rules’ or ‘list of decisions’ to be made.
Step 1 – create an empty decision list, ‘R’.
Step 2 – ‘Learn-One-Rule’ Algorithm It
extracts the best rule for a particular class
‘y’, where a rule.
In the beginning,
Step 2.a – if all training examples ∈ class
‘y’, then it’s classified as positive example.
Step 2.b – else if all training examples ∉
class ‘y’, then it’s classified as negative
example.
Step 3 – The rule becomes
‘desirable’ when it covers a
majority of the positive examples.
Step 4 – When this rule is obtained,
delete all the training data
associated with that rule. (i.e. when
the rule is applied to the dataset, it
covers most of the training data, and
has to be removed).
Step 5 – The new rule is added to
the bottom of decision list, ‘R’.
Decision List format
Working of the algorithm