DWDM Unit 3 Part 2
DWDM Unit 3 Part 2
INTRODUCTION
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities such as the probability that a given tuple belongs to a particular class. Bayesian
classification is based on Bayes’ theorem.
Studies comparing classification algorithms have found a simple Bayesian classifier
known as the Naive Bayesian Classifier. Bayesian classifiers have also exhibited high accuracy
and speed when applied to large databases. Naive Bayesian classifiers assume that the effect of
an attribute value on a given class is independent of the values of the other attributes. This
assumption is called class conditional independence. It is made to simplify the computations
involved and, in this sense, is considered “naive.”
Bayesian Belief Networks which unlike naive Bayesian classifiers, do not assume class
conditional independence i.e., given the class label of a tuple, the values of the attributes are
assumed to be conditionally independent of one another.
BAYES’ THEOREM
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who
did early work in probability and decision theory during the 18th century.
Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is
described by measurements made on a set of n attributes. Let H be some hypothesis such as
that the data tuple X belongs to a specified class C. For classification problems, we want to
determine P(H/X), the probability that the hypothesis H holds given the “evidence” or observed
data tuple X. In other words, we are looking for the probability that tuple X belongs to class C,
given that we know the attribute description of X.
P(H/X)is the posterior probability, or a posteriori probability, of H conditioned on X. For
example, suppose our world of data tuples is confined to customers described by the attributes
age and income, respectively, and that X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a computer. Then P(H/X)reflects
the probability that customer X will buy a computer given that we know the customer’s age and
income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example,
this is the probability that any given customer will buy a computer, regardless of age, income,
or any other information, for that matter. The posterior probability, P(H/X), is based on more
information (e.g., customer information) than the prior probability, P(H), which is independent
of X.
Similarly, P(X/H) is the posterior probability of X conditioned on H. That is, it is the
probability that a customer, X, is 35 years old and earns $40,000, given that we know the
customer will buy a computer.
P(X) is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000.
DATA WARE HOUSING AND DATA MINING (R16)
P(H), P(X/H) and P(X) may be estimated from the given data, as we shall see next. Bayes’
theorem is useful in that it provides a way of calculating the posterior probability, P(H/X), from
P(H), P(X/H) and P(X).
Bayes’ theorem is
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple
is represented by an n-dimensional attribute vector, X = (x1, x2, . . . , xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, . . . , An.
2. Suppose that there are m classes, C1, C2, . . . , Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability,
conditioned on X. That is, the Naive Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if
Thus, we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
3. As P(X) is constant for all classes, only P(X/Ci)P(Ci) needs to be maximized. If the
class prior probabilities are not known, then it is commonly assumed that the classes
are equally likely, that is, P(C1) = P(C2) = . . . = P(Cm) and we would therefore
maximize P(X/Ci).
Otherwise, we maximize P(X/Ci)P(Ci). Note that the class prior probabilities
may be estimated by P(Ci) = |Ci,D|/ |D|, where |Ci,D| is the number of
training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X/Ci). To reduce computation in evaluating P(X/Ci), the
naive assumption of class-conditional independence is made. This presumes that
the attributes’ values are conditionally independent of one another, given the class
label of the tuple (i.e., that there are no dependence relationships among the
attributes). Thus,
DATA WARE HOUSING AND DATA MINING (R16)
5. To predict the class label of X, P(X/Ci)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if
the predicted class label is the class Ci for which P(X/Ci)P(Ci) is the maximum.
Example:
Consider the following data set:
DATA WARE HOUSING AND DATA MINING (R16)
The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute, buys_computer, has two distinct values (namely, {yes, no}).
C1 correspond to the class buys_computer=yes and C2 correspond to
buys_computer=no.
? Tuple to classify is
DATA WARE HOUSING AND DATA MINING (R16)
The conditional probability for each known value of LungCancer is given for each
possible combination of the values of its parents. For instance, from the upper
leftmost and bottom rightmost entries, respectively, we see that
Let X = (x1, x2, . . . , xn) be a data tuple described by the variables or attributes Y1,Y2, . . . ,Yn
respectively. Recall that each variable is conditionally independent of its non-descendants in
the network graph, given its parents. This allows the network to provide a complete
representation of the existing joint probability distribution with the following equation:
where P(x1, x2, . . . , xn) is the probability of a particular combination of values of X, and the
values for P(xi / Parents(Yi)) correspond to the entries in the CPT for Yi .
if wijk is the upper leftmost CPT entry of the following Conditional Probability
Table (CPT) for LungCancer.
Then Yi is LungCancer; yij is its value, “yes”;
Ui lists the parent nodes of Yi , namely, {FamilyHistory, Smoker};
uik lists the values of the parent nodes, namely, {“yes”, “yes”}
DATA WARE HOUSING AND DATA MINING (R16)
EXAMPLE
Calculate
Note: As P(T) = 0.001 then P(~T) = 0.999 ; Similarly as P(E) = 0.002 then P(~E) = 0.998