0% found this document useful (0 votes)
10 views8 pages

DWDM Unit 3 Part 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

DWDM Unit 3 Part 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

DATA WARE HOUSING AND DATA MINING (R16)

Unit – III Part 2

Classification: Alterative Techniques


 Bayes’ Theorem
 Naïve Bayesian Classification
 Bayesian Belief Networks

INTRODUCTION
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities such as the probability that a given tuple belongs to a particular class. Bayesian
classification is based on Bayes’ theorem.
Studies comparing classification algorithms have found a simple Bayesian classifier
known as the Naive Bayesian Classifier. Bayesian classifiers have also exhibited high accuracy
and speed when applied to large databases. Naive Bayesian classifiers assume that the effect of
an attribute value on a given class is independent of the values of the other attributes. This
assumption is called class conditional independence. It is made to simplify the computations
involved and, in this sense, is considered “naive.”
Bayesian Belief Networks which unlike naive Bayesian classifiers, do not assume class
conditional independence i.e., given the class label of a tuple, the values of the attributes are
assumed to be conditionally independent of one another.

BAYES’ THEOREM
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who
did early work in probability and decision theory during the 18th century.
Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is
described by measurements made on a set of n attributes. Let H be some hypothesis such as
that the data tuple X belongs to a specified class C. For classification problems, we want to
determine P(H/X), the probability that the hypothesis H holds given the “evidence” or observed
data tuple X. In other words, we are looking for the probability that tuple X belongs to class C,
given that we know the attribute description of X.
P(H/X)is the posterior probability, or a posteriori probability, of H conditioned on X. For
example, suppose our world of data tuples is confined to customers described by the attributes
age and income, respectively, and that X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a computer. Then P(H/X)reflects
the probability that customer X will buy a computer given that we know the customer’s age and
income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example,
this is the probability that any given customer will buy a computer, regardless of age, income,
or any other information, for that matter. The posterior probability, P(H/X), is based on more
information (e.g., customer information) than the prior probability, P(H), which is independent
of X.
Similarly, P(X/H) is the posterior probability of X conditioned on H. That is, it is the
probability that a customer, X, is 35 years old and earns $40,000, given that we know the
customer will buy a computer.
P(X) is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000.
DATA WARE HOUSING AND DATA MINING (R16)

P(H), P(X/H) and P(X) may be estimated from the given data, as we shall see next. Bayes’
theorem is useful in that it provides a way of calculating the posterior probability, P(H/X), from
P(H), P(X/H) and P(X).
Bayes’ theorem is

NAÏVE BAYESIAN CLASSIFICATION


The Naive Bayesian Classifier, or Simple Bayesian Classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple
is represented by an n-dimensional attribute vector, X = (x1, x2, . . . , xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, . . . , An.
2. Suppose that there are m classes, C1, C2, . . . , Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability,
conditioned on X. That is, the Naive Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if

Thus, we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X/Ci)P(Ci) needs to be maximized. If the
class prior probabilities are not known, then it is commonly assumed that the classes
are equally likely, that is, P(C1) = P(C2) = . . . = P(Cm) and we would therefore
maximize P(X/Ci).
Otherwise, we maximize P(X/Ci)P(Ci). Note that the class prior probabilities
may be estimated by P(Ci) = |Ci,D|/ |D|, where |Ci,D| is the number of
training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally
expensive to compute P(X/Ci). To reduce computation in evaluating P(X/Ci), the
naive assumption of class-conditional independence is made. This presumes that
the attributes’ values are conditionally independent of one another, given the class
label of the tuple (i.e., that there are no dependence relationships among the
attributes). Thus,
DATA WARE HOUSING AND DATA MINING (R16)

The probabilities P(x1/Ci), P(x2/Ci), . . . , P(xn/Ci) can be calculated from the


training tuples. xk refers to the value of attribute Ak for tuple X. For each attribute,
we look at whether the attribute is categorical or continuous-valued. For instance, to
compute P(X/Ci), we consider the following:
(a) If Ak is categorical, then P(xk/Ci) is the number of tuples of class Ci in D having
the value xk for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, then we need to do a bit more work, but the
calculation is pretty straightforward. A continuous-valued attribute is typically
assumed to have a Gaussian distribution with a mean μ and standard deviation
σ, defined by

are the mean (i.e., average) and standard deviation,


& respectively, of the values of attribute Ak for training
tuples of class Ci

5. To predict the class label of X, P(X/Ci)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if

the predicted class label is the class Ci for which P(X/Ci)P(Ci) is the maximum.

Example:
Consider the following data set:
DATA WARE HOUSING AND DATA MINING (R16)

Predict a class label using naive Bayesian classification.

 The data tuples are described by the attributes age, income, student, and credit rating.
 The class label attribute, buys_computer, has two distinct values (namely, {yes, no}).
 C1 correspond to the class buys_computer=yes and C2 correspond to
buys_computer=no.

? Tuple to classify is
DATA WARE HOUSING AND DATA MINING (R16)

BAYESIAN BELIEF NETWORKS


 The naive Bayesian classifier makes the assumption of class conditional independence,
that is, given the class label of a tuple, the values of the attributes are assumed to be
conditionally independent of one another.
 Bayesian belief networks specify joint conditional probability distributions. They allow
class conditional independencies to be defined between subsets of variables.
 Bayesian belief networks are also known as Belief Networks, Bayesian Networks, and
Probabilistic Networks.
 A belief network is defined by two components— (1) a directed acyclic graph and a set
of (2) conditional probability tables.
(1) Each node in the directed acyclic graph represents a random variable. The
variables may be discrete or continuous-valued. They may correspond to actual
attributes given in the data or to “hidden variables” believed to form a
relationship (e.g., in the case of medical data, a hidden variable may indicate a
syndrome, representing a number of symptoms that, together, characterize a
specific disease). Each arc represents a probabilistic dependence. If an arc is
drawn from a node Y to a node Z, then Y is a parent or immediate predecessor
of Z, and Z is a descendant of Y. Each variable is conditionally independent of its
non-descendants in the graph, given its parents.
DATA WARE HOUSING AND DATA MINING (R16)

The arcs in above directed acyclic graph allow a representation of causal


knowledge. For example, having lung cancer is influenced by a person’s family
history of lung cancer, as well as whether or not the person is a smoker. Note
that the variable PositiveXRay is independent of whether the patient has a family
history of lung cancer or is a smoker, given that we know the patient has lung
cancer. In other words, once we know the outcome of the variable LungCancer,
then the variables FamilyHistory and Smoker do not provide any additional
information regarding PositiveXRay. The arcs also show that the variable
LungCancer is conditionally independent of Emphysema, given its parents,
FamilyHistory and Smoker.
(2) A belief network has one Conditional Probability Table (CPT) for each variable.
The CPT for a variable Y specifies the conditional distribution P(Y/Parents(Y)),
where Parents(Y) are the parents of Y. The shows a CPT for the variable
LungCancer.

The conditional probability for each known value of LungCancer is given for each
possible combination of the values of its parents. For instance, from the upper
leftmost and bottom rightmost entries, respectively, we see that

Let X = (x1, x2, . . . , xn) be a data tuple described by the variables or attributes Y1,Y2, . . . ,Yn
respectively. Recall that each variable is conditionally independent of its non-descendants in
the network graph, given its parents. This allows the network to provide a complete
representation of the existing joint probability distribution with the following equation:

where P(x1, x2, . . . , xn) is the probability of a particular combination of values of X, and the
values for P(xi / Parents(Yi)) correspond to the entries in the CPT for Yi .

Training Bayesian Belief Networks


 Let D be a training set of data tuples, X1, X2, . . . , X|D|.
 Training the belief network means that we must learn the values of the CPT entries.
 Let wijk be a CPT entry for the variable Yi = yij having the parents Ui = uik , where wijk
Ξ P(Yi = yij | Ui = uik).
DATA WARE HOUSING AND DATA MINING (R16)

For example, consider the following belief network

Conditional Probability Table (CPT) for LungCancer is as follows:

 if wijk is the upper leftmost CPT entry of the following Conditional Probability
Table (CPT) for LungCancer.
 Then Yi is LungCancer; yij is its value, “yes”;
 Ui lists the parent nodes of Yi , namely, {FamilyHistory, Smoker};
 uik lists the values of the parent nodes, namely, {“yes”, “yes”}
DATA WARE HOUSING AND DATA MINING (R16)

EXAMPLE

Calculate

For P(J ^ M ^ A ^ ~T ^ ~E)

P(J ^ M ^ A ^ ~T ^ ~E) = P(J/ A) P(M/ A) P(A/ ~T ^ ~E) P(~T) P(~E)


= 0.9 * 0.7 * 0.001 * 0.999 * 0.998
= 0.00062

Note: As P(T) = 0.001 then P(~T) = 0.999 ; Similarly as P(E) = 0.002 then P(~E) = 0.998

You might also like