Data Mining Practical 8
Data Mining Practical 8
(VI SEMESTER)
Date:
Student Name:
Student Enrollment No:
EXPERIMENT NO: 8
THEORY:
Imagine you only ever do four things at the weekend: go shopping, watch a movie, play tennis or just stay
in. What you do depends on three things: the weather (windy, rainy or sunny); how much money you
have (rich or poor) and whether your parents are visiting. You say to your yourself: if my parents are
visiting, we'll go to the cinema. If they're not visiting and it's sunny, then I'll play tennis, but if it's windy,
and I'm rich, then I'll go shopping. If they're not visiting, it's windy and I'm poor, then I will go to the
cinema. If they're not visiting and it's rainy, then I'll stay in.
To remember all this, you draw a flowchart which will enable you to read off your decision. We call such
diagrams decision trees. A suitable decision tree for the weekend decision choices would be as follows:
Figure 1
We can see why such diagrams are called trees, because, while they are admittedly upside down, they
start from a root and have branches leading to leaves (the tips of the graph at the bottom). Note that the
leaves are always decisions, and a particular decision might be at the end of multiple branches (for
example, we could choose to go to the cinema for two different reasons).
Armed with our decision tree, on Saturday morning, when we wake up, all we need to do is check (a) the
weather (b) how much money we have and (c) whether our parent's car is parked in the drive. The
decision tree will then enable us to make our decision. Suppose, for example, that the parents haven't
turned up and the sun is shining. Then this path through our decision tree will tell us what to do:
Figure 2
and hence we run off to play tennis because our decision tree told us to. Note that the decision tree covers
all eventualities. That is, there are no values that the weather, the parents turning up or the money
situation could take which aren't catered for in the decision tree. Note that, in this lecture, we will be
looking at how to automatically generate decision trees from examples, not at how to turn thought
processes into decision trees.
There is a link between decision tree representations and logical representations, which can be exploited
to make it easier to understand (read) learned decision trees. If we think about it, every decision tree is
actually a disjunction of implications (if ... then statements), and the implications are Horn clauses: a
conjunction of literals implying a single literal. In the above tree, we can see this by reading from the root
node to each leaf node:
If the parents are not visiting and it is windy and you're poor, then go to cinema
Or
If the parents are not visiting and it is rainy, then stay in.
Of course, this is just a re-statement of the original mental decision making process we described.
Remember, however, that we will be programming an agent to learn decision trees from example, so this
kind of situation will not occur as we will start with only example situations. It will therefore be important
for us to be able to read the decision tree the agent suggests.
ID3 Algorithm
Start from root node of decision tree, testing the attribute specified by thisnode, then moving down the
tree branch according to the attribute value in thegiven set. This process is the repeated at the sub-tree
level.
1. Instance is represented as attribute-value pairs. For example, attribute 'Temperature' and its value
'hot', 'mild', 'cool'. We are also concerning to extend attribute -value to continuous-valued data
(numeric attribute value) in our project.
2. The target function has discrete output values. It can easily deal with instance which is assigned to
a Boolean decision, such as 'true' and 'false', 'p(positive)' and 'n(negative)'. Although it is possible
to extend target to realvalued outputs, we will cover the issue in the later part of this report.
3. The training data may contain errors. This can be dealt with pruning techniques that we will not
cover here.
The 3 widely used decision tree learning algorithms are: ID3, ASSISTANTand C4.5. We will cover ID3
in this experiment.
Decision tree learning is attractive for 3 reasons: (Paul Utgoff& Carla Brodley, 1990)
1. Decision tree is a good generalization for unobserved instance, only if the instances are described
in terms of features that are correlated with the target concept.
2. The methods are efficient in computation that is proportional to the number of observed training
instances.
3. The resulting decision tree provides a representation of the concept that appeal to human because
it renders the classification process self-evident.
ID3 is a no incremental algorithm, meaning it derives its classes from a fixed set of training instances. An
incremental algorithm revises the current concept definition, if necessary, with a new sample. The classes
created by ID3 are inductive, that is, given a small set of training instances, the specific classes created by
ID3 are expected to work for all future instances. The distribution of the unknowns must be the same as
the test cases. Induction classes cannot be proven to work in every case since they may classify an infinite
number of instances. Note that ID3 (or any inductive algorithm) may misclassify data.
Data Description
The sample data used by ID3 has certain requirements, which are:
Attribute-value description - the same attributes must describe each example and have a fixed
number of values.
Predefined classes - an example's attributes must already be defined, that is, they are not learned
by ID3.
Discrete classes - classes must be sharply delineated. Continuous classes broken up into vague
categories such as a metal being "hard, quite hard, flexible, soft, quite soft" are suspect.
Sufficient examples - since inductive generalization is used (i.e. not provable) there must be
enough test cases to distinguish valid patterns from chance occurrences.
Attribute Selection
How does ID3 decide which attribute is the best? A statistical property, called information gain, is
used. Gain measures how well a given attribute separates training examples into targeted classes.
The one with the highest information (information being the most useful for classification) is
selected. In order to define gain, we first borrow an idea from information theory called entropy.
Entropy measures the amount of information in an attribute.
Where
pi is the proportion of S belonging to class I.
S is over c.
Log2 is log base 2.
Note that S is not an attribute but the entire sample set.
Example 1
Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified).
The range of entropy is 0 ("perfectly classified") to 1 ("totally random").
Where:
Student Name (Enrollment No) Page no
Shree Swaminarayan Institute of Technology CE DEPT. (VI SEMESTER)
Example 2
Suppose S is a set of 14 examples in which one of the attributes is wind speed. The values of
Wind can be Weak or Strong. The classification of these 14 examples are 9 YES and 5 NO. For
attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6 occurrences of Wind =
Strong. For Wind = Weak, 6 of the examples are YES and 2 are NO. For Wind = Strong, 3 are
YES and 3 are NO. Therefore
For each attribute, the gain is calculated and the highest gain is used in the decision node.
Suppose we want ID3 to decide whether the weather is amenable to playing baseball. Over the course of 2
weeks, data is collected to help ID3 build a decision tree (see table 1).
The target classification is "should we play baseball?" which can be yes or no.
The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following
values:
Table 1
We need to find which attribute will be the root node in our decision tree. The gain is calculated for all
four attributes:
Outlook attribute has the highest gain, therefore it is used as the decision attribute in the root node.
Since Outlook has three possible values, the root node has three branches (sunny, overcast, rain). The
next question is "what attribute should be tested at the Sunny branch node?" Since we=92ve used Outlook
at the root, we only decide on the remaining three attributes: Humidity, Temperature, or Wind.
Ssunny = {D1, D2, D8, D9, D11} = 5 examples from table 1 with outlook = sunny
Gain(Ssunny, Humidity) = 0.970
Gain(Ssunny, Temperature) = 0.570
Gain(Ssunny, Wind) = 0.019
Humidity has the highest gain; therefore, it is used as the decision node. This process goes on until all
data is classified perfectly or we run out of attributes.
ID3 has been incorporated in a number of commercial rule-induction packages. Some specific
applications include medical diagnosis, credit risk assessment of loan applications, equipment
malfunctions by their cause, classification of soybean diseases, and web search classification.
EXCERSICE:
Consider the customer database described below where an application for a credit card is either approved
or rejected. Construct a decision tree (with Approved as the decision variable) using the entropy measure.
EVALUATION:
Observation &
Timely completion Viva Total
Implementation
4 2 4 10