Decision Tree
exploring and preparing the data
Data preparation – creating random training and test
datasets
• Usually data that had been sorted in a random order, we simply divided the dataset
into two portions, by taking the first 90 percent of records for training, and the
remaining 10 percent for testing.
• In contrast, the credit dataset is not randomly ordered, making the prior approach
unwise.
• Suppose that the bank had sorted the data by the loan amount, with the largest
loans at the end of the file.
• If we used the first 90 percent for training and the remaining 10 percent for testing,
we would be training a model on only the small loans and testing the model on the
big loans. Obviously, this could be problematic.
• We'll solve this problem by using a random sample of the credit data for training.
• A random sample is simply a process that selects a subset of records at random.
• In R, the sample() function is used to perform random sampling.
• However, before putting it in action, a common practice is to set a seed value, which
causes the randomization process to follow a sequence that can be replicated later
on if desired.
training a model on the data
• We will use the C5.0 algorithm in the
C50 package to train our decision tree
model.
• For the first iteration of our credit
approval model, we'll use the default
C5.0 configuration, as shown in the
following code.
• The 17th column in credit_train is the
default class variable, so we need to
exclude it from the training data frame,
but supply it as the target factor vector
for classification.
If the checking account balance is unknown or greater than 200 DM, then classify as "not likely to
default."
2. Otherwise, if the checking account balance is less than zero DM or between one and 200 DM.
3. And the credit history is perfect or very good, then classify as "likely to default."
evaluating model performance
• credit_pred <- predict(credit_model, credit_test)
• This creates a vector of predicted class values, which we can compare
to the actual class values using the CrossTable() function in the
gmodels package.
Results
Out of the 100 test loan application
records, our model correctly predicted that
59 did not default and 14 did default,
resulting in an accuracy of 73 percent and
an error rate of 27 percent.
Also note that the model only correctly
predicted 14 of the 33 actual loan defaults
in the test data, or 42 percent.