Unit 2
Unit 2
YEAR/SEM: IV / VIII
Unit - 2
Regulation: 2017
1
Part - A
UNIT II CLUSTERING AND CLASSIFICATION
Advanced Analytical Theory and Methods: Overview of Clustering - K-means - Use Cases -
Overview of the Method - Determining the Number of Clusters - Diagnostics - Reasons to
Choose and Cautions .- Classification: Decision Trees - Overview of a Decision Tree - The
General Algorithm - Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in
R - Naïve Bayes - Bayes‘ Theorem - Naïve Bayes Classifier.
1. What is Clustering?
Given a set of data points, we can use a clustering algorithm to classify each
data point into a specific group.
Data points that are in the same group should have similar properties and/or
features, while data points in different groups should have highly dissimilar
properties and/or features.
2
Clustering is used in the identification of groups of automobiles Insurance
policy customer.
3
5. Define k-means clustering algorithm.
k-means is one of the simplest unsupervised learning algorithms that solve the
well known clustering problem.
The procedure follows a simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters) fixed apriori.
So, the better choice is to place them as much as possible far away from each
other.
The next step is to take each point belonging to a given data set and associate it to
the nearest center.
When no point is pending, the first step is completed and an early group
age is done.
At this point we need to re-calculate k new centroids as bary center of the clusters
resulting from the previous step.
After we have these k new centroids, a new binding has to be done between the
same data set points and the nearest new center.
A loop has been generated. As a result of this loop we may notice that the k
centers change their location step by step until no more changes are done or
in other words centers do not move any more.
4
where,
A decision tree is a flow chart like tree structures, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test,
and leaf nodes represent classes or class distributions.
5
7. Describe Tree pruning methods.
When a decision tree is built, many of the branches will reflect anomalies in
the training data due to noise or outlier.
Tree pruning methods address this problem of over fitting the data.
Two Approaches:
Pre pruning
Post pruning
The major advantage of using decision trees is that they are intuitively very
easy to explain.
They can be displayed graphically, and they can easily handle qualitative
predictors without the need to create dummy variables.
However, decision trees generally do not have the same level of predictive
accuracy as other approaches, since they aren't quite robust.
A small change in the data can cause a large change in the final estimated tree.
6
As in bagging, you build a number of decision trees on bootstrapped training
samples.
But when building these decision trees, each time a split in a tree is
considered, a random sample of m predictors is chosen as split candidates
from the full set of pp predictors.
This is the main difference between random forests and bagging; because as in
bagging, the choice of predictor m=pm=p.
7
Bernoulli Naive Bayes:
This is similar to the multinomial naive bayes but the predictors are boolean
variables. The parameters that we use to predict the class variable take up only
values yes or no, for example if a word occurs in the text or not.
When the predictors take up a continuous value and are not discrete, we
assume that these values are sampled from a gaussian distribution.
Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
8
13. Differentiate classification and clustering.
The efficiency of CLARA depends upon the size of the representative data
set.
CLARA does not work properly if any representative dataset from the
selected representative datasets does not find best k-medoids.
9
To recover this drawback a new algorithm, Clustering Large Applications
based upon RANdomized search (CLARANS) is introduced. The CLARANS
works like
10
Part – B
1. What is clustering, briefly explain various types of clustering, algorithms and
applications.
Table of Contents
1. Overview
2. Types of Clustering
3. Types of Clustering Algorithms
4. K Means Clustering
5. Hierarchical Clustering
6. Difference between K Means and Hierarchical clustering
7. Applications of Clustering
8. Improving Supervised Learning algorithms with clustering
1. Overview
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups.
In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and
wish to understand preferences of your costumers to scale up your business.
Is it possible for you to look at details of each costumer and devise a unique business
strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their
purchasing habits and use a separate strategy for costumers in each of these 10 groups.
And this is what we call clustering.
11
Now, that we understand what is clustering. Let’s take a look at the types of clustering.
2. Types of Clustering
Hard Clustering: In hard clustering, each data point either belongs to a cluster
completely or not. For example, in the above example each customer is put into one
group out of the 10 groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate
cluster, a probability or likelihood of that data point to be in those clusters is assigned.
For example, from the above scenario each costumer is assigned a probability to be in
either of 10 clusters of the retail store.
Since the task of clustering is subjective, the means that can be used for achieving this
goal are plenty.
Every methodology follows a different set of rules for defining the ‘similarity’ among
data points. In fact, there are more than 100 clustering algorithms known.
But few of the algorithms are used popularly; let’s look at them in detail:
Connectivity models
As the name suggests, these models are based on the notion that the data
points closer in data space exhibit more similarity to each other than the data
points lying farther away.
These models can follow two approaches. In the first approach, they start with
classifying all data points into separate clusters & then aggregating them as
the distance decreases.
In the second approach, all data points are classified as a single cluster and
then partitioned as the distance increases.
12
Also, the choice of distance function is subjective. These models are very easy
to interpret but lack scalability for handling big datasets.
Examples of these models are hierarchical clustering algorithm and its
variants.
Centroid models
Distribution models
These clustering models are based on the notion of how probable is it that all data
points in the cluster belong to the same distribution (For example: Normal,
Gaussian).
These models often suffer from over fitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal
distributions.
Density Models
These models search the data space for areas of varied density of data points in
the data space.
It isolates various different density regions and assign the data points within these
regions in the same cluster. Popular examples of density models are DBSCAN
and OPTICS.
Now I will be taking you through two of the most popular clustering algorithms in
detail – K Means clustering and Hierarchical is clustering. Let’s begin.
13
4. K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each
iteration.
1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2-
D space.
2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey colors.
14
3. Compute cluster centroids: The centroid of data points in the red cluster is shown using
Red Cross and those in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid: Note that only the data point at the
bottom is assigned to the red cluster even though it’s closer to the centroid of grey
cluster. Thus, we assign that data point into grey cluster
15
5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima.
When there will be no further switching of data points between two clusters for
two successive repeats.
It will mark the termination of the algorithm if not explicitly mentioned.
Here is a live coding window where you can try out K Means Algorithm using scikit-learn
library.
5. Hierarchical Clustering
Then two nearest clusters are merged into the same cluster. In the end, this
algorithm terminates when there is only a single cluster left.
16
The dendrogram can be interpreted as:
At the bottom, we start with 25 data points, each assigned to separate clusters.
Two closest clusters are then merged till we have just one cluster at the top.
The height in the dendrogram at which two clusters are merged represents the
distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be
chosen by observing the dendrogram.
The best choice of the no. of clusters is the no. of vertical lines in the dendrogram
cut by a horizontal line that can transverse the maximum distance vertically
without intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red
horizontal line in the dendrogram below covers maximum vertical distance AB.
17
Two important things that you should know about hierarchical clustering are:
Hierarchical clustering can’t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
18
In K Means clustering, since we start with random choice of clusters, the results produced
by running the algorithm multiple times might differ. While results are reproducible in
Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like
circle in 2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide
your data into. But, you can stop at whatever number of clusters you find appropriate in
hierarchical clustering by interpreting the dendrogram
7. Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most
popular applications of clustering are:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
Let’s check out the impact of clustering on the accuracy of our model for the
classification problem using 3000 observations with 100 predictors of stock data
to predicting whether the stock will go up or down using R.
19
2. Briefly explain K-means Clustering algorithm with example.
K-means clustering is one of the simplest algorithms which use unsupervised learning
method to solve known clustering issues. K-means clustering require following two inputs.
1. k = number of clusters
2. Training set(m) = {x1, x2, x3,……….., xm}
Let’s say you have an unlabeled data set like the one shown below and you want to group this
data into clusters.
Now, the important question is how should you choose the optimum number of clusters? There
are two possible ways for choosing the number of clusters.
Here, you draw a curve between WSS (within sum of squares) and the number of
clusters.
It is called elbow method because the curve looks like a human arm and the elbow point
gives us the optimum number of clusters.
20
As you can see that after the elbow point, there is a very slow change in the value of
WSS, so you should take the elbow point value as the final number of clusters.
You can run k-means clustering algorithm to get different clusters based on a variety of
purposes.
You can partition the data on different metrics and see how well it performs for that
particular case. Let’s take an example of marketing T-shirts of different sizes.
You can partition the dataset into different number of clusters depending upon the
purpose that you want to meet.
In the following example, I have taken two different criteria, price and comfort.
21
1. K=3: If you want to provide only 3 sizes(S, M, L) so that prices are cheaper, you will
divide the data set into 3 clusters.
2. K=5: Now, if you want to provide more comfort and variety to your customers with more
sizes (XS, S, M, L, XL), then you will divide the data set into 5 clusters.
Now, once we have the value of k with us, let’s understand its execution.
Initialization
Firstly, you need to randomly initialize two points called the cluster centroids.
Here, you need to make sure that your cluster centroids depicted by an orange and
blue cross as shown in the image are less than the training data points depicted by
navy blue dots.
K-means clustering algorithm is an iterative algorithm and it follows next two
steps iteratively.
Once you are done with the initialization, let’s move on to the next step.
22
Cluster Assignment
In this step, it will go through all the navy blue data points to compute the distance
between the data points and the cluster centroid initialized in the previous step.
Now, depending upon the minimum distance from the orange cluster centroid or blue
cluster centroid, it will group itself into that particular group.
So, data points are divided into two groups, one represented by orange color and the other
one in blue color as shown in the graph.
Since these cluster formations are not the optimized clusters, so let’s move ahead and see
how to get final clusters.
MoveCentroid
Now, you will take the above two cluster centroids and iteratively reposition them
for optimization.
You will take all blue dots, compute their average and move current cluster
centroid to this new location.
Similarly, you will move orange cluster centroid to the average of orange data
points.
Therefore, the new cluster centroids will look as shown in the graph.
23
Moving forward, let’s see how we can optimize clusters which will give us better
insight.
Optimization
You need to repeat above two steps iteratively till the cluster centroids stop
changing their positions and become static.
Once the clusters become static, then k-means clustering algorithm is said to be
converged.
Convergence
Finally, k-means clustering algorithm converges and divides the data points into
two clusters clearly visible in orange and blue.
It can happen that k-means may end up converging with different solutions
depending on how the clusters were initialized.
As you can see in the graph below, the three clusters are clearly visible but you
might end up having different clusters depending upon your choice of cluster
centroids.
24
Below shown are some other possibilities of cluster partitioning based on the different
choice of cluster centroids.
You may end up having any of these groupings based on your requirements and the goal
that you are trying to achieve.
Now that you have understood the concepts of clustering, so let’s do some hands-on in R.
Let’s say, you have a movie dataset with 28 different attributes ranging from director face
book likes, movie likes, actor likes, budget to gross and you want to find out movies with
maximum popularity amongst the viewers.
You can achieve this by k-means clustering and divide the entire data into different
clusters and do further analysis based on the popularity.
For this, I have taken the movie dataset of 5000 values with 28 attributes. You can find
the dataset here.
View(movie_metadata)
25
Step 2. As you can see that there are many NA values in this data, so I will clean the dataset and
remove all the null values from it.
Step 3. In this example, I have taken first 500 values from the data set for analysis.
Step 4. Further, with the R code below, you can take two attributes budget and gross from the
dataset to make clusters.
26
3. What is a Decision Tree? How to Create a Perfect Decision Tree?
Chance nodes,
Decision nodes
27
End nodes.
A chance node, represented by a circle, shows the probabilities of certain results.
A decision node, represented by a square, shows a decision to be made, and an end node
shows the final outcome of a decision path.
Advantages
Disadvantages
Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
Decision trees can be computationally expensive to train.
The process of growing a decision tree is computationally expensive. At each node, each
candidate splitting field must be sorted before its best split can be found.
In some algorithms, combinations of fields are used and a search must be made for
optimal combining weights.
Pruning algorithms can also be expensive since many candidate sub-trees must be formed
and compared.
28
Now the question is whether it could be ‘the next earth?’ The answer to this question will
revolutionize the way people live. Well, literally!
There is n number of deciding factors which need to be thoroughly researched to take an
intelligent decision.
These factors can be whether water is present on the planet, what is the temperature,
whether the surface is prone to continuous storms, flora and fauna survives the climate or
not, etc.
Let us create a decision tree to find out whether we have discovered a new habitat.
The habitable temperature falls into the range 0 to 100 Celsius.
29
Whether flora and fauna flourishes?
30
Classification Rules:
Classification rules are the cases in which all the scenarios are taken into
consideration and a class variable is assigned to each.
Class Variable:
Each leaf node is assigned a class-variable. A class-variable is the final output
which leads to our decision.
Let us derive the classification rules from the Decision Tree created:
2. If Temperature is between 273 to 373K, and water is not present, -> Survival Difficult
3. If Temperature is between 273 to 373K, water is present, and flora and fauna is not present ->
Survival Difficult
4. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and a
stormy surface is not present -> Survival Probable
5. If Temperature is between 273 to 373K, water is present, flora and fauna is present, and a
stormy surface is present -> Survival Difficult
Decision Tree
Root Node: The factor of ‘temperature’ is considered as the root in this case.
Internal Node: The nodes with one incoming edge and 2 or more outgoing edges.
Leaf Node: This is the terminal node with no out-going edge.
31
As the decision tree is now constructed, starting from the root-node we check the test
condition and assign the control to one of the outgoing edges, and so the condition is
again tested and a node is assigned.
The decision tree is said to be complete when all the test conditions lead to a leaf node.
The leaf node contains the class-labels, which vote in favor or against the decision.
Let's imagine you are playing a game of Twenty Questions.
Your opponent has secretly chosen a subject, and you must figure out what he/she chose.
At each turn, you may ask a yes-or-no question, and your opponent must answer
truthfully.
How do you find out the secret in the fewest number of questions?
32
Let's identify important terminologies on Decision Tree, looking at the image above:
Root Node represents the entire population or sample. It further gets divided into two or
more homogeneous sets.
Splitting is a process of dividing a node into two or more sub-nodes.
When a sub-node splits into further sub-nodes, it is called a Decision Node.
Nodes that do not split is called a Terminal Node or a Leaf.
When you remove sub-nodes of a decision node, this process is called Pruning. The
opposite of pruning is Splitting.
A sub-section of an entire tree is called Branch.
A node, which is divided into sub-nodes is called a parent node of the sub-nodes;
whereas the sub-nodes are called the child of the parent node.
33
4. Briefly explain the various types of decision trees.
Regression Trees
Let's take a look at the image below, which helps visualize the nature of
partitioning carried out by a Regression Tree.
This shows an unpruned tree and a regression tree fit to a random dataset. Both
the visualizations show a series of splitting rules, starting at the top of the tree.
Notice that every split of the domain is aligned with one of the feature axes. The
concept of axis parallel splitting generalizes straightforwardly to dimensions
greater than two.
For a feature space of size pp, a subset of RpRp, the space is divided
into MM regions, RmRm, each of which is a pp-dimensional "hyperblock".
34
In order to build a regression tree, you first use recursive binary splititng to grow
a large tree on the training data, stopping only when each terminal node has fewer
than some minimum number of observations.
Recursive Binary Splitting is a greedy and top-down algorithm used to minimize
the Residual Sum of Squares (RSS), an error measure also used in linear
regression settings.
The RSS, in the case of a partitioned feature space with M partitions is given by:
Beginning at the top of the tree, you split it into 2 branches, creating a partition of 2
spaces.
You then carry out this particular split at the top of the tree multiple times and choose the
split of the features that minimizes the (current) RSS.
Next, you apply cost complexity pruning to the large tree in order to obtain a sequence of
best subtrees, as a function of αα.
The basic idea here is to introduce an additional tuning parameter, denoted by αα that
balances the depth of the tree and its goodness of fit to the training data.
You can use K-fold cross-validation to choose αα. This technique simply involves
dividing the training observations into K folds to estimate the test error rate of the
subtrees.
Your goal is to select the one that leads to the lowest error rate.
Classification Trees
A classification tree is very similar to a regression tree, except that it is used to predict a
qualitative response rather than a quantitative one.
Recall that for a regression tree, the predicted response for an observation is given by the
mean response of the training observations that belong to the same terminal node.
In contrast, for a classification tree, you predict that each observation belongs to the most
commonly occurring class of training observations in the region to which it belongs.
35
In interpreting the results of a classification tree, you are often interested not only in the
class prediction corresponding to a particular terminal node region, but also in the class
proportions among the training observations that fall into that region.
The task of growing a classification tree is quite similar to the task of growing a
regression tree.
Just as in the regression setting, you use recursive binary splitting to grow a classification
tree.
However, in the classification setting, Residual Sum of Squares cannot be used as a
criterion for making the binary splits.
Classification Error Rate: Rather than seeing how far a numerical response is away from
the mean value, as in the regression setting, you can instead define the "hit rate" as the
fraction of training observations in a particular region that don't belong to the most
widely occuring class.
The error is given by this equation:
E = 1 - argmaxc(π^mcπ^mc)
in which π^mcπ^mc represents the fraction of training data in region Rm that belong
to class c.
36
Gini Index:
The Gini Index is an alternative error metric that is designed to show how "pure" a region
is. "Purity" in this case means how much of the training data in a particular region
belongs to a single class.
If a region Rm contains data that is mostly from a single class c then the Gini Index value
will be small:
Cross-Entropy:
A third alternative, which is similar to the Gini Index, is known as the Cross-Entropy or
Deviance:
The cross-entropy will take on a value near zero if the π^mcπ^mc’s are all near 0 or
near 1.
Therefore, like the Gini index, the cross-entropy will take on a small value if the mth
node is pure.
In fact, it turns out that the Gini index and the cross-entropy are quite similar
numerically.
When building a classification tree, either the Gini index or the cross-entropy are
typically used to evaluate the quality of a particular split, since they are more sensitive to
node purity than is the classification error rate.
Any of these 3 approaches might be used when pruning the tree, but the classification
error rate is preferable if prediction accuracy of the final pruned tree is the goal.
The major advantage of using decision trees is that they are intuitively very
easy to explain.
37
They closely mirror human decision-making compared to other regression and
classification approaches.
They can be displayed graphically, and they can easily handle qualitative
predictors without the need to create dummy variables.
However, decision trees generally do not have the same level of predictive
accuracy as other approaches, since they aren't quite robust.
A small change in the data can cause a large change in the final estimated tree.
By aggregating many decision trees, using methods like bagging, random
forests, and boosting, the predictive performance of decision trees can be
substantially improved.
Tree-Based Methods
Bagging
The decision trees discussed above suffer from high variance, meaning if you split the
training data into 2 parts at random, and fit a decision tree to both halves, the results that
you get could be quite different.
In contrast, a procedure with low variance will yield similar results if applied repeatedly
to distinct dataset.
Bagging, or bootstrap aggregation, is a technique used to reduce the variance of your
predictions by combining the result of multiple classifiers modeled on different sub-
samples of the same dataset. Here is the equation for bagging:
38
The visual below shows the 3 different steps in bagging:
Step 1: Here you replace the original data with new data. The new data usually have a
fraction of the original data's columns and rows, which then can be used as hyper-
parameters in the bagging model.
Step 2: You build classifiers on each dataset. Generally, you can use the same classifier
for making models and predictions.
Step 3: Lastly, you use an average value to combine the predictions of all the classifiers,
depending on the problem. Generally, these combined values are more robust than a
single model.
While bagging can improve predictions for many regression and classification methods, it
is particularly useful for decision trees.
To apply bagging to regression/classification trees, you simply
construct BB regression/classification trees using BB bootstrapped training sets, and
average the resulting predictions.
These trees are grown deep, and are not pruned. Hence each individual tree has high
variance, but low bias. Averaging these BB trees reduces the variance.
39
Broadly speaking, bagging has been demonstrated to give impressive improvements in
accuracy by combining together hundreds or even thousands of trees into a single
procedure.
Random Forests
40
If there are pp input variables, specify a number m<pm<p such that at each node, you can
select mm random variables out of the pp. The best split on these mm is used to split the
node.
Each tree is subsequently grown to the largest extent possible and no pruning is needed.
Finally, aggregate the predictions of the target trees to predict new data.
Random Forests is very effective at estimating missing data and maintaining accuracy
when a large proportion of the data is missing.
It can also balance errors in datasets where the classes are imbalanced. Most
importantly, it can handle massive datasets with large dimensionality.
However, one disadvantage of using Random Forests is that you might easily over fit
noisy datasets, especially in the case of doing regression.
Boosting
Boosting is another approach to improve the predictions resulting from a decision tree.
Like bagging and random forests, it is a general approach that can be applied to many
statistical learning methods for regression or classification.
Recall that bagging involves creating multiple copies of the original training dataset using
the bootstrap, fitting a separate decision tree to each copy, and then combining all of the
trees in order to create a single predictive model.
Notably, each tree is built on a bootstrapped dataset, independent of the other trees.
Boosting works in a similar way, except that the trees are grown sequentially: each tree is
grown using information from previously grown trees.
Boosting does not involve bootstrap sampling; instead, each tree is fitted on a modified
version of the original dataset.
41
For regression and classification trees, boosting works like this:
Unlike fitting a single large decision tree to the data, which amounts to fitting the data
hard and potentially over fitting, the boosting approach instead learns slowly.
Given the current model, you fit a decision tree to the residuals from the model. That is,
you fit a tree using the current residuals, rather than the outcome YY, as the response.
You then add this new decision tree into the fitted function in order to update the
residuals. Each of these trees can be rather small, with just a few terminal nodes,
determined by the parameter dd in the algorithm. By fitting small trees to the residuals,
you slowly improve f^f^ in areas where it does not perform well.
The shrinkage parameter νν slows the process down even further, allowing more and
different shaped trees to attack the residuals.
Boosting is very useful when you have a lot of data and you expect the decision trees to
be very complex.
Boosting has been used to solve many challenging classification and regression problems,
including risk analysis, sentiment analysis, predictive advertising, price modeling, sales
estimation and patient diagnosis, among others.
42
5.Illustrate the concepts of Decision Trees in R
Classification Trees
For this part, you work with the Car seats dataset using the tree package in R.
Mind that you need to install the ISLR and tree packages in your R Studio environment
first. Let's first load the Car seats data frame from the ISLR package.
library (ISLR)
data (package="ISLR")
carseats<-Carseats
require(tree)
The Car seats dataset is a data frame with 400 observations on the following 11 variables:
Sales: unit sales in thousands
Comp Price: price charged by competitor at each location
Income: community income level in 1000s of dollars
Advertising: local ad budget at each location in 1000s of dollars
Population: regional pop in thousands
Price: price for car seats at each site
Shelve Loc: Bad, Good or Medium indicates quality of shelving location
Age: age level of the population
Education: ed level at location
Urban: Yes/No
US: Yes/No
Names (carseats)
hist(carseats$Sales)
43
Observe that Sales is a quantitative variable.
You want to demonstrate it using trees with a binary response.
To do so, you turn Sales into a binary variable, which will be called high.
If the sales are less than 8, it will be not high. Otherwise, it will be high.
Then you can put that new variable High back into the data frame.
Now let's fill a model using decision trees. Of course, you can't have the Sales variable
here because your response variable High was created from Sales. Thus, let's exclude it
and fit the tree.
Summary (tree.carseats)
You can see the variables involved, the number of terminal nodes, the residual mean
deviance, as well as the misclassification error rate. To make it more visual, let's plot the
tree as well, then annotate it using the handy text function:
plot(tree.carseats)
text(tree.carseats, pretty = 0)
There are so many variables, making it very complicated to look at the tree. At least, you
can see that at each of the terminal nodes, they're labeled Yes or No. At each splitting
node, the variables and the value of the splitting choice are shown (for example, Price <
92.5 or Advertising < 13.5).
For a detailed summary of the tree, simply print it. It'll be handy if you want to extact
details from the tree for other purposes:
tree.carseats
44
It's time to prune the tree down. Let's create a training set and a test by splitting
the carseats dataframe into 250 training and 150 test samples.
First, you set a seed to make the results reproducible. Then you take a random sample of
the ID (index) numbers of the samples.
Specifically here, you sample from the set 1 to n row number of rows of car seats, which
is 400. You want a sample of size 250 (by default, sample uses without replacement).
set.seed(101)
train=sample(1:nrow(carseats), 250)
So now you get this index of train, which indexes 250 of the 400 observations.
You can refit the model with tree, using the same formula except telling the tree to use a
subset equals train. Then let's make a plot:
plot(tree.carseats)
text(tree.carseats, pretty=0)
The plot looks a bit different because of the slightly different dataset. Nevertheless, the
complexity of the tree looks roughly the same.
Now you're going to take this tree and predict it on the test set, using the predict method
for trees. Here you'll want to actually predict the class labels.
On the diagonals are the correct classifications, while off the diagonals are the incorrect
ones. You only want to recored the correct ones.
To do that, you can take the sum of the 2 diagonals divided by the total (150 test
observations).
45
(72 + 43) / 150
When growing a big bushy tree, it could have too much variance.
Thus, let's use cross-validation to prune the tree optimally. Using cv.tree, you'll use the
misclassification error as the basis for doing the pruning.
cv.carseats
Printing out the results shows the details of the path of the cross-validation.
You can see the sizes of the trees as they were pruned back, the deviances as the pruning
proceeded, as well as the cost complexity parameter used in the process.
plot(cv.carseats)
Looking at the plot, you see a downward spiral part because of the misclassification error
on 250 cross-validated points. So let's pick a value in the downward steps (12).
Then, let's prune the tree to a size of 12 to identify that tree. Finally, let's plot and
annotate that tree to see the outcome.
plot(prune.carseats)
text(prune.carseats, pretty=0)
It's a bit shallower than previous trees, and you can actually read the labels. Let's evaluate
it on the test dataset again.
46
tree.pred = predict(prune.carseats, carseats[-train,], type="class")
Seems like the correct classifications dropped a little bit. It has done about the same as
your original tree, so pruning did not hurt much with respect to misclassification errors,
and gave a simpler tree.
Often case, trees don't give very good prediction errors, so let's go ahead take a look at
random forests and boosting, which tend to outperform trees as far as prediction and
misclassification are concerned.
Random Forests
For this part, you will use the Boston housing data to explore random forests and
boosting. The dataset is located in the MASS package.
It gives housing values and other statistics in each of 506 suburbs of Boston based on a
1970 census.
library(MASS)
data(package="MASS")
boston<-Boston
dim(boston)
names(boston)
require(randomForest)
47
To prepare data for random forest, let's set the seed and create a sample training set of
300 observations.
set.seed(101)
In this dataset, there are 506 surburbs of Boston. For each surburb, you have variables
such as crime per capita, types of industry, average # of rooms per dwelling, average
proportion of age of the houses etc.
Let's use medv - the median value of owner-occupied homes for each of these surburbs,
as the response variable.
Let's fit a random forest and see how well it performs. As being said, you use the
response medv, the median housing value (in $1K dollars), and the training sample set.
rf.boston
Printing out the random forest gives its summary: the # of trees (500 were grown), the
mean squared residuals (MSR), and the percentage of variance explained.
The MSR and % variance explained are based on the out-of-bag estimates, a very clever
device in random forests to get honest error estimates.
The only tuning parameter in a random Forests is the argument called mtry, which is the
number of variables that are selected at each split of each tree when you make a split.
As seen here, mtry is 4 of the 13 exploratory variables (excluding medv) in the Boston
Housing data - meaning that each time the tree comes to split a node, 4 variables would
be selected at random, then the split would be confined to 1 of those 4 variables. That's
how randomForests de-correlates the trees.
You're going to fit a series of random forests. There are 13 variables, so let's
have mtry range from 1 to 13:
In order to record the errors, you set up 2 variables oob.err and test.err.
48
In a loop of mtry from 1 to 13, you first fit the randomForest with that value of mtry on
the train dataset, restricting the number of trees to be 350.
Then you extract the mean-squared-error on the object (the out-of-bag error).
Then you predict on the test dataset (boston[-train]) using fit (the fit of randomForest).
Lastly, you compute the test error: mean-squared error, which is equals to mean( (medv - pred) ^
2 ).
oob.err = double(13)
test.err = double(13)
for(mtry in 1:13){
oob.err[mtry] = fit$mse[350]
Basically you just grew 4550 trees (13 times 350). Now let's make a plot using
the matplot command.
The test error and the out-of-bag error are binded together to make a 2-column matrix.
There are a few other arguments in the matrix, including the plotting character values
(pch = 23 means filled diamond), colors (red and blue), type equals both (plotting both
points and connecting them with the lines), and name of y-axis (Mean Squared Error).
You can also put a legend at the top right corner of the plot.
49
matplot(1:mtry, cbind(test.err, oob.err), pch = 23, col = c("red", "blue"), type = "b", ylab="Mean
Squared Error")
Ideally, these 2 curves should line up, but it seems like the test error is a bit lower.
However, there's a lot of variability in these test error estimates.
Since the out-of-bag error estimate was computed on one dataset and the test error
estimate was computed on another dataset, these differences are pretty much well within
the standard errors.
Notice that the red curve is smoothly above the blue curve? These error estimates are
very correlated, because the random Forest with mtry = 4 is very similar to the one
with mtry = 5.
That's why each of the curves is quite smooth. What you see is that mtry around 4 seems
to be the most optimal choice, at least for the test error. This value of mtry for the out-of-
bag error equals 9.
So with very few tiers, you have fitted a very powerful prediction model using random
forests. How so? The left-hand side shows the performance of a single tree.
The mean squared error on out-of-bag is 26, and you've dropped down to about 15 (just a
bit above half). This means you reduced the error by half.
Likewise for the test error, you reduced the error from 20 to 12.
Boosting
Compared to random forests, boosting grows smaller and stubbier trees and goes at the
bias. You will use the package GBM (Gradient Boosted Modeling), in R.
require(gbm)
GBM asks for the distribution, which is Gaussian, because you'll be doing squared error
loss. You're going to ask GBM for 10,000 trees, which sounds like a lot, but these are
going to be shallow trees.
50
Interaction depth is the number of splits, so you want 4 splits in each tree. Shrinkage is
0.01, which is how much you're going to shrink the tree step back.
summary(boost.boston)
The summary function gives a variable importance plot. It seems like there are 2
variables that have high relative importance: rm (number of rooms) and lstat (percentage
of lower economic status people in the community). Let's plot these 2 variables:
plot(boost.boston,i="lstat")
plot(boost.boston,i="rm")
The 1st plot shows that the higher the proportion of lower status people in the suburb, the
lower the value of the housing prices. The 2nd plot shows the reversed relationship with
the number of rooms: the average number of rooms in the house increases as the price
increases.
It's time to predict a boosted model on the test dataset. Let's look at the test performance
as a function of the number of trees:
First, you make a grid of number of trees in steps of 100 from 100 to 10,000.
Then, you run the predict function on the boosted model. It takes n.trees as an argument,
and produces a matrix of predictions on the test data.
The dimensions of the matrix are 206 test observations and 100 different predict vectors
at the 100 different values of tree.
dim(predmat)
51
It's time to compute the test error for each of the predict vectors:
predmat is a matrix, medv is a vector, thus (predmat - medv) is a matrix of differences.
You can use the apply function to the columns of these square differences (the mean).
That would compute the column-wise mean squared error for the predict vectors.
Then you make a plot using similar parameters to that one used for Random Forest. It
would show a boosting error plot.
plot(n.trees, boost.err, pch = 23, ylab = "Mean Squared Error", xlab = "# Trees", main =
"Boosting Test Error")
The boosting error pretty much drops down as the number of trees increases.
This is evidence showing that boosting is reluctant to over fit.
Let's also include the best test error from the random Forest into the plot.
Boosting actually gets a reasonable amount below the test error for random Forest.
52
6. What is a classifier in big data analytics? Explain in detail.
A classifier is a machine learning model that is used to discriminate different objects based
on certain features.
A Naive Bayes classifier is a probabilistic machine learning model that’s used for
classification task. The crux of the classifier is based on the Bayes theorem.
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has
occurred. Here, B is the evidence and A is the hypothesis.
The assumption made here is that the predictors/features are independent. That is
presence of one particular feature does not affect the other. Hence it is called naive.
Example:
Let us take an example to get some better intuition. Consider the problem of playing golf.
The dataset is represented as below.
53
We classify whether the day is suitable for playing golf, given the features of the day.
The columns represent these features and the rows represent individual entries.
If we take the first row of the dataset, we can observe that is not suitable for playing golf
if the outlook is rainy, temperature is hot, humidity is high and it is not windy.
We make two assumptions here, one as stated above we consider that these predictors are
independent.
That is, if the temperature is hot, it does not necessarily mean that the humidity is high.
54
Another assumption made here is that all the predictors have an equal effect on the
outcome. That is, the day being windy does not have more importance in deciding to play
golf or not.
The variable y is the class variable (play golf), which represents if it is suitable to play golf
or not given the conditions. Variable X represents the parameters/features.
X is given as,
Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook,
temperature, humidity and windy. By substituting for X and expanding using the chain
rule we get,
Now,
you can obtain the values for each by looking at the dataset and substitute them into the
equation.
For all entries in the dataset, the denominator does not change, it remain static. Therefore,
the denominator can be removed and proportionality can be introduced.
55
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate.
Therefore, we need to find the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.
This is mostly used for document classification problem, i.e whether a document belongs
to the category of sports, politics, technology etc.
The features/predictors used by the classifier are the frequency of the words present in the
document.
This is similar to the multinomial naive bayes but the predictors are boolean variables.
The parameters that we use to predict the class variable take up only values yes or no, for
example if a word occurs in the text or not.
56
When the predictors take up a continuous value and are not discrete, we assume that these
values are sampled from a gaussian distribution.
Gaussian Distribution(Normal Distribution)
Since the way the values are present in the dataset changes, the formula for conditional
probability changes to,
57