We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23
Machine Learning
Samatrix Consulting Pvt Ltd
Resampling Methods Resampling Methods • Resampling methods involve draw samples from a training set repeatedly. • Then refit the model of interest on each sample. In this process, we obtain additional information about the fitted model. • This information would not be available if we fit the model only once using the original training sample. • The resampling methods involve fitting a machine learning method multiple times by using different subset of training data. • Hence the resampling methods may be computationally expensive. • However, due to recent advances in computing power, the resampling methods are no longer prohibitive. Resampling Methods • The most commonly used resampling methods are cross-validation and bootstrap. • For example, we can use cross-validation to estimate the error associated with a machine learning method to evaluate its performance or select the level of flexibility. • The process of evaluating the performance of a model is called model assessment. • Whereas the process of selecting the level of flexibility of a model is called the model selection. • We use bootstrap methods to measure the accuracy of a given machine learning methods. Resampling Methods • In the previous chapters, we discussed how the test error rates and the training error rates could be different from each other. • If we do not have a very large test data set to estimate the test error rate, we can use different techniques to estimate the test error rate using the available training data. • By using a class of methods, we can estimate the test error rates by holding out a subset of training observation from the fitting process and then applying the machine learning methods to the help out observations. Validation Set Approach • Validation set approach is used to estimate the test error rate. • In this approach, we randomly divide the available set of observations into two parts: training set and validation set or hold-out set. • We use the training set to fit the model and then use the fitted model to predict the responses for the observations in the validation set. • We use the validation set error rate to estimate the test error rate. Validation Set Approach • We have demonstrated the schematic display of the validation set approach in the Figure 1. • We have randomly split a set of 𝑛 observations into a training set and a validation set. • The training set is shown in blue and it contains observations, 7, 22, and 13 among others. • The validation set is shown in beige and it contains observation 91 among others. • We use a machine learning model to fit on the training set and use validation set to evaluate its performance. Validation Set Approach • Even though the validation set approach is conceptually simple and is easy to implement, it suffers two drawbacks: • The variance of the estimate of the test error rate using validation approach can be high because it depends how the observations have been split between the training set and the validation set • In the validation approach we use only a subset of the observations to fit the model. The machine learning models do not perform well on fewer observation and the validation set error rate may be an overestimation of the test error rate for the model that has been fit on the entire data set. Leave-one-out Cross-Validation • Leave-One-Out Cross Validation (LOOCV) is similar to the validation set approach but it ties to address the drawbacks of validation approach. • Similar to the validation set approach, the LOOCV also split the data set into two parts. • Instead of creating two subsets of comparable size, we use only one observation 𝑥1 , 𝑦1 for the validation set and use the remaining (𝑛 − 1) observations { 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 } for the training set. • We fit a machine learning model on the (𝑛 − 1) training observations, make prediction 𝑦ො1 using the excluded observation 𝑥1 , and calculate the 𝑀𝑆𝐸1 = 𝑦1 − 𝑦ො1 2 . • But the 𝑀𝑆𝐸1 is highly variable because it is based on a single observation. Leave-one-out Cross-Validation • Now we repeat the procedure by selecting 𝑥2 , 𝑦2 for validation data, training machine learning model on 𝑛 − 1 observation • 𝑥1 , 𝑦1 , 𝑥3 , 𝑦3 … , 𝑥𝑛 , 𝑦𝑛 , and computing 𝑀𝑆𝐸2 = 𝑦2 − 𝑦ො2 2 . • If we repeat this approach 𝑛 times, we get 𝑛 squared errors 𝑀𝑆𝐸1 , … , 𝑀𝑆𝐸𝑛 . • The LOOCV estimate for the test 𝑀𝑆𝐸 can be calculated by taking the average of all these 𝑛 test error estimates: 𝑛 1 𝐶𝑉 𝑛 = 𝑀𝑆𝐸𝑖 𝑛 𝑖=1 Leave-one-out Cross-Validation • LOOCV has two major advantages over the validation set approach. • Firstly, LOOCV has far less bias as compared to validation set approach. • In the case of LOOCV, we repeatedly fit the machine learning model on the 𝑛 − 1 training observations (almost as many as are in the entire data set), whereas in the case of validation set approach, we split the training set into two comparable sizes. • Hence the LOOCV approach does not overestimate the test error rate as much as the validation set approach tends to overestimate. Leave-one-out Cross-Validation • Secondly, due to randomness in the training/validation set splits, the validation set approach will yield different results when applied repeatedly. • While if we perform LOOCV multiple times, we will always get the same results because there is no randomness in the training/validation set splits. • On the other hand, LOOCV is computationally expensive because the model has to be fit 𝑛 times. • If 𝑛 is large and each individual model is slow to fit, this can be very time consuming. Leave-one-out Cross-Validation • LOOCV is a general method. We can use it for any kind of predictive modeling. • We can also use it with logistic regression or linear discriminant analysis k-fold Cross Validation • 𝑘-fold cross validation is an alternative to LOOCV. In this method, we randomly divide the data set into 𝑘 groups or folds of approximately equal size. • We treat the first fold as a validation set and a machine learning method is fit on the remaining 𝑘 − 1 folds. • We calculate the mean squared error,𝑀𝑆𝐸1 , on the observations in the held-out fold. • We repeat the procedure 𝑘 times by treating a different group of observations as a validation set. k-fold Cross Validation • Thus, we get 𝑘 estimates of the test error 𝑀𝑆𝐸1 , 𝑀𝑆𝐸2 , … , 𝑀𝑆𝐸𝑘 . Finally, we compute 𝑘-fold CV estimate by averaging these values 𝑘 1 𝐶𝑉 𝑘 = 𝑀𝑆𝐸𝑖 𝑘 𝑖=1 • A schematic of the 𝑘-fold cross validation approach has been demonstrated in Figure – 3. k-fold Cross Validation • We can see that LOOCV is a special case of 𝑘-fold CV in which 𝑘 = 𝑛. • In practice, we typically perform 𝑘-fold CV using 𝑘 = 5 or 𝑘 = 10. • The question arises about the advantage of using 𝑘 = 5 or 𝑘 = 10 instead of 𝑘 = 𝑛. • One of the major advantages is computational. • The LOOCV requires fitting the learning method 𝑛 times, which makes LOOCV computationally expensive when compared to the 𝑘-fold CV using 𝑘 = 5 or 𝑘 = 10. • For example, if we take 𝑘 = 10, the 10-fold CV requires fitting the learning method only 10 times which may be much more feasible. The Bootstrap • The bootstrap is widely acceptable and very powerful tool. • It is used to quantity the uncertainty that are associated with a given machine learning method. • For example, we can use the bootstrap to estimate the standard errors of the coefficients from a linear regression fit. • It may not be useful for linear regression fit because standard statistical software such as R and Python automatically provide standard error. • For a wide range of machine learning methods, the standard errors are difficult to obtain. We can use the bootstrap for such methods. The Bootstrap • The bootstrap estimate quantities about a population by averaging estimates from small data samples. • The samples are constructed by drawing observations from a large data sample one at a time and returning to the sample after they have been chosen. • This allows a given observation to be included in a given sample more than once. • This approach of sample is known as sampling with replacement. The Bootstrap • In this example, we have a simple data set which we can call 𝑍 and it contains only 𝑛 = 3 observations. • From this dataset, we randomly select 𝑛 observations to produce ∗1 a bootstrap dataset 𝑍 . • We performed this sampling by replacement due to which same observation can be taken in the bootstrap dataset more than once. • In the current example, for 𝑍 ∗1 dataset, we have the first observation once and the third observation twice but the second observation is not present in the dataset. The Bootstrap • We can use the 𝑍 ∗1 to calculate the bootstrap estimate for 𝛼, which we can call 𝛼ො ∗1 . • We repeat this procedure 𝐵 times by keeping a large value of 𝐵. • Thus, we produce 𝐵 different bootstrap data sets, 𝑍 ∗1 , 𝑍 ∗2 , … , 𝑍 ∗𝐵 . • We also produce corresponding 𝐵 𝛼 estimates, 𝛼ො ∗1 , 𝛼ො ∗2 , … , 𝛼ො ∗𝐵 . • We can now compute the standard error of these bootstrap estimates. Subset Selection • There are two reasons why we are not satisfied with the least squares estimates. • The first is prediction accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. • The second reason is interpretation. With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the "big picture", we are willing to sacrifice some of the small details Shrinkage Methods • By retaining a subset of the predictors and discarding the rest, subset selection produces a model that is interpretable and has possibly lower prediction error than the full model. • However, because it is a discrete process—variables are either retained or discarded—it often exhibits high variance, and so doesn't reduce the prediction error of the full model. • Shrinkage methods are more continuous, and don't suffer as much from high variability. Thanks Samatrix Consulting Pvt Ltd