Based on the lecture class and the recommended videos by Prof. Gary, I was able to learn the intricacies of various cross validation approaches and their usefulness when it comes to dealing with smaller dataset similar to the project 1 dataset.
it all boils down how we interpret the test error rate in comparison to training error rate of a model. As we may have observed, usually the training error rate of a polynomial regression function, for example, falls with increase in the degree of polynomial functions. However, we found that the test error rate falls initial before it rises with complexity of the model.This leads to overfitting. To resolve this I learned that there are various cross validation approaches as mentioned below:
- The Validation Set Approach
This process entails randomly dividing the available set of observations into two segments: a training set and a validation set (or hold-out set). The model is trained on the training set, and the resulting model is employed to predict responses for the observations in the validation set. The ensuing validation set error rate, often evaluated using Mean Squared Error (MSE) for quantitative responses, serves as an approximation of the test error rate. However, this approach comes with certain limitations.
The validation estimate of the test error rate can exhibit high variability, depending on the specific observations included in the training and validation sets.In addition, the validation approach employs only a subset of observations—those included in the training set rather than the validation set—to train the model. Given that statistical methods generally perform less effectively when trained on a smaller dataset, this implies that the validation set error rate may tend to overstate the test error rate for the model fitted on the entire dataset.
- Leave-One-Out Cross-Validation
Leave-One-Out Cross-Validation (LOOCV) is a technique used to assess the performance of predictive models by systematically leaving out one data point at a time for validation while training the model on the remaining data. This process is repeated for each observation in the dataset, allowing the model to be validated against the entire dataset. While LOOCV provides a thorough evaluation of model performance and utilizes all available data, it can be computationally expensive for large datasets due to the need for multiple model fittings. Despite this, LOOCV is commonly employed when the dataset size is limited, providing a robust assessment of model generalization.
- k-Fold Cross-Validation
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1, MSE2, . . . , MSEk.
The most important advantage of k-fold CV is that it often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off.