K-Fold Cross Validation – Intro to Machine Learning
So Katie, you told everybody about training and test sets, and I hope people exercise it quite a bit. Is that correct?>>Yes, that’s right.>>So now I’m going to talk about something that slightly generalizes this called cross validation. And to get into cross validation, let’s first talk about problems with splitting a data set into training and testing data. Suppose this is your data. By doing what Katie told you, you now have to say what fraction of data is testing and what is training. And the dilemma you’re running into is you like to maximize both of the sets. You want to have as many data points in the training sets to get the best learning results, and you want the maximum number of data items in your test set to get the best validation. But obviously, there’s an inherent trade-off here, which is every data point you take out of the training set into the test is lost for the training set. So we had to reset this trade-off. And this is where cross validation comes into the picture. The basic idea is that you partition the data set into k bins of equal size. So example, if you have 200 data points. And you have ten bins. Very quickly. What’s the number of data points per bin? Quite obviously, it’s 20. So you will have 20 data points in each of the 10 bins. So here’s the picture. Whereas in the work that Katie showed you, you just pick one of those bins as a testing bin and the other then as a training bin. In k-fold cross validation, you run k separate learning experiments. In each of those, you pick one of those k subsets as your testing set. The remaining k minus one bins are put together into the training set, then you train your machine learning algorithm and just like before, you’ll test the performance on the testing set. The key thing in cross validation is you run this multiple times. In this case ten times, and then you average the ten different testing set performances for the ten different hold out sets, so you average the test results from those k experiments. So obviously, this takes more compute time because you now have to run k separate learning experiments, but the assessment of the learning algorithm will be more accurate. And in a way, you’ve kind of used all your data for training and all your data for testing, which is kind of cool. Say we just ask one question. Suppose you have a choice to do the static train test methodology that Katie told you about, or you do say 10-fold cross validation, C.V., and you really care about minimizing training time. Minimize run time after training using your machine learning algorithm to output past the training time and maximize accuracy. In each of these three situations, you might pick either train/test or 10-fold cross validation. Give me your best guess. Which one would you pick? So for each minimum training time, pick one of the two over here on the right side.