Thread: Criss-cross validation View Single Post
#1
05-19-2013, 06:54 AM
 Elroch Invited Guest Join Date: Mar 2013 Posts: 143
Criss-cross validation

Cross validation is a terrific technique which squeezes a tremendous amount out of data, but is there any room for improvement?

My question relates to a comment by someone somewhere (sorry, I can't remember where: you know who you are!) about the idea of doing 10-fold cross validation 10 times to get better results.

It seems intuitively clear to me that this works in getting better results, as it removes noise in the information received from cross-validation about the quality of the options available - eg hyperparameters - so the question is whether it is an optimal use of computing time. This require 10 times as much computing as a single 10-fold cross validation, so is there some better way to use the time?

Firstly, I suspect there is a simple yes answer, at least with relatively small data sets, as there is some scope to be more precise about the selection of the runs. When selecting the out of sample data for cross-validation, this is done so there are no overlaps. This principle should be extended as far as possible if the cross validations are repeated. Really 10x(10-fold cross-validation) should be regarded as simply 100 runs with a 90:10 split of the data, and then those runs selected to be optimal. Of course to get no overlaps in the validation sets is now impossible, but overlaps can be minimised.

One idea is to break the data into much more than 10 subsets and then select larger subsets in a methodical way to minimise overlaps. Imagine you want to do 30 10-fold cross validations on a data set. You could split the data set into 1000 parts, number them from 0 to 999 and use the first digit to define the first 10 validations, the second and third digits to define the rest. The overlap between the out-of sample datasets ranges from 0% to 10%. Not too bad. This is where the title of this post comes from.

Q1: What is the absolute limit of this approach?

For a given fraction of the data to be devoted to validation and a given number of runs , how small can the overlap between the validation data sets be made, and can the overlaps be kept to the same size (rather than varying from 0% to 10% as above)? [Of course if , as in normal cross-validation, the overlaps can easily be made zero but with more runs this is not possible.]

Q2: Which variation is the best use of computing time?

Instead of doing 100 runs with 10% of the data being used for cross-validation, we might prefer to do 100-fold cross validation, which would use a similar amount of computing time but only use 1% of the data each time for validation, with no overlaps. [Note: to be precise, rather more computing time is usually needed because of the larger input dataset]. The pros of doing the 100-fold version are that the input data is a bit bigger and validation data points are never re-used (albeit with different hypotheses), the con is that the validation data is a lot smaller.

These are not simple theoretical questions, as the answers depend on things like the correlations between the out of sample predictions of two hypotheses when they are created using overlapping data. This depends on the complexity of the hypothesis set, and statistically on the precise training method.

[EDIT: I can now see how to get the answer to the first question. It's quite neat and may be useful: I'll elaborate later]