LFD Book Forum Criss-cross validation

#1
05-19-2013, 06:54 AM
 Elroch Invited Guest Join Date: Mar 2013 Posts: 143
Criss-cross validation

Cross validation is a terrific technique which squeezes a tremendous amount out of data, but is there any room for improvement?

My question relates to a comment by someone somewhere (sorry, I can't remember where: you know who you are!) about the idea of doing 10-fold cross validation 10 times to get better results.

It seems intuitively clear to me that this works in getting better results, as it removes noise in the information received from cross-validation about the quality of the options available - eg hyperparameters - so the question is whether it is an optimal use of computing time. This require 10 times as much computing as a single 10-fold cross validation, so is there some better way to use the time?

Firstly, I suspect there is a simple yes answer, at least with relatively small data sets, as there is some scope to be more precise about the selection of the runs. When selecting the out of sample data for cross-validation, this is done so there are no overlaps. This principle should be extended as far as possible if the cross validations are repeated. Really 10x(10-fold cross-validation) should be regarded as simply 100 runs with a 90:10 split of the data, and then those runs selected to be optimal. Of course to get no overlaps in the validation sets is now impossible, but overlaps can be minimised.

One idea is to break the data into much more than 10 subsets and then select larger subsets in a methodical way to minimise overlaps. Imagine you want to do 30 10-fold cross validations on a data set. You could split the data set into 1000 parts, number them from 0 to 999 and use the first digit to define the first 10 validations, the second and third digits to define the rest. The overlap between the out-of sample datasets ranges from 0% to 10%. Not too bad. This is where the title of this post comes from.

Q1: What is the absolute limit of this approach?

For a given fraction of the data to be devoted to validation and a given number of runs , how small can the overlap between the validation data sets be made, and can the overlaps be kept to the same size (rather than varying from 0% to 10% as above)? [Of course if , as in normal cross-validation, the overlaps can easily be made zero but with more runs this is not possible.]

Q2: Which variation is the best use of computing time?

Instead of doing 100 runs with 10% of the data being used for cross-validation, we might prefer to do 100-fold cross validation, which would use a similar amount of computing time but only use 1% of the data each time for validation, with no overlaps. [Note: to be precise, rather more computing time is usually needed because of the larger input dataset]. The pros of doing the 100-fold version are that the input data is a bit bigger and validation data points are never re-used (albeit with different hypotheses), the con is that the validation data is a lot smaller.

These are not simple theoretical questions, as the answers depend on things like the correlations between the out of sample predictions of two hypotheses when they are created using overlapping data. This depends on the complexity of the hypothesis set, and statistically on the precise training method.

[EDIT: I can now see how to get the answer to the first question. It's quite neat and may be useful: I'll elaborate later]
#2
05-20-2013, 01:33 AM
 htlin NTU Join Date: Aug 2009 Location: Taipei, Taiwan Posts: 601
Re: Criss-cross validation

For your reference, this is one very early work of mine on using repeated CV on a very small data set:

http://www.csie.ntu.edu.tw/~htlin/pa...pkdd05sage.pdf

Hope this helps.
__________________
When one teaches, two learn.
#3
05-20-2013, 08:34 AM
 Elroch Invited Guest Join Date: Mar 2013 Posts: 143
Re: Criss-cross validation

Thanks, Prof. Lin, that is indeed relevant. Am I right that everywhere later in the paper that 10-fold cross-validation is referred to, it means 10 x (10-fold cross-validation) ?

The instability of LOO cross-validation is interesting. I see two reasons for this instability. The first is that the individual runs only differ on two data points, so the hypotheses generated may be highly correlated. (This is likely to be mitigated by higher correlation with the hypothesis generated when finally using all the data). In addition, there is no scope for error reduction with LOO by repeating the cross validation and averaging errors. So, in your study, LOO errors were only based on 90 OOS observations, compared with 10 x 90 in the 10 x 10-fold cross-validation. What is your experience on the nature of the instability?

I observe that in a circumstance where the second reason is the important one, "Leave-two-out" cross-validation provides the potential to improve this enormously, with up to 90*89 OOS observations. The question is how much damage is done as a result of the correlation between hypotheses generated with data that differs on only 2 points (as well as those differing on 4 points).

Another observation is that it is perfectly reasonable to combine cross-validations with different fractions of data, again in order to reduce noise in the cross-validation procedure.

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 11:55 PM.