LFD Book Forum  

Go Back   LFD Book Forum > General > General Discussion of Machine Learning

Reply
 
Thread Tools Display Modes
  #1  
Old 05-19-2013, 06:54 AM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Criss-cross validation

Cross validation is a terrific technique which squeezes a tremendous amount out of data, but is there any room for improvement?

My question relates to a comment by someone somewhere (sorry, I can't remember where: you know who you are!) about the idea of doing 10-fold cross validation 10 times to get better results.

It seems intuitively clear to me that this works in getting better results, as it removes noise in the information received from cross-validation about the quality of the options available - eg hyperparameters - so the question is whether it is an optimal use of computing time. This require 10 times as much computing as a single 10-fold cross validation, so is there some better way to use the time?

Firstly, I suspect there is a simple yes answer, at least with relatively small data sets, as there is some scope to be more precise about the selection of the runs. When selecting the out of sample data for cross-validation, this is done so there are no overlaps. This principle should be extended as far as possible if the cross validations are repeated. Really 10x(10-fold cross-validation) should be regarded as simply 100 runs with a 90:10 split of the data, and then those runs selected to be optimal. Of course to get no overlaps in the validation sets is now impossible, but overlaps can be minimised.

One idea is to break the data into much more than 10 subsets and then select larger subsets in a methodical way to minimise overlaps. Imagine you want to do 30 10-fold cross validations on a data set. You could split the data set into 1000 parts, number them from 0 to 999 and use the first digit to define the first 10 validations, the second and third digits to define the rest. The overlap between the out-of sample datasets ranges from 0% to 10%. Not too bad. This is where the title of this post comes from.

Q1: What is the absolute limit of this approach?

For a given fraction of the data f to be devoted to validation and a given number of runs R, how small can the overlap between the validation data sets be made, and can the overlaps be kept to the same size (rather than varying from 0% to 10% as above)? [Of course if f \times R \leq 1, as in normal cross-validation, the overlaps can easily be made zero but with more runs this is not possible.]

Q2: Which variation is the best use of computing time?

Instead of doing 100 runs with 10% of the data being used for cross-validation, we might prefer to do 100-fold cross validation, which would use a similar amount of computing time but only use 1% of the data each time for validation, with no overlaps. [Note: to be precise, rather more computing time is usually needed because of the larger input dataset]. The pros of doing the 100-fold version are that the input data is a bit bigger and validation data points are never re-used (albeit with different hypotheses), the con is that the validation data is a lot smaller.

These are not simple theoretical questions, as the answers depend on things like the correlations between the out of sample predictions of two hypotheses when they are created using overlapping data. This depends on the complexity of the hypothesis set, and statistically on the precise training method.

[EDIT: I can now see how to get the answer to the first question. It's quite neat and may be useful: I'll elaborate later]
Reply With Quote
  #2  
Old 05-20-2013, 01:33 AM
htlin's Avatar
htlin htlin is offline
NTU
 
Join Date: Aug 2009
Location: Taipei, Taiwan
Posts: 601
Default Re: Criss-cross validation

For your reference, this is one very early work of mine on using repeated CV on a very small data set:

http://www.csie.ntu.edu.tw/~htlin/pa...pkdd05sage.pdf

Hope this helps.
__________________
When one teaches, two learn.
Reply With Quote
  #3  
Old 05-20-2013, 08:34 AM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Criss-cross validation

Thanks, Prof. Lin, that is indeed relevant. Am I right that everywhere later in the paper that 10-fold cross-validation is referred to, it means 10 x (10-fold cross-validation) ?

The instability of LOO cross-validation is interesting. I see two reasons for this instability. The first is that the individual runs only differ on two data points, so the hypotheses generated may be highly correlated. (This is likely to be mitigated by higher correlation with the hypothesis generated when finally using all the data). In addition, there is no scope for error reduction with LOO by repeating the cross validation and averaging errors. So, in your study, LOO errors were only based on 90 OOS observations, compared with 10 x 90 in the 10 x 10-fold cross-validation. What is your experience on the nature of the instability?

I observe that in a circumstance where the second reason is the important one, "Leave-two-out" cross-validation provides the potential to improve this enormously, with up to 90*89 OOS observations. The question is how much damage is done as a result of the correlation between hypotheses generated with data that differs on only 2 points (as well as those differing on 4 points).

Another observation is that it is perfectly reasonable to combine cross-validations with different fractions of data, again in order to reduce noise in the cross-validation procedure.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 02:29 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.