 itooam 08-27-2012 07:34 AM

Training v Testing set size rules of thumb...

I have read elsewhere comments like "for learning it is best to use say 40% of your whole dataset for training, 30% for validation and say 30% for testing". In light of cross-validation using "leave one/many out" technique, is there a rule of thumb for training vs test set size proportions?
Would I be correct in answering as follows: the test set should be larger than the minimum indicated by VC... effectively 10*degrees of freedom (using other rule of thumb)?...maybe after this it is just trial and error as to what to apportion to the test set with the remaining data?

 htlin 08-27-2012 08:53 AM

Re: Training v Testing set size rules of thumb...

The most common number of folds used for cross validation is 3 to 10, and more than 20 is really rare. For single-shot validation, I've seen 5% up to 40% reserved for validation. Hope this helps.

 itooam 08-28-2012 02:42 AM

Re: Training v Testing set size rules of thumb...

If I have understood correctly, once you start using cross validation model you only need to partition your data into 2 (as opposed to a training/validation/testing set model i.e., you partition your data into 3 sets). One set to be used for both training and cross validation, the other set for testing. The "test" set being the set you lock away and don't look at until you are decided on the best hypothesis to use i.e., to see how well the model generalises to independent data. I was wondering what % you should allocate to each of these two sets?

When you wrote:
For single-shot validation, I've seen 5% up to 40% reserved for validation

I assume your meaning of "validation" set is synonymous with "test" set since cross validation is already in place?

