View Single Post
Old 09-14-2012, 02:49 AM
Andrs Andrs is offline
Join Date: Jul 2012
Posts: 47
Default Separating cv,test data and class distr.

I am not sure if this question should be here or not but I will try!
Here is a training scenario and a question about separating the data:
You have a fair amount of data (approximately 1200 data points). The data is divided in 4 classes. You want to train a classifier to detect 4 classes.
One approach is to select approx 10% of the data as a test set (150 data points) that you lock in the safe! The standard rule here is to select the test data randomly. This random selection is reasonable if I have two classes. Should I also use pure random selection if I have 4 classes or should I assure that similar distributions are present in the test data?
Here are the scenarios:
1)Selecting test data
The original data has a "certain" distribution for the 4 classes. When I select the "test data", should I make sure that it contains the same class distribution as the original data? Or is it considered snooping???

2)Selecting the folds for the 10 fold-CV:
When I divide the training data in folds for cross validation, should I assure that the same "class distribution" is present in the folds??
Reply With Quote