![]() |
#1
|
|||
|
|||
![]()
I am not sure if this question should be here or not but I will try!
Here is a training scenario and a question about separating the data: You have a fair amount of data (approximately 1200 data points). The data is divided in 4 classes. You want to train a classifier to detect 4 classes. One approach is to select approx 10% of the data as a test set (150 data points) that you lock in the safe! The standard rule here is to select the test data randomly. This random selection is reasonable if I have two classes. Should I also use pure random selection if I have 4 classes or should I assure that similar distributions are present in the test data? Here are the scenarios: 1)Selecting test data The original data has a "certain" distribution for the 4 classes. When I select the "test data", should I make sure that it contains the same class distribution as the original data? Or is it considered snooping??? 2)Selecting the folds for the 10 fold-CV: When I divide the training data in folds for cross validation, should I assure that the same "class distribution" is present in the folds?? |
![]() |
Tags |
cross-validation, test |
Thread Tools | |
Display Modes | |
|
|