LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 2 - Training versus Testing (http://book.caltech.edu/bookforum/forumdisplay.php?f=109)
-   -   Separating cv,test data and class distr. (http://book.caltech.edu/bookforum/showthread.php?t=1516)

Andrs 09-14-2012 01:49 AM

Separating cv,test data and class distr.
 
I am not sure if this question should be here or not but I will try!
Here is a training scenario and a question about separating the data:
You have a fair amount of data (approximately 1200 data points). The data is divided in 4 classes. You want to train a classifier to detect 4 classes.
One approach is to select approx 10% of the data as a test set (150 data points) that you lock in the safe! The standard rule here is to select the test data randomly. This random selection is reasonable if I have two classes. Should I also use pure random selection if I have 4 classes or should I assure that similar distributions are present in the test data?
Here are the scenarios:
1)Selecting test data
The original data has a "certain" distribution for the 4 classes. When I select the "test data", should I make sure that it contains the same class distribution as the original data? Or is it considered snooping???

2)Selecting the folds for the 10 fold-CV:
When I divide the training data in folds for cross validation, should I assure that the same "class distribution" is present in the folds??


All times are GMT -7. The time now is 09:03 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.