LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 2 - Training versus Testing

Reply
 
Thread Tools Display Modes
  #1  
Old 09-14-2012, 01:49 AM
Andrs Andrs is offline
Member
 
Join Date: Jul 2012
Posts: 47
Default Separating cv,test data and class distr.

I am not sure if this question should be here or not but I will try!
Here is a training scenario and a question about separating the data:
You have a fair amount of data (approximately 1200 data points). The data is divided in 4 classes. You want to train a classifier to detect 4 classes.
One approach is to select approx 10% of the data as a test set (150 data points) that you lock in the safe! The standard rule here is to select the test data randomly. This random selection is reasonable if I have two classes. Should I also use pure random selection if I have 4 classes or should I assure that similar distributions are present in the test data?
Here are the scenarios:
1)Selecting test data
The original data has a "certain" distribution for the 4 classes. When I select the "test data", should I make sure that it contains the same class distribution as the original data? Or is it considered snooping???

2)Selecting the folds for the 10 fold-CV:
When I divide the training data in folds for cross validation, should I assure that the same "class distribution" is present in the folds??
Reply With Quote
Reply

Tags
cross-validation, test

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 06:24 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.