LFD Book Forum  

Go Back   LFD Book Forum > General > General Discussion of Machine Learning

 
 
Thread Tools Display Modes
Prev Previous Post   Next Post Next
  #1  
Old 09-15-2012, 01:09 AM
Andrs Andrs is offline
Member
 
Join Date: Jul 2012
Posts: 47
Default Selecting "representative" test data

I had a similar question posted in another forum but may be it belongs to this general forum.
I have data with multiple classes and I want to to divide the data in a "big chunck for training" and a "smaller chunk for testing". The original data has a certain distribution for the different classes (i.e. x% for class 1, y% for class 2, z% for class 3).
How should I select the "test data" (and "training set") using this multiple class data input? The basic assumption is that there is enough data to start with! If I use a pure random selection to create the two sets, the "test set" may not contain all the classes and it may not be representative (test set is much smaller than the training set). Another alternative is to find the classes distribution in the data and to assure that the "test data" contains approx the same distribution. Here I am really looking into the data and there is a risk for snooping. Of course, I may not use this class distribution information in the training process, but...
Is it a relevant question or is there a misunderstanding from my side? I would like to discuss this issue, what are the risks here, what are the best experiences?
Reply With Quote
 

Tags
data snooping, test, training set

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 12:11 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.