LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   General Discussion of Machine Learning (http://book.caltech.edu/bookforum/forumdisplay.php?f=105)
-   -   Selecting "representative" test data (http://book.caltech.edu/bookforum/showthread.php?t=1536)

Andrs 09-15-2012 01:09 AM

Selecting "representative" test data
 
I had a similar question posted in another forum but may be it belongs to this general forum.
I have data with multiple classes and I want to to divide the data in a "big chunck for training" and a "smaller chunk for testing". The original data has a certain distribution for the different classes (i.e. x% for class 1, y% for class 2, z% for class 3).
How should I select the "test data" (and "training set") using this multiple class data input? The basic assumption is that there is enough data to start with! If I use a pure random selection to create the two sets, the "test set" may not contain all the classes and it may not be representative (test set is much smaller than the training set). Another alternative is to find the classes distribution in the data and to assure that the "test data" contains approx the same distribution. Here I am really looking into the data and there is a risk for snooping. Of course, I may not use this class distribution information in the training process, but...
Is it a relevant question or is there a misunderstanding from my side? I would like to discuss this issue, what are the risks here, what are the best experiences?

magdon 09-15-2012 07:03 AM

Re: Selecting "representative" test data
 
You really have no option but to select randomly for the test and training data. The problem that the test set may not be representative is not a problem with the selection of data but with the size of the test set. In such case your statement that the test data may not be representative (due to statistical fluctuations) means that you could not trust the result on it any way (even if it happened to contain the right amount of each class).

A better option for you is to move into the cross validation framework which even allows you to use a "test set" of size 1. (See Chapter 4 for more details).

Quote:

Originally Posted by Andrs (Post 5309)
I had a similar question posted in another forum but may be it belongs to this general forum.
I have data with multiple classes and I want to to divide the data in a "big chunck for training" and a "smaller chunk for testing". The original data has a certain distribution for the different classes (i.e. x% for class 1, y% for class 2, z% for class 3).
How should I select the "test data" (and "training set") using this multiple class data input? The basic assumption is that there is enough data to start with! If I use a pure random selection to create the two sets, the "test set" may not contain all the classes and it may not be representative (test set is much smaller than the training set). Another alternative is to find the classes distribution in the data and to assure that the "test data" contains approx the same distribution. Here I am really looking into the data and there is a risk for snooping. Of course, I may not use this class distribution information in the training process, but...
Is it a relevant question or is there a misunderstanding from my side? I would like to discuss this issue, what are the risks here, what are the best experiences?


Andrs 09-15-2012 12:02 PM

Re: Selecting "representative" test data
 
Quote:

Originally Posted by magdon (Post 5315)
You really have no option but to select randomly for the test and training data. The problem that the test set may not be representative is not a problem with the selection of data but with the size of the test set. In such case your statement that the test data may not be representative (due to statistical fluctuations) means that you could not trust the result on it any way (even if it happened to contain the right amount of each class).

A better option for you is to move into the cross validation framework which even allows you to use a "test set" of size 1. (See Chapter 4 for more details).

Thanks Magdon!
I will be using cross validation as my basic approach. However, I was thinking also to put aside some data for testing. The reason is that I am new in the area and there is the risk that I will be overusing the CV data to define my hypothesis/parameters (implying too optimistic results). The test set would be my real proof of generalization (some limit for E_out that could increase my confidence in the results). Of course, we could discuss the value of this limited test data that is randomly selected as a upper limit for E_out compared to E_cv. May be E_cv is the best bet in the end.....if I do not overuse it:clueless:
Hopefully my data(in sample) is (well) randomly selected and it is representative for my out_of sample population. I do not know the out of sample distribution and the best I could do is to use a random selection to select a "test sample" ( as you suggested). The question is if 10% of my in_sample will be enough for test data data or not.


All times are GMT -7. The time now is 09:44 AM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.