View Single Post
Old 08-09-2012, 06:34 AM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 597
Default Re: Data snooping (test vs. train data)

You can do anything you want with the training data. Here is a very simple prescription that you can use and it will never let you down:

Take your test data and lock it up in a password protected encrypted file to which only your client has the password. (Note: you can be your own client.)

Now do whatever you want with the training data to obtain your final hypothesis g. Give it to the client. When the client asks you what performance to expect with g, you ask her to open the test data file and run your g on that file. The result on the test data is the performance to expect. The client is now stuck with that g and that test performance. You are not allowed to change g any more.

Now let's reexamine the statement "whatever you want with the training data". You may want to be careful here with your choice of "whatever" if you want to have some idea whether your client will fire you or not, after examining the test data . That is, if you want your performance on your training data to give you some indication about what the client will see on the test performance then use a smaller hypothesis set (for example).

Originally Posted by rainbow View Post
Do I understand the issue of data snooping correctly, if it is only an issue related to the test data itself? For example, if the inspection of test data affects the learning in some way.
- The test data has been used for estimation.
- If the learning model is changed after evaluating the performance on the test data?

How does data snooping relates to the train data (if at all). "How much" can you look into this data. Is it a violation wrt. data snooping to look at the target variable y if you are interested in exploratory data analysis such as PCA, or if you want to create features. For example, if you want to create a non-linear feature by cutting a continous variables such as age into a discrete feature with y in respect?
Have faith in probability
Reply With Quote