LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 5 - Three Learning Principles

Reply
 
Thread Tools Display Modes
  #1  
Old 08-05-2012, 02:44 AM
rainbow rainbow is offline
Member
 
Join Date: Jul 2012
Posts: 41
Default Data snooping (test vs. train data)

Do I understand the issue of data snooping correctly, if it is only an issue related to the test data itself? For example, if the inspection of test data affects the learning in some way.
- The test data has been used for estimation.
- If the learning model is changed after evaluating the performance on the test data?

How does data snooping relates to the train data (if at all). "How much" can you look into this data. Is it a violation wrt. data snooping to look at the target variable y if you are interested in exploratory data analysis such as PCA, or if you want to create features. For example, if you want to create a non-linear feature by cutting a continous variables such as age into a discrete feature with y in respect?
Reply With Quote
  #2  
Old 08-09-2012, 05:34 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Data snooping (test vs. train data)

You can do anything you want with the training data. Here is a very simple prescription that you can use and it will never let you down:

Take your test data and lock it up in a password protected encrypted file to which only your client has the password. (Note: you can be your own client.)

Now do whatever you want with the training data to obtain your final hypothesis g. Give it to the client. When the client asks you what performance to expect with g, you ask her to open the test data file and run your g on that file. The result on the test data is the performance to expect. The client is now stuck with that g and that test performance. You are not allowed to change g any more.

Now let's reexamine the statement "whatever you want with the training data". You may want to be careful here with your choice of "whatever" if you want to have some idea whether your client will fire you or not, after examining the test data . That is, if you want your performance on your training data to give you some indication about what the client will see on the test performance then use a smaller hypothesis set (for example).

Quote:
Originally Posted by rainbow View Post
Do I understand the issue of data snooping correctly, if it is only an issue related to the test data itself? For example, if the inspection of test data affects the learning in some way.
- The test data has been used for estimation.
- If the learning model is changed after evaluating the performance on the test data?

How does data snooping relates to the train data (if at all). "How much" can you look into this data. Is it a violation wrt. data snooping to look at the target variable y if you are interested in exploratory data analysis such as PCA, or if you want to create features. For example, if you want to create a non-linear feature by cutting a continous variables such as age into a discrete feature with y in respect?
__________________
Have faith in probability
Reply With Quote
  #3  
Old 08-09-2012, 09:23 AM
rainbow rainbow is offline
Member
 
Join Date: Jul 2012
Posts: 41
Default Re: Data snooping (test vs. train data)

Thanks. This was helpful.
Reply With Quote
  #4  
Old 08-09-2012, 01:02 PM
rseiter rseiter is offline
Junior Member
 
Join Date: Jul 2012
Posts: 2
Default Re: Data snooping (test vs. train data)

Thanks @magdon To help my understanding I'd like to translate this into a more concrete example. For the heart attack/discrete age bins lecture example I see at least three different approaches. Here is my attempt to assess how d_vc changes by approach. I would appreciate any feedback you can offer.

1. The number of bins and cutoff ages are added as variable parameters for learning. I would expect this to add to d_vc as the number of parameters we add.
2. I decide on the number of bins and cutoff ages by looking at the training data. I would expect this to add to d_vc as the number of parameters we add. Is this exactly comparable to case 1? Is it possible that d_vc would be even higher if I considered adding more parameters but decided the data did not justify it?
3. I decide on the number of bins and cutoff ages based on my problem domain knowledge (without looking at my current set of training or test data). If I understand the statement at the end of lecture 9 correctly this complexity would not be charged to d_vc. Could d_vc even be considered to have decreased if the bin (a less complex measure since it has fewer alternatives?) replaces the age in the feature set?

Thanks for any help. As noted in the lecture this seems like an important practical question.
Reply With Quote
  #5  
Old 08-10-2012, 02:32 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,472
Default Re: Data snooping (test vs. train data)

Quote:
Originally Posted by rseiter View Post
Thanks @magdon To help my understanding I'd like to translate this into a more concrete example. For the heart attack/discrete age bins lecture example I see at least three different approaches. Here is my attempt to assess how d_vc changes by approach. I would appreciate any feedback you can offer.

1. The number of bins and cutoff ages are added as variable parameters for learning. I would expect this to add to d_vc as the number of parameters we add.
2. I decide on the number of bins and cutoff ages by looking at the training data. I would expect this to add to d_vc as the number of parameters we add. Is this exactly comparable to case 1? Is it possible that d_vc would be even higher if I considered adding more parameters but decided the data did not justify it?
3. I decide on the number of bins and cutoff ages based on my problem domain knowledge (without looking at my current set of training or test data). If I understand the statement at the end of lecture 9 correctly this complexity would not be charged to d_vc. Could d_vc even be considered to have decreased if the bin (a less complex measure since it has fewer alternatives?) replaces the age in the feature set?

Thanks for any help. As noted in the lecture this seems like an important practical question.
Just to clarify: By bins and cutoff, you mean taking the input variable "age" which is a real number and discretizing it into a finite number of values? In general, processing the inputs of a data set without looking at the outputs does not contaminate the data.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #6  
Old 08-10-2012, 09:10 PM
rseiter rseiter is offline
Junior Member
 
Join Date: Jul 2012
Posts: 2
Default Re: Data snooping (test vs. train data)

Quote:
Originally Posted by yaser View Post
Just to clarify: By bins and cutoff, you mean taking the input variable "age" which is a real number and discretizing it into a finite number of values? In general, processing the inputs of a data set without looking at the outputs does not contaminate the data.
Yes. The three cases I am trying to distinguish (understand how they compare in the effect on d_vc) are:
1. The learning algorithm chooses the discretization to use.
2. I choose the discretization to use based on looking at the data (snooping).
3. I choose the discretization to use based on my prior knowledge (without looking at the data).

Based on your last sentence, case 3 does not adversely impact d_vc because I do not look at the data. Is there any change in d_vc because the discretized age loses the ability to distinguish some of the data points? I'm having trouble thinking about how different types of features (say integer valued, real valued, discretized ages, and multiple binary flags for different age ranges) affect d_vc.

My understanding is that cases 1 and 2 are the same (assuming the same hypothesis set) because the VC analysis depends only on the hypothesis set and not the learning algorithm. Are there any subtleties I'm missing here?

Thank you!
Reply With Quote
  #7  
Old 08-10-2012, 10:34 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,472
Default Re: Data snooping (test vs. train data)

Quote:
Originally Posted by rseiter View Post
The three cases I am trying to distinguish (understand how they compare in the effect on d_vc) are:
1. The learning algorithm chooses the discretization to use.
2. I choose the discretization to use based on looking at the data (snooping).
3. I choose the discretization to use based on my prior knowledge (without looking at the data).
You are right that case 3 patently has no snooping. It seems to me that for both 1 and 2 you can depend entirely on the inputs of the data set without looking at the outputs (labels), so that also would not involve snooping.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 02:11 AM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.