![]() |
|
#1
|
|||
|
|||
![]()
Do I understand the issue of data snooping correctly, if it is only an issue related to the test data itself? For example, if the inspection of test data affects the learning in some way.
- The test data has been used for estimation. - If the learning model is changed after evaluating the performance on the test data? How does data snooping relates to the train data (if at all). "How much" can you look into this data. Is it a violation wrt. data snooping to look at the target variable y if you are interested in exploratory data analysis such as PCA, or if you want to create features. For example, if you want to create a non-linear feature by cutting a continous variables such as age into a discrete feature with y in respect? |
#2
|
||||
|
||||
![]()
You can do anything you want with the training data. Here is a very simple prescription that you can use and it will never let you down:
Take your test data and lock it up in a password protected encrypted file to which only your client has the password. (Note: you can be your own client.) Now do whatever you want with the training data to obtain your final hypothesis ![]() ![]() ![]() ![]() ![]() Now let's reexamine the statement "whatever you want with the training data". You may want to be careful here with your choice of "whatever" if you want to have some idea whether your client will fire you or not, after examining the test data ![]() Quote:
__________________
Have faith in probability |
#3
|
|||
|
|||
![]()
Thanks. This was helpful.
|
#4
|
|||
|
|||
![]()
Thanks @magdon To help my understanding I'd like to translate this into a more concrete example. For the heart attack/discrete age bins lecture example I see at least three different approaches. Here is my attempt to assess how d_vc changes by approach. I would appreciate any feedback you can offer.
1. The number of bins and cutoff ages are added as variable parameters for learning. I would expect this to add to d_vc as the number of parameters we add. 2. I decide on the number of bins and cutoff ages by looking at the training data. I would expect this to add to d_vc as the number of parameters we add. Is this exactly comparable to case 1? Is it possible that d_vc would be even higher if I considered adding more parameters but decided the data did not justify it? 3. I decide on the number of bins and cutoff ages based on my problem domain knowledge (without looking at my current set of training or test data). If I understand the statement at the end of lecture 9 correctly this complexity would not be charged to d_vc. Could d_vc even be considered to have decreased if the bin (a less complex measure since it has fewer alternatives?) replaces the age in the feature set? Thanks for any help. As noted in the lecture this seems like an important practical question. |
#5
|
||||
|
||||
![]() Quote:
__________________
Where everyone thinks alike, no one thinks very much |
#6
|
|||
|
|||
![]() Quote:
1. The learning algorithm chooses the discretization to use. 2. I choose the discretization to use based on looking at the data (snooping). 3. I choose the discretization to use based on my prior knowledge (without looking at the data). Based on your last sentence, case 3 does not adversely impact d_vc because I do not look at the data. Is there any change in d_vc because the discretized age loses the ability to distinguish some of the data points? I'm having trouble thinking about how different types of features (say integer valued, real valued, discretized ages, and multiple binary flags for different age ranges) affect d_vc. My understanding is that cases 1 and 2 are the same (assuming the same hypothesis set) because the VC analysis depends only on the hypothesis set and not the learning algorithm. Are there any subtleties I'm missing here? Thank you! |
#7
|
||||
|
||||
![]() Quote:
__________________
Where everyone thinks alike, no one thinks very much |
![]() |
Thread Tools | |
Display Modes | |
|
|