![]() |
Data snooping (test vs. train data)
Do I understand the issue of data snooping correctly, if it is only an issue related to the test data itself? For example, if the inspection of test data affects the learning in some way.
- The test data has been used for estimation. - If the learning model is changed after evaluating the performance on the test data? How does data snooping relates to the train data (if at all). "How much" can you look into this data. Is it a violation wrt. data snooping to look at the target variable y if you are interested in exploratory data analysis such as PCA, or if you want to create features. For example, if you want to create a non-linear feature by cutting a continous variables such as age into a discrete feature with y in respect? |
Re: Data snooping (test vs. train data)
Thanks. This was helpful.
|
Re: Data snooping (test vs. train data)
Thanks @magdon To help my understanding I'd like to translate this into a more concrete example. For the heart attack/discrete age bins lecture example I see at least three different approaches. Here is my attempt to assess how d_vc changes by approach. I would appreciate any feedback you can offer.
1. The number of bins and cutoff ages are added as variable parameters for learning. I would expect this to add to d_vc as the number of parameters we add. 2. I decide on the number of bins and cutoff ages by looking at the training data. I would expect this to add to d_vc as the number of parameters we add. Is this exactly comparable to case 1? Is it possible that d_vc would be even higher if I considered adding more parameters but decided the data did not justify it? 3. I decide on the number of bins and cutoff ages based on my problem domain knowledge (without looking at my current set of training or test data). If I understand the statement at the end of lecture 9 correctly this complexity would not be charged to d_vc. Could d_vc even be considered to have decreased if the bin (a less complex measure since it has fewer alternatives?) replaces the age in the feature set? Thanks for any help. As noted in the lecture this seems like an important practical question. |
Re: Data snooping (test vs. train data)
Quote:
|
Re: Data snooping (test vs. train data)
Quote:
1. The learning algorithm chooses the discretization to use. 2. I choose the discretization to use based on looking at the data (snooping). 3. I choose the discretization to use based on my prior knowledge (without looking at the data). Based on your last sentence, case 3 does not adversely impact d_vc because I do not look at the data. Is there any change in d_vc because the discretized age loses the ability to distinguish some of the data points? I'm having trouble thinking about how different types of features (say integer valued, real valued, discretized ages, and multiple binary flags for different age ranges) affect d_vc. My understanding is that cases 1 and 2 are the same (assuming the same hypothesis set) because the VC analysis depends only on the hypothesis set and not the learning algorithm. Are there any subtleties I'm missing here? Thank you! |
Re: Data snooping (test vs. train data)
Quote:
|
All times are GMT -7. The time now is 12:45 PM. |
Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.