View Single Post
  #5  
Old 08-10-2012, 03:32 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: Data snooping (test vs. train data)

Quote:
Originally Posted by rseiter View Post
Thanks @magdon To help my understanding I'd like to translate this into a more concrete example. For the heart attack/discrete age bins lecture example I see at least three different approaches. Here is my attempt to assess how d_vc changes by approach. I would appreciate any feedback you can offer.

1. The number of bins and cutoff ages are added as variable parameters for learning. I would expect this to add to d_vc as the number of parameters we add.
2. I decide on the number of bins and cutoff ages by looking at the training data. I would expect this to add to d_vc as the number of parameters we add. Is this exactly comparable to case 1? Is it possible that d_vc would be even higher if I considered adding more parameters but decided the data did not justify it?
3. I decide on the number of bins and cutoff ages based on my problem domain knowledge (without looking at my current set of training or test data). If I understand the statement at the end of lecture 9 correctly this complexity would not be charged to d_vc. Could d_vc even be considered to have decreased if the bin (a less complex measure since it has fewer alternatives?) replaces the age in the feature set?

Thanks for any help. As noted in the lecture this seems like an important practical question.
Just to clarify: By bins and cutoff, you mean taking the input variable "age" which is a real number and discretizing it into a finite number of values? In general, processing the inputs of a data set without looking at the outputs does not contaminate the data.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote