Quote:
Originally Posted by yaser
Just to clarify: By bins and cutoff, you mean taking the input variable "age" which is a real number and discretizing it into a finite number of values? In general, processing the inputs of a data set without looking at the outputs does not contaminate the data.
|
Yes. The three cases I am trying to distinguish (understand how they compare in the effect on d_vc) are:
1. The learning algorithm chooses the discretization to use.
2. I choose the discretization to use based on looking at the data (snooping).
3. I choose the discretization to use based on my prior knowledge (without looking at the data).
Based on your last sentence, case 3 does not adversely impact d_vc because I do not look at the data. Is there any change in d_vc because the discretized age loses the ability to distinguish some of the data points? I'm having trouble thinking about how different types of features (say integer valued, real valued, discretized ages, and multiple binary flags for different age ranges) affect d_vc.
My understanding is that cases 1 and 2 are the same (assuming the same hypothesis set) because the VC analysis depends only on the hypothesis set and not the learning algorithm. Are there any subtleties I'm missing here?
Thank you!