I also have the same questions, and I read your replies
please consider if I have the correct conclusions:
1- we cannot plug "g" for "h" in inequality, because it depends on the sample we already selected, or in other words, we choose it deliberately (as the h with lowest error inside D) like selecting the bin which has the minimum frequency of heads.
So! what if we select "g" randomly? (in a uniform distribution of
hs ?) or to select a bin randomly, then can we use Hoeffding inequality for "g"? or still we should consider M, the H size?
2- which of the following interpretation for equation 1.6 are correct:
- The only function that has zero error inside and outside D is f, So if the number of hypothesis increases, then the chance to select "f" (the correct function, or better approximation) becomes lower. (however I feel its not what you say)
- Or maybe, when we increase the number of hypothesis, we increase the chance that data behave differently inside and outside the D! for example if we limit the hypothesis to one! we may have high error but we lower the difference between E(in) and E(out). For example if we use one feature, we have limited the number of hypothesis! then when we evaluate h outside D, its not flexible enough to show minor errors, then it is more close to E(in)?!
=====================================
Second question:
In "h is fixed
before you generate the data set"
I also can't understand your emphasis on "before".
Do you want to say that
h shouldn't change?
because I feel
h is independent from
D then "before" or "after" doesn't mean much. We don't need to have an
h in mind to be able to generate D, we can select D, then decide which
h to use, then evaluate h over D, but we should use the same h for the test set, right? or maybe
h is used somehow in generating D?! Anyway, I think you may mean it should be selected independently from D