The following point was raised in the previous thread:
Quote:
Originally Posted by grozhd
For example no matter how much data you give me I can always come up with a polynomial of degree equal to the number of training examples which is perfect in sample. Suppose you gave me 20 points from linear function and I generate a polynomial of degree 20. I think it will be really bad out of sample.

You are right! But this is a different problem and I will try to explain it in the next paragraph. In short, here you are using the
set of all polynomials to choose one polynomial based on the data (not to mention the 20 dimensional weight space you will search to get the fit). Because you are 'optimizing', you will have to pay the price and then the whole game of VC dimensions and modified Hoeffding's inequality (addition of multiple bins in the lecture) need to be considered.
1. The process of searching H to get g has a cost. When you chose one h over another h, you have to pay in terms of the guarantee available.
Let's take a new coin tossing experiment. Assume we have
10000 coins and we toss each coin 10 times (10000 coins can be thought of as 10000 bins and N = 10!). Let's assume that P(head) = 1/2. We now plot a histogram of "number of heads in 10 tosses". So you will have some coins which gave 0 Heads, more which gave 1, even more which gave 2, while most gave between 38 heads. This will follow a binomial distribution. Getting 0 heads in 10 tosses is rare but not impossible. Notice P(0 heads) value goes up with n or the no of coins.
2. Thus we see that as the number of options for a bad event to happen increase, the p of that event happening
(atleast once) goes up! This is an analog to what Prof explained: As you have more and more h in your H, it is probable ( notice that he is modelling the
worst case scenario) that a bad event will happen. By bad event, I mean those cases where the bound is broken and you observe a large deviation.
3. The modified inequality tells us that we should be careful about choosing H. If H is too large (let's say if H is space of all polynomial), the P of a bad event (read overfitting) will be high. The new formulation of the inequality accounts for the complexity.
4. The inequality gives us an upper bound on the worst case performance. By adding M, we ensure that the worse case scenario of this given complexity is accounted for.
In summary, the next stage of the problem is to think about the complexity of your hypothesis set. As you can imagine, a 20th order polynomial is too complex for 20 data points; try calculating the rhs of modified Hoeffding's inequality and you will be convinced that this is the case! In fact, even a linear classifier with
weight adjustment needs to be treated carefully as we are searching the entire 3 dimensional weight space to get the weights right!
Once we can separate the 2 concepts:
a. that of cooking up a g(x)
absolutely randomly (based on the population of India, speed or Earth's rotation and what my mother cooked for supper ) and then testing it on my X to get Ein as a prediction of Eout given a N.
b. and that of choosing a classifier based on adjusting by searching my hypothesis space for which we will have to pay a price in terms of the worst case theoretical guarantee.
..... it should be understandable.