View Single Post
Old 04-05-2013, 12:56 PM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 597
Default Re: Ch. 1.3.1 vs. Ch. 1.3.2

The truth table in section 1.3.1 is emphasizing that if you had your choice of those 8 functions, then each is as good as the other. On any test point, half will be correct, and half wrong.

To link with the coin, you must narrow down from all 8 functions to just one of them, say f_1 (this function is picked before you look at the data). Now when you look at the data (five points), you get agreement on all 5 points. Assuming that each set of 5 data points is equally likely to have been picked, the agreement or disagreement is like tossing the coin (with a small modification that repeated data points should have been allowed to have independence).

It is still the case that if any of those previous 8 functions could be the true target function, then on a test point f_1 is equally likely to be correct as wrong. However, the point being made by the Hoeffding bound is that you must have been extremely unlucky to get those particular points on which you agreed with the target, when in fact you disagreed with the target elsewhere. Or, equivalently, on a normal day you would have made some mistakes on your data, so the fact that you didn't means that either you are having a very abnormal day or your f_1 is good.

Originally Posted by Michael Reach View Post
The book shows that in the case given in the truth table in 1.3.1 (not in the lectures), any hypothesis you choose from the first five points is no better than a random guess on the last three data points.

Whereas, it says in 1.3.2 that we can have a good probability that a hypothesis is close, using the Hoeffding inequality.

The book doesn't work that out for the earlier example given, so I thought I'd try to understand it. But let me make the case a little more extreme:

Let's say that our space of results comes from flipping a coin 100 times. Our "sample" data is the first 97 flips, N=97. We would like to develop a hypothesis that will do a good job on the last three flips.

Now, that isn't going to work; as the book points out for its example, no hypothesis is better than any other on the last three random flips. The average probability of error will be 0.5?

So here we are with our (somehow) carefully chosen hypothesis, which does a perfect job on the first 97 flips.

Pick \epsilon=0.4: not so stringent. Doesn't the Hoeffding inequality tell us that the probability of our in-sample error (=0) being different from the out-of-sample error (~0.5 on average) by more than 0.4 is less than 2e^{(-2(0.4^2)*97)}? Which is real small, 10^{(-13)} or so.


Where did I go wrong? My guess (I'd appreciate someone with the correct answer):

I said "(somehow) carefully chosen hypothesis". What I was supposed to do was choose it randomly. If I had, I would have almost always gotten a hypothesis that does a terrible job both on the sample and on the remaining three. Probably it would get about 50% on each, and Hoeffding would be right.

At the other extreme, I could have gone through some complex procedure for fitting a 97th degree polynomial through my sample, and thereby "memorized" the data, getting a hypothesis that does a great job on the sample. When I did that, I was effectively pawing through all the possible such polynomials (capital H) looking for the ones that do best. The professor called that M - the number of possible hypotheses. I guess for this job I'd need something like 97 parameters, each of which should have (at least) two possible values, so M is something like 2^{97}. Multiply that into Hoeffding's inequality and the rhs isn't small any more.

Anyhow, I'd appreciate it if someone has a clearer response.
Have faith in probability
Reply With Quote