The book shows that in the case given in the truth table in 1.3.1 (not in the lectures), any hypothesis you choose from the first five points is no better than a random guess on the last three data points.

Whereas, it says in 1.3.2 that we can have a good probability that a hypothesis is close, using the Hoeffding inequality.

The book doesn't work that out for the earlier example given, so I thought I'd try to understand it. But let me make the case a little more extreme:

Let's say that our space of results comes from flipping a coin 100 times. Our "sample" data is the first 97 flips, N=97. We would like to develop a hypothesis that will do a good job on the last three flips.

Now, that isn't going to work; as the book points out for its example, no hypothesis is better than any other on the last three random flips. The average probability of error will be 0.5?

So here we are with our (somehow) carefully chosen hypothesis, which does a perfect job on the first 97 flips.

Pick

: not so stringent. Doesn't the Hoeffding inequality tell us that the probability of our in-sample error (=0) being different from the out-of-sample error (~0.5 on average) by more than 0.4 is less than

? Which is real small,

or so.

---------------------------------------

Where did I go wrong? My guess (I'd appreciate someone with the correct answer):

I said "(somehow) carefully chosen hypothesis". What I was supposed to do was choose it randomly. If I had, I would have almost always gotten a hypothesis that does a terrible job both on the sample and on the remaining three. Probably it would get about 50% on each, and Hoeffding would be right.

At the other extreme, I could have gone through some complex procedure for fitting a 97th degree polynomial through my sample, and thereby "memorized" the data, getting a hypothesis that does a great job on the sample. When I did that, I was effectively pawing through all the possible such polynomials (capital H) looking for the ones that do best. The professor called that M - the number of possible hypotheses. I guess for this job I'd need something like 97 parameters, each of which should have (at least) two possible values, so M is something like

. Multiply that into Hoeffding's inequality and the rhs isn't small any more.

Anyhow, I'd appreciate it if someone has a clearer response.