 LFD Book Forum Ch. 1.3.1 vs. Ch. 1.3.2
 Register FAQ Calendar Mark Forums Read

#1
 Michael Reach Senior Member Join Date: Apr 2013 Location: Baltimore, Maryland, USA Posts: 71 Ch. 1.3.1 vs. Ch. 1.3.2

The book shows that in the case given in the truth table in 1.3.1 (not in the lectures), any hypothesis you choose from the first five points is no better than a random guess on the last three data points.

Whereas, it says in 1.3.2 that we can have a good probability that a hypothesis is close, using the Hoeffding inequality.

The book doesn't work that out for the earlier example given, so I thought I'd try to understand it. But let me make the case a little more extreme:

Let's say that our space of results comes from flipping a coin 100 times. Our "sample" data is the first 97 flips, N=97. We would like to develop a hypothesis that will do a good job on the last three flips.

Now, that isn't going to work; as the book points out for its example, no hypothesis is better than any other on the last three random flips. The average probability of error will be 0.5?

So here we are with our (somehow) carefully chosen hypothesis, which does a perfect job on the first 97 flips.

Pick : not so stringent. Doesn't the Hoeffding inequality tell us that the probability of our in-sample error (=0) being different from the out-of-sample error (~0.5 on average) by more than 0.4 is less than ? Which is real small, or so.

---------------------------------------

Where did I go wrong? My guess (I'd appreciate someone with the correct answer):

I said "(somehow) carefully chosen hypothesis". What I was supposed to do was choose it randomly. If I had, I would have almost always gotten a hypothesis that does a terrible job both on the sample and on the remaining three. Probably it would get about 50% on each, and Hoeffding would be right.

At the other extreme, I could have gone through some complex procedure for fitting a 97th degree polynomial through my sample, and thereby "memorized" the data, getting a hypothesis that does a great job on the sample. When I did that, I was effectively pawing through all the possible such polynomials (capital H) looking for the ones that do best. The professor called that M - the number of possible hypotheses. I guess for this job I'd need something like 97 parameters, each of which should have (at least) two possible values, so M is something like . Multiply that into Hoeffding's inequality and the rhs isn't small any more.

Anyhow, I'd appreciate it if someone has a clearer response.
#2
 Elroch Invited Guest Join Date: Mar 2013 Posts: 143 Re: Ch. 1.3.1 vs. Ch. 1.3.2

I suspect this will be clear!

You have indicated that a coin toss is a trial and you have had 97 of them. The only hypotheses for the toss of a coin are "heads" and "tails". If you have been dead right on 97 trials, you have a double headed coin (or a double-tailed one). This will give you confidence for the next 3 trials.

Your example reminded me that Hoeffding requires independent, identically distributed examples.
#3 magdon RPI Join Date: Aug 2009 Location: Troy, NY, USA. Posts: 597 Re: Ch. 1.3.1 vs. Ch. 1.3.2

The truth table in section 1.3.1 is emphasizing that if you had your choice of those 8 functions, then each is as good as the other. On any test point, half will be correct, and half wrong.

To link with the coin, you must narrow down from all 8 functions to just one of them, say (this function is picked before you look at the data). Now when you look at the data (five points), you get agreement on all 5 points. Assuming that each set of 5 data points is equally likely to have been picked, the agreement or disagreement is like tossing the coin (with a small modification that repeated data points should have been allowed to have independence).

It is still the case that if any of those previous 8 functions could be the true target function, then on a test point is equally likely to be correct as wrong. However, the point being made by the Hoeffding bound is that you must have been extremely unlucky to get those particular points on which you agreed with the target, when in fact you disagreed with the target elsewhere. Or, equivalently, on a normal day you would have made some mistakes on your data, so the fact that you didn't means that either you are having a very abnormal day or your is good.

Quote:
 Originally Posted by Michael Reach The book shows that in the case given in the truth table in 1.3.1 (not in the lectures), any hypothesis you choose from the first five points is no better than a random guess on the last three data points. Whereas, it says in 1.3.2 that we can have a good probability that a hypothesis is close, using the Hoeffding inequality. The book doesn't work that out for the earlier example given, so I thought I'd try to understand it. But let me make the case a little more extreme: Let's say that our space of results comes from flipping a coin 100 times. Our "sample" data is the first 97 flips, N=97. We would like to develop a hypothesis that will do a good job on the last three flips. Now, that isn't going to work; as the book points out for its example, no hypothesis is better than any other on the last three random flips. The average probability of error will be 0.5? So here we are with our (somehow) carefully chosen hypothesis, which does a perfect job on the first 97 flips. Pick : not so stringent. Doesn't the Hoeffding inequality tell us that the probability of our in-sample error (=0) being different from the out-of-sample error (~0.5 on average) by more than 0.4 is less than ? Which is real small, or so. --------------------------------------- Where did I go wrong? My guess (I'd appreciate someone with the correct answer): I said "(somehow) carefully chosen hypothesis". What I was supposed to do was choose it randomly. If I had, I would have almost always gotten a hypothesis that does a terrible job both on the sample and on the remaining three. Probably it would get about 50% on each, and Hoeffding would be right. At the other extreme, I could have gone through some complex procedure for fitting a 97th degree polynomial through my sample, and thereby "memorized" the data, getting a hypothesis that does a great job on the sample. When I did that, I was effectively pawing through all the possible such polynomials (capital H) looking for the ones that do best. The professor called that M - the number of possible hypotheses. I guess for this job I'd need something like 97 parameters, each of which should have (at least) two possible values, so M is something like . Multiply that into Hoeffding's inequality and the rhs isn't small any more. Anyhow, I'd appreciate it if someone has a clearer response.
__________________
Have faith in probability

 Thread Tools Show Printable Version Email this Page Display Modes Linear Mode Switch to Hybrid Mode Switch to Threaded Mode Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 03:08 PM. The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.