LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 1 - The Learning Problem

Reply
 
Thread Tools Display Modes
  #1  
Old 04-04-2013, 12:30 PM
Michael Reach Michael Reach is offline
Senior Member
 
Join Date: Apr 2013
Location: Baltimore, Maryland, USA
Posts: 71
Default Ch. 1.3.1 vs. Ch. 1.3.2

The book shows that in the case given in the truth table in 1.3.1 (not in the lectures), any hypothesis you choose from the first five points is no better than a random guess on the last three data points.

Whereas, it says in 1.3.2 that we can have a good probability that a hypothesis is close, using the Hoeffding inequality.

The book doesn't work that out for the earlier example given, so I thought I'd try to understand it. But let me make the case a little more extreme:

Let's say that our space of results comes from flipping a coin 100 times. Our "sample" data is the first 97 flips, N=97. We would like to develop a hypothesis that will do a good job on the last three flips.

Now, that isn't going to work; as the book points out for its example, no hypothesis is better than any other on the last three random flips. The average probability of error will be 0.5?

So here we are with our (somehow) carefully chosen hypothesis, which does a perfect job on the first 97 flips.

Pick \epsilon=0.4: not so stringent. Doesn't the Hoeffding inequality tell us that the probability of our in-sample error (=0) being different from the out-of-sample error (~0.5 on average) by more than 0.4 is less than 2e^{(-2(0.4^2)*97)}? Which is real small, 10^{(-13)} or so.

---------------------------------------

Where did I go wrong? My guess (I'd appreciate someone with the correct answer):

I said "(somehow) carefully chosen hypothesis". What I was supposed to do was choose it randomly. If I had, I would have almost always gotten a hypothesis that does a terrible job both on the sample and on the remaining three. Probably it would get about 50% on each, and Hoeffding would be right.

At the other extreme, I could have gone through some complex procedure for fitting a 97th degree polynomial through my sample, and thereby "memorized" the data, getting a hypothesis that does a great job on the sample. When I did that, I was effectively pawing through all the possible such polynomials (capital H) looking for the ones that do best. The professor called that M - the number of possible hypotheses. I guess for this job I'd need something like 97 parameters, each of which should have (at least) two possible values, so M is something like 2^{97}. Multiply that into Hoeffding's inequality and the rhs isn't small any more.

Anyhow, I'd appreciate it if someone has a clearer response.
Reply With Quote
  #2  
Old 04-04-2013, 04:09 PM
Elroch Elroch is offline
Invited Guest
 
Join Date: Mar 2013
Posts: 143
Default Re: Ch. 1.3.1 vs. Ch. 1.3.2

I suspect this will be clear!

You have indicated that a coin toss is a trial and you have had 97 of them. The only hypotheses for the toss of a coin are "heads" and "tails". If you have been dead right on 97 trials, you have a double headed coin (or a double-tailed one). This will give you confidence for the next 3 trials.

Your example reminded me that Hoeffding requires independent, identically distributed examples.
Reply With Quote
  #3  
Old 04-05-2013, 12:56 PM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 595
Default Re: Ch. 1.3.1 vs. Ch. 1.3.2

The truth table in section 1.3.1 is emphasizing that if you had your choice of those 8 functions, then each is as good as the other. On any test point, half will be correct, and half wrong.

To link with the coin, you must narrow down from all 8 functions to just one of them, say f_1 (this function is picked before you look at the data). Now when you look at the data (five points), you get agreement on all 5 points. Assuming that each set of 5 data points is equally likely to have been picked, the agreement or disagreement is like tossing the coin (with a small modification that repeated data points should have been allowed to have independence).

It is still the case that if any of those previous 8 functions could be the true target function, then on a test point f_1 is equally likely to be correct as wrong. However, the point being made by the Hoeffding bound is that you must have been extremely unlucky to get those particular points on which you agreed with the target, when in fact you disagreed with the target elsewhere. Or, equivalently, on a normal day you would have made some mistakes on your data, so the fact that you didn't means that either you are having a very abnormal day or your f_1 is good.

Quote:
Originally Posted by Michael Reach View Post
The book shows that in the case given in the truth table in 1.3.1 (not in the lectures), any hypothesis you choose from the first five points is no better than a random guess on the last three data points.

Whereas, it says in 1.3.2 that we can have a good probability that a hypothesis is close, using the Hoeffding inequality.

The book doesn't work that out for the earlier example given, so I thought I'd try to understand it. But let me make the case a little more extreme:

Let's say that our space of results comes from flipping a coin 100 times. Our "sample" data is the first 97 flips, N=97. We would like to develop a hypothesis that will do a good job on the last three flips.

Now, that isn't going to work; as the book points out for its example, no hypothesis is better than any other on the last three random flips. The average probability of error will be 0.5?

So here we are with our (somehow) carefully chosen hypothesis, which does a perfect job on the first 97 flips.

Pick \epsilon=0.4: not so stringent. Doesn't the Hoeffding inequality tell us that the probability of our in-sample error (=0) being different from the out-of-sample error (~0.5 on average) by more than 0.4 is less than 2e^{(-2(0.4^2)*97)}? Which is real small, 10^{(-13)} or so.

---------------------------------------

Where did I go wrong? My guess (I'd appreciate someone with the correct answer):

I said "(somehow) carefully chosen hypothesis". What I was supposed to do was choose it randomly. If I had, I would have almost always gotten a hypothesis that does a terrible job both on the sample and on the remaining three. Probably it would get about 50% on each, and Hoeffding would be right.

At the other extreme, I could have gone through some complex procedure for fitting a 97th degree polynomial through my sample, and thereby "memorized" the data, getting a hypothesis that does a great job on the sample. When I did that, I was effectively pawing through all the possible such polynomials (capital H) looking for the ones that do best. The professor called that M - the number of possible hypotheses. I guess for this job I'd need something like 97 parameters, each of which should have (at least) two possible values, so M is something like 2^{97}. Multiply that into Hoeffding's inequality and the rhs isn't small any more.

Anyhow, I'd appreciate it if someone has a clearer response.
__________________
Have faith in probability
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 08:09 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.