LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 1 - The Learning Problem

Reply
 
Thread Tools Display Modes
  #1  
Old 01-31-2013, 12:57 AM
scottedwards2000 scottedwards2000 is offline
Junior Member
 
Join Date: Jan 2013
Posts: 9
Default Is the Hoeffding Inequality really valid for each bin despite non-random sampling?

The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).

Thanks for your help!
Reply With Quote
  #2  
Old 02-05-2013, 10:18 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
Originally Posted by scottedwards2000 View Post
The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).!
I see your concern. Here is one way to argue about it.

Assume the marbles have numbers to identify the {\rm x} point they are associated with, in addition to being green and red. Start with one bin, pick the marbles at random and look at the colors in sample and out of sample. It is clear that the Hoeffding inequality holds for this bin since the experiment is that of a single bin, regardless of the numbers on the marbles in sample.

Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #3  
Old 02-07-2013, 10:49 AM
scottedwards2000 scottedwards2000 is offline
Junior Member
 
Join Date: Jan 2013
Posts: 9
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!
Reply With Quote
  #4  
Old 02-07-2013, 12:45 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
Originally Posted by scottedwards2000 View Post
Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!
Correct. The behavior of each sample is indeed dependent on the other bin samples, but the Hoeffding inequality is valid for each bin by itself, and the union bound is valid whether they are independent or dependent.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #5  
Old 04-06-2013, 01:24 PM
scottedwards2000 scottedwards2000 is offline
Junior Member
 
Join Date: Jan 2013
Posts: 9
Question Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
Originally Posted by yaser View Post
Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.
Actually, now that I re-read this part (I'm taking the course again!), I realize that I don't really follow this logic. Can you please expand on your last sentence above? How does the fact that the first bin is arbitrary help us, when we have no idea if the bin we choose is the first bin? I can see how if we ran the experiment an infinite amount of times, we would eventually choose the first bin, where Hoeffding applies, but we aren't running it an infinite amount of times are we?

Thanks for your help!
Reply With Quote
  #6  
Old 04-06-2013, 01:40 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
Originally Posted by scottedwards2000 View Post
Actually, now that I re-read this part (I'm taking the course again!), I realize that I don't really follow this logic. Can you please expand on your last sentence above?
The sentence addresses the following concern. Hoeffding assumes a random sample from a bin. The different samples from multiple bins in the learning analogy are not "totally random" since they all depend on one sample of input points (which is random) but then all of them give red/geen values on that same sample according to the agreement/disagreement of their respective hypothesis with the target function.

The resolution of this dilemma is that while it is true, each bin in isolation sees the sample as random, and therefore each bin obeys Hoeffding inequality by itself. When we consider all the bins at once, it does not matter how they are correlated since we invoked the union bound which is always valid ragardless of the correlations.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #7  
Old 04-12-2013, 03:58 PM
grozhd grozhd is offline
Junior Member
 
Join Date: Apr 2013
Posts: 4
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

I have the same concern as scottedwards2000 and I still don't understand how it is resolved.

As I understand the bin symbolizes the probability space of all possible inputs X. Sample of balls drawn randomly from the bin symbolizes our training set \bar{x_0}=(x_1,...,x_N).

Now we pick a hypothesis h_1 (suppose we are running PLA). We look at our sample \bar{x_0}, compute \nu and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis h_2 which automatically gives us \nu_2 and professor is saying that we can write down Hoeffding inequality for \nu_2 and \mu_2?

I guess, we can. But that inequality tells us something about random variable \nu_2: X^n \rightarrow [0;1], i.e. about: P\left( |\nu_2(\bar{x}) - \mu_2 |\right) where \bar{x} is a random sample. But it seems like we are using \nu_2(\bar{x_0}) where \bar{x_0} is hardly random with regard to h_2 since we built h_2 using that sample.

Here is an example that illustrates my point: say we tried some random h_1, compared it with target function f on our training sample (x_1,...,x_N), wrote down Hoeffding's inequality. Now let's construct h_2 as follows: h_2(x_i) = f(x_i), \ i=1,...,N and h_2(x) = h_1(x), \ x \neq x_i, \ i=1,...,N. Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using \nu_2(\bar{x_0}) then here it would be equal to 1 since h_2 = f on \bar{x_0} and we would have: P\left( |1 - \mu_2| > \varepsilon \right) is small. So somehow we are saying with high probability that h_2 does an excellent job out of sample though we didn't change it much from h_1. This example shouldn't be correct, right? If it isn't how is the one with PLA correct?
Reply With Quote
  #8  
Old 04-12-2013, 11:16 PM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
Originally Posted by grozhd View Post
I have the same concern as scottedwards2000 and I still don't understand how it is resolved.

As I understand the bin symbolizes the probability space of all possible inputs X. Sample of balls drawn randomly from the bin symbolizes our training set \bar{x_0}=(x_1,...,x_N).

Now we pick a hypothesis h_1 (suppose we are running PLA). We look at our sample \bar{x_0}, compute \nu and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis h_2 which automatically gives us \nu_2 and professor is saying that we can write down Hoeffding inequality for \nu_2 and \mu_2?

I guess, we can. But that inequality tells us something about random variable \nu_2: X^n \rightarrow [0;1], i.e. about: P\left( |\nu_2(\bar{x}) - \mu_2 |\right) where \bar{x} is a random sample. But it seems like we are using \nu_2(\bar{x_0}) where \bar{x_0} is hardly random with regard to h_2 since we built h_2 using that sample.

Here is an example that illustrates my point: say we tried some random h_1, compared it with target function f on our training sample (x_1,...,x_N), wrote down Hoeffding's inequality. Now let's construct h_2 as follows: h_2(x_i) = f(x_i), \ i=1,...,N and h_2(x) = h_1(x), \ x \neq x_i, \ i=1,...,N. Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using \nu_2(\bar{x_0}) then here it would be equal to 1 since h_2 = f on \bar{x_0} and we would have: P\left( |1 - \mu_2| > \varepsilon \right) is small. So somehow we are saying with high probability that h_2 does an excellent job out of sample though we didn't change it much from h_1. This example shouldn't be correct, right? If it isn't how is the one with PLA correct?
It is a subtle point, so let me try to explain it in the terms you outlined. Let us take the sample {\cal D} (what you call \bar{x_0}, just to follow the book notation). Now evaluate \nu for all hypotheses h in your model {\cal H}. We didn't start at one h and moved to another. We just evaluated \nu for all h\in{\cal H}. The question is, does Hoeffding inequality apply to each of these h's by itself? The answer is clearly yes since each of them could be in principle the hypothesis you started with (which you called h_1).

Hoeffding states what the probabilities are before the sample is drawn. When you choose one of these hypotheses because of its small \nu, as in the scenario you point out, the probability that applies now is conditioned on the sample having small \nu. We can try to get conditional version of Hoeffding to deal with the situation, or we can try to get a version of Hoeffding that applies regardless of which h we choose and how we choose it. The latter is what we did using the union bound.

Finally, taking the example you illustrated, any hypothesis you use has to be in {\cal H} (which is decided before the sample is drawn). The one you constructed is not guaranteed to be in {\cal H}. Of course you can guarantee that it is in {\cal H} by taking {\cal H} to be the set of all possible hypotheses, but in this case, M is thoroughly infinite and the multiple-bin Hoeffding does not guarantee anything at all.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #9  
Old 04-12-2013, 11:29 PM
Rahul Sinha Rahul Sinha is offline
Junior Member
 
Join Date: Apr 2013
Posts: 9
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Another thread discusses a similar question. Trying to post my reply here and elaborating to avoid cross posting.

Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X" closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of non-bias was correct and you reward me with more Dollars .
Let us try and decode the experiment: Here x is a binary random variable which is either head or tail in a single toss. X is your data at hand; i.e the Outcome of 10K tosses. y = P(h) is the target function which we are trying to learn. We use mu or the fraction of heads in the entire data set at hand as an estimator of y. By definition, P(h) is continuous (It can take any value) but mu is discrete and hence you need an infinite number of tosses to get P or E(mu). Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that in my 10K tosses, I got mu = .05 while P(h)=.5! The right hand side of the Hoeffding's gives you a predictive bound and the prediction can go wrong on some occasions. So given a N, the inequality might evaluate to P(.000001); P(.000001) is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890 ).

So we learned an estimate of P(h). In normal machine learning, for example in a regression problem, we will learn something else. ASSUMING (The assumption is extremely important) a linear classifier, we learn the weights. To be precise, x is your data point on the x axis. y is the true f(x) which is your target. g(x) is given by the estimate of weights you will learn. Within the assumptions of this setting, Hoeffding's will give you a good estimate (but of what? ERROR!). Specifically, if you get an error of .0005 on a data set of size 10K ( X is dataset of 10K input values on the x axis whose y you already know) you can be almost certain that your learned weights are close to the original weights. Why? because your estimate of error should closely follow original error on the entire space and vice-versa, given that N is large. So you learn an estimate of the error by calculating the error on your sample data (Ein) and using this estimate to predict real errors on new data points (Eout).

<I actually cheated a bit when I said you can chose the weights and still use Hoeffding. The original Hoeffding will only apply if you randomly guess some weight and calculate the error on data (Ein) to predict Eout. This is not a big problem however and it should be clear when you read my next post.>
Reply With Quote
  #10  
Old 04-13-2013, 12:00 AM
Rahul Sinha Rahul Sinha is offline
Junior Member
 
Join Date: Apr 2013
Posts: 9
Default Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

The following point was raised in the previous thread:
Quote:
Originally Posted by grozhd View Post
For example no matter how much data you give me I can always come up with a polynomial of degree equal to the number of training examples which is perfect in sample. Suppose you gave me 20 points (x, y) from linear function and I generate a polynomial of degree 20. I think it will be really bad out of sample.
You are right! But this is a different problem and I will try to explain it in the next paragraph. In short, here you are using the set of all polynomials to choose one polynomial based on the data (not to mention the 20 dimensional weight space you will search to get the fit). Because you are 'optimizing', you will have to pay the price and then the whole game of VC dimensions and modified Hoeffding's inequality (addition of multiple bins in the lecture) need to be considered.

1. The process of searching H to get g has a cost. When you chose one h over another h, you have to pay in terms of the guarantee available.
Let's take a new coin tossing experiment. Assume we have 10000 coins and we toss each coin 10 times (10000 coins can be thought of as 10000 bins and N = 10!). Let's assume that P(head) = 1/2. We now plot a histogram of "number of heads in 10 tosses". So you will have some coins which gave 0 Heads, more which gave 1, even more which gave 2, while most gave between 3-8 heads. This will follow a binomial distribution. Getting 0 heads in 10 tosses is rare but not impossible. Notice P(0 heads) value goes up with n or the no of coins.

2. Thus we see that as the number of options for a bad event to happen increase, the p of that event happening (at-least once) goes up! This is an analog to what Prof explained: As you have more and more h in your H, it is probable ( notice that he is modelling the worst case scenario) that a bad event will happen. By bad event, I mean those cases where the bound is broken and you observe a large deviation.

3. The modified inequality tells us that we should be careful about choosing H. If H is too large (let's say if H is space of all polynomial), the P of a bad event (read over-fitting) will be high. The new formulation of the inequality accounts for the complexity.

4. The inequality gives us an upper bound on the worst case performance. By adding M, we ensure that the worse case scenario of this given complexity is accounted for.

In summary, the next stage of the problem is to think about the complexity of your hypothesis set. As you can imagine, a 20th order polynomial is too complex for 20 data points; try calculating the rhs of modified Hoeffding's inequality and you will be convinced that this is the case! In fact, even a linear classifier with weight adjustment needs to be treated carefully as we are searching the entire 3 dimensional weight space to get the weights right!

Once we can separate the 2 concepts:
a. that of cooking up a g(x) absolutely randomly (based on the population of India, speed or Earth's rotation and what my mother cooked for supper ) and then testing it on my X to get Ein as a prediction of Eout given a N.
b. and that of choosing a classifier based on adjusting by searching my hypothesis space for which we will have to pay a price in terms of the worst case theoretical guarantee.
..... it should be understandable.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 12:16 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.