LFD Book Forum Is the Hoeffding Inequality really valid for each bin despite non-random sampling?
 Register FAQ Calendar Mark Forums Read

#1
01-31-2013, 12:57 AM
 scottedwards2000 Junior Member Join Date: Jan 2013 Posts: 9
Is the Hoeffding Inequality really valid for each bin despite non-random sampling?

The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).

#2
02-05-2013, 10:18 PM
 yaser Caltech Join Date: Aug 2009 Location: Pasadena, California, USA Posts: 1,478
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by scottedwards2000 The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).!
I see your concern. Here is one way to argue about it.

Assume the marbles have numbers to identify the point they are associated with, in addition to being green and red. Start with one bin, pick the marbles at random and look at the colors in sample and out of sample. It is clear that the Hoeffding inequality holds for this bin since the experiment is that of a single bin, regardless of the numbers on the marbles in sample.

Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.
__________________
Where everyone thinks alike, no one thinks very much
#3
02-07-2013, 10:49 AM
 scottedwards2000 Junior Member Join Date: Jan 2013 Posts: 9
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!
#4
02-07-2013, 12:45 PM
 yaser Caltech Join Date: Aug 2009 Location: Pasadena, California, USA Posts: 1,478
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by scottedwards2000 Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!
Correct. The behavior of each sample is indeed dependent on the other bin samples, but the Hoeffding inequality is valid for each bin by itself, and the union bound is valid whether they are independent or dependent.
__________________
Where everyone thinks alike, no one thinks very much
#5
04-06-2013, 01:24 PM
 scottedwards2000 Junior Member Join Date: Jan 2013 Posts: 9
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by yaser Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.
Actually, now that I re-read this part (I'm taking the course again!), I realize that I don't really follow this logic. Can you please expand on your last sentence above? How does the fact that the first bin is arbitrary help us, when we have no idea if the bin we choose is the first bin? I can see how if we ran the experiment an infinite amount of times, we would eventually choose the first bin, where Hoeffding applies, but we aren't running it an infinite amount of times are we?

#6
04-06-2013, 01:40 PM
 yaser Caltech Join Date: Aug 2009 Location: Pasadena, California, USA Posts: 1,478
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by scottedwards2000 Actually, now that I re-read this part (I'm taking the course again!), I realize that I don't really follow this logic. Can you please expand on your last sentence above?
The sentence addresses the following concern. Hoeffding assumes a random sample from a bin. The different samples from multiple bins in the learning analogy are not "totally random" since they all depend on one sample of input points (which is random) but then all of them give red/geen values on that same sample according to the agreement/disagreement of their respective hypothesis with the target function.

The resolution of this dilemma is that while it is true, each bin in isolation sees the sample as random, and therefore each bin obeys Hoeffding inequality by itself. When we consider all the bins at once, it does not matter how they are correlated since we invoked the union bound which is always valid ragardless of the correlations.
__________________
Where everyone thinks alike, no one thinks very much
#7
04-12-2013, 03:58 PM
 grozhd Junior Member Join Date: Apr 2013 Posts: 4
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

I have the same concern as scottedwards2000 and I still don't understand how it is resolved.

As I understand the bin symbolizes the probability space of all possible inputs . Sample of balls drawn randomly from the bin symbolizes our training set .

Now we pick a hypothesis (suppose we are running PLA). We look at our sample , compute and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis which automatically gives us and professor is saying that we can write down Hoeffding inequality for and ?

I guess, we can. But that inequality tells us something about random variable , i.e. about: where is a random sample. But it seems like we are using where is hardly random with regard to since we built using that sample.

Here is an example that illustrates my point: say we tried some random , compared it with target function on our training sample , wrote down Hoeffding's inequality. Now let's construct as follows: and . Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using then here it would be equal to 1 since on and we would have: is small. So somehow we are saying with high probability that does an excellent job out of sample though we didn't change it much from . This example shouldn't be correct, right? If it isn't how is the one with PLA correct?
#8
04-12-2013, 11:16 PM
 yaser Caltech Join Date: Aug 2009 Location: Pasadena, California, USA Posts: 1,478
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by grozhd I have the same concern as scottedwards2000 and I still don't understand how it is resolved. As I understand the bin symbolizes the probability space of all possible inputs . Sample of balls drawn randomly from the bin symbolizes our training set . Now we pick a hypothesis (suppose we are running PLA). We look at our sample , compute and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis which automatically gives us and professor is saying that we can write down Hoeffding inequality for and ? I guess, we can. But that inequality tells us something about random variable , i.e. about: where is a random sample. But it seems like we are using where is hardly random with regard to since we built using that sample. Here is an example that illustrates my point: say we tried some random , compared it with target function on our training sample , wrote down Hoeffding's inequality. Now let's construct as follows: and . Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using then here it would be equal to 1 since on and we would have: is small. So somehow we are saying with high probability that does an excellent job out of sample though we didn't change it much from . This example shouldn't be correct, right? If it isn't how is the one with PLA correct?
It is a subtle point, so let me try to explain it in the terms you outlined. Let us take the sample (what you call , just to follow the book notation). Now evaluate for all hypotheses in your model . We didn't start at one and moved to another. We just evaluated for all . The question is, does Hoeffding inequality apply to each of these 's by itself? The answer is clearly yes since each of them could be in principle the hypothesis you started with (which you called ).

Hoeffding states what the probabilities are before the sample is drawn. When you choose one of these hypotheses because of its small , as in the scenario you point out, the probability that applies now is conditioned on the sample having small . We can try to get conditional version of Hoeffding to deal with the situation, or we can try to get a version of Hoeffding that applies regardless of which we choose and how we choose it. The latter is what we did using the union bound.

Finally, taking the example you illustrated, any hypothesis you use has to be in (which is decided before the sample is drawn). The one you constructed is not guaranteed to be in . Of course you can guarantee that it is in by taking to be the set of all possible hypotheses, but in this case, is thoroughly infinite and the multiple-bin Hoeffding does not guarantee anything at all.
__________________
Where everyone thinks alike, no one thinks very much
#9
04-12-2013, 11:29 PM
 Rahul Sinha Junior Member Join Date: Apr 2013 Posts: 9
Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Another thread discusses a similar question. Trying to post my reply here and elaborating to avoid cross posting.

Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X" closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of non-bias was correct and you reward me with more Dollars .
Let us try and decode the experiment: Here x is a binary random variable which is either head or tail in a single toss. X is your data at hand; i.e the Outcome of 10K tosses. y = P(h) is the target function which we are trying to learn. We use mu or the fraction of heads in the entire data set at hand as an estimator of y. By definition, P(h) is continuous (It can take any value) but mu is discrete and hence you need an infinite number of tosses to get P or E(mu). Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that in my 10K tosses, I got mu = .05 while P(h)=.5! The right hand side of the Hoeffding's gives you a predictive bound and the prediction can go wrong on some occasions. So given a N, the inequality might evaluate to P(.000001); P(.000001) is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890 ).

So we learned an estimate of P(h). In normal machine learning, for example in a regression problem, we will learn something else. ASSUMING (The assumption is extremely important) a linear classifier, we learn the weights. To be precise, x is your data point on the x axis. y is the true f(x) which is your target. g(x) is given by the estimate of weights you will learn. Within the assumptions of this setting, Hoeffding's will give you a good estimate (but of what? ERROR!). Specifically, if you get an error of .0005 on a data set of size 10K ( X is dataset of 10K input values on the x axis whose y you already know) you can be almost certain that your learned weights are close to the original weights. Why? because your estimate of error should closely follow original error on the entire space and vice-versa, given that N is large. So you learn an estimate of the error by calculating the error on your sample data (Ein) and using this estimate to predict real errors on new data points (Eout).

<I actually cheated a bit when I said you can chose the weights and still use Hoeffding. The original Hoeffding will only apply if you randomly guess some weight and calculate the error on data (Ein) to predict Eout. This is not a big problem however and it should be clear when you read my next post.>

 Thread Tools Display Modes Hybrid Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 02:35 PM.