LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 1 - The Learning Problem (http://book.caltech.edu/bookforum/forumdisplay.php?f=108)
-   -   Is the Hoeffding Inequality really valid for each bin despite non-random sampling? (http://book.caltech.edu/bookforum/showthread.php?t=3926)

 scottedwards2000 01-31-2013 12:57 AM

Is the Hoeffding Inequality really valid for each bin despite non-random sampling?

The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).

 yaser 02-05-2013 10:18 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by scottedwards2000 (Post 9074) The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).!
I see your concern. Here is one way to argue about it.

Assume the marbles have numbers to identify the point they are associated with, in addition to being green and red. Start with one bin, pick the marbles at random and look at the colors in sample and out of sample. It is clear that the Hoeffding inequality holds for this bin since the experiment is that of a single bin, regardless of the numbers on the marbles in sample.

Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.

 scottedwards2000 02-07-2013 10:49 AM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!

 yaser 02-07-2013 12:45 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by scottedwards2000 (Post 9237) Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!
Correct. The behavior of each sample is indeed dependent on the other bin samples, but the Hoeffding inequality is valid for each bin by itself, and the union bound is valid whether they are independent or dependent.

 scottedwards2000 04-06-2013 01:24 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by yaser (Post 9215) Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.
Actually, now that I re-read this part (I'm taking the course again!), I realize that I don't really follow this logic. Can you please expand on your last sentence above? How does the fact that the first bin is arbitrary help us, when we have no idea if the bin we choose is the first bin? I can see how if we ran the experiment an infinite amount of times, we would eventually choose the first bin, where Hoeffding applies, but we aren't running it an infinite amount of times are we? :clueless:

 yaser 04-06-2013 01:40 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by scottedwards2000 (Post 10169) Actually, now that I re-read this part (I'm taking the course again!), I realize that I don't really follow this logic. Can you please expand on your last sentence above?
The sentence addresses the following concern. Hoeffding assumes a random sample from a bin. The different samples from multiple bins in the learning analogy are not "totally random" since they all depend on one sample of input points (which is random) but then all of them give red/geen values on that same sample according to the agreement/disagreement of their respective hypothesis with the target function.

The resolution of this dilemma is that while it is true, each bin in isolation sees the sample as random, and therefore each bin obeys Hoeffding inequality by itself. When we consider all the bins at once, it does not matter how they are correlated since we invoked the union bound which is always valid ragardless of the correlations.

 grozhd 04-12-2013 03:58 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

I have the same concern as scottedwards2000 and I still don't understand how it is resolved.

As I understand the bin symbolizes the probability space of all possible inputs . Sample of balls drawn randomly from the bin symbolizes our training set .

Now we pick a hypothesis (suppose we are running PLA). We look at our sample , compute and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis which automatically gives us and professor is saying that we can write down Hoeffding inequality for and ?

I guess, we can. But that inequality tells us something about random variable , i.e. about: where is a random sample. But it seems like we are using where is hardly random with regard to since we built using that sample.

Here is an example that illustrates my point: say we tried some random , compared it with target function on our training sample , wrote down Hoeffding's inequality. Now let's construct as follows: and . Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using then here it would be equal to 1 since on and we would have: is small. So somehow we are saying with high probability that does an excellent job out of sample though we didn't change it much from . This example shouldn't be correct, right? If it isn't how is the one with PLA correct?

 yaser 04-12-2013 11:16 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Quote:
 Originally Posted by grozhd (Post 10357) I have the same concern as scottedwards2000 and I still don't understand how it is resolved. As I understand the bin symbolizes the probability space of all possible inputs . Sample of balls drawn randomly from the bin symbolizes our training set . Now we pick a hypothesis (suppose we are running PLA). We look at our sample , compute and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis which automatically gives us and professor is saying that we can write down Hoeffding inequality for and ? I guess, we can. But that inequality tells us something about random variable , i.e. about: where is a random sample. But it seems like we are using where is hardly random with regard to since we built using that sample. Here is an example that illustrates my point: say we tried some random , compared it with target function on our training sample , wrote down Hoeffding's inequality. Now let's construct as follows: and . Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using then here it would be equal to 1 since on and we would have: is small. So somehow we are saying with high probability that does an excellent job out of sample though we didn't change it much from . This example shouldn't be correct, right? If it isn't how is the one with PLA correct?
It is a subtle point, so let me try to explain it in the terms you outlined. Let us take the sample (what you call , just to follow the book notation). Now evaluate for all hypotheses in your model . We didn't start at one and moved to another. We just evaluated for all . The question is, does Hoeffding inequality apply to each of these 's by itself? The answer is clearly yes since each of them could be in principle the hypothesis you started with (which you called ).

Hoeffding states what the probabilities are before the sample is drawn. When you choose one of these hypotheses because of its small , as in the scenario you point out, the probability that applies now is conditioned on the sample having small . We can try to get conditional version of Hoeffding to deal with the situation, or we can try to get a version of Hoeffding that applies regardless of which we choose and how we choose it. The latter is what we did using the union bound.

Finally, taking the example you illustrated, any hypothesis you use has to be in (which is decided before the sample is drawn). The one you constructed is not guaranteed to be in . Of course you can guarantee that it is in by taking to be the set of all possible hypotheses, but in this case, is thoroughly infinite :) and the multiple-bin Hoeffding does not guarantee anything at all.

 Rahul Sinha 04-12-2013 11:29 PM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

Another thread discusses a similar question. Trying to post my reply here and elaborating to avoid cross posting.

Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X" closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of non-bias was correct and you reward me with more Dollars .
Let us try and decode the experiment: Here x is a binary random variable which is either head or tail in a single toss. X is your data at hand; i.e the Outcome of 10K tosses. y = P(h) is the target function which we are trying to learn. We use mu or the fraction of heads in the entire data set at hand as an estimator of y. By definition, P(h) is continuous (It can take any value) but mu is discrete and hence you need an infinite number of tosses to get P or E(mu). Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that in my 10K tosses, I got mu = .05 while P(h)=.5! The right hand side of the Hoeffding's gives you a predictive bound and the prediction can go wrong on some occasions. So given a N, the inequality might evaluate to P(.000001); P(.000001) is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890 ).

So we learned an estimate of P(h). In normal machine learning, for example in a regression problem, we will learn something else. ASSUMING (The assumption is extremely important) a linear classifier, we learn the weights. To be precise, x is your data point on the x axis. y is the true f(x) which is your target. g(x) is given by the estimate of weights you will learn. Within the assumptions of this setting, Hoeffding's will give you a good estimate (but of what? ERROR!). Specifically, if you get an error of .0005 on a data set of size 10K ( X is dataset of 10K input values on the x axis whose y you already know) you can be almost certain that your learned weights are close to the original weights. Why? because your estimate of error should closely follow original error on the entire space and vice-versa, given that N is large. So you learn an estimate of the error by calculating the error on your sample data (Ein) and using this estimate to predict real errors on new data points (Eout).

<I actually cheated a bit when I said you can chose the weights and still use Hoeffding. The original Hoeffding will only apply if you randomly guess some weight and calculate the error on data (Ein) to predict Eout. This is not a big problem however and it should be clear when you read my next post.>

 Rahul Sinha 04-13-2013 12:00 AM

Re: Is the Hoeffding Inequality really valid for each bin despite non-random sampling

The following point was raised in the previous thread:
Quote:
 Originally Posted by grozhd (Post 10358) For example no matter how much data you give me I can always come up with a polynomial of degree equal to the number of training examples which is perfect in sample. Suppose you gave me 20 points from linear function and I generate a polynomial of degree 20. I think it will be really bad out of sample.
You are right! But this is a different problem and I will try to explain it in the next paragraph. In short, here you are using the set of all polynomials to choose one polynomial based on the data (not to mention the 20 dimensional weight space you will search to get the fit). Because you are 'optimizing', you will have to pay the price and then the whole game of VC dimensions and modified Hoeffding's inequality (addition of multiple bins in the lecture) need to be considered.

1. The process of searching H to get g has a cost. When you chose one h over another h, you have to pay in terms of the guarantee available.
Let's take a new coin tossing experiment. Assume we have 10000 coins and we toss each coin 10 times (10000 coins can be thought of as 10000 bins and N = 10!). Let's assume﻿ that P(head) = 1/2. We now plot a histogram of "number of heads in 10 tosses". So you will have some coins which gave 0 Heads, more which gave 1, even more which gave 2, while most gave between 3-8 heads. This will follow a binomial distribution. Getting 0 heads in 10 tosses is rare but not impossible. Notice P(0 heads) value goes up with n or the no of coins.

2. Thus we see that as the number of options for a bad event to happen increase, the p of that event happening (at-least once) goes up! This is an analog to what Prof explained: As you have more and more h in your H, it is probable ( notice that he is modelling the worst case scenario) that a bad event will happen. By bad event, I mean those cases where the bound is broken and you observe a large deviation.

3. The modified inequality tells us that we should be careful about choosing H. If H is too large (let's say if H is space of all polynomial), the P of a bad event (read over-fitting) will be high. The new formulation of the inequality accounts for the complexity.

4. The inequality gives us an upper bound on the worst case performance. By adding M, we ensure that the worse case scenario of this given complexity is accounted for.

In summary, the next stage of the problem is to think about the complexity of your hypothesis set. As you can imagine, a 20th order polynomial is too complex for 20 data points; try calculating the rhs of modified Hoeffding's inequality and you will be convinced that this is the case! In fact, even a linear classifier with weight adjustment needs to be treated carefully as we are searching the entire 3 dimensional weight space to get the weights right!

Once we can separate the 2 concepts:
a. that of cooking up a g(x) absolutely randomly (based on the population of India, speed or Earth's rotation and what my mother cooked for supper ) and then testing it on my X to get Ein as a prediction of Eout given a N.
b. and that of choosing a classifier based on adjusting by searching my hypothesis space for which we will have to pay a price in terms of the worst case theoretical guarantee.
..... it should be understandable. :)

All times are GMT -7. The time now is 06:12 AM.