Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling?
The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then repicking the same marbles from each bin (yes, the redgreen colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).
Thanks for your help! 
Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!

Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
Quote:

Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
Quote:
Thanks for your help! 
Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
Quote:
The resolution of this dilemma is that while it is true, each bin in isolation sees the sample as random, and therefore each bin obeys Hoeffding inequality by itself. When we consider all the bins at once, it does not matter how they are correlated since we invoked the union bound which is always valid ragardless of the correlations. 
Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
I have the same concern as scottedwards2000 and I still don't understand how it is resolved.
As I understand the bin symbolizes the probability space of all possible inputs . Sample of balls drawn randomly from the bin symbolizes our training set . Now we pick a hypothesis (suppose we are running PLA). We look at our sample , compute and use Hoeffding's Inequality. We do one step of PLA and come up with new hypothesis which automatically gives us and professor is saying that we can write down Hoeffding inequality for and ? I guess, we can. But that inequality tells us something about random variable , i.e. about: where is a random sample. But it seems like we are using where is hardly random with regard to since we built using that sample. Here is an example that illustrates my point: say we tried some random , compared it with target function on our training sample , wrote down Hoeffding's inequality. Now let's construct as follows: and . Let's write down Hoeffding's ineqaulity for this hypothesis. If we are indeed using then here it would be equal to 1 since on and we would have: is small. So somehow we are saying with high probability that does an excellent job out of sample though we didn't change it much from . This example shouldn't be correct, right? If it isn't how is the one with PLA correct? 
Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
Quote:
Hoeffding states what the probabilities are before the sample is drawn. When you choose one of these hypotheses because of its small , as in the scenario you point out, the probability that applies now is conditioned on the sample having small . We can try to get conditional version of Hoeffding to deal with the situation, or we can try to get a version of Hoeffding that applies regardless of which we choose and how we choose it. The latter is what we did using the union bound. Finally, taking the example you illustrated, any hypothesis you use has to be in (which is decided before the sample is drawn). The one you constructed is not guaranteed to be in . Of course you can guarantee that it is in by taking to be the set of all possible hypotheses, but in this case, is thoroughly infinite :) and the multiplebin Hoeffding does not guarantee anything at all. 
Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
Another thread discusses a similar question. Trying to post my reply here and elaborating to avoid cross posting.
Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X" closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of nonbias was correct and you reward me with more Dollars . Let us try and decode the experiment: Here x is a binary random variable which is either head or tail in a single toss. X is your data at hand; i.e the Outcome of 10K tosses. y = P(h) is the target function which we are trying to learn. We use mu or the fraction of heads in the entire data set at hand as an estimator of y. By definition, P(h) is continuous (It can take any value) but mu is discrete and hence you need an infinite number of tosses to get P or E(mu). Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that in my 10K tosses, I got mu = .05 while P(h)=.5! The right hand side of the Hoeffding's gives you a predictive bound and the prediction can go wrong on some occasions. So given a N, the inequality might evaluate to P(.000001); P(.000001) is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890 ). So we learned an estimate of P(h). In normal machine learning, for example in a regression problem, we will learn something else. ASSUMING (The assumption is extremely important) a linear classifier, we learn the weights. To be precise, x is your data point on the x axis. y is the true f(x) which is your target. g(x) is given by the estimate of weights you will learn. Within the assumptions of this setting, Hoeffding's will give you a good estimate (but of what? ERROR!). Specifically, if you get an error of .0005 on a data set of size 10K ( X is dataset of 10K input values on the x axis whose y you already know) you can be almost certain that your learned weights are close to the original weights. Why? because your estimate of error should closely follow original error on the entire space and viceversa, given that N is large. So you learn an estimate of the error by calculating the error on your sample data (Ein) and using this estimate to predict real errors on new data points (Eout). <I actually cheated a bit when I said you can chose the weights and still use Hoeffding. The original Hoeffding will only apply if you randomly guess some weight and calculate the error on data (Ein) to predict Eout. This is not a big problem however and it should be clear when you read my next post.> 
Re: Is the Hoeffding Inequality really valid for each bin despite nonrandom sampling
The following point was raised in the previous thread:
Quote:
1. The process of searching H to get g has a cost. When you chose one h over another h, you have to pay in terms of the guarantee available. Let's take a new coin tossing experiment. Assume we have 10000 coins and we toss each coin 10 times (10000 coins can be thought of as 10000 bins and N = 10!). Let's assume that P(head) = 1/2. We now plot a histogram of "number of heads in 10 tosses". So you will have some coins which gave 0 Heads, more which gave 1, even more which gave 2, while most gave between 38 heads. This will follow a binomial distribution. Getting 0 heads in 10 tosses is rare but not impossible. Notice P(0 heads) value goes up with n or the no of coins. 2. Thus we see that as the number of options for a bad event to happen increase, the p of that event happening (atleast once) goes up! This is an analog to what Prof explained: As you have more and more h in your H, it is probable ( notice that he is modelling the worst case scenario) that a bad event will happen. By bad event, I mean those cases where the bound is broken and you observe a large deviation. 3. The modified inequality tells us that we should be careful about choosing H. If H is too large (let's say if H is space of all polynomial), the P of a bad event (read overfitting) will be high. The new formulation of the inequality accounts for the complexity. 4. The inequality gives us an upper bound on the worst case performance. By adding M, we ensure that the worse case scenario of this given complexity is accounted for. In summary, the next stage of the problem is to think about the complexity of your hypothesis set. As you can imagine, a 20th order polynomial is too complex for 20 data points; try calculating the rhs of modified Hoeffding's inequality and you will be convinced that this is the case! In fact, even a linear classifier with weight adjustment needs to be treated carefully as we are searching the entire 3 dimensional weight space to get the weights right! Once we can separate the 2 concepts: a. that of cooking up a g(x) absolutely randomly (based on the population of India, speed or Earth's rotation and what my mother cooked for supper ) and then testing it on my X to get Ein as a prediction of Eout given a N. b. and that of choosing a classifier based on adjusting by searching my hypothesis space for which we will have to pay a price in terms of the worst case theoretical guarantee. ..... it should be understandable. :) 
All times are GMT 7. The time now is 12:24 AM. 
Powered by vBulletin® Version 3.8.3
Copyright ©2000  2021, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. AbuMostafa, Malik MagdonIsmail, and HsuanTien Lin, and participants in the Learning From Data MOOC by Yaser S. AbuMostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.