![]() |
|
![]() |
|
Thread Tools | Display Modes |
|
#1
|
|||
|
|||
![]()
The multiple bin analogy of picking the best h is a very helpful way of visualizing the situation, and I totally get how the union bound sets an upper limit on the probability of exceeding the error threshold. What I am actually questioning is whether those individual probabilities that compose the union bound are correct. I can see that they are just the individual Hoeffding Inequalities for each h, but is the Hoeffding Inequality really valid for all those h's in spite of the fact that we are NOT taking random samples from each "bin"? We are only picking our marbles (x's) ONCE, and then re-picking the same marbles from each bin (yes, the red-green colors of those marbles can change, based on the specific h, but aren't they the same marbles (x's)?).
Thanks for your help! |
#2
|
||||
|
||||
![]() Quote:
Assume the marbles have numbers to identify the ![]() Now if you reuse these numbers to do something else somewhere else, that will not alter the applicability of Hoeffding to the bin we started with, right? You can now view the multiple-bin experiment as starting with any given bin, then reusing the numbers you got in sample from that bin for the rest of the bins. Hoeffding still applies to that original given bin, regardless of which one it is. Since that bin is arbitrary, Hoeffding must apply to all of the bins individually.
__________________
Where everyone thinks alike, no one thinks very much |
#3
|
|||
|
|||
![]()
Thanks for responding. I think I see what you mean. However, I think it does impact the independence of those individual H. Inequalities. As a simple example, using the perceptron, imagine that your sample picks points that are very close together. This means that they are more likely than disperse points to be either all correctly classified together or all misclassified together. Therefore it would seem that the chance of at least one error bound being validated goes up, as compared to picking a separate sample for each bin (which is like the coin experiment you had us do). But I guess you deal with this by using the union bound right? Thanks again!
|
#4
|
||||
|
||||
![]() Quote:
__________________
Where everyone thinks alike, no one thinks very much |
#5
|
|||
|
|||
![]() Quote:
![]() Thanks for your help! |
#6
|
||||
|
||||
![]() Quote:
The resolution of this dilemma is that while it is true, each bin in isolation sees the sample as random, and therefore each bin obeys Hoeffding inequality by itself. When we consider all the bins at once, it does not matter how they are correlated since we invoked the union bound which is always valid ragardless of the correlations.
__________________
Where everyone thinks alike, no one thinks very much |
#7
|
|||
|
|||
![]()
I have the same concern as scottedwards2000 and I still don't understand how it is resolved.
As I understand the bin symbolizes the probability space of all possible inputs ![]() ![]() Now we pick a hypothesis ![]() ![]() ![]() ![]() ![]() ![]() ![]() I guess, we can. But that inequality tells us something about random variable ![]() ![]() ![]() ![]() ![]() ![]() ![]() Here is an example that illustrates my point: say we tried some random ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
#8
|
||||
|
||||
![]() Quote:
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Hoeffding states what the probabilities are before the sample is drawn. When you choose one of these hypotheses because of its small ![]() ![]() ![]() Finally, taking the example you illustrated, any hypothesis you use has to be in ![]() ![]() ![]() ![]() ![]() ![]()
__________________
Where everyone thinks alike, no one thinks very much |
#9
|
|||
|
|||
![]()
Another thread discusses a similar question. Trying to post my reply here and elaborating to avoid cross posting.
Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X" closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of non-bias was correct and you reward me with more Dollars . Let us try and decode the experiment: Here x is a binary random variable which is either head or tail in a single toss. X is your data at hand; i.e the Outcome of 10K tosses. y = P(h) is the target function which we are trying to learn. We use mu or the fraction of heads in the entire data set at hand as an estimator of y. By definition, P(h) is continuous (It can take any value) but mu is discrete and hence you need an infinite number of tosses to get P or E(mu). Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that in my 10K tosses, I got mu = .05 while P(h)=.5! The right hand side of the Hoeffding's gives you a predictive bound and the prediction can go wrong on some occasions. So given a N, the inequality might evaluate to P(.000001); P(.000001) is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890 ). So we learned an estimate of P(h). In normal machine learning, for example in a regression problem, we will learn something else. ASSUMING (The assumption is extremely important) a linear classifier, we learn the weights. To be precise, x is your data point on the x axis. y is the true f(x) which is your target. g(x) is given by the estimate of weights you will learn. Within the assumptions of this setting, Hoeffding's will give you a good estimate (but of what? ERROR!). Specifically, if you get an error of .0005 on a data set of size 10K ( X is dataset of 10K input values on the x axis whose y you already know) you can be almost certain that your learned weights are close to the original weights. Why? because your estimate of error should closely follow original error on the entire space and vice-versa, given that N is large. So you learn an estimate of the error by calculating the error on your sample data (Ein) and using this estimate to predict real errors on new data points (Eout). <I actually cheated a bit when I said you can chose the weights and still use Hoeffding. The original Hoeffding will only apply if you randomly guess some weight and calculate the error on data (Ein) to predict Eout. This is not a big problem however and it should be clear when you read my next post.> |
![]() |
Thread Tools | |
Display Modes | |
|
|