Another

thread discusses a similar question. Trying to post my reply here and elaborating to avoid cross posting.

Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X" closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of non-bias was correct and you reward me with more Dollars .

Let us try and decode the experiment: Here

** x is a binary random variable** which is either head or tail in a single toss.

**X is your data at hand**; i.e the Outcome of 10K tosses.

** y = P(h)** is the

**target function** which we are trying to learn. We use

**mu or the fraction of heads in the entire data set** at hand as an

__estimator of y__. By definition, P(h) is continuous (It can take any value) but mu is discrete and hence you need an infinite number of tosses to get P or E(mu). Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that in my 10K tosses, I got mu = .05 while P(h)=.5! The right hand side of the Hoeffding's gives you a

*predictive bound* and the prediction can go wrong on some occasions. So given a N, the inequality might evaluate to P(.000001); P(.000001) is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890 ).

So we

**learned an estimate of P(h)**. In normal machine learning, for example in a regression problem, we will learn something else. ASSUMING (The assumption is extremely important) a linear classifier, we

*learn* the weights. To be precise, x is your data point on the x axis. y is the true f(x) which is your target. g(x) is given by the estimate of weights you will

* learn*. Within the assumptions of this setting, Hoeffding's will give you a good estimate (but of what? ERROR!). Specifically, if you get an error of .0005 on a data set of size 10K ( X is dataset of 10K input values on the x axis whose y you already know) you can be almost certain that your learned weights are close to the original weights. Why? because your estimate of error should closely follow original error on the entire space and vice-versa, given that N is large. So you

**learn an estimate of the error** by calculating the error on your sample data (Ein) and using this estimate to predict real errors on new data points (Eout).

<I actually cheated a bit when I said you can chose the weights and still use Hoeffding. The original Hoeffding will only apply if you randomly guess some weight and calculate the error on data (Ein) to predict Eout. This is not a big problem however and it should be clear when you read my next post.>