Quote:
Originally Posted by grozhd
As far as I understand Hoeffding's inequality tells us about the probability of "big" differences between a random variable value (mean of N independent r.v.'s) and a number (it's expected value).
Since it is a probabilistic inequality it means that if we try to do the expirement we will get this "big" difference only with certain limited probability. And "doing the experiment" assumes that we have a probability space X with a probability function P, then some outcome x from X happens, our random variables have some value on this x.
But when we are talking about learning we have a fixed data set so it seems to me that we don't observe a new outcome x each time, but we use one fixed which corresponds to our learning data set. And as we go from first hypothesis to second we can't use Hoeffding's inequality no more unless we select another random sample. For example is sample is finite and I have big hypothesis set I will easily go from one hypothesis to another and find one which is equal to target function on that sample. But as professor said in first lecture I didn't learn anything, I just memorized the data. But in lecture 2 it seems that we do the same and we are saying that inequality should work which is obviously not true for my example.
I know that I may have stated it fuzzily but that's because I don't actually understand the formalism here: what is our probability space, what is P, what are the random variables,  so that's why I can't formulate my question precisely. But I have a feeling that using this same data set somehow violates the probabilistic picture.
So, if anyone could clarify that it would be great, thanks in advance.

I will try and summarize what I know and you are encouraged to correct me If I don't make any sense or if I am wrong.
Let's take the easiest example of coin tossing. Let's assume you know that a coin is unbiased (P(h)=P(t)=.5). You give the coin to me and pay me some money to toss the coin 10K times and diligently record the outcome of each toss. These outcomes form X or your data set. Now, you ask me to report the P(h). Well, I being a stupid labor have no clue. But I have learned that "mu of X"
closely follows P(h) due to some mysterious equation called Hoeffding's inequality. So I report this to you which is about .4998. You are happy that your assessment of nonbias was correct and you reward me with more Dollars
.
Let us try and decode the experiment: X is not the whole input space but just a fraction of it. card(X) = 10000 and not infinity. Usually, we will neither have the time nor the energy to flip a coin infinite number of times. So we rely on some statistical assumption to play God and make predictions. The assumption, in our case being, that for a large N (10K), the P(h) being far away from mu is small. How small? That's given by our good old Hoeffding's inequality. It is possible however that mu is .05 while P(h)=.5 ! P(.00001) for example is very very small but it is > P(0) and hence "what can go wrong will go wrong" philosophy might apply. But in general it won't! For eg, you can say that it is very rare to find a girl who is 9ft tall but that doesn't mean it can not happen in the lifetime of our species! So if an Alien samples all humans alive today, they can approximately predict the height of any Homo sapiens it is to encounter considerably accurately (but they might miss on some 9ft miss yet to be born in the year 2890
).
Yes, you did memorize the mu but if N is large, it can not be too bad! That's the key. But if you sampled just 5 Asians, you will be wrong mostly in Europe!
Quote:
Originally Posted by grozhd
But when we are talking about learning we have a fixed data set so it seems to me that we don't observe a new outcome x each time, but we use one fixed which corresponds to our learning data set.

I will think of things differently. x describes each toss of coin. X is 10K x's. So you don't have one x, but 10K of them. The assumption being, each x is iid and derived from P. Let's say the value of MU = .4998. You do another 10K tosses and derive a MU. You do another 10K... You do this infinite number of times. Each iteration of 10K tosses will give you a MU value. In this case, MU can be considered to be a random variable. And this will follow a normal distribution about E(MU) or the Population mean. The inequality gives you a good approximation of the relationship between
any such MU = mu (Read
the data set obtained with 10K tosses for which you paid me) and E(MU) or P.
Google Sampling distribution of mean! That's the central idea.