Section 1.3 argument for feasibility of learning is fundamentally flawed
Section 1.3 takes on an important question: When is it
reasonable to claim that a hypothesis produced by a learning
algorithm is (probably) a good approximation to the target function?
The section's answer is summed up succinctly in its final sentence
(p. 27): "As long as we make sure that the complexity of
H [ the set of hypotheses from which we will select an
approximator ] gives us a good Hoeffding bound, our success
or failure in learning f can be determined by our success or
failure in fitting the training data."
If this claim were true, it would be a nice answer to the
question of when we can reasonably claim to learn. However,
it is not true.
Consider the simple case of a large finite input space X and a simple
hypothesis set H consisting of exactly two hypotheses, one that always
outputs +1 and one that always outputs 1. Assume that
the number N of random examples drawn is sufficient to guarantee
that, with high probability per the Hoeffding bound, the error on the
training data is similar to the error over the entire input space.
Then, according to the quote above, our "success or failure in learning
f can be determined by our success or failure in fitting the training
data." For instance, if we draw a Hoeffdingsufficient number of
training examples and, say, 90% of them are classified as +1,
Section 1.3 indicates that it is reasonable to conclude that we have
learned the target f, that the +1 hypothesis will have relatively low
error on the unseen inputs.
The problem is this: f might be a function that is +1 with
probability exactly 1/2. Then every time we draw training
data such that one of the two hypotheses in H seems to fit the data
reasonably well, the error of that hypothesis over the unseen data
will be large. In fact, the expected error over unseen
inputs will be slightly greater than 1/2 (since the error is less than 1/2 on
the seen inputs and exactly 1/2 overall), which is worse than random
guessing would be! Thus, in this setting, a good Hoeffding bound and
success in fitting the training data never implies success in
learning f, contradicting the claim of Section 1.3.
Now, it is true that, for sufficiently large N,
Hoeffding tells us that it is unlikely that the overall error
will differ much from from the training error. In the context of this example,
this implies that it is unlikely that we will draw a training set on which
we will see low training error, and therefore it is unlikely that any given
run of the algorithm will claim to have learned f. The problem is, if and
when the algorithm does claim to have learned f, it will be wrong, every
single time.
To summarize, the key claim of Section 1.3 is that if the choice of H
and N imply a good Hoeffding guarantee, then good performance of
hypothesis g on the training data implies good performance of g
over the entire input space. However, a simple example demonstrates
that we can have a good Hoeffding guarantee and yet good performance
of g on the training data does not imply good performance of g
over rest of the input space. So, the argument of Section 1.3 is
fundamentally flawed.
