![]() |
#1
|
|||
|
|||
![]()
Section 1.3 takes on an important question: When is it
reasonable to claim that a hypothesis produced by a learning algorithm is (probably) a good approximation to the target function? The section's answer is summed up succinctly in its final sentence (p. 27): "As long as we make sure that the complexity of H [ the set of hypotheses from which we will select an approximator ] gives us a good Hoeffding bound, our success or failure in learning f can be determined by our success or failure in fitting the training data." If this claim were true, it would be a nice answer to the question of when we can reasonably claim to learn. However, it is not true. Consider the simple case of a large finite input space X and a simple hypothesis set H consisting of exactly two hypotheses, one that always outputs +1 and one that always outputs -1. Assume that the number N of random examples drawn is sufficient to guarantee that, with high probability per the Hoeffding bound, the error on the training data is similar to the error over the entire input space. Then, according to the quote above, our "success or failure in learning f can be determined by our success or failure in fitting the training data." For instance, if we draw a Hoeffding-sufficient number of training examples and, say, 90% of them are classified as +1, Section 1.3 indicates that it is reasonable to conclude that we have learned the target f, that the +1 hypothesis will have relatively low error on the unseen inputs. The problem is this: f might be a function that is +1 with probability exactly 1/2. Then every time we draw training data such that one of the two hypotheses in H seems to fit the data reasonably well, the error of that hypothesis over the unseen data will be large. In fact, the expected error over unseen inputs will be slightly greater than 1/2 (since the error is less than 1/2 on the seen inputs and exactly 1/2 overall), which is worse than random guessing would be! Thus, in this setting, a good Hoeffding bound and success in fitting the training data never implies success in learning f, contradicting the claim of Section 1.3. Now, it is true that, for sufficiently large N, Hoeffding tells us that it is unlikely that the overall error will differ much from from the training error. In the context of this example, this implies that it is unlikely that we will draw a training set on which we will see low training error, and therefore it is unlikely that any given run of the algorithm will claim to have learned f. The problem is, if and when the algorithm does claim to have learned f, it will be wrong, every single time. To summarize, the key claim of Section 1.3 is that if the choice of H and N imply a good Hoeffding guarantee, then good performance of hypothesis g on the training data implies good performance of g over the entire input space. However, a simple example demonstrates that we can have a good Hoeffding guarantee and yet good performance of g on the training data does not imply good performance of g over rest of the input space. So, the argument of Section 1.3 is fundamentally flawed. |
Thread Tools | |
Display Modes | |
|
|