#1




Section 1.3 argument for feasibility of learning is fundamentally flawed
Section 1.3 takes on an important question: When is it
reasonable to claim that a hypothesis produced by a learning algorithm is (probably) a good approximation to the target function? The section's answer is summed up succinctly in its final sentence (p. 27): "As long as we make sure that the complexity of H [ the set of hypotheses from which we will select an approximator ] gives us a good Hoeffding bound, our success or failure in learning f can be determined by our success or failure in fitting the training data." If this claim were true, it would be a nice answer to the question of when we can reasonably claim to learn. However, it is not true. Consider the simple case of a large finite input space X and a simple hypothesis set H consisting of exactly two hypotheses, one that always outputs +1 and one that always outputs 1. Assume that the number N of random examples drawn is sufficient to guarantee that, with high probability per the Hoeffding bound, the error on the training data is similar to the error over the entire input space. Then, according to the quote above, our "success or failure in learning f can be determined by our success or failure in fitting the training data." For instance, if we draw a Hoeffdingsufficient number of training examples and, say, 90% of them are classified as +1, Section 1.3 indicates that it is reasonable to conclude that we have learned the target f, that the +1 hypothesis will have relatively low error on the unseen inputs. The problem is this: f might be a function that is +1 with probability exactly 1/2. Then every time we draw training data such that one of the two hypotheses in H seems to fit the data reasonably well, the error of that hypothesis over the unseen data will be large. In fact, the expected error over unseen inputs will be slightly greater than 1/2 (since the error is less than 1/2 on the seen inputs and exactly 1/2 overall), which is worse than random guessing would be! Thus, in this setting, a good Hoeffding bound and success in fitting the training data never implies success in learning f, contradicting the claim of Section 1.3. Now, it is true that, for sufficiently large N, Hoeffding tells us that it is unlikely that the overall error will differ much from from the training error. In the context of this example, this implies that it is unlikely that we will draw a training set on which we will see low training error, and therefore it is unlikely that any given run of the algorithm will claim to have learned f. The problem is, if and when the algorithm does claim to have learned f, it will be wrong, every single time. To summarize, the key claim of Section 1.3 is that if the choice of H and N imply a good Hoeffding guarantee, then good performance of hypothesis g on the training data implies good performance of g over the entire input space. However, a simple example demonstrates that we can have a good Hoeffding guarantee and yet good performance of g on the training data does not imply good performance of g over rest of the input space. So, the argument of Section 1.3 is fundamentally flawed. 
#2




Re: Section 1.3 argument for feasibility of learning is fundamentally flawed
You are correct. It is possible to be in an adversarial ML setting (bad f w.r.t. your H) in which Eout is bad (independent of what Ein is).
What Hoeffding gives you for the feasibility of learning is that when you output your g because Ein is small: Either your g is good or you generated an unreliable data set (that is true as a tautology). What Hoeffding gives is that the probability of generating reliable data sets is high. What this means for the learning scenario in this thread is that most of the time Ein will not be low and you will say you failed. Very rarely you will think you did not fail when in fact you failed. In general, with high probability you will: Either say you failed or produce a good g.
__________________
Have faith in probability 
#3




Re: Section 1.3 argument for feasibility of learning is fundamentally flawed
Malik, I'm glad to hear that we agree that Section 1.3 is not a valid defense of learning. Now let me sketch what I believe is a reasonable defense.
Essentially, the idea is to extend the argument of Section 1.3 with two additional observations:
Regarding the second observation, randomized algorithms are procedures for solving problems much as regular (deterministic) algorithms are. However, any given run of a randomized algorithm makes random choices and has a small chance that these random choices will cause it to produce an incorrect answer. Probably the bestknown randomized algorithm, and one of the earliest to be discovered, is the MillerRabin primality testing algorithm, which has been widely used in browsers in support of secure communication. If this algorithm is used as part of secure communication of, say, credit card information, and if the algorithm incorrectly certifies that a number is prime, then the result could be insecure communication of that credit card information. So, since MillerRabin has a small chance of error, why would browser manufacturers take a chance with my credit card information by using this potentially faulty algorithm? Because, in practice, the probability of randomization error can be driven so small that it is in effect more likely that the hardware running the algorithm will incorrectly execute the algorithm and thereby produce an erroneous output than it is that the randomization of the algorithm will lead to an erroneous output. Thus, since it is generally considered reasonable to ignore the smallbut nonzeropossibility of hardware errors producing incorrect outputs, it should similarly be considered reasonable to ignore the unlikely possibility of sufficiently small randomization error producing incorrect outputs. What does this discussion of randomized algorithms have to do with learning? A learning algorithm can in some sense be viewed as a form of randomized algorithm in which the randomization comes from a presumed random choice of the training data rather than from any random choices made by the algorithm itself. Hoeffding bounds the probability that the algorithm will be misled by its random input and therefore produce an erroneous output, that is, will output a hypothesis with a claim that it approximates the target when in fact the hypothesis is far from the target. Thus, given that it is reasonable to accept the results of randomized algorithms run with sufficiently low probability of error, it would seem to also be reasonable to accept the results of running a learning algorithm with sufficiently low Hoeffding probability bound (and given the standard assumption that the training data is drawn at random). That is, it would seem that if for a given learning problem we have such that is sufficiently small, and if the learning algorithm run on this problem claims that its hypothesis is a good approximation to the target, then we should accept this claim. However, it should be noted that this position has proved to be controversial, as ultimately it runs counter to the widelyheld belief that Bayesian decision theory is the preferred way to make rational decisions. In particular, see Section 5.6 of David Wolpert's paper The Lack of A Priori Distinctions Between Learning Algorithms (Neural Computation 8, pp. 134190, 1996), which tacitly assumes that the Bayesian view is the only reasonable view in the learning setting and then, based on this assumption, shows that an argument such as the one above fails to establish the feasibility of learning. So, in the end, to accept the defense of learning outlined above is to accept that Bayesian decision theory is not (always) the preferred approach to rational decision making, despite the apparently strong theoretical foundations for Bayesian theory, its success in many practical applications, and its widespread acceptance. I believe that I can, in fact, show that there is a better theory that encompasses both traditional Bayesian decision theory and the defense of learning presented above. But that discussion won't fit in the margins of this post ;) So, for now, I leave it to the reader to decide which he or she prefers: the argument above which implies that learning free of target assumptions can be feasible for problems with sufficiently large , or strict adherence to Bayesian decision theory which implies that targetassumptionfree learning is impossible. 
#4




Re: Section 1.3 argument for feasibility of learning is fundamentally flawed
Quote:
Quote:
Quote:
1. Ensure you are generating a reliable dataset with HIGH probability (possible given Hoeffding). 2. Proceed as though the data set is reliable, which is not guaranteed, but a reasonable assumption given 1.
__________________
Have faith in probability 
#5




Re: Section 1.3 argument for feasibility of learning is fundamentally flawed
Quote:
In short, while I'm glad that it appears that you agree with my argument, Malik, it's my argument, not the book's. The argument found in Section 1.3 attempts to use Hoeffding and assumes that an algorithm will output a hypothesis only if the error on the training set is small, so in those ways it is similar to my argument. But, as you've already agreed in your first post in this thread, the argument of Section 1.3 is fundamentally flawed. You're not saying that about my argument. Ergo, the arguments are different. 
Thread Tools  
Display Modes  

