- **Chapter 1 - The Learning Problem**
(*http://book.caltech.edu/bookforum/forumdisplay.php?f=108*)

- - **Section 1.3 argument for feasibility of learning is fundamentally flawed**
(*http://book.caltech.edu/bookforum/showthread.php?t=4689*)

Section 1.3 argument for feasibility of learning is fundamentally flawedSection 1.3 takes on an important question: When is it
reasonable to claim that a hypothesis produced by a learning algorithm is (probably) a good approximation to the target function? The section's answer is summed up succinctly in its final sentence (p. 27): "As long as we make sure that the complexity of H [ the set of hypotheses from which we will select an approximator ] gives us a good Hoeffding bound, our success or failure in learning f can be determined by our success or failure in fitting the training data." If this claim were true, it would be a nice answer to the question of when we can reasonably claim to learn. However, it is not true. Consider the simple case of a large finite input space X and a simple hypothesis set H consisting of exactly two hypotheses, one that always outputs +1 and one that always outputs -1. Assume that the number N of random examples drawn is sufficient to guarantee that, with high probability per the Hoeffding bound, the error on the training data is similar to the error over the entire input space. Then, according to the quote above, our "success or failure in learning f can be determined by our success or failure in fitting the training data." For instance, if we draw a Hoeffding-sufficient number of training examples and, say, 90% of them are classified as +1, Section 1.3 indicates that it is reasonable to conclude that we have learned the target f, that the +1 hypothesis will have relatively low error on the unseen inputs. The problem is this: f might be a function that is +1 with probability exactly 1/2. Then every time we draw trainingdata such that one of the two hypotheses in H seems to fit the data reasonably well, the error of that hypothesis over the unseen data will be large. In fact, the expected error over unseen inputs will be slightly greater than 1/2 (since the error is less than 1/2 on the seen inputs and exactly 1/2 overall), which is worse than random guessing would be! Thus, in this setting, a good Hoeffding bound and success in fitting the training data never implies success in learning f, contradicting the claim of Section 1.3. Now, it is true that, for sufficiently large N, Hoeffding tells us that it is unlikely that the overall error will differ much from from the training error. In the context of this example, this implies that it is unlikely that we will draw a training set on which we will see low training error, and therefore it is unlikely that any given run of the algorithm will claim to have learned f. The problem is, if and when the algorithm does claim to have learned f, it will be wrong, every single time. To summarize, the key claim of Section 1.3 is that if the choice of H and N imply a good Hoeffding guarantee, then good performance of hypothesis g on the training data implies good performance of g over the entire input space. However, a simple example demonstrates that we can have a good Hoeffding guarantee and yet good performance of g on the training data does not imply good performance of g over rest of the input space. So, the argument of Section 1.3 is fundamentally flawed. |

Re: Section 1.3 argument for feasibility of learning is fundamentally flawedYou are correct. It is possible to be in an adversarial ML setting (bad f w.r.t. your H) in which Eout is bad (independent of what Ein is).
What Hoeffding gives you for the feasibility of learning is that when you output your g because Ein is small: Either your g is good or you generated an unreliable data set (that is true as a tautology). What Hoeffding gives is that the probability of generating reliable data sets is high. What this means for the learning scenario in this thread is that most of the time Ein will not be low and you will say you failed. Very rarely you will think you did not fail when in fact you failed. In general, with high probability you will: Either say you failed or produce a good g. |

Re: Section 1.3 argument for feasibility of learning is fundamentally flawedMalik, I'm glad to hear that we agree that Section 1.3 is not a valid defense of learning. Now let me sketch what I believe is a reasonable defense.
Essentially, the idea is to extend the argument of Section 1.3 with two additional observations: - The sample size of the Hoeffding bound depends only logarithmically on the desired probability bound.
- Most computer scientists "trust" randomized algorithms when the probability of error of such an algorithm is extremely small, such as, say, .
Regarding the second observation, randomized algorithms are procedures for solving problems much as regular (deterministic) algorithms are. However, any given run of a randomized algorithm makes random choices and has a small chance that these random choices will cause it to produce an incorrect answer. Probably the best-known randomized algorithm, and one of the earliest to be discovered, is the Miller-Rabin primality testing algorithm, which has been widely used in browsers in support of secure communication. If this algorithm is used as part of secure communication of, say, credit card information, and if the algorithm incorrectly certifies that a number is prime, then the result could be insecure communication of that credit card information. So, since Miller-Rabin has a small chance of error, why would browser manufacturers take a chance with my credit card information by using this potentially faulty algorithm? Because, in practice, the probability of randomization error can be driven so small that it is in effect more likely that the hardware running the algorithm will incorrectly execute the algorithm and thereby produce an erroneous output than it is that the randomization of the algorithm will lead to an erroneous output. Thus, since it is generally considered reasonable to ignore the small--but nonzero--possibility of hardware errors producing incorrect outputs, it should similarly be considered reasonable to ignore the unlikely possibility of sufficiently small randomization error producing incorrect outputs. What does this discussion of randomized algorithms have to do with learning? A learning algorithm can in some sense be viewed as a form of randomized algorithm in which the randomization comes from a presumed random choice of the training data rather than from any random choices made by the algorithm itself. Hoeffding bounds the probability that the algorithm will be misled by its random input and therefore produce an erroneous output, that is, will output a hypothesis with a claim that it approximates the target when in fact the hypothesis is far from the target. Thus, given that it is reasonable to accept the results of randomized algorithms run with sufficiently low probability of error, it would seem to also be reasonable to accept the results of running a learning algorithm with sufficiently low Hoeffding probability bound (and given the standard assumption that the training data is drawn at random). That is, it would seem that if for a given learning problem we have such that is sufficiently small, and if the learning algorithm run on this problem claims that its hypothesis is a good approximation to the target, then we should accept this claim. However, it should be noted that this position has proved to be controversial, as ultimately it runs counter to the widely-held belief that Bayesian decision theory is the preferred way to make rational decisions. In particular, see Section 5.6 of David Wolpert's paper The Lack of A Priori Distinctions Between Learning Algorithms (Neural Computation 8, pp. 1341-90, 1996), which tacitly assumes that the Bayesian view is the only reasonable view in the learning setting and then, based on this assumption, shows that an argument such as the one above fails to establish the feasibility of learning.So, in the end, to accept the defense of learning outlined above is to accept that Bayesian decision theory is not (always) the preferred approach to rational decision making, despite the apparently strong theoretical foundations for Bayesian theory, its success in many practical applications, and its widespread acceptance. I believe that I can, in fact, show that there is a better theory that encompasses both traditional Bayesian decision theory and the defense of learning presented above. But that discussion won't fit in the margins of this post ;-) So, for now, I leave it to the reader to decide which he or she prefers: the argument above which implies that learning free of target assumptions can be feasible for problems with sufficiently large , or strict adherence to Bayesian decision theory which implies that target-assumption-free learning is impossible. |

Re: Section 1.3 argument for feasibility of learning is fundamentally flawedQuote:
- Section 1.3 contains no discussion of how small the Hoeffding should be in order to make meaningful learning claims.
To the contrary, the textbook seems content to use relatively large values for (see Problem 2.1, for instance, which assumes ). But if we want to make meaningful target-assumption-free learning claims, will not work. The book should make this clear; it does not.
- Section 1.3 makes no mention of "safely" assuming that the data set is reliable.
To the contrary, rather than making concrete assumptions about the data set, Section 1.3 makes probabilistic claims such as (p. 20) "knowing that we are within of most of the time is a significant improvement over not knowing anything at all." But if we want to make meaningful target-assumption-free learning claims, knowing that the data set is reliable "most of the time" is not enough. We need to be certain that the data set is not misleading, analogous to being certain that our computer hardware has not produced a misleading output. The book should make this clear; it does not.
- Section 1.3 fails to mention that deciding to assume that the data set is reliable is not a decision based in probability theory.
To the contrary, Section 1.3 claims that "[b]y adopting the probabilistic view ['alone' is implied], we get a positive answer to the feasibility question." But if we want to make meaningful target-assumption-free learning claims, probability theory alone will not give us a positive answer to the feasibility question, because probability theory alone will give us Wolpert's No Free Lunch theorems that apparently imply that target assumptions must be made. No, in addition to Hoeffding we need something that is not justified by probability theory*per se*in order to justify learning. But making such a non-probability-justified assumption is controversial because of the resulting conflict with conclusions drawn from Bayesian (probabilistic) decision theory. The book should make this clear; it does not.
In short, while I'm glad that it appears that you agree with my argument, Malik, it's my argument, not the book's. The argument found in Section 1.3 attempts to use Hoeffding and assumes that an algorithm will output a hypothesis only if the error on the training set is small, so in those ways it is similar to my argument. But, as you've already agreed in your first post in this thread, the argument of Section 1.3 is fundamentally flawed. You're not saying that about my argument. Ergo, the arguments are different. |

All times are GMT -7. The time now is 04:51 PM. |

Powered by vBulletin® Version 3.8.3

Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.