View Single Post
  #3  
Old 08-01-2016, 01:18 PM
jeffjackson jeffjackson is offline
Junior Member
 
Join Date: Jun 2016
Posts: 5
Default Re: Section 1.3 argument for feasibility of learning is fundamentally flawed

Malik, I'm glad to hear that we agree that Section 1.3 is not a valid defense of learning. Now let me sketch what I believe is a reasonable defense.

Essentially, the idea is to extend the argument of Section 1.3 with two additional observations:
  1. The sample size of the Hoeffding bound depends only logarithmically on the desired probability bound.
  2. Most computer scientists "trust" randomized algorithms when the probability of error of such an algorithm is extremely small, such as, say, 10^{-15}.
The first observation implies that we can drive the probability of seeing a misleading training set down to extremely small values with a relatively modest increase in training set size and/or in the allowable error of the hypothesis. For instance, in the standard (1-hypothesis) Hoeffding bound, we want the product N\epsilon^2 to be greater than \ln(2/\delta)/2. For \delta=0.01 the lower bound on N\epsilon^2 is approximately 2, while for \delta=10^{-15} the bound is approximately 17, an increase of less than a factor of 9. To be sure, such an increase in N\epsilon^2 might not be feasible for many learning problems. But for those problems where something like this increase is feasible, we may be able to say something stronger than Hoeffding alone says about the outcome of learning.

Regarding the second observation, randomized algorithms are procedures for solving problems much as regular (deterministic) algorithms are. However, any given run of a randomized algorithm makes random choices and has a small chance that these random choices will cause it to produce an incorrect answer. Probably the best-known randomized algorithm, and one of the earliest to be discovered, is the Miller-Rabin primality testing algorithm, which has been widely used in browsers in support of secure communication. If this algorithm is used as part of secure communication of, say, credit card information, and if the algorithm incorrectly certifies that a number is prime, then the result could be insecure communication of that credit card information. So, since Miller-Rabin has a small chance of error, why would browser manufacturers take a chance with my credit card information by using this potentially faulty algorithm? Because, in practice, the probability of randomization error can be driven so small that it is in effect more likely that the hardware running the algorithm will incorrectly execute the algorithm and thereby produce an erroneous output than it is that the randomization of the algorithm will lead to an erroneous output. Thus, since it is generally considered reasonable to ignore the small--but nonzero--possibility of hardware errors producing incorrect outputs, it should similarly be considered reasonable to ignore the unlikely possibility of sufficiently small randomization error producing incorrect outputs.

What does this discussion of randomized algorithms have to do with learning? A learning algorithm can in some sense be viewed as a form of randomized algorithm in which the randomization comes from a presumed random choice of the training data rather than from any random choices made by the algorithm itself. Hoeffding bounds the probability that the algorithm will be misled by its random input and therefore produce an erroneous output, that is, will output a hypothesis with a claim that it approximates the target when in fact the hypothesis is far from the target. Thus, given that it is reasonable to accept the results of randomized algorithms run with sufficiently low probability of error, it would seem to also be reasonable to accept the results of running a learning algorithm with sufficiently low Hoeffding probability bound (and given the standard assumption that the training data is drawn at random). That is, it would seem that if for a given learning problem we have N\epsilon^2 such that \delta is sufficiently small, and if the learning algorithm run on this problem claims that its hypothesis is a good approximation to the target, then we should accept this claim.

However, it should be noted that this position has proved to be controversial, as ultimately it runs counter to the widely-held belief that Bayesian decision theory is the preferred way to make rational decisions. In particular, see Section 5.6 of David Wolpert's paper The Lack of A Priori Distinctions Between Learning Algorithms (Neural Computation 8, pp. 1341-90, 1996), which tacitly assumes that the Bayesian view is the only reasonable view in the learning setting and then, based on this assumption, shows that an argument such as the one above fails to establish the feasibility of learning.

So, in the end, to accept the defense of learning outlined above is to accept that Bayesian decision theory is not (always) the preferred approach to rational decision making, despite the apparently strong theoretical foundations for Bayesian theory, its success in many practical applications, and its widespread acceptance. I believe that I can, in fact, show that there is a better theory that encompasses both traditional Bayesian decision theory and the defense of learning presented above. But that discussion won't fit in the margins of this post ;-) So, for now, I leave it to the reader to decide which he or she prefers: the argument above which implies that learning free of target assumptions can be feasible for problems with sufficiently large N\epsilon^2, or strict adherence to Bayesian decision theory which implies that target-assumption-free learning is impossible.
Reply With Quote