LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 1 - The Learning Problem

Reply
 
Thread Tools Display Modes
  #1  
Old 07-19-2016, 01:00 PM
jeffjackson jeffjackson is offline
Junior Member
 
Join Date: Jun 2016
Posts: 5
Default Section 1.3 argument for feasibility of learning is fundamentally flawed

Section 1.3 takes on an important question: When is it
reasonable to claim that a hypothesis produced by a learning
algorithm is (probably) a good approximation to the target function?

The section's answer is summed up succinctly in its final sentence
(p. 27): "As long as we make sure that the complexity of
H [ the set of hypotheses from which we will select an
approximator ] gives us a good Hoeffding bound, our success
or failure in learning f can be determined by our success or
failure in fitting the training data."

If this claim were true, it would be a nice answer to the
question of when we can reasonably claim to learn. However,
it is not true.

Consider the simple case of a large finite input space X and a simple
hypothesis set H consisting of exactly two hypotheses, one that always
outputs +1 and one that always outputs -1. Assume that
the number N of random examples drawn is sufficient to guarantee
that, with high probability per the Hoeffding bound, the error on the
training data is similar to the error over the entire input space.
Then, according to the quote above, our "success or failure in learning
f can be determined by our success or failure in fitting the training
data." For instance, if we draw a Hoeffding-sufficient number of
training examples and, say, 90% of them are classified as +1,
Section 1.3 indicates that it is reasonable to conclude that we have
learned the target f, that the +1 hypothesis will have relatively low
error on the unseen inputs.

The problem is this: f might be a function that is +1 with
probability exactly 1/2. Then every time we draw training
data such that one of the two hypotheses in H seems to fit the data
reasonably well, the error of that hypothesis over the unseen data
will be large. In fact, the expected error over unseen
inputs will be slightly greater than 1/2 (since the error is less than 1/2 on
the seen inputs and exactly 1/2 overall), which is worse than random
guessing would be! Thus, in this setting, a good Hoeffding bound and
success in fitting the training data never implies success in
learning f, contradicting the claim of Section 1.3.

Now, it is true that, for sufficiently large N,
Hoeffding tells us that it is unlikely that the overall error
will differ much from from the training error. In the context of this example,
this implies that it is unlikely that we will draw a training set on which
we will see low training error, and therefore it is unlikely that any given
run of the algorithm will claim to have learned f. The problem is, if and
when the algorithm does claim to have learned f, it will be wrong, every
single time.

To summarize, the key claim of Section 1.3 is that if the choice of H
and N imply a good Hoeffding guarantee, then good performance of
hypothesis g on the training data implies good performance of g
over the entire input space. However, a simple example demonstrates
that we can have a good Hoeffding guarantee and yet good performance
of g on the training data does not imply good performance of g
over rest of the input space. So, the argument of Section 1.3 is
fundamentally flawed.
Reply With Quote
  #2  
Old 08-01-2016, 10:05 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Section 1.3 argument for feasibility of learning is fundamentally flawed

You are correct. It is possible to be in an adversarial ML setting (bad f w.r.t. your H) in which Eout is bad (independent of what Ein is).

What Hoeffding gives you for the feasibility of learning is that when you output your g because Ein is small:

Either your g is good or you generated an unreliable data set

(that is true as a tautology). What Hoeffding gives is that the probability of generating reliable data sets is high. What this means for the learning scenario in this thread is that most of the time Ein will not be low and you will say you failed. Very rarely you will think you did not fail when in fact you failed.

In general, with high probability you will:

Either say you failed or produce a good g.
__________________
Have faith in probability
Reply With Quote
  #3  
Old 08-01-2016, 12:18 PM
jeffjackson jeffjackson is offline
Junior Member
 
Join Date: Jun 2016
Posts: 5
Default Re: Section 1.3 argument for feasibility of learning is fundamentally flawed

Malik, I'm glad to hear that we agree that Section 1.3 is not a valid defense of learning. Now let me sketch what I believe is a reasonable defense.

Essentially, the idea is to extend the argument of Section 1.3 with two additional observations:
  1. The sample size of the Hoeffding bound depends only logarithmically on the desired probability bound.
  2. Most computer scientists "trust" randomized algorithms when the probability of error of such an algorithm is extremely small, such as, say, 10^{-15}.
The first observation implies that we can drive the probability of seeing a misleading training set down to extremely small values with a relatively modest increase in training set size and/or in the allowable error of the hypothesis. For instance, in the standard (1-hypothesis) Hoeffding bound, we want the product N\epsilon^2 to be greater than \ln(2/\delta)/2. For \delta=0.01 the lower bound on N\epsilon^2 is approximately 2, while for \delta=10^{-15} the bound is approximately 17, an increase of less than a factor of 9. To be sure, such an increase in N\epsilon^2 might not be feasible for many learning problems. But for those problems where something like this increase is feasible, we may be able to say something stronger than Hoeffding alone says about the outcome of learning.

Regarding the second observation, randomized algorithms are procedures for solving problems much as regular (deterministic) algorithms are. However, any given run of a randomized algorithm makes random choices and has a small chance that these random choices will cause it to produce an incorrect answer. Probably the best-known randomized algorithm, and one of the earliest to be discovered, is the Miller-Rabin primality testing algorithm, which has been widely used in browsers in support of secure communication. If this algorithm is used as part of secure communication of, say, credit card information, and if the algorithm incorrectly certifies that a number is prime, then the result could be insecure communication of that credit card information. So, since Miller-Rabin has a small chance of error, why would browser manufacturers take a chance with my credit card information by using this potentially faulty algorithm? Because, in practice, the probability of randomization error can be driven so small that it is in effect more likely that the hardware running the algorithm will incorrectly execute the algorithm and thereby produce an erroneous output than it is that the randomization of the algorithm will lead to an erroneous output. Thus, since it is generally considered reasonable to ignore the small--but nonzero--possibility of hardware errors producing incorrect outputs, it should similarly be considered reasonable to ignore the unlikely possibility of sufficiently small randomization error producing incorrect outputs.

What does this discussion of randomized algorithms have to do with learning? A learning algorithm can in some sense be viewed as a form of randomized algorithm in which the randomization comes from a presumed random choice of the training data rather than from any random choices made by the algorithm itself. Hoeffding bounds the probability that the algorithm will be misled by its random input and therefore produce an erroneous output, that is, will output a hypothesis with a claim that it approximates the target when in fact the hypothesis is far from the target. Thus, given that it is reasonable to accept the results of randomized algorithms run with sufficiently low probability of error, it would seem to also be reasonable to accept the results of running a learning algorithm with sufficiently low Hoeffding probability bound (and given the standard assumption that the training data is drawn at random). That is, it would seem that if for a given learning problem we have N\epsilon^2 such that \delta is sufficiently small, and if the learning algorithm run on this problem claims that its hypothesis is a good approximation to the target, then we should accept this claim.

However, it should be noted that this position has proved to be controversial, as ultimately it runs counter to the widely-held belief that Bayesian decision theory is the preferred way to make rational decisions. In particular, see Section 5.6 of David Wolpert's paper The Lack of A Priori Distinctions Between Learning Algorithms (Neural Computation 8, pp. 1341-90, 1996), which tacitly assumes that the Bayesian view is the only reasonable view in the learning setting and then, based on this assumption, shows that an argument such as the one above fails to establish the feasibility of learning.

So, in the end, to accept the defense of learning outlined above is to accept that Bayesian decision theory is not (always) the preferred approach to rational decision making, despite the apparently strong theoretical foundations for Bayesian theory, its success in many practical applications, and its widespread acceptance. I believe that I can, in fact, show that there is a better theory that encompasses both traditional Bayesian decision theory and the defense of learning presented above. But that discussion won't fit in the margins of this post ;-) So, for now, I leave it to the reader to decide which he or she prefers: the argument above which implies that learning free of target assumptions can be feasible for problems with sufficiently large N\epsilon^2, or strict adherence to Bayesian decision theory which implies that target-assumption-free learning is impossible.
Reply With Quote
  #4  
Old 08-01-2016, 03:02 PM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Section 1.3 argument for feasibility of learning is fundamentally flawed

Quote:
JJ: That is, it would seem that if for a given learning problem we have N\epsilon^2 such that \delta is sufficiently small, and if the learning algorithm run on this problem claims that its hypothesis is a good approximation to the target, then we should accept this claim.
Perhaps I am missing some subtlety but this is exactly what is being said in Section 1.3.

Quote:
MMI: Either your g is good or you generated an unreliable data set
Hoeffding says that the probability of an unreliable data set is (say) 10^{-15} and so you can "safely" assume the data set was reliable and trust what Ein says.

Quote:
JJ: I believe that I can, in fact, show that there is a better theory that encompasses both traditional Bayesian decision theory and the defense of learning presented above.
That would be very interesting, though perhaps a little beyond the scope of this book. Our approach is to view the feasibility learning in 2-steps:

1. Ensure you are generating a reliable dataset with HIGH probability (possible given Hoeffding).
2. Proceed as though the data set is reliable, which is not guaranteed, but a reasonable assumption given 1.
__________________
Have faith in probability
Reply With Quote
  #5  
Old 08-02-2016, 01:16 PM
jeffjackson jeffjackson is offline
Junior Member
 
Join Date: Jun 2016
Posts: 5
Default Re: Section 1.3 argument for feasibility of learning is fundamentally flawed

Quote:
Originally Posted by magdon View Post
Perhaps I am missing some subtlety but this is exactly what is being said in Section 1.3.

Hoeffding says that the probability of an unreliable data set is (say) 10^{-15} and so you can "safely" assume the data set was reliable and trust what Ein says.
Hoeffding will tell us the probability that a data set is unreliable, and Section 1.3 does present Hoeffding. But beyond that, yes, you're missing a lot, Malik ;-).
  1. Section 1.3 contains no discussion of how small the Hoeffding \delta should be in order to make meaningful learning claims.

    To the contrary, the textbook seems content to use relatively large values for \delta (see Problem 2.1, for instance, which assumes \delta=0.03). But if we want to make meaningful target-assumption-free learning claims, \delta=0.03 will not work. The book should make this clear; it does not.
    ~
  2. Section 1.3 makes no mention of "safely" assuming that the data set is reliable.

    To the contrary, rather than making concrete assumptions about the data set, Section 1.3 makes probabilistic claims such as (p. 20) "knowing that we are within \pm \epsilon of \mu most of the time is a significant improvement over not knowing anything at all." But if we want to make meaningful target-assumption-free learning claims, knowing that the data set is reliable "most of the time" is not enough. We need to be certain that the data set is not misleading, analogous to being certain that our computer hardware has not produced a misleading output. The book should make this clear; it does not.
    ~
  3. Section 1.3 fails to mention that deciding to assume that the data set is reliable is not a decision based in probability theory.

    To the contrary, Section 1.3 claims that "[b]y adopting the probabilistic view ['alone' is implied], we get a positive answer to the feasibility question." But if we want to make meaningful target-assumption-free learning claims, probability theory alone will not give us a positive answer to the feasibility question, because probability theory alone will give us Wolpert's No Free Lunch theorems that apparently imply that target assumptions must be made. No, in addition to Hoeffding we need something that is not justified by probability theory per se in order to justify learning. But making such a non-probability-justified assumption is controversial because of the resulting conflict with conclusions drawn from Bayesian (probabilistic) decision theory. The book should make this clear; it does not.

In short, while I'm glad that it appears that you agree with my argument, Malik, it's my argument, not the book's. The argument found in Section 1.3 attempts to use Hoeffding and assumes that an algorithm will output a hypothesis only if the error on the training set is small, so in those ways it is similar to my argument. But, as you've already agreed in your first post in this thread, the argument of Section 1.3 is fundamentally flawed. You're not saying that about my argument. Ergo, the arguments are different.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 07:35 AM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.