 LFD Book Forum Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

#1
 sptripathi Junior Member Join Date: Apr 2013 Posts: 8 Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

We have:
total-noise =
var (overfitting-noise ? )
+ bias (deterministic-noise)
+ stochastic-noise

Qs:

1. Is overfitting-noise the var part alone? From Profs lecture, I tend to conclude that it is var caused because of attempt to fit stochastic-noise i.e. overfitting-noise really is an interplay of (stochastic-noise -> variance). Need help in interpreting it.

2. When we try to arrest the overfitting, using brakes(regularization) and/or validation, are we really working with overfitting alone ?
In case of validation, we will have a measure of total-error : Is it that the relativity of total-errors across choice of model-complexity(e.g. H2 Vs H10), is giving us an estimate of relative measure of overfitting across choices of hypothesis-complexity?
In case of brakes(regularization) : will the brake really be applied on overfitting alone, and not other parts of total-error, esp bias part ?

3. Consider a case in which target-complexity is 2nd order polynomial and we chose a 2nd order(H2) and a 10th order polynomial(H10) to fit it. How will the overfit and bias vary for the two hypothesis (as N grows on the x-axis)?
Specifically, will the H10 have overfitting (with or without stochastic noise)? Also, H10 should have higher bias compared to H2 ?

4. Is there a notion of underfitting wrt Target-Function ? When we try to fit a 10th order polynomial target-function, with a 2nd order polynomial hypothesis, are we not underfitting ? If so, can we associate underfitting to bias then ? If not, what else ?

Thanks
#2
 Elroch Invited Guest Join Date: Mar 2013 Posts: 143 Re: Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

Quote:
 Originally Posted by sptripathi We have: total-noise = var (overfitting-noise ? ) + bias (deterministic-noise) + stochastic-noise Qs: 1. Is overfitting-noise the var part alone? From Profs lecture, I tend to conclude that it is var caused because of attempt to fit stochastic-noise i.e. overfitting-noise really is an interplay of (stochastic-noise -> variance). Need help in interpreting it. 2. When we try to arrest the overfitting, using brakes(regularization) and/or validation, are we really working with overfitting alone ? In case of validation, we will have a measure of total-error : Is it that the relativity of total-errors across choice of model-complexity(e.g. H2 Vs H10), is giving us an estimate of relative measure of overfitting across choices of hypothesis-complexity? In case of brakes(regularization) : will the brake really be applied on overfitting alone, and not other parts of total-error, esp bias part ? 3. Consider a case in which target-complexity is 2nd order polynomial and we chose a 2nd order(H2) and a 10th order polynomial(H10) to fit it. How will the overfit and bias vary for the two hypothesis (as N grows on the x-axis)? Specifically, will the H10 have overfitting (with or without stochastic noise)? Also, H10 should have higher bias compared to H2 ? 4. Is there a notion of underfitting wrt Target-Function ? When we try to fit a 10th order polynomial target-function, with a 2nd order polynomial hypothesis, are we not underfitting ? If so, can we associate underfitting to bias then ? If not, what else ? Thanks
Impressive list of questions! I'll try to shed light on some of them.

1. I think it is fair to say deterministic noise or bias can lead to overfitting as well. For example, suppose you try to model sine functions on with a hypothesis set made up of positive constant functions only on This is such a bad hypothesis set for the job that however many data points you use, and however much regularization you use, you'd be better off in general using the single hypothesis consisting of the zero function. I would say this is a clear case of overfitting of the bias.

4. As I understand it, underfitting and overfitting can only ever be defined by contrast with what is possible, hence your first two questions in paragraph 4 are not well-posed.

A crucial point emphasised in the lectures is that the appropriate approximation technique (i.e. combination of hypothesis set and regularization) is determined by the data that is available to a greater extent than the actual form of the function. For example, fitting a 10th order polynomial with a 2nd order polynomial hypothesis (without regularization) may easily be overfitting if the data provided is only 3 points.

Pondering on these issues a bit, I realise that the missing piece of the jigsaw that is needed to make it possible to make these issues precise and quantitative is the distribution of possible actual functions that we are trying to approximate. I say "distribution" rather than "set", because how likely each function dramatically affects the optimal combination of hypothesis set and regularization, as well as the data that is available.

Say, for example, all possible 10th order polynomials on a unit interval are possibilities for some unknown function. Suppose however, that anything that is very far from a quadratic is very unlikely, and increasingly unlikely as the coefficients get bigger (excuse my vagueness, but the idea that the actual function is a 10th order polynomial, but it is extremely unlikely to be much different from a quadratic).

Now assume that we are given 3 points and asked to approximate the actual function. If the actual function had been a quadratic, we could just fit it perfectly. Since we know it is very close to a quadratic, we can still be sure that almost all the time a quadratic fit is going to be pretty good.

This is by contrast with the situation where the chance of the original function being far from quadratic is high, and using a quadratic to fit 3 points can be wildly overfitting. In this situation, I believe only severe regularization might justify the use of a quadratic at all, and using a simpler model might make more sense.
#3
 sptripathi Junior Member Join Date: Apr 2013 Posts: 8 Re: Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

[ Just one clarification to my first set of Qs. Let's say that we always have 'sufficient' data-points to learn from, for any choice of the 'order of polynomial' in the hypothesis set - i.e for H2 we have >>20 and for H10, we have >> 100 points, and likewise for any other order ]

Quote:
 Originally Posted by Elroch 1. I think it is fair to say deterministic noise or bias can lead to overfitting as well. For example, suppose you try to model sine functions on with a hypothesis set made up of positive constant functions only on This is such a bad hypothesis set for the job that however many data points you use, and however much regularization you use, you'd be better off in general using the single hypothesis consisting of the zero function. I would say this is a clear case of overfitting of the bias.
In here and in your last para, you seem to suggest that bias is one form of overfitting. This is where I'm struggling. For instance, in your above example, a constant hypothesis sounds more like underfitting than overfitting. So it definitely has bias in that sense, but is that inability(bias) really overfitting ?

Quote:
 Originally Posted by Elroch Say, for example, all possible 10th order polynomials on a unit interval are possibilities for some unknown function. Suppose however, that anything that is very far from a quadratic is very unlikely, and increasingly unlikely as the coefficients get bigger (excuse my vagueness, but the idea that the actual function is a 10th order polynomial, but it is extremely unlikely to be much different from a quadratic).
It is indeed interesting to think about probability-distribution on the order of the polynomial (target-function), wrt the order of hypothesis-polynomial.
However, if we had a probability-distribution on target-function's complexity, then a given instance of it will still be a fixed-order polynomial, albeit we may not know what it is. So we will use validation-set to gauge which order of polynomial on hypothesis seems more promising. Right?

Quote:
 Originally Posted by Elroch For example, fitting a 10th order polynomial with a 2nd order polynomial hypothesis (without regularization) may easily be overfitting if the data provided is only 3 points.
Ok. Now we augment it with sufficient data points. Given that, the H2 can never do as good as H10 in approximating a 10th order polynomial (target function). So H2 clearly has higher bias than H10, but is that inability an underfitting or an overfitting issue. Apologies for repetition of the Q. #4 magdon RPI Join Date: Aug 2009 Location: Troy, NY, USA. Posts: 595 Re: Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

Very thoughtful questions.

1. Yes, you are correct. The term which overfitting is responsible for is the . It can occur when there is either deterministic or stochastic noise.

One way to look at this is as follows. Suppose you picked the function that was truly the best. What would your error be? To a good approximation, it would be: This is because for most normal learning models, the best hypothesis is approximately the average hypothesis (see Problem 3.14(a) for an example). That being said, these first two terms in the bias variance decomposition are inevitable, and we can view this as the direct impact of the noise (stochastic and deterministic).

So the additional term contributing to the error must be resulting from our inability to pick the best hypothesis. But, why are we unable to pick the best hypothesis: because we are being misled by the data. That is, the best hypothesis on the data (having minimum Ein) is not the best hypothesis out-of-sample (which must have higher Ein). By going for the lower Ein hypothesis we are getting a higher Eout hypothesis - we are overfitting the data.

So you can view the term as the indirect impact of the noise. It is not inevitable per se, but exists because of your ability' to be misled by the data (ie overfit). The complexity of your model plays a heavy role in your ability' to be misled since if your model is complicated you have more ways in which to be misled. If the number of data points goes up, approaching infinity, you will not significantly change the direct contribution of the noise; it is the var term that will go down, eventually approaching 0.

The answers to your remaining questions are related to the above discussion as well as to later material in the text.

2. Let's take regularization (validation is a little more complicated). In chapter 4 we will make an explicit connection between regularization and using a smaller' hypothesis set. So at the end of the day most methods for braking' effectively result in using a smaller hypothesis set. Regularization does this in a more flexible and `soft' way than simply picking H2 versus H10.

And then, you are right. There is a tradeoff when you reduce the size of the model. You will increase the bias (direct impact) but decrease the var (indirect impact). One of these effects wins and this determines whether you should increase or decrease your model size. In small high noise settings with complex models, the indirect impact wins and so it pays to regularize.

3. I highly recommend thinking about Exercise 4.3 4. Yes, there is such a thing as underfitting (see chapter 4). This is usually happening when it is the direct impact (bias) that wins over the indirect impact (var). And so, you should increase the size of to reduce the direct impact at the expense of a small increase in the indirect impact. Underfitting occurs when the quality and quantity of your data is very high in relation to your model size.

Quote:
 Originally Posted by sptripathi We have: total-noise = var (overfitting-noise ? ) + bias (deterministic-noise) + stochastic-noise Qs: 1. Is overfitting-noise the var part alone? From Prof’s lecture, I tend to conclude that it is var caused because of attempt to fit stochastic-noise i.e. overfitting-noise really is an interplay of (stochastic-noise -> variance). Need help in interpreting it. 2. When we try to arrest the overfitting, using brakes(regularization) and/or validation, are we really working with overfitting alone ? In case of validation, we will have a measure of total-error : Is it that the relativity of total-errors across choice of model-complexity(e.g. H2 Vs H10), is giving us an estimate of relative measure of overfitting across choices of hypothesis-complexity? In case of brakes(regularization) : will the brake really be applied on overfitting alone, and not other parts of total-error, esp bias part ? 3. Consider a case in which target-complexity is 2nd order polynomial and we chose a 2nd order(H2) and a 10th order polynomial(H10) to fit it. How will the overfit and bias vary for the two hypothesis (as N grows on the x-axis)? Specifically, will the H10 have overfitting (with or without stochastic noise)? Also, H10 should have higher bias compared to H2 ? 4. Is there a notion of underfitting wrt Target-Function ? When we try to fit a 10th order polynomial target-function, with a 2nd order polynomial hypothesis, are we not underfitting ? If so, can we associate underfitting to bias then ? If not, what else ? Thanks
__________________
Have faith in probability
#5
 sptripathi Junior Member Join Date: Apr 2013 Posts: 8 Re: Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

Thanks a lot, Prof Magdon. It feels much better now. Sure - I'll take the exercises as you suggested.

 Thread Tools Show Printable Version Email this Page Display Modes Linear Mode Switch to Hybrid Mode Switch to Threaded Mode Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General     General Discussion of Machine Learning     Free Additional Material         Dynamic e-Chapters         Dynamic e-Appendices Course Discussions     Online LFD course         General comments on the course         Homework 1         Homework 2         Homework 3         Homework 4         Homework 5         Homework 6         Homework 7         Homework 8         The Final         Create New Homework Problems Book Feedback - Learning From Data     General comments on the book     Chapter 1 - The Learning Problem     Chapter 2 - Training versus Testing     Chapter 3 - The Linear Model     Chapter 4 - Overfitting     Chapter 5 - Three Learning Principles     e-Chapter 6 - Similarity Based Methods     e-Chapter 7 - Neural Networks     e-Chapter 8 - Support Vector Machines     e-Chapter 9 - Learning Aides     Appendix and Notation     e-Appendices

All times are GMT -7. The time now is 02:17 PM. The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.