View Single Post
#2
 Elroch Invited Guest Join Date: Mar 2013 Posts: 143 Re: Lec-11: Overfitting in terms of (bias, var, stochastic-noise)

Quote:
 Originally Posted by sptripathi We have: total-noise = var (overfitting-noise ? ) + bias (deterministic-noise) + stochastic-noise Qs: 1. Is overfitting-noise the var part alone? From Profs lecture, I tend to conclude that it is var caused because of attempt to fit stochastic-noise i.e. overfitting-noise really is an interplay of (stochastic-noise -> variance). Need help in interpreting it. 2. When we try to arrest the overfitting, using brakes(regularization) and/or validation, are we really working with overfitting alone ? In case of validation, we will have a measure of total-error : Is it that the relativity of total-errors across choice of model-complexity(e.g. H2 Vs H10), is giving us an estimate of relative measure of overfitting across choices of hypothesis-complexity? In case of brakes(regularization) : will the brake really be applied on overfitting alone, and not other parts of total-error, esp bias part ? 3. Consider a case in which target-complexity is 2nd order polynomial and we chose a 2nd order(H2) and a 10th order polynomial(H10) to fit it. How will the overfit and bias vary for the two hypothesis (as N grows on the x-axis)? Specifically, will the H10 have overfitting (with or without stochastic noise)? Also, H10 should have higher bias compared to H2 ? 4. Is there a notion of underfitting wrt Target-Function ? When we try to fit a 10th order polynomial target-function, with a 2nd order polynomial hypothesis, are we not underfitting ? If so, can we associate underfitting to bias then ? If not, what else ? Thanks
Impressive list of questions! I'll try to shed light on some of them.

1. I think it is fair to say deterministic noise or bias can lead to overfitting as well. For example, suppose you try to model sine functions on with a hypothesis set made up of positive constant functions only on This is such a bad hypothesis set for the job that however many data points you use, and however much regularization you use, you'd be better off in general using the single hypothesis consisting of the zero function. I would say this is a clear case of overfitting of the bias.

4. As I understand it, underfitting and overfitting can only ever be defined by contrast with what is possible, hence your first two questions in paragraph 4 are not well-posed.

A crucial point emphasised in the lectures is that the appropriate approximation technique (i.e. combination of hypothesis set and regularization) is determined by the data that is available to a greater extent than the actual form of the function. For example, fitting a 10th order polynomial with a 2nd order polynomial hypothesis (without regularization) may easily be overfitting if the data provided is only 3 points.

Pondering on these issues a bit, I realise that the missing piece of the jigsaw that is needed to make it possible to make these issues precise and quantitative is the distribution of possible actual functions that we are trying to approximate. I say "distribution" rather than "set", because how likely each function dramatically affects the optimal combination of hypothesis set and regularization, as well as the data that is available.

Say, for example, all possible 10th order polynomials on a unit interval are possibilities for some unknown function. Suppose however, that anything that is very far from a quadratic is very unlikely, and increasingly unlikely as the coefficients get bigger (excuse my vagueness, but the idea that the actual function is a 10th order polynomial, but it is extremely unlikely to be much different from a quadratic).

Now assume that we are given 3 points and asked to approximate the actual function. If the actual function had been a quadratic, we could just fit it perfectly. Since we know it is very close to a quadratic, we can still be sure that almost all the time a quadratic fit is going to be pretty good.

This is by contrast with the situation where the chance of the original function being far from quadratic is high, and using a quadratic to fit 3 points can be wildly overfitting. In this situation, I believe only severe regularization might justify the use of a quadratic at all, and using a simpler model might make more sense.