Lec11: Overfitting in terms of (bias, var, stochasticnoise)
We have:
totalnoise = var (overfittingnoise ? ) + bias (deterministicnoise) + stochasticnoise Qs: 1. Is overfittingnoise the var part alone? From Prof’s lecture, I tend to conclude that it is var caused because of attempt to fit stochasticnoise i.e. overfittingnoise really is an interplay of (stochasticnoise > variance). Need help in interpreting it. 2. When we try to arrest the overfitting, using brakes(regularization) and/or validation, are we really working with overfitting alone ? In case of validation, we will have a measure of totalerror : Is it that the relativity of totalerrors across choice of modelcomplexity(e.g. H2 Vs H10), is giving us an estimate of relative measure of overfitting across choices of hypothesiscomplexity? In case of brakes(regularization) : will the brake really be applied on overfitting alone, and not other parts of totalerror, esp bias part ? 3. Consider a case in which targetcomplexity is 2nd order polynomial and we chose a 2nd order(H2) and a 10th order polynomial(H10) to fit it. How will the overfit and bias vary for the two hypothesis (as N grows on the xaxis)? Specifically, will the H10 have overfitting (with or without stochastic noise)? Also, H10 should have higher bias compared to H2 ? 4. Is there a notion of underfitting wrt TargetFunction ? When we try to fit a 10th order polynomial targetfunction, with a 2nd order polynomial hypothesis, are we not underfitting ? If so, can we associate underfitting to bias then ? If not, what else ? Thanks 
Re: Lec11: Overfitting in terms of (bias, var, stochasticnoise)
Thanks Elroch for your detailed reply ( and your patience therein ). That helped.
[ Just one clarification to my first set of Qs. Let's say that we always have 'sufficient' datapoints to learn from, for any choice of the 'order of polynomial' in the hypothesis set  i.e for H2 we have >>20 and for H10, we have >> 100 points, and likewise for any other order ] Quote:
Quote:
However, if we had a probabilitydistribution on targetfunction's complexity, then a given instance of it will still be a fixedorder polynomial, albeit we may not know what it is. So we will use validationset to gauge which order of polynomial on hypothesis seems more promising. Right? Quote:

Re: Lec11: Overfitting in terms of (bias, var, stochasticnoise)
Very thoughtful questions.
1. Yes, you are correct. The term which overfitting is responsible for is the . It can occur when there is either deterministic or stochastic noise. One way to look at this is as follows. Suppose you picked the function that was truly the best. What would your error be? To a good approximation, it would be: This is because for most normal learning models, the best hypothesis is approximately the average hypothesis (see Problem 3.14(a) for an example). That being said, these first two terms in the bias variance decomposition are inevitable, and we can view this as the direct impact of the noise (stochastic and deterministic). So the additional term contributing to the error must be resulting from our inability to pick the best hypothesis. But, why are we unable to pick the best hypothesis: because we are being misled by the data. That is, the best hypothesis on the data (having minimum Ein) is not the best hypothesis outofsample (which must have higher Ein). By going for the lower Ein hypothesis we are getting a higher Eout hypothesis  we are overfitting the data. So you can view the term as the indirect impact of the noise. It is not inevitable per se, but exists because of your `ability' to be misled by the data (ie overfit). The complexity of your model plays a heavy role in your `ability' to be misled since if your model is complicated you have more ways in which to be misled. If the number of data points goes up, approaching infinity, you will not significantly change the direct contribution of the noise; it is the var term that will go down, eventually approaching 0. The answers to your remaining questions are related to the above discussion as well as to later material in the text. 2. Let's take regularization (validation is a little more complicated). In chapter 4 we will make an explicit connection between regularization and using a `smaller' hypothesis set. So at the end of the day most methods for `braking' effectively result in using a smaller hypothesis set. Regularization does this in a more flexible and `soft' way than simply picking H2 versus H10. And then, you are right. There is a tradeoff when you reduce the size of the model. You will increase the bias (direct impact) but decrease the var (indirect impact). One of these effects wins and this determines whether you should increase or decrease your model size. In small high noise settings with complex models, the indirect impact wins and so it pays to regularize. 3. I highly recommend thinking about Exercise 4.3:) 4. Yes, there is such a thing as underfitting (see chapter 4). This is usually happening when it is the direct impact (bias) that wins over the indirect impact (var). And so, you should increase the size of to reduce the direct impact at the expense of a small increase in the indirect impact. Underfitting occurs when the quality and quantity of your data is very high in relation to your model size. Quote:

Re: Lec11: Overfitting in terms of (bias, var, stochasticnoise)
Thanks a lot, Prof Magdon. It feels much better now. :)
Sure  I'll take the exercises as you suggested. 
All times are GMT 7. The time now is 11:44 AM. 
Powered by vBulletin® Version 3.8.3
Copyright ©2000  2020, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. AbuMostafa, Malik MagdonIsmail, and HsuanTien Lin, and participants in the Learning From Data MOOC by Yaser S. AbuMostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.