Thanks very much for the detailed reply!
Quote:
Originally Posted by magdon
Doing stuff that looks good in-sample that leads to disasters out-of-sample is the essence of overfitting. An example of this is trying to choose the regularization parameter. If you pick a lower regularization parameter, then you have lower in-sample error, but it leads to higher out-of-sample error - you picked the  with lower  but it gave higher  . We call that overfitting. Underfitting is just the name we give to the opposite process in the context of picking the regularization parameter.
|
This is a helpful distinction. The idea of being "led astray" has also been nice for intuition.
Quote:
Originally Posted by magdon
To understand what is going on, the Bias Variance decomposition helps (bottom of page 125 in the textbook).
 is the direct impact of the stochastic noise. bias is the direct impact of the deterministic noise. The var term is interesting and is the indirect impact of the noise, through  . The var term is mostly controlled by the size of  in relation to the number of data points. So getting back to the point, if you make  more complex, you will decrease the det. noise (bias) but you will increase the var (its indirect impact).
|
This makes perfect sense as well, and is how I had been thinking of the major impact of deterministic noise in causing overfitting. What spurred me to think about this is in fact the exercise on page 125, and the hint that, as

becomes more complex, there are two factors affecting overfitting. The bias/variance trade-off -- and thus the indirect impact of deterministic noise -- is clear, but that deterministic noise (bias) would
directly cause overfitting is a little confusing.
What I am curious about is how we can be "led astray" if

and

must stay fixed, and in my mind, I keep coming back to the precise definition of

: if

(size of training data set) is very small, variance will suffer, but also

will differ from the best hypothesis in

, leading to higher deterministic noise; if

is big enough,

will match the best hypothesis closely, and both variance and deterministic noise will shrink. So, even in cases of very large deterministic noise, if

is very big and gives us a near-perfect shape of the target function, we are not "led astray" at all (and indeed

would track

very well). It seems like that wiggle room in the deterministic noise tracks a bigger change in the variance. Does this make sense?