Thanks very much for the detailed reply!
Quote:
Originally Posted by magdon
Doing stuff that looks good insample that leads to disasters outofsample is the essence of overfitting. An example of this is trying to choose the regularization parameter. If you pick a lower regularization parameter, then you have lower insample error, but it leads to higher outofsample error  you picked the with lower but it gave higher . We call that overfitting. Underfitting is just the name we give to the opposite process in the context of picking the regularization parameter.

This is a helpful distinction. The idea of being "led astray" has also been nice for intuition.
Quote:
Originally Posted by magdon
To understand what is going on, the Bias Variance decomposition helps (bottom of page 125 in the textbook).
is the direct impact of the stochastic noise. bias is the direct impact of the deterministic noise. The var term is interesting and is the indirect impact of the noise, through . The var term is mostly controlled by the size of in relation to the number of data points. So getting back to the point, if you make more complex, you will decrease the det. noise (bias) but you will increase the var (its indirect impact).

This makes perfect sense as well, and is how I had been thinking of the major impact of deterministic noise in causing overfitting. What spurred me to think about this is in fact the exercise on page 125, and the hint that, as
becomes more complex, there are two factors affecting overfitting. The bias/variance tradeoff  and thus the indirect impact of deterministic noise  is clear, but that deterministic noise (bias) would
directly cause overfitting is a little confusing.
What I am curious about is how we can be "led astray" if
and
must stay fixed, and in my mind, I keep coming back to the precise definition of
: if
(size of training data set) is very small, variance will suffer, but also
will differ from the best hypothesis in
, leading to higher deterministic noise; if
is big enough,
will match the best hypothesis closely, and both variance and deterministic noise will shrink. So, even in cases of very large deterministic noise, if
is very big and gives us a nearperfect shape of the target function, we are not "led astray" at all (and indeed
would track
very well). It seems like that wiggle room in the deterministic noise tracks a bigger change in the variance. Does this make sense?