View Single Post
#3
05-17-2012, 05:28 PM
 mic00 Invited Guest Join Date: Apr 2012 Posts: 49
Re: How does deterministic noise cause overfitting?

Thanks very much for the detailed reply!

Quote:
 Originally Posted by magdon Doing stuff that looks good in-sample that leads to disasters out-of-sample is the essence of overfitting. An example of this is trying to choose the regularization parameter. If you pick a lower regularization parameter, then you have lower in-sample error, but it leads to higher out-of-sample error - you picked the with lower but it gave higher . We call that overfitting. Underfitting is just the name we give to the opposite process in the context of picking the regularization parameter.
This is a helpful distinction. The idea of being "led astray" has also been nice for intuition.

Quote:
 Originally Posted by magdon To understand what is going on, the Bias Variance decomposition helps (bottom of page 125 in the textbook). is the direct impact of the stochastic noise. bias is the direct impact of the deterministic noise. The var term is interesting and is the indirect impact of the noise, through . The var term is mostly controlled by the size of in relation to the number of data points. So getting back to the point, if you make more complex, you will decrease the det. noise (bias) but you will increase the var (its indirect impact).
This makes perfect sense as well, and is how I had been thinking of the major impact of deterministic noise in causing overfitting. What spurred me to think about this is in fact the exercise on page 125, and the hint that, as becomes more complex, there are two factors affecting overfitting. The bias/variance trade-off -- and thus the indirect impact of deterministic noise -- is clear, but that deterministic noise (bias) would directly cause overfitting is a little confusing.

What I am curious about is how we can be "led astray" if and must stay fixed, and in my mind, I keep coming back to the precise definition of : if (size of training data set) is very small, variance will suffer, but also will differ from the best hypothesis in , leading to higher deterministic noise; if is big enough, will match the best hypothesis closely, and both variance and deterministic noise will shrink. So, even in cases of very large deterministic noise, if is very big and gives us a near-perfect shape of the target function, we are not "led astray" at all (and indeed would track very well). It seems like that wiggle room in the deterministic noise tracks a bigger change in the variance. Does this make sense?