LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 4 - Overfitting

Reply
 
Thread Tools Display Modes
  #1  
Old 05-16-2012, 05:57 AM
mic00 mic00 is offline
Invited Guest
 
Join Date: Apr 2012
Posts: 49
Default How does deterministic noise cause overfitting?

I am still a little confused about this. It's clear to me that reducing deterministic noise can lead to overfitting (if there is not enough in-sample data), but the presence of deterministic noise itself seems (to me) to cause underfitting. Am I just being pedantic?

(I contrast this with stochastic noise: it cannot be reduced, and clearly any attempt to fit it is overfitting, because no amount of in-sample data will clarify its shape).
Reply With Quote
  #2  
Old 05-17-2012, 11:54 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 595
Default Re: How does deterministic noise cause overfitting?

This is a very subtle question!

The most important thing to realize is that in learning, \cal{H} is fixed and \cal{D} is given, and so can be assumed fixed. Now we can ask, what is going on in this learning scenario. Here is what we can say:

i) If there is stochastic noise with 'magnitude' \sigma^2, then you are in trouble.

ii) If there deterministic noise then you are in trouble.

The stochastic noise can be viewed as one part of the data generation process (eg. measurement errors). The deterministic noise can similarly be viewed as another part of the data generation process, namely f. The deterministic and stochastic noise are fixed. In your analogy, you can increase the stochastic noise by increasing the noise variance and you get into deeper trouble. Similarly, you can increase the deterministic noise by making f more complex and you will get into deeper trouble.

I just need to tell you what 'trouble' means. Well, we actually use another word instead of 'trouble' - overfitting. This means you may be likely to make an inferior choice over the superior choice because the inferior choice has lower in-sample error. Doing stuff that looks good in-sample that leads to disasters out-of-sample is the essence of overfitting. An example of this is trying to choose the regularization parameter. If you pick a lower regularization parameter, then you have lower in-sample error, but it leads to higher out-of-sample error - you picked the \lambda with lower E_{in} but it gave higher E_{out}. We call that overfitting. Underfitting is just the name we give to the opposite process in the context of picking the regularization parameter. Once the regularization parameter gets too high, as you pick a higher \lambda you get both higher E_{in} and higher E_{out}. It also turns out that this means you over regularized and obtained an over-simplistic g - i.e. you 'underfitted', you didn't fit the data enough. The underfitting and overfitting are just terms. The substance of what is going on under the hood is how the deterministic and stochastic noise are affecting what you should and should not do in-sample.

Now let's get back to the subtle part of your question. There is actually another way to decrease the deterministic noise - increase the complexity of \cal{H} (the other way is to decrease the complexity of f which we discussed above). Now is where the difference with stochastic noise pops up. With stochastic noise, it either goes up or down; if down, then things get better. With deterministic noise, if you just tell me that it went down, I need to ask you *how*. Did your target function get simpler - if yes, then great, it is just as if the stochastic noise went down. If it is that your \cal{H} got more complicated, then things get interesting. To understand what is going on, the Bias Variance decomposition helps (bottom of page 125 in the textbook).

E_{out}=\sigma^2+bias+var

\sigma^2 is the direct impact of the stochastic noise. bias is the direct impact of the deterministic noise. The var term is interesting and is the indirect impact of the noise, through \cal{H}. The var term is mostly controlled by the size of \cal{H} in relation to the number of data points. So getting back to the point, if you make \cal{H} more complex, you will decrease the det. noise (bias) but you will increase the var (its indirect impact). Usually the latter dominates (overfitting, not because of the direct impact of the noise, but because of its indirect impact) ... unless you are in the underfitting regime when the former dominates.




Quote:
Originally Posted by mic00 View Post
I am still a little confused about this. It's clear to me that reducing deterministic noise can lead to overfitting (if there is not enough in-sample data), but the presence of deterministic noise itself seems (to me) to cause underfitting. Am I just being pedantic?

(I contrast this with stochastic noise: it cannot be reduced, and clearly any attempt to fit it is overfitting, because no amount of in-sample data will clarify its shape).
__________________
Have faith in probability
Reply With Quote
  #3  
Old 05-17-2012, 05:28 PM
mic00 mic00 is offline
Invited Guest
 
Join Date: Apr 2012
Posts: 49
Default Re: How does deterministic noise cause overfitting?

Thanks very much for the detailed reply!

Quote:
Originally Posted by magdon View Post
Doing stuff that looks good in-sample that leads to disasters out-of-sample is the essence of overfitting. An example of this is trying to choose the regularization parameter. If you pick a lower regularization parameter, then you have lower in-sample error, but it leads to higher out-of-sample error - you picked the \lambda with lower E_{in} but it gave higher E_{out}. We call that overfitting. Underfitting is just the name we give to the opposite process in the context of picking the regularization parameter.
This is a helpful distinction. The idea of being "led astray" has also been nice for intuition.

Quote:
Originally Posted by magdon View Post
To understand what is going on, the Bias Variance decomposition helps (bottom of page 125 in the textbook).

E_{out}=\sigma^2+bias+var

\sigma^2 is the direct impact of the stochastic noise. bias is the direct impact of the deterministic noise. The var term is interesting and is the indirect impact of the noise, through \cal{H}. The var term is mostly controlled by the size of \cal{H} in relation to the number of data points. So getting back to the point, if you make \cal{H} more complex, you will decrease the det. noise (bias) but you will increase the var (its indirect impact).
This makes perfect sense as well, and is how I had been thinking of the major impact of deterministic noise in causing overfitting. What spurred me to think about this is in fact the exercise on page 125, and the hint that, as \cal{H} becomes more complex, there are two factors affecting overfitting. The bias/variance trade-off -- and thus the indirect impact of deterministic noise -- is clear, but that deterministic noise (bias) would directly cause overfitting is a little confusing.

What I am curious about is how we can be "led astray" if \cal{H} and f must stay fixed, and in my mind, I keep coming back to the precise definition of \bar g: if N (size of training data set) is very small, variance will suffer, but also \bar g will differ from the best hypothesis in \cal{H}, leading to higher deterministic noise; if N is big enough, \bar g will match the best hypothesis closely, and both variance and deterministic noise will shrink. So, even in cases of very large deterministic noise, if N is very big and gives us a near-perfect shape of the target function, we are not "led astray" at all (and indeed E_{in} would track E_{out} very well). It seems like that wiggle room in the deterministic noise tracks a bigger change in the variance. Does this make sense?
Reply With Quote
  #4  
Old 05-17-2012, 10:17 PM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 595
Default Re: How does deterministic noise cause overfitting?

The "being led astray" refers to the "noise" in the finite data set leading the learning algorithm in the wrong direction and outputting the wrong final hypothesis (though {\cal{H}},f are fixed). This tendency to be led astray is worse for more complex \cal{H} because it has more flexibility to (over)fit the noise and hence be led astray. This is what contributes most to the var term in the bias variance decomposition. Different data sets (with noise) will lead the learning astray in wildly different directions resulting in high var.

We didn't precisely define deterministic noise, we just gave the intuitive idea. bias is very related to it though not exactly the same. Indeed though \bar g might be worse for smaller N, its dependence on N is mild. See for example Problem 3.14 as an evidence that the bias has only mild dependence on N. In practice, math]\bar g[/math] is close to h^* no matter what N and so the bias is more or less the deterministic noise.

Quote:
Originally Posted by mic00 View Post
Thanks very much for the detailed reply!



This is a helpful distinction. The idea of being "led astray" has also been nice for intuition.



This makes perfect sense as well, and is how I had been thinking of the major impact of deterministic noise in causing overfitting. What spurred me to think about this is in fact the exercise on page 125, and the hint that, as \cal{H} becomes more complex, there are two factors affecting overfitting. The bias/variance trade-off -- and thus the indirect impact of deterministic noise -- is clear, but that deterministic noise (bias) would directly cause overfitting is a little confusing.

What I am curious about is how we can be "led astray" if \cal{H} and f must stay fixed, and in my mind, I keep coming back to the precise definition of \bar g: if N (size of training data set) is very small, variance will suffer, but also \bar g will differ from the best hypothesis in \cal{H}, leading to higher deterministic noise; if N is big enough, \bar g will match the best hypothesis closely, and both variance and deterministic noise will shrink. So, even in cases of very large deterministic noise, if N is very big and gives us a near-perfect shape of the target function, we are not "led astray" at all (and indeed E_{in} would track E_{out} very well). It seems like that wiggle room in the deterministic noise tracks a bigger change in the variance. Does this make sense?
__________________
Have faith in probability
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 10:48 AM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.