LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 4 - Overfitting (http://book.caltech.edu/bookforum/forumdisplay.php?f=111)
-   -   overfitting and spurious final hypothesis (http://book.caltech.edu/bookforum/showthread.php?t=4483)

 sasin324 05-19-2014 11:20 PM

overfitting and spurious final hypothesis

Based on the book page 124-125
"On a finite data set, the algorithm inadvertently uses some of the degree of freedom to fit the noise, which can result in overfitting and a spurious final hypothesis."
I have some questions based on this sentence:
1. What is spurious hypothesis? How can we identify the spurious hypothesis?
2. Is there any relationship between overfitting phenomenon and the spurious hypothesis?
3. Does spurious hypothesis come from the impact of deterministic noise in data set?

I got stuck for a while to define spurious hypothesis and how to identify it from the model.

Best Regards,

 yaser 05-20-2014 01:34 PM

Re: overfitting and spurious final hypothesis

Quote:
 Originally Posted by sasin324 (Post 11674) Based on the book page 124-125 "On a finite data set, the algorithm inadvertently uses some of the degree of freedom to fit the noise, which can result in overfitting and a spurious final hypothesis." I have some questions based on this sentence: 1. What is spurious hypothesis? How can we identify the spurious hypothesis? 2. Is there any relationship between overfitting phenomenon and the spurious hypothesis? 3. Does spurious hypothesis come from the impact of deterministic noise in data set? I got stuck for a while to define spurious hypothesis and how to identify it from the model. Best Regards,
The expression "spurious final hypothesis" is informal. When you fit the noise in sample, whether it is stochastic or deterministic, this takes you away from the desired hypothesis out of sample, since the 'extrapolation' of noise has nothing to do with the desired hypothesis. What you end up with is a spurious (not genuine or authentic) hypothesis.

This is indeed an overfitting phenomenon since fitting the noise is what overfitting is about. Validation can identify overfitting by detecting that the error is getting worse out of sample while we are having a better fit in sample.

 sasin324 05-20-2014 03:57 PM

Re: overfitting and spurious final hypothesis

Thanks for your response. This is very clear answer for my questions.
However, I still have some confusing about overfitting and the noise.

Suppose I fit the noise in the sample, Does this noise always introduce additional parameters into my model, i.e. the model have unnecessary parameters to overfit the sample?

Is it possible that an additional parameter in a model comes from a spurious relationship (between parameters) that appears only in a sample by chance, e.g. people who born in December have more chance to have cancer, but doesn't appear in out-of-sample data can lead to overfitting phenomenon?

Could feature selection help mitigate overfitting problem?

Best Regards

 magdon 06-19-2014 07:37 AM

Re: overfitting and spurious final hypothesis

The number of parameters in your model (to describe a hypothesis) is fixed before you see the data. A more complex model with many parameters increases your ability to fit the noise (usually more so than your ability to fit the true information in the data). This leads to the overfitting.

One effect of feature selection is to reduce the number of parameters which usually helps with overfitting.

Quote:
 Originally Posted by sasin324 (Post 11676) Thanks for your response. This is very clear answer for my questions. However, I still have some confusing about overfitting and the noise. Suppose I fit the noise in the sample, Does this noise always introduce additional parameters into my model, i.e. the model have unnecessary parameters to overfit the sample? Is it possible that an additional parameter in a model comes from a spurious relationship (between parameters) that appears only in a sample by chance, e.g. people who born in December have more chance to have cancer, but doesn't appear in out-of-sample data can lead to overfitting phenomenon? Could feature selection help mitigate overfitting problem? Best Regards

 All times are GMT -7. The time now is 12:49 PM.