View Single Post
Old 10-05-2013, 07:02 AM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 597
Default Re: Feature dimensionality, regularization and generalization

This is a very important point you raise. Feature selection and regularization play different roles.

Feature selection is used to construct the `right' input that is useful for predicting the output. With respect to the right features, the target function will be simple (for example nearly linear). Feature selection should always be used if possible and it does not matter how many data points you have, or how many dimensions. Again, the role of feature selection is to get the target function into a simpler form - that is, for the simple hypothesis set you plan to use, the deterministic noise is reduced. Some might use feature selection as a way of reducing dimension to control the var, but that is not its primary role. You can always do systematic dimension reduction after feature selection if you need to get better generalization.

Once you have determined your features, selected your hypothesis set, and only then look at your data, there will likely still be deterministic noise and almost always stochastic noise. The role of regularization is to help you deal with the noise.

If you have bad features, there will typically be lots of deterministic noise and you will need lots of regularization to combat it. If you have good features, then you may only need little regularization, primarily to combat the stochastic noise.

Summary: features and regularization address different things. Good features reduce deterministic noise. Regularization combats noise. Don't underestimate the role of either.

But as you see, to some extent, regularization can combat the extra deterministic noise when you have bad features. However, if you have lots of noise, that places a fundamental limit on learning. And, using a larger hypothesis set as a way to combat deterministic noise is not usually good because you suffer the disproportionate indirect impact of any noise through the var term in the bias var decomposition.

Originally Posted by hsolo View Post
I had a couple of conceptual questions:

The VC result and Bias Variance result imply that if the number of features is very large then unless the number of training samples is high there is the sceptre of overfitting. So there is the requirement that feature selection has to be done systematically and carefully.

However it seems that if one uses regularization in some form then that can serve as a generic antidote to overfitting; and consequently one can ignore the feature dimensionality (assuming for a moment that the computing overhead of large feature set can be ignored) -- I got that impression from online notes from a couple of courses and I also saw in a recent Google paper that they used logistic regression with regularization on a billion-dimension (highly sparse) feature set..

Is this a correct notion from a statistics that if one uses regularization and is willing to pay the computing costs, one can be lax about feature selection?

Is there a theoretical result about the above notion (feature dimensionality and regularization effect on generalization error)?
Have faith in probability
Reply With Quote