Quote:
Originally Posted by Haowen
I have a general question regarding weight decay regularization.
Since is a component inside the regularization term, it looks like it is possible to trade off distance from the origin for model complexity, e.g., I can have more complex models closer to the origin.
For this to make intuitive sense so that the regularization correctly "charges" the hypothesis a cost for being more complex, it seems to me that all the features must be normalized to have zero mean. Otherwise for example if all the data points are in a ball far from the origin, regularization could fail in the sense that a "good" classifier would have large and all other w small, but potentially a poor (overfitting) classifier could have small and other w large and achieve the same regularization cost.
I'm not sure about this reasoning, is it correct? Is this a concern in practice? Thanks!

Hi,
If one matches the regularization criterion to a given problem, the regularizer may be more specific than general weight decay. For instance, when we discuss SVM next week, the regularizer will indeed exclude
. However, if your criterion is that the zero hypothesis is the simplest hypothesis in linear regression, then
should be included in the regularizer.
As emphasized in the lecture, the choice of a regularizer in a real situation is largely heuristic. If you have information in a particular situation that suggests that one form of regularizer is more plausible than the other, then that overrules the general choices that are developed for a different, idealized situation.
In all of these cases, the amount of regularization (
), which is determined through validation (discussed in the next lecture), is key to making sure that we are getting the most benefit (or the least harm if we choose a bad
) from regularization.