View Single Post
Old 11-07-2012, 01:29 AM
yaser's Avatar
yaser yaser is offline
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,477
Default Re: Out of syllabus question on Regularization vs Priors

Originally Posted by hashable View Post
Since taking this course in Summer 2012, I have tried to read up more about regularization and found out that there are different approaches. The relatively more commonly used are L1 and L2 (covered in class under the name of 'weight-decay') regularization.

There appears to be some mathematical equivalence between using regularization and the usage of prior probabilities (in the Bayesian approach). From what I understand, imposing an L2 penalty is same as imposing a Gaussian prior assumption on the unknown weights. Similarly L1 corresponds to imposing a Laplacian prior.

In the concluding lectures, Professor YAM mentioned that we have to be careful in verifying that our assumptions on priors are valid when going with the Bayesian approach.

If my understanding is correct, the "danger" introduced in choosing priors is identically (mathematically) to the "danger"" introduced by choosing some arbitrary regularization technique. In other words, we have to be equally careful about using the right regularization technique as we need to be about choosing the right prior.

Is my understanding correct? In other words, does the Bayesian approach particularly warrant any more caution, or both approaches warrant the same amount/kind of caution?

PS: For future versions of the class, it would be great if another lecture is added to introduce various regularization techniques since in practice it appears that L1 is being used everywhere "big data" for its sparsity benefits.
Thank you for this important post.

The equivalence you mention would hold if there was no regularization parameter that is to be determined using validation techniques. The parameter \lambda can be thought of as a reality check (data check) on the assumption that the chosen form of regularization is valid. This parameter can completely overrule the assumption (such that \lambda=0) if need be. The parameter can also be incorporated in the Bayesian analysis as a hyperparameter in a "hyperprior."

Certainly more time could have been spent on regularization in the course (as well as on other deserving topics). However, I feel that the time constraint was in fact beneficial in forcing us to focus on the essential. The main message in regularization is that it is fundamentally a heuristic, albeit with some mathematical backbone. As you mention, different regularizers are suited for different situations, and this is determined in practice rather than in theory. This message in and of itself is perhaps the most essential message to convey.
Where everyone thinks alike, no one thinks very much
Reply With Quote