A majorization-minimization algorithm for (multiple) hyperparameter learning (2009)

Authors

Abstract

We present a general Bayesian framework for hyperparameter tuning in $L_2$-regularized supervised learning models. Paradoxically, our algorithm works by first analytically integrating out the hyperparameters from the model. We solve the resulting non-convex optimization problem efficiently using a majorization-minimization (MM) algorithm, in which the non-convex problem is reduced to a series of convex $L_2$-regularized parameter estimation tasks. The principal appeal of our method is its simplicity: the updates for choosing the $L_2$-regularized subproblems in each step are trivial to implement (or even perform by hand), and each subproblem can be efficiently solved by adapting existing solvers. Empirical results on a variety of supervised learning models show that our algorithm is competitive with both grid-search and gradient-based algorithms, but is more efficient and far easier to implement.

Discussion

Yoann Fabre, 2009/07/02 10:06

Hi,

You make multiple references to SVM style objectives (L2 regularized) in your paper, but you test your algorithm only on smooth probalistic objective. So my question is basically : what's the catch ? if any :-) It seems to me, from a practical point of view, that your wrapper could be applied, but perhaps the majorization-minimization scheme is not so effective for the non-smooth SVM problem ? (you also lose the direct bayesian interpretation by Gamma prior since the loss function is not (directly) ba log likelihood) Have you tried it ?

Chuan-Sheng Foo, 2009/07/06 07:26

Yes you are right that the catch is that there will no longer be a Bayesian interpretation for the algorithm if it were to be applied to max-margin (SVM) style models. However, the algorithm can also be viewed as a way to optimize a log-L2 style regularizer (e.g. log(1+0.5*||w||^2)), and from this point of view the algorithm is justified. We are not sure if using such regularizers in general will yield benefits over standard L2-norm regularization; this is something that could be further explored.

Enter your comment (wiki syntax is allowed):
EKVIJ
 
paper/2009/20.txt · Last modified: 2009/05/24 18:42 (external edit)
 
Driven by DokuWiki