Function factorization using warped Gaussian processes (2009)

Authors

Abstract

We introduce a new approach to non-linear regression called function factorization, that is suitable for problems where an output variable can reasonably be modeled by a number of multiplicative interaction terms between non-linear functions of the inputs. The idea is to approximate a complicated function on a high-dimensional space by the sum of products of simpler functions on lower-dimensional subspaces. Function factorization can be seen as a generalization of matrix and tensor factorization methods, in which the data are approximated by the sum of outer products of vectors. We present a non-parametric Bayesian approach to function factorization where the priors over the factorizing functions are warped Gaussian processes, and we do inference using Hamiltonian Markov chain Monte Carlo. We demonstrate the superior predictive performance of the method on a food science data set compared to Gaussian process regression and tensor factorization using PARAFAC and GEMANOVA models.

Discussion

Iain Murray, 2009/06/24 18:48

This is an interesting paper that suggests one way of putting a type of factorized structure into a flexible regression model.

I don't think Equation (26) is quite right. It should be a Kronecker rather than Dirac delta. Also multiplying the covariance by Latex seems more appropriate and gives better results (at least with a standard GP on the meat dataset).

As pointed out in the discussion of Figure 2, the proposed model will obviously do better than a standard Gaussian Process (GP) regression model if its modelling assumptions are more accurate. I don't see evidence that this is the case for the food science dataset explored in the paper however.

If I guess the hyper-parameters of a GP by setting lengthscales to the ranges of the inputs and by setting the signal variance to the empirical variance of the targets then I get a better result than that reported (RMSE of around 1.5). Optimizing the likelihood using conjugate gradients to convergence didn't tend to improve things much and sometimes made things worse. The RMSE of around 1.8 reported in the paper is possibly consistent with being really unlucky with the choice of validation folds, or by using the covariance in (26). Using either early stopping or penalized maximum likelihood I get an RMSE of around 1.36. My first run using MCMC to sum over hyper-parameters of a standard GP using a dumb uninformative prior gave an RMSE of 1.45, coincidentally the same as that reported in the paper for the proposed method. I did another run using a slightly more sensible prior and obtained 1.33. Disclaimer: I haven't monitored convergence or performed any other responsible checks.

(Detail: The intrinsic noise level of the individual meat colour measurements seems higher than the reported RMSEs. Therefore I assumed that the RMSEs reported were for predicting the average color level for the 5 repetitions at each condition. One could model and use the variability of the repetitions too, but as this wasn't mentioned, I didn't. Apologies if it turns out that I have not done the right comparison.)

Although I haven't run my comparison very carefully, it seems that good performance can be obtained with a standard GP combined with full Bayesian modelling or sensible regularization. It would be interesting to see if there really is evidence of useful factorial structure in the data, or to find other cases where function factorization might be useful.

Enter your comment (wiki syntax is allowed):
DSXLV
 
paper/2009/323.txt · Last modified: 2009/05/24 18:43 (external edit)
 
Driven by DokuWiki