Approximation Methods for Gaussian Process Regression

The theory for approximation methods for GPR when faced with large datasets is given in section 8.3. We provide implementations of three methods:

subset of datapoints (SD),
subset of regressors (SR),
projected process (PP).

The SD method is implemented by simply calling the function gpr.m with a subset of the training data. The SR and PP methods are implemented by the function gprSRPP.m. In the example the subsets are selected randomly, although they can also be selected by greedy algorithms, as discussed in section 8.3.

This page contains 2 topics

Description of the function gprSRPP.m
A demonstration using the gprSRPP.m function

1. Description of the function gprSRPP.m

The function gprSRPP.m is based on the function gpr.m. The specification of the covariance function is the same as in gpr.m, except that the covariance function is assumed to be specified as a cell array with the first entry being covSum and the last entry being covNoise.

  [mu, S2SR, S2PP] = gprSRPP(logtheta, covfunc, x, INDEX, y, xstar)

inputs
`logtheta`	a (column) vector containing the logarithm of the hyperparameters of the covariance function
`covfunc`	the covariance function, which is assumed to be a covSum, and the last entry of the sum is covNoise
`x`	a `n` by `D` matrix of training inputs
`INDEX`	a row vector of length `m` (with m<= n) used to specify which inputs are used in the active set
`y`	a (column) vector of training set targets (of length `n`)
`xstar`	a `nstar` by `D` matrix of test inputs
outputs
`mu`	(column) vector (of length `nstar`) of predictive means
`S2SR`	(column) vector (of length `nstar`) of predictive variances from the SR algorithm (incl noise variance)
`S2PP`	(column) vector (of length `nstar`) of predictive variances from the PP algorithm (incl noise variance)

The SR method is implemented by equations 8.14 (for mu) and 8.15 for S2SR (but also adding on the noise variance). The PP method has the same predictive mean, but a different predictive variance (S2PP) given by equation 8.27 (but also adding on the noise variance).

2. A demonstration using the gprSRPP.m function

We use the Boston housing data of D. Harrison and D. L. Rubinfeld, "Hedonic housing prices and the demand for clean air", Journal of Environmental Economics and Management 5, 81-102 (1978). This dataset has 13 input variables and one output target. A split of 455 training points and 51 test points is used. We use Gaussian process regression with a squared exponential covariance function, and allow a separate lengthscale for each input dimension, as in eqs. 5.1 and 5.2.

Run the script demo_gprsparse.m to produce the results shown below.

The training and test data is contained in the file data_boston.mat. The data has been scaled so that each variable has approximately zero mean and unit variance. Assuming that the current directory is gpml-demo we need to add the path of the code, and load the data:

  addpath('../gpml');    % add path for the code
  load data_boston

As usual the training inputs are stored in X, the training targets in y, the test inputs in Xstar and the test targets in ystar.

A random subset of the training data points are selected using the randperm function. This set is of size m= 200.

  m = 200;
  perm = randperm(n);
  INDEX = perm(1:m);
  Xm = X(INDEX,:);
  ym = y(INDEX);

We use a covariance function made up of the sum of a squared exponential (SE) covariance term with ARD, and independent noise. Thus, the covariance function is specified as:

  covfunc = {'covSum', {'covSEard','covNoise'}};

The hyperparameters are initialized to logtheta0=[0; 0; ... 0; 0; -1.15]

  logtheta0 = zeros(D+2,1); 
  logtheta0(D+2) = -1.15; % starting value for log(noise std dev)

Note that the noise standard deviation is set to exp(-1.15) corresponding to a noise variance of 0.1.

The hyperparameters are trained by maximizing the approximate marginal likelihood of the SD method given in eq. 8.31. This simply computes the marginal likelihood of the subset of size m.

  [logtheta, fvals, iter] = minimize(logtheta0, 'gpr', covfunc, -100, Xm, ym);

Predictions can now be made using the 3 methods. SD is implemented simply by calling gpr on the reduced training set,

  [fstarSD S2SD] = gpr(logtheta, Xm, ym, Xstar);

The outputs are the mean predictions fstarSD and the predictive (noise-free) variances S2SD. The SR and PP methods are called using the function gprSRPP; note that the INDEX vector is passed:

  [fstarSRPP S2SR S2PP] = gprSRPP(logtheta, X, INDEX, y, Xstar);

gprSRPP returns the predictive mean fstarSRPP (which is identical for both methods), and the predictive (noise-free) variances S2SR and S2PP. For comparison we also make predictions using gpr.m on the full training dataset, and a dumb predictor that just predicts the mean and variance of the training data.

We compute the residuals, the mean squared error (mse) and the predictive log likelihood (pll) for all methods. Note that the predictive variance for ystar includes the noise variance, as explained on p. 18. Thus for example we have

  resSR = fstarSRPP-ystar;
  mseSR = sum(resSR.^2)/nstar;
  pllSR = -0.5*mean(log(2*pi*S2SR)+resSR.^2./S2SR);

The test results are:

  mse_full 0.118924	 pll_full -0.414019
  mse_SD   0.179692	 pll_SD   -0.54915
  mse_SR   0.138302	 pll_SR   -0.658645
  mse_PP   0.138302	 pll_PP   -0.395815
  mse_dumb 1.11464	 pll_dumb -1.4769

where mse denotes mean squared error and pll denotes predictive log likelihood. A higher (less negative) pll is more desirable. Note that the mse for the SR and PP methods is identical as expected. The SR and PP methods outperform SD on mse, and are close to the full mse. On pll, the PP method does slightly better than the full predictor, followed by the SD and SR methods.

You can experiment further by varying the value of the subset size m in the script demo_gprsparse.m.

Go back to the web page for Gaussian Processes for Machine Learning.

Last modified: Wed Mar 29 12:19:25 CEST 2006