Gaussian Process Regression

This page contains 4 sections

Description of the GP regression function gpr.m
1-d demo using gpr.m
Multi-dimensional demo using gpr.m with Automatic Relevance Determination (ARD)
References

1. Description of the GP regression function `gpr.m`

The basic computations needed for standard Gaussian process regression (GPR) are straight forward to implement in matlab. Several implementations are possible, here we present an implementation closely resembling Algorithm 2.1 (p. 19) with three exceptions: Firstly, the predicitive variance returned is the variance for noisy test-cases, whereas Algorithm 2.1 gives the variance for the noise-free latent function; conversion between the two variances is done by simply adding (or subtracting) the noise variance. Secondly, the negative log marginal likelihood is returned, and thirdly the partial derivatives of the negative log marginal likelihood, equation (5.9), section 5.4.1, page 114 are also computed.

Algorithm 2.1, section 2.2, page 19:

A simple implementation of a Gaussian process for regression is provided by the gpr.m program (which can conveniently be used together with minimize.m for optimization of the hyperparameters). The program can do one of two things:

compute the negative log marginal likelihood and its partial derivatives wrt. the hyperparameters, usage
```
[nlml dnlml] = gpr(logtheta, covfunc, x, y)
```
which is used when "training" the hyperparameters, or

compute the (marginal) predictive distribution of test inputs, usage
```
[mu S2]  = gpr(logtheta, covfunc, x, y, xstar)
```

Selection between the two modes is indicated by the presence (or absence) of test cases, xstar. The arguments to the gpr.m function are:

inputs
`logtheta`	a (column) vector containing the logarithm of the hyperparameters
`covfunc`	the covariance function
`x`	a `n` by `D` matrix of training inputs
`y`	a (column) vector if training set targets (of length `n`)
`xstar`	a `nn` by `D` matrix of test inputs
outputs
`nlml`	the negative log marginal likelihood
`dnlml`	(column) vector with the partial derivatives of the negative log marginal likelihood wrt. the logarithm of the hyperparameters.
`mu`	(column) vector (of length `nn`) of predictive means
`S2`	(column) vector (of length `nn`) of predictive variances

The covfunc argument specifies the function to be called to evaluate covariances. The covfunc can be specified either as a string naming the function to be used or as a cell array. A number of covariance functions are provided, see covFunctions.m for a more complete description. A commonly used covariance function is the sum of a sqaured exponential (SE) contribution and independent noise. This can be specified as:

  covfunc = {'covSum', {'covSEiso','covNoise'}};

where covSum merely does some bookkeeping and calls the squared exponential (SE) covariance function covSEiso.m and the independent noise covariance covNoise.m. The squared exponential (SE) covariance function (also called the radial basis function (RBF) or Gaussian covariance function) is given by in equation (2.31) in section 2.3 page 19 for the scalar input case, and equation (5.1) section 5.1 page 106 for multivariate inputs.

2. 1-d demo using gpr.m

You can either follow the example below or run a short script demo_gpr.m.

First, add the directory containing the gpml code to your path (assuming that your current directory is gpml-demo:

  addpath('../gpml')

Generate some data from a noisy Gaussian process (it is not essential to understand the details of this):

  n = 20;
  rand('state',18);
  randn('state',20);
  covfunc = {'covSum', {'covSEiso','covNoise'}};
  loghyper = [log(1.0); log(1.0); log(0.1)];
  x = 15*(rand(n,1)-0.5);
  y = chol(feval(covfunc{:}, loghyper, x))'*randn(n,1);      % Cholesky decomp.

Take a look at the data:

  plot(x, y, 'k+', 'MarkerSize', 17)

Now, we compute the predictions for this data using a Gaussian process with a covariance function being the sum of a squared exponential (SE) contribution and an independent noise contribution:

  covfunc = {'covSum', {'covSEiso','covNoise'}};

To verify how many hyperparameters this covariance function expects:

  feval(covfunc{:})

which reurns '2+1', ie two hyperparameters for the SE part (a characteristic length-scale parameter and a signal magnitude parameter) and a single parameter for the noise (noise magnitude). We specify the hyperparameters ell=1.0 (characteristic length-scale), sigma2f=1.0 (signal variance) and sigma2n=0.01 (noise variance)

  loghyper = [log(1.0); log(sqrt(1.0)); log(sqrt(0.01))]; % this is for figure 2.5(a)

We place 201 test points evenly distributed in the interval [-7.5, 7.5] and make predictions

  xstar = linspace(-7.5,7.5,201)';
  [mu S2] = gpr(loghyper, covfunc, x, y, xstar);

The predictive variance is computed for the distribution of test targets, which are assumed to be as noisy as the training targets. If we are interested instead in the distribution of the noise-free function, we subtract the noise variance from S2

  S2 = S2 - exp(2*loghyper(3));

To plot the mean and two standard error (95% confidence), noise-free, pointwise errorbars:

  hold on
  errorbar(xstar, mu, 2*sqrt(S2), 'g');

Alternatively, you can also display the error range in gray-scale:

  clf
  f = [mu+2*sqrt(S2);flipdim(mu-2*sqrt(S2),1)];
  fill([xstar; flipdim(xstar,1)], f, [7 7 7]/8, 'EdgeColor', [7 7 7]/8);
  hold on
  plot(xstar,mu,'k-','LineWidth',2);
  plot(x, y, 'k+', 'MarkerSize', 17);

which procudes

reproducing Figure 2.5(a), section 2.3, page 20. Note that the two standard deviation error bars are for the noise-free function. If you rather want the 95% confidence region for the noisy observations you should not subtract the noise variance sigma2n from the predictive variances S2. To reproduce Figs 2.5 (b) and (c) change the log hyperparameters when calling the prediction program to:

  loghyper = [log(0.3); log(1.08); log(5e-5)];  % this is for figure 2.5(b)
  loghyper = [log(3.0); log(1.16); log(0.89)];  % this is for figure 2.5(c)

respectively. To learn, or optimize the hyperparameters by maximizing the marginal likelihood using the derivatives from equation (5.9) section 5.4.1 page 114 using minimize.m do:

  loghyper = minimize([-1; -1; -1], 'gpr', -100, covfunc, x, y);
  exp(loghyper)

Here, we initialize the log of the hyperparameters to be all at minus one. We minimize the negative log marginal likelihood wrt. the log of the hyperparameters (allowing 100 evaluations of the function (this is more than adequate)), and report the hyperparameters:

  1.3659
  1.5461
  0.1443

Note that the hyperparameters learned here a close, but not identical to the log hyperparameters 1.0, 1.0, 0.1 used when generating the data. The discrepancy is partially due to the small training sample size, and partially due to the fact that we only get information about the process in a very limited range of input values. Predicting with the learned hyperparameters produces:

which is not very different from the plot above (based on the hyperparameters used to generate the data). Repeating the experiment with more training points distributed over a wider range leads to more accurate estimates.

Note, that above we have use the same functional form for the covariance function, as was used to generate the data. In practise things are seldom so simple, and one may have to try different covariance functions. Here, we try to explore how a Matern form, with a shape parameter of 3/2 (see eq. 4.17) would do.

  covfunc2 = {'covSum',{'covMatern3iso','covNoise'}};
  loghyper2 = minimize([-1; -1; -1], 'gpr', -100, covfunc2, x, y);

Comparing the value of the marginal likelihoods for the two models

  -gpr(loghyper, covfunc, x, y)
  -gpr(loghyper2, covfunc2, x, y)

with values of -15.6 for SE and -18.0 for Matern3, shows that the SE covariance function is about exp(18.0-15.6)=11 times more probable than the Matern form for these data (in agreement with the data generating process). The predictions from the worse Matern-based model

  [mu S2] = gpr(loghyper2, covfunc2, x, y, xstar);
  S2 = S2 - exp(2*loghyper2(3));

look like this:

Notice how the uncertainty grows more rapidly in the vicinity of data-points, reflecting the property that for the Matern class with a shape parameter of 3/2 the stochastic process is not twice mean-square differentiable (and is thus much less smooth than the SE covariance function).

3. GPR demo using gpr.m with multi-dimensional input space and ARD

This demonstration illustrates the use of a Gaussian Process regression for a multi-dimensional input, and illustrates the use of automatic relevance determination (ARD). It is based on Williams and Rasmussen (1996).

You can either follow the example below or run the script demo_gparm.m.

We initially consider a 2-d nonlinear robot arm mapping problem as defined in Mackay (1992)

  f(x_1,x_2) = r_1 cos (x_1) + r_2 cos(x_1 + x_2)

where x_1 is chosen randomly in [-1.932, -0.453], x_2 is chosen randomly in [0.534, 3.142], and r_1 = 2.0, r_2 = 1.3. The target values are obtained by adding Gaussian noise of variance 0.0025 to f(x_1,x_2). Following Neal (1996) we add four further inputs, two of which (x_3 and x_4) are copies of x_1 and x_2 corrupted by additive Gaussian noise of standard deviation 0.02, and two of which (x_5 and x_6) are N(0,1) Gaussian noise variables. Our dataset has n=200 training points and nstar=200 test points.

The training and test data is contained in the file data_6darm.mat. The raw training data is in the input matrix X (n=200 by D=6) and the target vector y (200 by 1). Assuming that the current directory is gpml-demo we need to add the path of the code, and load the data:

  addpath('../gpml');    % add path for the code
  load data_6darm

We first check the scaling of these variables using

  mean(X), std(X), mean(y), std(y)

In particular we might be concerned if the standard deviation is very different for different input dimensions; however, that is not the case here so we do not carry out rescaling for X. However, y has a non-zero mean which is not appropriate if we assume a zero-mean GP. We could add a constant onto the SE covariance function corresponding to a prior on constant offsets, but here instead we centre y by setting:

  offset = mean(y);
  y = y - offset;         % centre targets around 0.

We use Gaussian process regression with a squared exponential covariance function, and allow a separate lengthscale for each input dimension, as in eqs. 5.1 and 5.2. These lengthscales (and the other hyperparameters sigma_f and sigma_n) are adapted by maximizing the marginal likelihood (eq. 5.8) w.r.t. the hyperparameters. The covariance function is specified by

  covfunc = {'covSum', {'covSEard','covNoise'}};

We now wish to train the GP by optimizing the hyperparameters. The hyperparameters are stored as logtheta = [log(ell_1), log(ell_2), ... log(ell_6), log(sigma_f), log(sigma_n)] (as D = 6), and are initialized to logtheta0 = [0; 0; 0; 0; 0; 0; 0; log(sqrt(0.1))]. The last value means that the initial noise variance sigma^2_n is set to 0.1. The marginal likelihood is optimized using the minimize function.

  logtheta0 = [0; 0; 0; 0; 0; 0; 0; log(sqrt(0.1))];
  [logtheta, fvals, iter] = minimize(logtheta0, 'gpr', -100, covfunc, X, y);
  exp(logtheta)

By doing exp(logtheta) we find that:

  ell_1      1.804377
  ell_2      1.963956
  ell_3      8.884361
  ell_4     34.417657
  ell_5   1081.610451
  ell_6    375.445823
  sigma_f    2.379139
  sigma_n    0.050835

Notice that the lengthscales ell_1 and ell_2 are short indicating that inputs x_1 and x_2 are relevant to the task. The noisy inputs x_3 and x_4 have longer lengthscales, indicating they are less relevant, and the pure noise inputs x_5 and x_6 have very long lengthscales, so they are effectively irrelevant to the problem, as indeed we would hope. The process std deviation sigma_f is similar in magnitude to the standard deviation of the data std(y) = 1.2186. The learned noise standard deviation sigma_n is almost exactly equal to the generative noise level sqrt(0.0025)=0.05.

We now make predictions on the test points and assess the accuracy of these predictions. The test data is in Xstar and ystar. Recall that the training targets were centered, thus we must adjust the predictions by offset:

  [fstar S2] = gpr(logtheta, covfunc, X, y, Xstar);
  fstar = fstar + offset;  % add back offset to get true prediction

We compute the residuals, the mean squared error (mse) and the predictive log likelihood (pll). Note that the predictive variance for ystar includes the noise variance, as explained on p. 18.

  res = fstar-ystar;  % residuals
  mse = mean(res.^2)
  pll = -0.5*mean(log(2*pi*S2)+res.^2./S2)

The mean squared error is 0.002489. Note that this is almost equal to the value 0.0025, as would be obtained from the perfect predictor, due to the added noise with variance 0.0025.

We can also plot the residuals and the predictive (noise-free) variance for each test case. Note that the order along the x-axis is arbitrary.

  subplot(211), plot(res,'.'), ylabel('residuals'), xlabel('test case')
  subplot(212), plot(sqrt(S2),'.'), ylabel('predictive std deviation'), xlabel('test case')

4. References

D. J. C. MacKay, A Practical Bayesian Framework for Backpropagation Networks, Neural Computation 4, 448-472, (1992)
R. M. Neal, Bayesian Learning for Neural Networks, Springer, (1996)
C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Regression, Advances in Neural Information Processing Systems 8, pp 514-520, MIT Press (1996).

Go back to the web page for Gaussian Processes for Machine Learning.

Last modified: Mon Mar 27 16:10:14 CEST 2006