Gaussian Process Regression code to use on EPOCH PIC code outputs.
Code uses the assumption of a null mean function, however, the code can handle various requests such as:
- Custom Kernel - Can define a custom kernel from a combination of defined models.
- Input warping - Using Kumaraswamy distribution.
- Output Warping - Using a set of defined transformations. Can also handle composite output warping (several layers of warping).
- Heteroscedastic model - Can train a model with varying noise across the input space.
- Input/Output/Hyper-parameter optimisation - Optimised by finding the minimum of the negative of the log marginal likelihood.
To Add - Kernel gradients, Gradient of log marginal likelihood with respect to input/output/hyper-parameter optimisation.
Gaussian process models are stored in the Model directory. Example scripts of how to run the code are in the test_example directory.
To initialise the Gaussian process, you need to access the class from either Model/GP_model.py or Model/GP_het_scedat.py for a constant or varying noise model.
gp = GP_class(X, y, kern, kern_ops, iw, ow_model)gp = GP_hetscedat_class(X, y, y_var, kern, kern_ops, kern_var, kern_var_ops, iw, ow_model, ow_noise)X- Inputs (ndarray)y- Outputs (1d array)y_var- Variance of outputs (1d array)kern- List of kernels to use (See below)kern_ops- List of kernel operationskern_var- List of kernels to use (for variance model)kern_var_ops- List of kernel operations (for variance model)iw- Flag to turn input warping on or offow_model- List of output warping to perform (see below)ow_noise- List of output warping to perform (for variance model)
The code can handle a variety of versions of the Radial Basis Function (RBF), Exponential (EXP), Matern 3/2, 5/2 (MATERN_3_2/MATERN_5_2), and Rational Quadratic (RAT_QUAD) kernels.
The main types of kernels that can be built are Isotropic, Separable/Summation, Non-separable, and Automatic Relevance Detection (ARD) kernels. A combination of these can also be initialised using the kern_ops variable.
Isotropic kernels depend only on the Euclidean distance between two input points. The length-scale parameter does not change for each input dimension.
For an isotropic RBF kernel:
kern = ['RBF']- Isotropic across all input dimensions.kern = ['RBF_ISO_[1,2,3]']- Isotropic across input dimensions 1, 2, and 3 only.
Separable kernels can be written as the product of one-dimensional kernel functions acting on individual input dimensions:
Example for 3D inputs:
kern = ['RBF_[1]', 'MATERN_3_2_[2]', 'MATERN_5_2_[3]']
kern_ops = "k_1 * k_2 * k_3"A kernel using multiple dimensions for a 2D model:
kern = ['RBF_[1]', 'RBF_[2]', 'RBF_[1,2]']
kern_ops = "k_1 + k_2 + k_3"where the last kernel defines an RBF kernel where
Summation kernels combine multiple kernels:
Use kern_ops = "k_1 + k_2 + ... + k_n" to define summation operations.
ARD kernels introduce separate length scales for each input dimension:
kern = ['RBF_ARD']- ARD across all input dimensions.kern = ['RBF_ARD_[1,2,3]']- ARD across input dimensions 1, 2, and 3 only.
A non-separable kernel based on the Mahalanobis distance:
where
For the RBF kernel:
Defined in code as:
kern = ['RBF_NS']- Non-separable across all input dimensions.kern = ['RBF_NS_[1,2,3]']- Non-separable across input dimensions 1, 2, and 3 only.
During optimisation, we assume
Combinations of kernels are allowed using the kern_ops variable, with the following rules:
- All input dimensions must be used at least once.
- Non-separable and ARD flags cannot be used for 1D inputs.
- Only use the allowed operations * and +
Thus more complicated kernels can be defined such as:
kern_ops = "k_1 + (k_2 * k_3) + k_1 * k_4"
Gaussian processes assume a stationary covariance function, but real-world functions often vary across input space. Input warping adapts to such variations. This implementation uses the Kumaraswamy distribution:
where
where
To enable input warping in the code, you set iw=True.
Standard Gaussian processes assume Gaussian-distributed outputs, which may not always hold. Output warping transforms outputs to better fit this assumption. Common transformations include:
-
Affine Transformation: Scales and shifts the output:
$$\phi(y) = a y + b$$ Inverse:
$$\phi^{-1}(\tilde{y}) = \frac{\tilde{y} - b}{a}$$ Jacobian:
$$\frac{d\phi}{dy} = a$$ -
Logarithmic Transformation: Useful for positive outputs:
$$\phi(y) = \ln(y)$$ Inverse:
$$\phi^{-1}(\tilde{y}) = \exp(\tilde{y})$$ Jacobian:
$$\frac{d\phi}{dy} = \frac{1}{y}$$ -
Box-Cox Transformation: Handles skewed data:
$$\phi(y) = \frac{\text{sgn}(y)|y|^{a-1} - 1}{a-1}$$ Inverse:
$$\phi^{-1}(\tilde{y}) = \text{sgn}((a-1)\tilde{y} +1) |(a-1)\tilde{y} + 1|^{1/(a-1)}$$ Jacobian:
$$\frac{d\phi}{dy} = (a-1)|y|^{a-2}$$ -
Sinh-Arcsinh Transformation: Adjusts tail behavior:
$$\phi(y) = \sinh(b\sinh^{-1}y -a)$$ Inverse:
$$\phi^{-1}(\tilde{y}) = \sinh\left(\frac{\sinh^{-1}(\tilde{y}) - a}{b} \right)$$ Jacobian:
$$\frac{d\phi}{dy} = \frac{b \cosh(b \sinh^{-1}y -a)}{\sqrt{1 + y^2}}$$
For high-dimensional problems, a single transformation may not be sufficient. Composite warping applies multiple transformations in sequence:
Affine transformations are typically included to standardize the output to zero mean and unit variance. If a nonzero mean function is used, the Affine transform parameters can be adjusted accordingly.
To set the output warping in the code we provide a list of warping models in ow_model ot ow_noise for the noise model GP.
The allowed string flags are given by ('affine','nat_log', 'boxcox', 'sinharcsinh', 'meanstd','zero_mean', 'unit_var').
The order in which you plave these labels contols the ordering of the output warping.
Given the training data
-
Posterior Mean:
$$\mu(\tilde{X}^*) = \mathbf{K}(\tilde{X}^*, \tilde{X})\left[\mathbf{K}(\tilde{X}, \tilde{X}) + \sigma^2_{n}\mathbf{I}\right]^{-1}\mathbf{\tilde{y}}$$ -
Posterior Variance:
$$\text{Var}(\tilde{X}^*) = \mathbf{K}(\tilde{X}^*, \tilde{X}) - \mathbf{K}(\tilde{X}^*, \tilde{X})\left[\mathbf{K}(\tilde{X}, \tilde{X}) + \sigma^2_{n}\mathbf{I}\right]^{-1}\mathbf{K}(\tilde{X}, \tilde{X}^*)$$
Here,
To improve numerical stability, Cholesky decomposition is applied to avoid matrix inversion:
where
Thus, the mean becomes:
Similarly, for variance:
Finally, the predictions are converted back to the original output space (non-warped), using Gaussian quadrature to approximate the integral:
where
The code performs a varying noise model by converting test_example.
The kernel hyperparameters are optimized by maximizing the log marginal likelihood, which is computationally done by minimizing the negative log marginal likelihood. To ensure the optimization is done in the original space (even when using warped data), we perform a change of variables on the probability density:
The log marginal likelihood is then defined as:
This optimization is typically performed using scipy.minimize with the L-BFGS-B method or scipy.optimize.differential_evolution for stochastic global optimization. The differential evolution method requires more function evaluations and is better suited for problems with fewer hyperparameters. For the minimize function, multiple restarts are recommended to avoid local minima. Performance can be improved by defining the derivative of the log marginal likelihood, though this is not yet implemented.