Bayesian Stopping Criteria

The blog Why Add Q to MC?, explained the advantages of carefully chosen, low discrepancy sampling sites for approximating multivariate integrals, or equivalently, expectations of functions of multivariate random variables

\begin{align}
\mu: &= \int_{[0,1]^d} f(\boldsymbol{x}) \, \text{d}{\boldsymbol{x}} = {\mathbb{E}}[f(\boldsymbol{x})], \; \text{where } \boldsymbol{x} \sim \mathcal{U}[0,1]^d, \\
\mu \approx \widehat{\mu}_n  &: = \frac 1n \sum_{i=1}^n f(\boldsymbol{x}_i),
\text{where } \boldsymbol{x}_1, \boldsymbol{x}_2, \ldots \text{are the sampling sites or nodes}.
\end{align}

The question of when enough samples have been taken to satisfy the specified error criterion, $\left \lvert{\mu – \widehat{\mu}_n}\right \rvert \le \varepsilon$, is an important one. Stopping criteria based on Bayesian credible intervals are one good way of answering this question.

Traditionally, Bayesian credible intervals take $\mathcal{O}(n^3)$ operations to construct, where $n$ is the sample size. This is due to the need to invert an $n \times n$ (covariance) matrix. In contrast, computing $\widehat{\mu}$ requires only $\mathcal{O}(n)$ operations. Fortuitously, there is a way to reduce the computational cost of constructing the Bayesian credible interval to $\mathcal{O}(n \log (n))$ when using lattice or digital net low discrepancy samples. These Bayesian stopping criteria are implemented as CubBayesLatticeG and CubBayesNetG. This blog explains the key points.

The Bayesian Approach

The Bayesian approach to approximating integrals assumes that the integrand, $f:[0,1]^d \to \mathbb{R}$, is drawn from a Gaussian stochastic process parameterized by a (constant) mean and a covariance function defined by scale and shape parameters. This is denoted $f \sim \mathcal{GP}(m,s^2 C_{\boldsymbol{\theta}})$, and means that
\begin{align*}
{\mathbb{E}}[f(\boldsymbol{x})] = m, \; {\textup{cov}}\bigl(f(\boldsymbol{t}),f(\boldsymbol{x}) \bigr) = {\mathbb{E}}[(f(\boldsymbol{t}) – m)(f(\boldsymbol{x}) – m)] = s^2 C_{\boldsymbol{\theta}}(\boldsymbol{t},\boldsymbol{x}), \; \forall \boldsymbol{t}, \boldsymbol{x} \in [0,1]^d.
\end{align*}
Here, $m$, $s$, and $C_{\boldsymbol{\theta}}$ must be specified or estimated. Because $f$ is a Gaussian process, the multivariate integral of $f$, i.e., $\mu$, has a Gaussian distribution. Automatic Bayesian cubature uses these assumptions to increase $n$ until we reach
$
\mathbb{P}_f [\left\lvert { \mu – \widehat{\mu}_n} \right\rvert \le \varepsilon ] \ge 99\%.
$

The Bayesian credible interval, which depends on the sampling nodes, and the parameters defining the Gaussian process is
\begin{equation*}
\mathbb{P}_f \left[
|\mu-\widehat{\mu}_n| \leq \text{err}_\text{CI}
\right] = 99\%, \;
{\text{err}_{\text{CI}}} := 2.58 \; s \; \sqrt{c_{0,{\boldsymbol{\theta}}} – {\boldsymbol{c}_{\boldsymbol{\theta}}(\mathsf{X})}^T \mathsf{C}_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})^{-1}\boldsymbol{c}_{\boldsymbol{\theta}}(\mathsf{X})}
\end{equation*}
assuming that the mean of the Gaussian process, $m$, is zero. The formulae for the case of general $m$ are more complicated, but similar. Here,
\begin{gather*}
\label{eqn:fGaussDist_c0}
\mathsf{X} = ({\boldsymbol{x}}_1, \ldots, {\boldsymbol{x}}_n)^T, \qquad c_{0,{\boldsymbol{\theta}}} := \int_{[0,1]^{d} \times [0,1]^{d}} C_{\boldsymbol{\theta}} (\boldsymbol{t},\boldsymbol{x}) \, \text{d}{\boldsymbol{t}} \, \text{d}{\boldsymbol{x}} , \\
\boldsymbol{c}_{\boldsymbol{\theta}}(\mathsf{X}) := \left( \int_{[0,1]^d} C_{\boldsymbol{\theta}} (\boldsymbol{t},\boldsymbol{x}_i) \, \text{d}{\boldsymbol{t}} \right)_{i=1}^n, \qquad \mathsf{C}_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X}) = \left( C_{\boldsymbol{\theta}}({\boldsymbol{x}}_i,{\boldsymbol{x}}_j) \right)_{i,j=1}^n,
\end{gather*}
and $2.58$ represents $99.5\%$ quantile of the standard Gaussian distribution.
Bayesian cubature proceeds by increasing $n$ until $\text{err}_{\text{CI}}$ is no greater than the error tolerance.

The Bayesian credible interval as a stopping criterion assumes that the integrand, $f$, is not an outlier with respect to this Gaussian process. To improve the chances that this is the case, the hyperparameters $m$, $s$, and $\boldsymbol{\theta}$ should be tuned using the function data. In particular, $s$ should depend on the magnitude of the fluctuations of $f$. The standard approaches for tuning the hyperparameters require optimization, for which each iteration involves an eigendecomposition of $ C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$.

Fast Bayesian Cubature

The computation of $\text{err}_{\text{CI}}$ for a single $\mathsf{X}$ and $\boldsymbol{\theta}$ requires inversion of the covariance matrix $ C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$, which typically costs $\mathcal{O}(n^3)$. An eigendecompostion of $ C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$ costs at least this much. The way to overcome this computational burden is to use families of covariance kernels, $C_{\boldsymbol{\theta}}$, that match the design, $\mathsf{X}$, in a way that the eigenvectors and eigenvalues of the matrix $C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$ can be obtained via fast transforms. For integration lattices and kernels $C_{\boldsymbol{\theta}}$ such as
\begin{align*}
%\label{eq:shInv}
C_\boldsymbol{\theta}(\boldsymbol{t},\boldsymbol{x}) = K_\boldsymbol{\theta}(\boldsymbol{t} – \boldsymbol{x} \bmod \boldsymbol{1}),
\end{align*} where
\begin{align*}
\label{the_kernel_eqn_bernoulli}
K_\boldsymbol{\theta}(\boldsymbol{x}) =
\prod_{l=1}^d \biggl[
1 – (-1)^{r} \eta B_{2r}( x_l) \biggr], \;
\forall \boldsymbol{t},\boldsymbol{x} \in [0,1]^d, \ \boldsymbol{\theta} = (r,\eta), \ r \in \mathbb{N}, \ \eta > 0,
\end{align*}
and $B_{2r}$ is an even degree Bernoulli polynomial,
the fast transform corresponds to a fast Fourier transform. For digital nets (see our blog Digital Sequences, the Niederreiter Construction) and $C_\boldsymbol{\theta}$ that are digitally-shift invariant, the fast transform corresponds to a fast Walsh transform. In both cases, the computational burden attributable to tuning the hyperparameters and computing the width of the credible interval is a reasonable $\mathcal{O}(n \log(n))$.

The following code shows how to use a Bayesian stopping criterion for Keister’s example.

>>> from qmcpy import *
>>> from numpy import Keister, Lattice, CubBayesLatticeG
>>> 
>>> tol = .005
>>> 
>>> integrand = Keister(Lattice(dimension=2, order='linear'))
>>> keister_2d_exact = integrand.exact_integ()
>>> solution, data = CubBayesLatticeG(integrand, abs_tol=tol, n_init=2 ** 5).integrate()
>>> print('Integration error: ', abs(solution - keister_2d_exact) < tol)

Listing 1: Example usage of Lattice Bayesian Cubature algorithm.

This example can be run in Google Colab without any installation using this notebook.

References

  1. Hickernell, F. J. Blog: Why Add Q to MC? https://qmcpy.org/2020/06/25/why_add_q_to_mc/ (2020).
  2. Choi, S.-C. T., Hickernell, F., McCourt, M. & Sorokin, A. QMCPy: A quasi-Monte Carlo Python Library https://qmcsoftware.github.io/QMCSoftware/. 2020.
  3. Jagadeeswaran, R. & Hickernell, F. Fast Automatic Bayesian Cubature Using Lattice Sampling. Stat. Comput. 29, 1215–1229 (2019).
  4. Sloan, I. H. & Joe, S. Lattice Methods for Multiple Integration (Oxford University Press, Oxford, 1994).
  5. Hickernell, F. & Niederreiter, H. The Existence of Good Extensible Rank-1 Lattices. J. Complexity 19, 286–300 (2003).
  6. Cooley, J. W. & Tukey, J. W. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation 19, 297–301 (1965).
  7. Ebert, A. Blog: Digital Sequences, the Niederreiter Construction. https://qmcpy.org/2021/06/04/digital-sequences-the-niederreiter-construction/ (2021).
  8. Jagadeeswaran, R. Fast Automatic Bayesian Cubature Using Matching Kernels and Designs PhD thesis (Illinois Institute of Technology, 2019).
  9. Fino, B. J. & Algazi, V. R. Unified Matrix Treatment of the Fast Walsh-Hadamard Transform. IEEE Transactions on Computers C-25, 1142–1146 (1976).

 

jrathin1@hawk.iit.edu | + posts
Website | + posts

Fred Hickernell is Professor of Applied Mathematics and Vice Provost for Research at Illinois Institute of Technology. His research spans computational mathematics and statistics.

Leave a Reply

%d bloggers like this: