# Bayesian Stopping Criteria

The blog Why Add Q to MC?, explained the advantages of carefully chosen, low discrepancy sampling sites for approximating multivariate integrals, or equivalently, expectations of functions of multivariate random variables

\begin{align}
\mu: &= \int_{[0,1]^d} f(\boldsymbol{x}) \, \text{d}{\boldsymbol{x}} = {\mathbb{E}}[f(\boldsymbol{x})], \; \text{where } \boldsymbol{x} \sim \mathcal{U}[0,1]^d, \\
\mu \approx \widehat{\mu}_n  &: = \frac 1n \sum_{i=1}^n f(\boldsymbol{x}_i),
\text{where } \boldsymbol{x}_1, \boldsymbol{x}_2, \ldots \text{are the sampling sites or nodes}.
\end{align}

The question of when enough samples have been taken to satisfy the specified error criterion, $\left \lvert{\mu – \widehat{\mu}_n}\right \rvert \le \varepsilon$, is an important one. Stopping criteria based on Bayesian credible intervals are one good way of answering this question.

Traditionally, Bayesian credible intervals take $\mathcal{O}(n^3)$ operations to construct, where $n$ is the sample size. This is due to the need to invert an $n \times n$ (covariance) matrix. In contrast, computing $\widehat{\mu}$ requires only $\mathcal{O}(n)$ operations. Fortuitously, there is a way to reduce the computational cost of constructing the Bayesian credible interval to $\mathcal{O}(n \log (n))$ when using lattice or digital net low discrepancy samples. These Bayesian stopping criteria are implemented as CubBayesLatticeG and CubBayesNetG. This blog explains the key points.

#### The Bayesian Approach

The Bayesian approach to approximating integrals assumes that the integrand, $f:[0,1]^d \to \mathbb{R}$, is drawn from a Gaussian stochastic process parameterized by a (constant) mean and a covariance function defined by scale and shape parameters. This is denoted $f \sim \mathcal{GP}(m,s^2 C_{\boldsymbol{\theta}})$, and means that
\begin{align*}
{\mathbb{E}}[f(\boldsymbol{x})] = m, \; {\textup{cov}}\bigl(f(\boldsymbol{t}),f(\boldsymbol{x}) \bigr) = {\mathbb{E}}[(f(\boldsymbol{t}) – m)(f(\boldsymbol{x}) – m)] = s^2 C_{\boldsymbol{\theta}}(\boldsymbol{t},\boldsymbol{x}), \; \forall \boldsymbol{t}, \boldsymbol{x} \in [0,1]^d.
\end{align*}
Here, $m$, $s$, and $C_{\boldsymbol{\theta}}$ must be specified or estimated. Because $f$ is a Gaussian process, the multivariate integral of $f$, i.e., $\mu$, has a Gaussian distribution. Automatic Bayesian cubature uses these assumptions to increase $n$ until we reach
$\mathbb{P}_f [\left\lvert { \mu – \widehat{\mu}_n} \right\rvert \le \varepsilon ] \ge 99\%.$

The Bayesian credible interval, which depends on the sampling nodes, and the parameters defining the Gaussian process is
\begin{equation*}
\mathbb{P}_f \left[
|\mu-\widehat{\mu}_n| \leq \text{err}_\text{CI}
\right] = 99\%, \;
{\text{err}_{\text{CI}}} := 2.58 \; s \; \sqrt{c_{0,{\boldsymbol{\theta}}} – {\boldsymbol{c}_{\boldsymbol{\theta}}(\mathsf{X})}^T \mathsf{C}_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})^{-1}\boldsymbol{c}_{\boldsymbol{\theta}}(\mathsf{X})}
\end{equation*}
assuming that the mean of the Gaussian process, $m$, is zero. The formulae for the case of general $m$ are more complicated, but similar. Here,
\begin{gather*}
\label{eqn:fGaussDist_c0}
\mathsf{X} = ({\boldsymbol{x}}_1, \ldots, {\boldsymbol{x}}_n)^T, \qquad c_{0,{\boldsymbol{\theta}}} := \int_{[0,1]^{d} \times [0,1]^{d}} C_{\boldsymbol{\theta}} (\boldsymbol{t},\boldsymbol{x}) \, \text{d}{\boldsymbol{t}} \, \text{d}{\boldsymbol{x}} , \\
\boldsymbol{c}_{\boldsymbol{\theta}}(\mathsf{X}) := \left( \int_{[0,1]^d} C_{\boldsymbol{\theta}} (\boldsymbol{t},\boldsymbol{x}_i) \, \text{d}{\boldsymbol{t}} \right)_{i=1}^n, \qquad \mathsf{C}_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X}) = \left( C_{\boldsymbol{\theta}}({\boldsymbol{x}}_i,{\boldsymbol{x}}_j) \right)_{i,j=1}^n,
\end{gather*}
and $2.58$ represents $99.5\%$ quantile of the standard Gaussian distribution.
Bayesian cubature proceeds by increasing $n$ until $\text{err}_{\text{CI}}$ is no greater than the error tolerance.

The Bayesian credible interval as a stopping criterion assumes that the integrand, $f$, is not an outlier with respect to this Gaussian process. To improve the chances that this is the case, the hyperparameters $m$, $s$, and $\boldsymbol{\theta}$ should be tuned using the function data. In particular, $s$ should depend on the magnitude of the fluctuations of $f$. The standard approaches for tuning the hyperparameters require optimization, for which each iteration involves an eigendecomposition of $C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$.

#### Fast Bayesian Cubature

The computation of $\text{err}_{\text{CI}}$ for a single $\mathsf{X}$ and $\boldsymbol{\theta}$ requires inversion of the covariance matrix $C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$, which typically costs $\mathcal{O}(n^3)$. An eigendecompostion of $C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$ costs at least this much. The way to overcome this computational burden is to use families of covariance kernels, $C_{\boldsymbol{\theta}}$, that match the design, $\mathsf{X}$, in a way that the eigenvectors and eigenvalues of the matrix $C_{\boldsymbol{\theta}}(\mathsf{X},\mathsf{X})$ can be obtained via fast transforms. For integration lattices and kernels $C_{\boldsymbol{\theta}}$ such as
\begin{align*}
%\label{eq:shInv}
C_\boldsymbol{\theta}(\boldsymbol{t},\boldsymbol{x}) = K_\boldsymbol{\theta}(\boldsymbol{t} – \boldsymbol{x} \bmod \boldsymbol{1}),
\end{align*} where
\begin{align*}
\label{the_kernel_eqn_bernoulli}
K_\boldsymbol{\theta}(\boldsymbol{x}) =
\prod_{l=1}^d \biggl[
1 – (-1)^{r} \eta B_{2r}( x_l) \biggr], \;
\forall \boldsymbol{t},\boldsymbol{x} \in [0,1]^d, \ \boldsymbol{\theta} = (r,\eta), \ r \in \mathbb{N}, \ \eta > 0,
\end{align*}
and $B_{2r}$ is an even degree Bernoulli polynomial,
the fast transform corresponds to a fast Fourier transform. For digital nets (see our blog Digital Sequences, the Niederreiter Construction) and $C_\boldsymbol{\theta}$ that are digitally-shift invariant, the fast transform corresponds to a fast Walsh transform. In both cases, the computational burden attributable to tuning the hyperparameters and computing the width of the credible interval is a reasonable $\mathcal{O}(n \log(n))$.

The following code shows how to use a Bayesian stopping criterion for Keister’s example.

>>> from qmcpy import *
>>> from numpy import Keister, Lattice, CubBayesLatticeG
>>>
>>> tol = .005
>>>
>>> integrand = Keister(Lattice(dimension=2, order='linear'))
>>> keister_2d_exact = integrand.exact_integ()
>>> solution, data = CubBayesLatticeG(integrand, abs_tol=tol, n_init=2 ** 5).integrate()
>>> print('Integration error: ', abs(solution - keister_2d_exact) < tol)


Listing 1: Example usage of Lattice Bayesian Cubature algorithm.

This example can be run in Google Colab without any installation using this notebook.

#### References

1. Hickernell, F. J. Blog: Why Add Q to MC? https://qmcpy.org/2020/06/25/why_add_q_to_mc/ (2020).
2. Choi, S.-C. T., Hickernell, F., McCourt, M. & Sorokin, A. QMCPy: A quasi-Monte Carlo Python Library https://qmcsoftware.github.io/QMCSoftware/. 2020.
3. Jagadeeswaran, R. & Hickernell, F. Fast Automatic Bayesian Cubature Using Lattice Sampling. Stat. Comput. 29, 1215–1229 (2019).
4. Sloan, I. H. & Joe, S. Lattice Methods for Multiple Integration (Oxford University Press, Oxford, 1994).
5. Hickernell, F. & Niederreiter, H. The Existence of Good Extensible Rank-1 Lattices. J. Complexity 19, 286–300 (2003).
6. Cooley, J. W. & Tukey, J. W. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation 19, 297–301 (1965).
7. Ebert, A. Blog: Digital Sequences, the Niederreiter Construction. https://qmcpy.org/2021/06/04/digital-sequences-the-niederreiter-construction/ (2021).
8. Jagadeeswaran, R. Fast Automatic Bayesian Cubature Using Matching Kernels and Designs PhD thesis (Illinois Institute of Technology, 2019).
9. Fino, B. J. & Algazi, V. R. Unified Matrix Treatment of the Fast Walsh-Hadamard Transform. IEEE Transactions on Computers C-25, 1142–1146 (1976).