Breaking the iid Barrier: How Stochastic Gradients Are Going Spatial
Authors:
(1) Mohamed A. Abba, Department of Statistics, North Carolina State University;
(2) Brian J. Reich, Department of Statistics, North Carolina State University;
(3) Reetam Majumder, Southeast Climate Adaptation Science Center, North Carolina State University;
(4) Brandon Feng, Department of Statistics, North Carolina State University.
Table of Links
Abstract and 1 Introduction
1.1 Methods to handle large spatial datasets
1.2 Review of stochastic gradient methods
2 Matern Gaussian Process Model and its Approximations
2.1 The Vecchia approximation
3 The SG-MCMC Algorithm and 3.1 SG Langevin Dynamics
3.2 Derivation of gradients and Fisher information for SGRLD
4 Simulation Study and 4.1 Data generation
4.2 Competing methods and metrics
4.3 Results
5 Analysis of Global Ocean Temperature Data
6 Discussion, Acknowledgements, and References
Appendix A.1: Computational Details
Appendix A.2: Additional Results
1.2 Review of stochastic gradient methods
When dealing with large datasets, stochastic gradient (SG) methods (Robbins and Monro, 1951) have become the default choice in machine learning (Hardt et al., 2016). To avoid computing a costly gradient based on the full dataset, SG methods only require an unbiased and possibly noisy estimate using a subsample of the data. When the data is independent and identically distributed (iid) a proper scaling of the gradient based on a given subsample of the data yields an unbiased gradient estimate. The popularity and success of SG methods in optimization eventually lead to their adoption for scalable Bayesian inference (Nemeth and Fearnhead, 2021). Scalable SG Markov Chain Monte Carlo (SGMCMC) methods for posterior sampling in the iid setting have been proposed (Welling and Teh, 2011; Chen et al., 2015; Ma et al., 2015; Dubey et al., 2016; Baker et al., 2019). Convergence of SGMCMC methods has also received considerable attention. Under mild conditions, SGMCMC methods produce approximate samples from the posterior (Teh et al., 2016; Durmus and Moulines, 2017; Dalalyan and Karagulyan, 2019).
Although SG methods are widely used in the iid setting, their possible use in the correlated setting is still new. A naive application of SGMCMC methods in the correlated setting would overlook critical dependencies in the data during subsampling. Moreover, the gradient estimate from the subsamples cannot be guaranteed to be unbiased. To the best of our knowledge, subsampling methods for spatial data that result in unbiased gradient estimates has not been addressed. Chen et al. (2020) studied the performance and theoretical guarantees for SG optimization for GP models. Although, the gradient based on a minibatch of the data leads to biased estimates of the full gradient of the log-likelihood, Chen et al. (2020) established convergence guarantees for recovering recovering noise variance and spatial process variance in the case of the exponential covariance function. In their work, the length scale parameter, which controls the degree of correlation between distinct points is assumed known, and no convergence result is provided. Recent works have considered other types of dependent data. In the case of network data, Li et al. (2016b) developed an SGMCMC algorithm for the mixed-member stochastic block models. Ma et al. (2017) leveraged the short-term dependencies in hidden Markov models to construct an estimate of the gradient with a controlled bias using non-overlapping subsequences of the data. This approach was extended to linear and non-linear state space models (Aicher et al., 2019, 2021).
SGMCMC methods can be divided in two main groups based on either Hamilton dynamics (Chen et al., 2014) or Langevin dynamics (Welling and Teh, 2011). In this work we use the Langevin dynamics (LD) method due to its lower number of hyperparameters, our approach can be extended to the Hamiltonian dynamics with minor modifications. We extend the SGLD method to the case of non-iid data using the Vecchia approximation and provide a method that takes account of the local curvature to improve convergence.
In the remainder of this paper, Section 2 discusses the Mat´ern Gaussian process model and the Vecchia approximation used to obtain unbiased gradients. Section 3 presents the derived SGMCMC algorithm for Gaussian process learning. We test our proposed method using a simulation study in Section 4, and present a case study for ocean temperature data in Section 5; Section 6 concludes. A modification of our approach into a stochastic gradient Fisher scoring method for GPs is discussed in the Supplementary Material, alongside its performance for maximum likelihood estimation.