--- title: "Multivariate Gaussian marginal likelihood via THAMES" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{multivariate-gaussian} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE) ``` ```{r setup} set.seed(2023) library(thames) ``` To use the function `thames()` we only need two things 1. A $T\times d$ matrix of parameters drawn from the posterior distribution. The columns are the parameters (dimension $d$) and the rows are the $T$ different posterior draws. 2. The vector of unnormalized log posterior values of length $T$ (sum of the log prior and the log likelihood for each drawn parameter). To illustrate the use of the `thames()` function we give a toy example on multivariate Gaussian data. For more details on the method, see the paper [Metodiev M, Perrot-Dockès M, Ouadah S, Irons N. J., Raftery A. E. (2023), Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator.](https://arxiv.org/abs/2305.08952) Here the data $y_i, i=1,\ldots,n$ are drawn independently from a multivariate normal distribution: \begin{eqnarray*} y_i|\mu & \stackrel{\rm iid}{\sim} & {\rm MVN}_d(\mu, I_d), \;\; i=1,\ldots, n, \end{eqnarray*} along with a prior distribution on the mean vector $\mu$: \begin{equation*} p(\mu)= {\rm MVN}_d(\mu; 0_d, s_0 I_d), \end{equation*} with $s_0 > 0$. It can be shown that the posterior distribution of the mean vector $\mu$ given the data $D=\{y_1, \ldots, y_n\}$ is given by: \begin{equation}\label{eq:postMultiGauss} p(\mu|D) = {\rm MVN}_d(\mu; m_n,s_n I_d), \end{equation} where $m_n=n\bar{y}/(n+1/s_0)$, $\bar{y}=(1/n)\sum_{i=1}^n y_i$, and $s_n=1/(n +1/s_0)$. ### Toy example: d = 1, n = 20 We fix $s_0$ (the variance of the prior) and $\mu$ to 1 (note that the results are similar for some other values). ```{r} s0 <- 1 mu_star <- 1 n <- 20 d <- 1 ``` We simulate values of Y and calculate the associated $s_n$ and $m_n$ ```{r} library(mvtnorm) Y = rmvnorm(n, mu_star, diag(d)) sn = 1/(n+1/s0) mn = sum(Y)/(n + 1/s0) ``` To use `thames()` we need a sample drawn from the posterior of the parameters. In this toy example the only parameter is $\mu$, and we can draw from the posterior exactly. (More generally, MCMC can be used to obtain approximate posterior samples.) Here we take $2000$ samples. ```{r} mu_sample = rmvnorm(2000, mean=mn, sn*diag(d)) ``` Now we calculate the unnormalized log posterior for each $\mu^{(i)}$. We first calculate the prior on each sample: ```{r} reg_log_prior <- function(param,sig2) { d <- length(param) p <- dmvnorm(param, rep(0,d), (sig2)*diag(d), log = TRUE) return(p) } log_prior <- apply(mu_sample, 1, reg_log_prior, s0) ``` and the likelihood of the data for each sampled parameter: ```{r} reg_log_likelihood <- function(param, X) { n <- nrow(X) d <- length(param) sum(dmvnorm(X, param, diag(d),log = TRUE)) } log_likelihood <- apply(mu_sample, 1, reg_log_likelihood, Y) ``` and then sum the two to get the log posterior: ```{r} log_post <- log_prior + log_likelihood ``` We can now estimate the marginal likelihood using THAMES: ```{r} result <- thames(log_post,mu_sample) ``` The THAMES estimate of the log marginal likelihood is then ```{r} -result$log_zhat_inv ``` The upper and lower bounds of a confidence interval based on asymptotic normality of the estimator (a 95\% interval by default) are ```{r} -result$log_zhat_inv_L -result$log_zhat_inv_U ``` If we instead want a 90\% confidence interval, we specify a lower quantile of $0.05$: ```{r} result_90 <- thames(log_post,mu_sample, p = 0.05) -result_90$log_zhat_inv_L -result_90$log_zhat_inv_U ``` To check our estimate, we can calculate the Gaussian log marginal likelihood analytically as $$ \ell(y) = -\frac{nd}{2}\log(2\pi)-\frac{d}{2}\log(s_0n+1)-\frac{1}{2}\sum_{i=1}^n\|y_i\|^2+ \frac{n^2}{2(n+1/s_0)}\|\bar{y}\|^2. $$ ```{r} - n*d*log(2*pi)/2 - d*log(s0*n+1)/2 - sum(Y^2)/2 + sum(colSums(Y)^2)/(2*(n+1/s0)) ``` ### In higher dimensions We can do exactly the same thing if $Y_i \in \mathbb{R}^d$ with $d>1$ ```{r} s0 <- 1 n <- 20 d <- 2 mu_star <- rep(1,d) Y = rmvnorm(n, mu_star, diag(d)) sn = 1/(n+1/s0) mn = apply(Y,2,sum)/(n + 1/s0) mu_sample = rmvnorm(2000, mean=mn, sn*diag(d)) mvg_log_post <- function(param, X, sig2){ n <- nrow(X) d <- length(param) l <- sum(dmvnorm(X, param, diag(d),log = TRUE)) p <- dmvnorm(param, rep(0,d), (sig2)*diag(d), log = TRUE) return(p + l) } log_post <- apply(mu_sample, 1,mvg_log_post,Y,s0) result <- thames(log_post,mu_sample) -result$log_zhat_inv ``` To check our estimate, we again calculate the Gaussian marginal likelihood analytically: ```{r} - n*d*log(2*pi)/2 - d*log(s0*n+1)/2 - sum(Y^2)/2 + sum(colSums(Y)^2)/(2*(n+1/s0)) ```