STA465/STA2016: Week 2 Questions and Homework

STA465/STA2016: Week 2 Questions and Homework 1
(UPDATED)
Daniel Simpson
2019-01-18
Please submit your answers to Question 1 (Parts A-D) as a single pdf document via Quercus. It will be
Due Wednesday 30 January, 2019 at 12pm (Midday). Late submissions will be heavily penalized.
Question 1 (For assessment)
There has been a lot of work done in Bayesian statistics on the effect of prior distributions. In this question,
we’re going to explore a framework for constructing priors that contain, in some sense, minimal information.
These priors have the advantage of being automatic (there are no parameters that need to be set by considering
the problem). On the other hand, they can have poor empirical behaviour for many problems. Nevertheless,
they are commonly used for parameters that are well-identified by the data, such as the (fixed) intercept
term in a linear regression model.
One of the most popular examples of this type of prior is called the Jeffreys’ Prior defined for a univariate
parameter θ in the model y | θ ～ p(y | θ) as
(Health warning: Although the Jeffreys priors mentioned in this question sometimes do nice things, they
should only be used with extreme care in real analyses. They can do very weird things if you’re not careful!
Nevertheless it’s important to know they exist and get a feeling for how they work.)
Part A. If y | μ N(μ, σ2) for some fixed value of σ2
, show that the Jeffreys’ prior for μ in this model is the
improper uniform prior p(μ) ∝ 1.
Part B. Consider the model yi| μ ～ N(μ, σ2)
p(μ) ∝ 1,
where σ2
is fixed and known. Show that although p(μ) ∝ 1 is not a proper prior distribution (as it does not
integrate to one), show that formally applying the Bayesian machinery leads to a proper posterior distribution.
Compute this distribution.
Part C. Consider the similar model
2 are fixed and known. Compute the posterior distribtuion p(μ | y) under this model and
write a sentence or two comparing it to the posterior computed using the Jeffreys prior for μ. Are they ever
the same?
Part D. Do you expect an improper prior to always yield a proper posterior?
1
Question 2 (Not for assessment)
In this question we will look at a very simple example that will help you become familiar with the R-INLA
software. We will look at a very simple non-spatial models. The aim of this question is to walk you through
an analysis with INLA.
Please work through this question step by step. There are three questions at the end.
To start with, we need to load the library
library(INLA)
In the event that you don’t already have INLA installed, you can install it but running this command.
install.packages("INLA", repos=c(getOption("repos"),
INLA="https://inla.r-inla-download.org/R/testing"),
dep=TRUE)
For this example we’re going to look at classical experiment that attempted to work out the effect that a
root extract had on the germination probability of certain types of seeds. More details can be found here:
http://www.openbugs.net/Examples/Seeds.html
The experimental set up is a 2x2 factorial design split up over 21 growing plates (the (0,1) experiment is
repeated six times, while the others are repeated four times).
The data looks like this.
data(Seeds)
head(Seeds)
## r n x1 x2 plate
## 1 10 39 0 0 1
## 2 23 62 0 0 2
## 3 23 81 0 0 3
## 4 26 51 0 0 4
## 5 17 39 0 0 5
## 6 5 6 0 1 6
We model the number of seeds that germinate ri out of ni seeds as
ri ～ Binomial(ni, pi).
We want to see how the probability of germination varies with the two treatments (labelled x1 and x2 in the
data). Because pi
is constrained to be between zero and one (it’s a probability!) we need to first transform it
to an unconstrained real number as this will allow us to do regression. The classical transformation is the
logistic transformβ1x1i + β2x2i + β12x1ix2i.
If each individual experiment was performed identically, we might just want to fit this model. However, each
repeated experiment took place on a different growing plate and it is possible that this introduces some extra
2
variation into the data. To account for this extra variation, we add a random effect ui so our model is
2 ～ Inverse Gamma(1, 1e 5)
The priors specified in this model are the default priors in INLA. For the moment we will keep them.
We can specify this model in INLA using the formula
formula_seeds = r ~ 1 + x1 + x2 + x1*x2 + f(plate, model="iid")
The formula in INLA works the same way as it does in lm with just one little twist. So the r on the left
hand side of ~ is the response variable, while the first four terms on the right hand side denote the regression
model. (NB: All of these terms could be replaced with x1*x2 as this is automatically expanded to include
the other terms.)
The one little twist is the f() function, which is what we use to tell INLA that there is something special. In
this case, it’s the random effect ui
. The arguments to f are : - The name of the variable in the data (this
must be unique) - The structure of the random effect, which is model = "iid". This option tell us that each ui
iid～ N(0, τ 2). We will become familiar with other options.
(There are a pile of other optional arguments that f can take, and we’ll discuss them as appropriate.)
With the formula in hand, we can fit the model using the following command:
fit_seeds = inla(formula_seeds, data=Seeds, family="binomial", Ntrials=n)
The first two arguments to the inla function specify the formula and the data (just like with lm). The
family argument tells inla that the data has a Binomial distribution (it automatically uses the logit
transformation), while the Ntrials argument tells inla where in the data to find the first argument of the
Binomial distributions (recall, ri ～ Binomial(ni
, pi)).
This code runs fairly fast and produces the following output.
summary(fit_seeds)
##
## Call:
## c("inla(formula = formula_seeds, family = \"binomial\", data =
## Seeds, ", " Ntrials = n)")
## Time used:
## Pre = 2.09, Running = 0.216, Post = 0.121, Total = 2.43
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) -0.557 0.129 -0.813 -0.557 -0.306 -0.555 0
## x1 0.143 0.227 -0.307 0.144 0.586 0.146 0
## x2 1.321 0.182 0.968 1.320 1.682 1.318 0
## x1:x2 -0.782 0.312 -1.395 -0.781 -0.169 -0.781 0
##
## Random effects:
## Name Model
## plate IID model
3
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Precision for plate 19386.16 19788.68 153.07 13211.66 72720.49 9.78
##
## Expected number of effective parameters(stdev): 4.41(1.41)
## Number of equivalent replicates : 4.77
##
## Marginal log-Likelihood: -72.08
One thing that we see here is that the precision for the random effect is very high (median 13212). This
means that the median of the standard deviation is around 0.009, which is very small.
Question: In the previous paragraph, I used the median rather than the mean because medians are easy to
transform. Show that if a random variable Z has median m, then f(Z) has median f(m) as long as f(·) is a
monotone function. Is this true for a general quantile?
It is also possible to plot the posteriors by running plot(fit_seeds). This produces multiple plots that you
can scroll through using the arrows above the plotting window in RStudio.
If you look at the object fit_seeds you will see it’s a list of lists. This contains all of the information that
you need to do things with the model in relative well-named places.
Question: Find the following quantities in the fit_seeds object:
The mean of the intercept (Hint: covariates are also called “fixed effects” and the mean is a summary)
The median of the coefficient of x1
Find the full distribution for x1 in fit_seeds and plot it (hint: it’s called a “marginal distribution”)
Question: How does this model differ from the model without the random effect?
Question 3 (Not for assessment)
For this question we will look at the PM2.5 data, which can be downloaded from the quercus or with the
following command (which may not work on all systems)
load(url("https://github.com/jgabry/bayes-vis-paper/blob/master/bayes-vis.RData?raw=True"))
class(GM)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
#Expected output:
# [1] "SpatialPointsDataFrame"
# attr(,"package")
# [1] "sp"
As that command shows, this is a special type of data frame called a SpatialPointsDataFrame. This is
inconvenient for today’s analysis (although it will become convenient later), so for the moment we will extract
out the “slot” of the SpatialPointsDataFrame that contains the stuff we want. (We access a slot using an @
rather than a $ because R has it’s own internal logic)
pm25_data = GM@data
There are three tasks for this question:
1. Use INLA to fit the linear model log(pm25) ~ log(sat_2014).
4
2. By modifying the exercise you did in Question 2, use INLA to fit a model with a random intercept
using the default priors ie
log(pm25ij ) | σ, β, μj ～ N(μj + β log(sat_2014ij ), σ2)
μj | μ, τ ～ N(μ, τ 2).
Using the summary.fixed and summary.random entries in the output list, find an estimate for the median
intercept in each region. (Hint: Use the numeric vector super_region rather than the Factor vector
super_region_name to index the random intercept.)
3. Fit the full multilevel model using INLA (with the default priors). This model is
log(pm25ij ) | σ, βj , μj ～ N(μj + βj log(sat_2014ij ), σ2)
μj | μ, τμ ～ N(μ, τ 2μ)
βj | β, τβ ～ N(β, τ 2β).
One of the challenges you will have with this third exercise is that a formula cannot use the same
variable in the data more than once. To get around this, I suggest copying the variable as follows:
pm25_data_part3 = data.frame(pm25 = pm25_data,
sat_2014 = pm25_data$sat_2014,
super_region_intercept = pm25_data$super_region,
super_region_sat = pm25_data$super_region)

因为专业，所以值得信赖。如有需要，请加QQ：99515681 或邮箱：[email protected]

微信：codinghelp

STA465/STA2016: Week 2 Questions and Homework

猜你喜欢