Note: Windows users in both Stan and Python might need RTools installed. Martin has a flash drive with RTools installers so that you don’t have to wait for the large download.
We will use the cmdstanr
package, follow https://mc-stan.org/cmdstanr/articles/cmdstanr.html
We will use the CmdStanPy
package, follow https://mc-stan.org/cmdstanpy/installation.html
Create a file called initial.stan
and paste the
following simple model.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
}
model {
// likelihood
target += normal_lpdf(y | mu, 1);
// prior
target += normal_lpdf(mu | 0, 1);
}
Note that we express all contributions to the target density
explicitly with target +=
.
The corresponding statistical model is
\[ y_i \sim N(\mu, 1) \\ \mu \sim N(0,1) \]
i.e. we are estimating the mean of a normal distribution with known standard deviation and a somewhat informative prior on the mean.
Note: You’ll notice many Stan models use statements like
y ~ normal(mu, 1)
instead of
target += normal(y | mu, 1)
. We will discuss this
alternative syntax and the differences in next lesson.
Simulate data for the model with 10 data points, using the same
structure as the model assumes, i.e. draw \(\mu\) from its prior, then draw \(y\) (using rnorm
in R,
numpy.random.Generator.normal
in Python)
Compile the model, prepare the data in a format for the model, fit the model and display a summary:
options(mc.cores = 4)
to compute
chains in parallel.InferenceData
via arviz.from_cmdstanpy
and then use arviz.summaryInspect the summary - how big is the uncertainty? Are the “true” values we used in simulation recovered?
sigma
Add a sigma
parameter of type real
to model
the standard deviation, with a half-Normal prior i.e. the new
mathematical model is:
\[ y_i \sim N(\mu, \sigma) \\ \mu \sim N(0,1) \\ \sigma \sim HalfN(0, 2) \]
Since standard deviation cannot be negative, add a
<lower=0>
constraint to sigma
.
HMC needs the parameter space to be Euclidean (i.e. \(\mathbb{R}^K\) for some \(K\)), so when something is constrained,
there is an implicit transformation, so Stan will use
log(sigma)
internally.
There is no builtin half-normal distribution in Stan. But it turns out, using a normal prior on a variable constrained to be positive is the same as using a half-normal distribution. We will go into the reasons for this and some nuances next time.
Generate data according to this model, i.e. first draw \(\mu\) and \(\sigma\) from their priors, then draw \(y\). You can draw from the half-normal distribution simply by taking the absolute value of a draw from the normal distribution.
Explore the summary, do you recover the true parameters?
Use the model from the previous task with dataset of size 1. What happens?
Is there a reason the model may not work well with just a single datapoint?
Does the model from Task 2 also have issues with a dataset of size 1? Why?
Bonus: Try this if you feel it won’t take you too
much time. Assume that the single data point is \(y = 1\), make a contour and/or heatmap plot
of the posterior density as a function \(\mu\) and \(\sigma\) (in R use
dnorm( log = TRUE)
, in Python use
scipy.stats.norm
and the logpdf
method).
Generally, to get a basic understanding of warnings about convergence from Stan, the guide at https://mc-stan.org/misc/warnings is IMHO quite good (I wrote most of it :-D).
Add two new data elements of type real
-
mu_prior_mean
and mu_prior_sd
to define the
location and scale of the normal prior for \(\mu\)
Should those data have some constraints attached? If yes, add them as well.
Test that the model now works with some of the datasets you generated previously.
Generate a dataset of size 10 with the prior \(\mu \sim N(0,1), \sigma \sim HalfN(0, 2)\).
Find the least extreme values of mu_prior_mean
and
mu_prior_sd
that make the posterior generated by Stan
exclude the true value of \(\mu\) from
your simulation.
Try the same task with a dataset of size 3 and a dataset of size 100.
Expand the model so that you now have two sets of data (possibly of unequal size), each belonging to a different group and estimate the mean in the first group and the difference in means. More specifically, the model should look like this:
\[ y_A \sim N(\mu, \sigma_A) \\ y_B \sim N(\mu + \delta, \sigma_B) \\ \mu \sim N(0, 1)\\ \delta \sim N\left(0, \frac{1}{2}\right) \\ \sigma_A, \sigma_B \sim HalfN(0, 1) \]
(you can keep the location and scale of the prior as data, but you can also hardcode it)
Repeatedly simulate data according to the model and fit it. How big do the groups need to be, so that your posterior interval for \(\delta\) excludes 0 most of the time?