## Friday, July 29, 2011

### How long should an MCMC chain be to get a stable HDI?

In the book, the default length of an MCMC chain differs from one program to another, ranging from around 2,000 to 10,000 total steps combined across chains (e.g., 3 chains with 1,000 steps each is 3,000 total steps). But how many steps are needed for a stable estimate of the HDI limits? Is 2,000 enough? Is 10,000 enough?

The issue motivating the question is that the MCMC chain is just one random sample from the true underlying posterior distribution. If we seeded the random chain differently (or used a few more or less burn-in steps, or thinned slightly more or less) we would have a different random MCMC chain with slightly different HDI limits because of random noise in the MCMC sample. When the MCMC chain gets very long, then in the limit it approximates the true posterior extremely well. But for chains of realistic length (not the unobtainable "in the limit" chain), how stable is the MCMC estimate of the HDI?

This blog entry reports some Monte Carlo simulations to illustrate the answer. Suppose that the true posterior is a known distribution, so we can easily generate random values from it. What we'll do is generate an MCMC chain from the true posterior, and estimate the (95%) HDI from the MCMC chain. We'll do that many times and keep track of the estimated HDI's, so we can see how much the estimated HDI's vary. For each MCMC chain we'll also determine the 2.5%ile and the 97.5%ile, to compare with the 95% HDI.

Here are results when the true posterior is a gamma distribution and the MCMC chain has length 1,000:
The top panel above shows the true gamma distribution from which the MCMC chains were generated. Each MCMC chain had 1,000 values in it, and for each chain the 95% HDI was estimated. This was done 20,000 times. The middle row of histograms shows the HDI limits across the 20,000 replications. The bottom row of histograms shows the percentiles across the 20,000 replications. The vertical dashed lines in the histograms show the true values of the HDI.
• Notice first that the 2.5%ile and 97.5%ile, in the bottom row, are very poor estimates of the HDI. This is because the true distribution is skewed, and the equal-tailed credible interval is quite different than the HDI.
• Notice next that there is indeed quite a lot of variation across the 20,000 replications. For example, the histogram of "HDI low" spans the true value of 0.0799, but different chains give different estimates. (The histogram displays the 95% HDI across the 20,000 replications going from 0.02559 to 0.1976. This is not the HDI of the MCMC chain! It is the HDI, across replications, of the estimated lower value of the HDI of the MCMC chain.)
• Notice finally that there is a bit of bias in the estimate of the limits of HDI. Specifically, the estimated HDI is a little bit narrower than the true HDI. This makes intuitive sense: The estimated HDI finds the narrowest 95% interval in the finite MCMC sample. Because the tails of the distribution have more points sampled just inside the HDI than just outside the HDI, there is, for small chains, a higher chance of finding a narrowest 95% interval just inside the true HDI. As the chain length gets larger, this small-sample bias is reduced, as can be seen in the next figures

Here are the results when the chain length is 5,000:

And here are the results when the chain length is 10,000:
When the chains have a length of 10,000, the bias in the HDI estimate is only slight. There is still, of course, some imprecision in the estimate of the HDI limits (and width). It is up to the user to ascertain whether this amount of imprecision is tolerable for the particular application, or whether a longer MCMC chain should be computed.

Conclusion: When the limits of the HDI are close to the decision boundary (such as the ROPE), or when the HDI width is only marginally narrow enough for the desired precision, then be sure to use a very long MCMC chain so that the HDI limits are stable. The graphs above help illustrate how long is "very long." In general, while the book uses chains on the order of 5,000 steps for pedagogical purposes, real research reports should probably use at least 10,000 steps or much more as patience permits.

## Tuesday, July 26, 2011

### BRugs for Linux users

A colleague (who prefers to remain anonymous) provided the following very informative message, which I have posted here with his permission:

After your inspiring tutorial at CogSci 2011, I pushed a bit further on Linux compatability. Martyn Plummer, the author of the related project JAGS, confirmed that both OpenBUGS and BRugs have been upgraded to work on Linux. The source and packaged binaries of OpenBUGS 3.2.1 are available for several distributions from http://www.openbugs.info/w/Downloads  and the BRugs 0.7.1 tarball from http://www.openbugs.info/w/UserContributedCode.  I installed on 32-bit Ubuntu 11.04, using

wajig force openbugs_3.2.1-1_i386.deb

to override the unneeded dependency on libc6-i386, a package for providing 32-bit support for libc6 on a 64-bit installation. From the commandline, I then issued

R CMD INSTALL BRugs_0.7.1.tar.gz

It compiled and installed cleanly. Starting R, I issued

windows <- x11

to remap the output and then ran a couple of your scripts for testing:

source("TutorialBRugsTest.R")

ran perfectly;

source("ANOVAtwowayBRugsSTZ.R")

generated some errors due to a couple of restatements of windows() in the script -- these appear to be unnecessary and can be removed, or one can add the "windows <- x11" remapping after each occurrence. A global replace also fixes the problem:

sed -i -r 's/windows/x11/' *R

A residual error is generated by savePlot's type="eps" (output to file):

Error in match.arg(type) :
'arg' should be one of “png”, “jpeg”, “tiff”, “bmp”

I'm guessing this is because my setup doesn't support eps output by default, but haven't investigated further. A quick fix is to replace with type jpg or png -- something like this should be safe:

sed -i -r 's/(.*)(.eps)(.*type=\")(eps)(\".*)/\1.jpg\3jpg\5/' *.R

In short, OpenBUGS and BRugs work great on Linux, and your scripts encounter difficulties only with the output. For an overview, see http://rwiki.sciviews.org/doku.php?id=tips:graphics-misc:export.

I very much appreciate all the work you've done with Bayesian data analysis and will spend some time learning the methods you have developed.

Best wishes,
...

P.S. In the medium to long run, you may want to look into switching from OpenBUGS to JAGS (and from BRugs to rjags). OpenBUGS is written in Component Pascal, a dead language with no debugging support, while JAGS is written in standard C++ and is deservedly getting the lion's share of developer energy.

## Monday, July 25, 2011

### Autocorrelation in Bayesian ANOVA: Fixing the book's overkill

In hierarchical Bayesian ANOVA (Ch's 18, 19, and 22 of the book), the model expresses the cell means in terms of deflections from baseline, such that the deflections sum to zero. The sum-to-zero constraint is only imposed after BUGS does its work; i.e., the deflection parameters in the BUGS model are not intrinsically constrained to sum to zero. The deflection parameters in the model can therefore become strongly autocorrelated, and extensive thinning may be necessary to attenuate autocorrelation in the unconstrained parameters.

But we don't care about the unconstrained parameters. What we care about is the sum-to-zero versions of the parameters, because those are the parameters we use to interpret the data. It turns out that autocorrelation in the sum-to-zero parameters is much less severe. I've known about this for quite a while, but haven't gotten around to modifying the programs until now.

Here's an example. This comes from the two-factor hierarchical Bayesian ANOVA in Section 19.1.3 (p. 520) of the book, regarding salaries. The deflection parameters in the BUGS model are labeled a1[], a2[], and a1a2[,]. The sum-to-zero versions of the parameters are labeled b1[], b2[], and b1b1[,]. Here are the a1[] chains when the thinning is set to 75:

Obviously there is nasty autocorrelation, and much more thinning would be needed if we cared about these parameters. (Added Nov. 7, 2011: Regarding thinning, see a later post which points out that thinning is not advisable. But the remainder of this post, below, is still important for better interpretation of the parameters in Bayesian ANOVA.)

But we don't care about a1[], instead we care about the sum-to-zero reparameterization, b1[]. Here are the b1[] chains derived from the chains above:

There is minimal autocorrelation in the sum-to-zero versions of the parameters.

Therefore, the versions of the programs that accompany the book, which assess and attenuate autocorrelation in the non-sum-to-zero parameters, thin far more than is necessary. The results are not wrong, merely overkill.

I have created new versions of two of the ANOVA programs such that the sum-to-zero constraint is computed inside the BUGS model statement, and therefore the sum-to-zero parameters can be assayed using the plotChains() command. The two new programs are ANOVAonewayBRugsSTZ.R and ANOVAtwowayBRugsSTZ.R . These programs use far less thinning than the original programs to achieve the same results. Other ANOVA programs --such as those that allow heterogeneous variances, and Poisson exponential ANOVA-- can be modified analogously.

## Wednesday, July 6, 2011

### Another example of null-value assessment by estimation or model comparison

The recent article in Perspectives on Psychological Science (see blog post here) showed an example of estimating the bias of a coin, and assessing the credibility that the coin is fair. In that article I showed that the Bayes factor can change dramatically when the alternative hypothesis is changed, but the Bayesian estimate of the bias barely changes at all.

Here I show an analogous example of estimating the mean and standard deviation for data described by a normal distribution. In other words, it's an example to which frequentists would apply a single-sample t test. The issue is assessing whether the mean is credibly non-zero. The results show that the Bayes factor can change dramatically when the alternative hypothesis is changed, but the Bayesian estimate of the mean barely changes at all. This fact is not news; what is new is illustrating it in BUGS as hierarchical model comparison, using the style of programming used in the book.

One other novelty in this post is a demonstration of the model comparison preferring the null even when the uncertainty in the estimate is large. This conclusion seems very suspicious, again with the estimation technique yielding a more informative conclusion.

For purposes of illustration, I set up a hierarchical model comparison in BUGS, analogous to the example in Section 10.2.1 (p. 244) of the book. Both models are the ordinary Bayesian estimation of mu and sigma of a normal likelihood, exactly as in Chapter 15 (e.g., p. 396) of the book. All that differs between models is the prior on mu. One model represents the null hypothesis and puts an extremely narrow "spike" prior on mu, normal(mean=0,SD=0.01). The other model represents an alternative hypothesis and put a relatively wide prior on mu, such as normal(mean=0,SD=20). For both models, the prior on sigma is uniform(low=0,high=10), which is diffuse relative to the data that will be entered. Altogether, the hierarchical model estimates five parameters: mu_alt and sigma_alt in the alternative model, mu_null and sigma_null in the null model (with mu_null residing close to zero because of the spike prior), and the model index parameter (which is given a 50/50 prior).

The data set was merely a random sample of 40 values from a normal distribution, re-scaled so that the sample had a mean of 0.8 and a standard deviation of 2.0. The resulting posterior distribution (20,000 steps in the MCMC chain) looks like this:
Notice that the model index, displayed in the central plot, substantially prefers the null model, 0.751 to 0.249. In other words, the Bayes factor is .751/.249 in favor of the null hypothesis. But notice that the estimate of mu, shown in the upper right, has a 95% HDI from 0.1814 to 1.453, clearly excluding the null value. From these results, should we conclude that the null hypothesis is credible, as suggested by the Bayes factor, or should we conclude that the null value is not among the credible values, as suggested by the parameter estimate? To me, the latter makes more sense.

When the alternative prior is changed to be a bit less diffuse, as normal(mean=0,SD=1.5), the posterior instead looks like this:
Now the model index substantially prefers the alternative model; the Bayes factor is 0.225/0.775 in favor of the alternative. Thus the model comparison has completely flip-flopped. But the 95% HDI of the estimated mu (upper right) has changed only slightly, now going from 0.1413 to 1.377.

As a final demo, consider a case in which there is a small sample (N=10) with a sample mean of zero:
Here, the model comparison overwhelmingly prefers the null hypothesis (with a Bayes factor of .967/.033), even though the estimate of mu (upper right) is extremely uncertain, with a 95% HDI ranging from -1.6 to +1.6, which is 1.6 SD's of the sample data! From these results, do we strongly believe the null hypothesis, as the Bayes factor suggests, or do we instead believe that the null value is among the credible values, with a large uncertainty? To me, the latter makes more sense.

## Friday, July 1, 2011

### Horses, donkies, and mules

This anti-Bayesian joke is floating around various sites on the internet:
"A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule."
In that case, everyone is a Bayesian. Researchers in perceptual science have shown over and over that background knowledge and expectation can strongly influence perception, especially when the visual stimulus is unclear. Anecdotal cases of this abound; e.g., the 1950s TV show in which people say (roughly) "Look, up in the sky, it's a bird, no, it's a plane, no, it's Superman!" Their prior knowledge indicates that flying things should be birds or airplanes, and it's only with accumulating visual data that the prior is overwhelmed.

And there is an error in the joke. If an actual Bayesian vaguely expects a horse and catches only a glimpse of a donkey, then s/he believes s/he has seen a mule only with great uncertainty, not strongly.

### A Bayesian and a frequentist are out for a drive...

 Photo by Cary Wolinsky

A Bayesian and a frequentist are out for a springtime drive in the country. The Bayesian notices a flock of sheep on the hillside, and says, "Look, the sheep have been shorn." The frequentist replies: "Well, on the sides facing us!"

[Meaning: The Bayesian is using prior knowledge to infer that the sides of the sheep away from the viewer have also been shorn, but the frequentist refuses to use prior knowledge.]