Tuesday, December 22, 2015

Bayesian Item Response Theory in JAGS: A Hierarchical Two Parameter Logistic Model

I recently created a hierarchical two-parameter logistic model for item response theory (IRT). The JAGS script is now in the folder of scripts that accompany the book (available at the book's web site; click book cover at right). 

Below are slides that accompany my presentation of the material. I hope the slides are self-explanatory for those of you who are already familiar with IRT, and maybe even for those of you who are not. Maybe one day I'll record a narration over the slides and post a video. Meanwhile, I hope the slides below are useful.

By the way, if you find this program to be useful and adapt it to your own real data, please let me know about the results. (And if you write up the results, it's okay to cite this blog post and the book :-)

All analyses begin with the data. The data are correct/wrong answers for items on a test. The JAGS program will use long format for the data:
(Click on any image to enlarge.)

Each item is modeled as producing a 1/0 response with probability that depends on the item's difficulty and the subject's ability:

There is indeterminacy in the difficulty/ability scale, and two points must be arbitrarily pinned down. I chose the following: The most difficult not-impossible item is given a difficulty of 100, and the easiest non-trivial item is given a difficulty of 0. Then I opted to put the subject abilities and item discriminations under higher-level distributions, to get the benefits of shrinkage in a hierarchical model:

Here is the JAGS script itself (again, see the program folder that accompanies the book):

Now for the data used to demonstrate the model:

Results from the JAGS run. First, the MCMC diagnostics all look good:
(Click on any image to the images enlarged.)

Here are the estimates of individual items, listed in order of estimated difficulty. Notice that the easiest has difficulty fixed at 0, and the most difficult has  difficulty fixed at 100.

Subject abilities:


Two of the (simulated) subjects have data from only a single item. But this does not leave the estimated ability unbounded because of shrinkage from the hierarchical model, and the Bayesian estimation provides exact degrees of uncertainty on the estimates:

And in Bayesian estimation it is easy to make comparisons of item difficulties, or of item discriminations, or of subject abilities:


  1. Dear John,

    Very interesting stuff. Have you thought at all how you might go about the assessing absolute fit of the model? In frequentist IRT analysis people use various types of LRTs, comparing the expected and observed deviations with a chi squared distribution. However, as samples get large (e.g., n > 10000), interpretation becomes difficult, as even the smallest deviation results in highly significant p values. Could the method you propose here be used to look at posterior predictive fits to get an idea about absolute model fit? And if yes, do you have any recommendations on how to approach this in principle?



  2. Gelman et al.'s Bayesian Data Analysis contains extensive details on evaluating Bayesian models. You can do posterior predictive checks, like posterior p-values. But in the simplest case, you can simulate data and check interval coverage. If the model's well calibrated, you'll get the appropriate coverage.