Tuesday, April 7, 2015

Sex, death, and unbalanced designs (done Bayesian)

In classical analysis of variance (ANOVA), data with unequal samples sizes in the different cells can be a headache. These are often called "unbalanced" designs. But Bayesian analysis works identically for unbalanced designs as for balanced designs. This post shows an example. We'll see that the small-N cells naturally have less precise estimates and more shrinkage than the large-N cells.

Consider a study of the life span (i.e., longevity measured in days) of male subjects as a function of how much sexual activity they endured. The males were put into five different treatment groups, each group allowed different levels of sexual activity. You can look in Section 19.3.2 of DBDA2E, p. 561, for more details, but one more detail is that the species being studied was the fruit fly.

The analysis of the original balanced design yields this result (as in Figure 19.3, p. 563):

In particular, here are the estimates of the means for Pregnant1 and Pregnant8:

Notice above that the widths of the HDIs are about the same (i.e., about 11 days). The widths are about the same because (well, at least in part because) the two groups have the same amount of data (N=25).

Now we will unbalance the design by arbitrarily shifting most of the Pregnant8 data into the Pregnant1 group. Using the identical analysis program, with no changes except to the data, here is the result:

Notice above that the HDI for mu of Pregnant1 is narrower than before, because the sample size for that cell is now larger, while the HDI for mu of Pregnant8 is wider than before, because the sample size for that cell is now smaller.

Notice also the shrinkage on mu for Pregnant8: Although the data (see the figure) have a sample mean near 80, the estimated mu has a mode near 72. The shrinkage is caused by the hierarchical structure of the model.

In summary, it's neat that Bayesian models don't care about whether there are equal sample sizes in the cells. The model works the same way regardless, and the posterior distribution makes sense: The cell with small sample size has greater uncertainty and more shrinkage.

Appendix: Code for generating the results above.

The basic analysis is from the script Jags-Ymet-Xnom1grp-Mnormal-Example.R. To generate the graphs of mu, I added the following to the very end of the script:

mcmcMat = as.matrix( mcmcCoda )
p1idx = which(levels(myDataFrame[,xName])=="Pregnant1")
p8idx = which(levels(myDataFrame[,xName])=="Pregnant8")
layout(matrix(1:2, nrow=1))
plotPost( mcmcMat[,paste0("m[",p1idx,"]")] , main="Pregnant1" , xlab="mu" , xlim=c(50,100) )
plotPost( mcmcMat[,paste0("m[",p8idx,"]")] , main="Pregnant8" , xlab="mu" , xlim=c(50,100) )

To recode the group labels of the data, I used the following extra lines immediately after the data file is read in near the top of the script:

myDataFrame = read.csv( file="FruitflyDataReduced.csv" )
# Take lots of data out of Pregnant8 and recode it as Pregnant1:
p1 = myDataFrame[,"CompanionNumber"]=="Pregnant1"
p8 = myDataFrame[,"CompanionNumber"]=="Pregnant8"
myDataFrame[p8,"CompanionNumber"][1:22] = "Pregnant1"


  1. Dear Prof,

    what about if my data are unbalanced by choice and one of the levels of a factor is completely missing observations? Should I correct the model structure, for example setting the interaction for that level factor with other factor levels to 0? Should I re-adapt the sum to 0 par too?


  2. Well, if you have a very small design with empty cells, it would probably be better to reconceptualize the factors; e.g., a 2x2 with one empty cell would be better to treat as a 3-group design because you can't really estimate the main effects and interaction.

    But a larger design with missing cells is no problem. Chapter 20 of DBDA2E has an extended example; see Figure 20.3