Wednesday, November 16, 2016

Is it legitimate to view the data and then decide on a distribution for the dependent variable?

An emailer asks,
In Bayesian parameter estimation, is it legitimate to view the data and then decide on a distribution for the dependent variable? I have heard that this is not “fully Bayesian”.
The shortest questions often probe some of the most difficult issues; this is one of those questions.

Let me try to fill in some details of what this questioner may have in mind. First, some examples:
  • Suppose we have some response-time data. Is it okay to look at the response-time data, notice they are very skewed, and therefore model them with, say, a Weibull distribution? Or must we stick with a normal distribution because that was the mindless default distribution we might have used before looking at the data? Or, having noticed the skew in the data and decided to use a skewed model, must we now obtain a completely new data set for the analysis?
  • As another example, is it okay to look at a scatter plot of population-by-time data, notice they have a non-linear trend, and therefore model them with, say, an exponential growth trend? Or must we stick with a linear trend because that was the mindless default trend we might have used before looking at the data? Or, having noticed the non-linear trend in the data and decided to use a non-linear model, must we now obtain a completely new data set for the analysis?
  • As a third example, suppose I’m looking at data about the positions of planets and asteroids against the background of stars. I’m trying to fit a Ptolemaic model with lots of epicycles. After considering the data for a long time, I realize that a completely different model, involving elliptical orbits with the sun at a focus, would describe the data nicely. Must I stick with the Ptolemaic model because it’s what I had in mind at first? Or, having noticed the Keplerian trend in the data, must I now obtain a completely new data set for the analysis?
What is the worry of the unnamed person who said it would not be “fully Bayesian” to look at the data and then decide on the model? I can think of a few possible worries:

One worry might be that selecting a model after considering the data is HARKing (hypothesizing after the results are known; Kerr 1998, http://psr.sagepub.com/content/2/3/196.short). Kerr even discusses Bayesian treatments of HARKing (pp. 206-207), but this is not a uniquely Bayesian problem. In that famous article, Kerr discusses why HARKing may be inadvisable. In particular, HARKing can transform Type I errors (false alarms) into confirmed hypothesis ergo fact. With respect to the three examples above, the skew in the RT data might be a random fluke, the non-linear trend in the population data might be a random fluke, and the better fit by the solar-centric ellipse might be a random fluke. Well, the only cure to Type I errors (false alarms) is replications (as Kerr mentions). Pre-registered replications. There are lots of disincentives to replication attempts, but these disincentives are gradually being mitigated by recent innovations like registered replication reports (https://www.psychologicalscience.org/publications/replication). In the three examples above, there are lots of known replications of skewed RT distributions, exponential growth curves, and elliptical orbits.

Another worry might be that the analyst had a tacit but strong theoretical commitment to a particular model before collecting the data, and then reneged on that commitment by sneaking a peak at the data. With respect to the three examples above, it may have been the case that the researcher had a strong theoretical commitment to normally distributed data, but, having noticed the skew, failed to mention that theoretical commitment and used a skewed distribution instead. And analogously for the other examples. But I think this mischaracterizes the usual situation of generic data analysis. The usual situation is that the analyst has no strong commitment to a particular model and is trying to get a reasonable summary description of the data. To be “fully Bayesian” in this situation, the analyst should set up a vast model space that includes all sorts of plausible descriptive models, including many different noise distributions and many different trends, because this would more accurately capture the prior uncertainty of the analyst than a prior with only a single default model. But doing that would be very difficult in practice, as the space of possible models is infinite. Instead, we start with some small model space, and refine the model as needed.

Bayesian analysis is always conditional on the assumed model space. Often the assumed model space is merely a convenient default. The default is convenient because it is familiar to both the analyst and the audience of the analysis, but the default need not be a theoretical commitment. There are also different goals for data analysis: Describing the one set of data in hand, and generalizing to the population from which the data were sampled. Various methods for penalizing overfitting of noise are aimed at finding a statistical compromise between describing the data in hand and generalizing to other data. Ultimately, I think it only gets resolved by (pre-registered) replication.

This is a big issue over which much ink has been spilled, and the above remarks are only a few off-the-cuff thoughts. What do you think is a good answer to the emailer's question?

5 comments:

  1. Hi John,
    In my opinion, ensuring a well fitting model is very important. In the multilevel frquentist literature one often uses model selection to select the most appropriate distribution. If my data is zero inflated, do I really want a poisson model that does not take this into account? In bayes, LOO can be used to produce a measure of fit, while posterior predictive checks can provide visual checks.

    Like all things, this approach can be abused. For instance, selecting a distribution in which some interval now excludes zero.

    I end with stating, strongly, that I really think researchers should spend more time modeling their data.

    ReplyDelete
    Replies
    1. In addition, through modeling, one can vary assumptions to see how robust the findings are.

      A similar argument could be made about prior-should these be registered as well ? Outside of a registered replication, many models could be fit to see how the evidence varies with varying priors, or whatever else assumption.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. If you have data to spare, holding out a test set is always s good way to sanity check your choices.

    ReplyDelete
  4. My view on this is here: http://simkovic.github.io/2016/11/21/With-pre-registered-fully-bayesian-analyses-against-the-pre-registered-replications.html

    ReplyDelete