Saturday, December 12, 2015

Lessons from Bayesian disease diagnosis: Don't over-interpret the Bayes factor

This post has been revised and is now superseded by the new post linked >HERE<.

"Captain, the prior probability of this
character dying is extremely small."
A primary example of Bayes' rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here's one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes' rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103-105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.

The ratio of the detection and false-alarm rates is the Bayes factor, hence the lesson of the primary example is: Do not use the Bayes factor alone to make decisions; instead always incorporate the prior probabilities of the options. Unfortunately, this lesson is often forgotten or at least de-emphasized in general model-selection settings. In model-selection settings, some people compute the Bayes factor and use it alone as the basis for model selection. The tacit assumption is that the models all have equal prior probabilities. But this assumption of equal prior probabilities seems implausible in most situations, and should be explicitly justified.

If we should not rely on the Bayes factor alone to make decisions in disease diagnosis or drug testing, why should we rely on the Bayes factor alone to make decisions in other model-selection applications?

In this post I first explain the role of the Bayes factor in the usual simple disease-diagnosis or drug-screening situation. Then I extend to a situation in which the prior probabilities are not certain. I show that in the limit, with utter uncertainty in the prior probabilities, the posterior odds fade to match the Bayes factor. But the decision should not be based on the Bayes factor alone.

This post has been revised and is now superseded by the new post linked >HERE<.

Bayes' rule applied to disease diagnosis

A person is randomly selected from a population in which the prior probability of having a particular disease is 1%. Denote the condition of the person by the variable \(M\), which can have two values, either \(M=\mbox{healthy}\) or \(M=\mbox{disease}\). The prior probability can be written as \(p(M=\mbox{disease}) = 0.01\), and hence \(p(M=\mbox{healthy}) = 0.99\).

A diagnostic test for the disease can show a "positive" result, suggesting the disease is present, or a "negative" result, suggesting the disease is absent. The result of the test is the datum, denoted \(D\), which can have two values \(D=\mbox{pos}\) or \(D=\mbox{neg}\). The detection rate of the test is 97%, meaning \(p( D=\mbox{pos} | M=\mbox{disease} ) = 0.97\). This is also called the "sensitivity" of the test. The false-alarm rate of the test is 5%, meaning \(p( D=\mbox{pos} | M=\mbox{healthy} ) = 0.05\). The complement of the false alarm rate is called the "specificity" of the test, which in this example is 95%.

Applying Bayes' rule yields: \[
p(  M=\mbox{disease}| D=\mbox{pos} ) = p( D=\mbox{pos} | M=\mbox{disease} ) p( M=\mbox{disease} ) / p( D=\mbox{pos} )
\] and \[
p(  M=\mbox{healthy}| D=\mbox{pos} ) = p( D=\mbox{pos} | M=\mbox{healthy} ) p( M=\mbox{healthy} ) / p( D=\mbox{pos} )
\] In the above two equations, the denominator \( p( D=\mbox{pos} ) \) is the same, computed as \( \sum_m p( D=\mbox{pos} | M=\mbox{m} ) p( M=\mbox{m} ) \). The ratio of the two equations is
\underbrace{\frac{p(  M=\mbox{disease}| D=\mbox{pos} )}{p(  M=\mbox{healthy}| D=\mbox{pos} ) } }_{\mbox{posterior odds}}
\underbrace{\frac{p( D=\mbox{pos} | M=\mbox{disease} )}{p( D=\mbox{pos} | M=\mbox{healthy} )} }_{\mbox{Bayes factor}} \times
\underbrace{\frac{p( M=\mbox{disease} ) }{ p( M=\mbox{healthy} )}}_{\mbox{prior odds}}
\] Notice from the above equation that the Bayes factor is the ratio of the detection rate to the false alarm rate (when the datum is \(D=\mbox{pos}\)). The Bayes factor converts the prior odds (i.e., the ratio of prior probabilities) to the posterior odds (i.e., the ratio of posterior probabilities).

When we plug in the numerical values of the example, we get
\underbrace{\frac{0.164}{0.836}}_{\mbox{posterior odds}}
\underbrace{\frac{0.97}{0.05}}_{\mbox{Bayes factor}} \times
\underbrace{\frac{0.01}{0.99}}_{\mbox{prior odds}}
\underbrace{0.196}_{\mbox{posterior odds}}
\underbrace{19.4}_{\mbox{Bayes factor}} \times
\underbrace{0.0102}_{\mbox{prior odds}}
(the displayed values are rounded, whereas the underlying calculations retained many decimal places). Thus, even though the Bayes factor is 19.4 in favor of having the disease, the posterior probability of having the disease is only 0.164 because the prior probability was 0.01.

The real-life lesson of these sorts of examples is that when doing random disease screening or random drug testing, it is not appropriate to use the Bayes factor alone to interpret the result of the diagnostic test. The prior probabilities of the conditions (disease or healthy, using drug or not) must be factored in, and the decision should be based on the posterior probability.

Disease diagnosis is model comparison

The case of disease diagnosis is just a case of model comparison. In model comparison, we have some data and we infer the posterior probabilities of the models given the data. For disease diagnosis, the data consist of a single dichotomous value, namely \(D=\mbox{pos}\) or \(D=\mbox{neg}\), and the two models are merely two possible underlying states of the world, namely \(M=\mbox{disease}\) or \(M=\mbox{healthy}\). Disease diagnosis is simply asking which model, disease or healthy, is a better model of the outcome of the diagnostic test.

The moral of the disease-diagnosis example ---that the Bayes factor alone is not appropriate for selecting the model--- applies to model comparison generally. Model comparison should be based on the posterior probabilities of the models.

Decision rule in model comparison sometimes uses only the Bayes factor --- but should not

One practice in model comparison is to base decisions on the Bayes factor, forgetting about or at least de-emphasizing the prior probabilities (and hence the posterior probabilities) of the models. A typical decision rule is,
If the Bayes factor is >3 in favor of one model relative to the other, accept the better model and reject the other. 
Applying the Bayes-factor-only decision rule to disease diagnosis would lead us to decide that a person has the disease whenever the diagnostic test is positive because the Bayes factor is 19.4 which is much greater than 3. Using only the Bayes factor for decision making, instead of the posterior probabilities of the models, ignores Bayes' rule and is not Bayesian despite the terminology. A Bayes factor is only Bayesian in the full context of Bayes' rule, which converts prior probabilities to posterior probabilities.

The usual justification for using the Bayes factor alone is two fold. First, if the prior probabilities of the models are all equal, then the posterior odds algebraically equal the Bayes factor. (You can see that for yourself from the equations above.) Setting the prior probabilities to be equal seems to be an innocuous default for many model-comparison settings, even if it's an absurd assumption for random screening of diseases or drug usage. Second, even if unequal prior probabilities for the models are assumed, the Bayes factor tells us how to modify those prior odds to arrive at the corresponding posterior odds. Therefore it is informative to report the Bayes factor.

The two justifications must be used carefully. First, I doubt that most model-comparison settings really are well represented by assuming equal prior probabilities of the models. If this were true, it would be asserting that the researcher is utterly ignorant of the models and has no prior data or theory or knowledge of any sort to suggest that one model is more or less plausible than the other.[replaced 2015-Dec-13 9:04am with the following sentence] Setting the prior probabilities equal to each other is an assertion of equality, not an expression of uncertainty; the next section discusses uncertainty. Second, even if the Bayes factor is reported to summarize the analysis, the Bayes factor should not be the basis for decision making. The decision should be based on the posterior probabilities of the models.

Modeling uncertainty in the prior probabilities of the models

The usual formal set-up, as in the equations above, assumes specific values for the prior probabilities of the models. That is to say, \(p(M=m)\) has a specific value, such as 0.01 for the "disease" model. But what if we are uncertain about the prior probabilities of the models?

This post has been revised and is now superseded by the new post linked >HERE<.

Modeling uncertainty is exactly what Bayesian analysis is all about. It is straight forward to model uncertainty in the prior probabilities of the models. (I do not recall seeing this analysis in the literature, but I would be quite surprised if the following analysis has not appeared somewhere before. Maybe even in this blog.) Denote \(p(M=1)\) by the parameter \(\theta\), which can take on any value from 0 to 1. We will express our uncertainty in \(\theta\) as a beta distribution that has mode \(\omega\) and concentration \(\kappa\). (Hence the shape parameters for the beta distribution are \(a=\omega (\kappa-2) + 1\) and \(b= (1-\omega) (\kappa-2) + 1\).) That is, \(p(\theta)=dbeta(\theta\,|\,\omega,\kappa)\). If we are extremely certain about the prior probability of the model, then \(\kappa\) has a very large value, meaning that the prior distribution on \(\theta\) is tightly concentrated at \(\omega\), and the analysis converges to the point value assumed in the equations above. If we are uncertain about the prior probability of the model, then \(\kappa\) is a small-ish value, with \(\kappa \ge 2\) and \(\kappa=2\) indicating a uniform distribution.

For the two-model case, the posterior probabilities of the models are
p(M=1|D) =
\int_0^1 d\theta \, p(\theta) \, p(D|M=1) \, \theta / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ] 
p(M=2|D) =
\int_0^1 d\theta \, p(\theta) \, p(D|M=2) \, (1-\theta) / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ] 
The ratio of those equations becomes
\underbrace{\frac{p(M=1|D)}{p(M=2|D)}}_{\mbox{Posterior odds}}
\underbrace{\frac{p(D|M=1)}{p(D|M=2)}}_{\mbox{Bayes factor}}
\frac{\int_0^1 d\theta \, p(\theta) \, \theta / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ]}{\int_0^1 d\theta \, p(\theta) \, (1-\theta) / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ]}

Heartier souls should feel free to derive an analytical simplification of the integrals above, but I'll just perform a numerical approximation in R. The code is appended below.

Here are some numerical results for disease diagnosis when the test result is positive. In all cases, the mode of the prior on \(M=1\) is \(\omega=0.01\), reflecting the assumed rarity of the disease. When \(\kappa\) is very large at 10002, then the posterior odds of having the disease are 0.197, essentially matching the analysis presented earlier when using \(p(M=\mbox{dis})=0.01\) exactly. When \(\kappa\) is set to a moderate value of 102, then the posterior odds of having the disease are 0.345. When \(\kappa\) is set to a value of 12 that expresses great uncertainty, then the posterior odds of having the disease are 1.23. Finally, when \(\kappa\) is set to a value of 2 that expresses essentially total ignorance about the prior probability of the disease, which means that the prior probability of having the disease could be 0.01 or 0.5 or 0.99 uniformly, then the posterior odds of having the disease are 7.65, which is getting close to the Bayes factor of 19.4. (Intuition tells me the result should equal the Bayes factor, which means either my intuition is wrong or the mathematical derivation is wrong or my R code is wrong. At this point I trust the numerical result and suspect my intuition. Perhaps a kind soul will gently enlighten me. UPDATE 25-Dec-2015: I am re-thinking this, and now believe that something is wrong with the derivation, and that the intuition is right after all. Stay tuned...!)

[Paragraph added 2015-Dec-13 12:50pm:] I think the math and numbers are correct. The prior ratio involves \(\int_0^1 d\theta \, p(\theta) \, \theta / [ p(D|M=1) \, \theta + p(D|M=2) \, (1-\theta) ]\) divided by 1 minus that integral. The integral involves \(p(D|M=1)\) and \(p(D|M=2)\), so the prior ratio will not generally become 1 when \(p(\theta)\) is uniform.

All of which is to say that if the prior probabilities of the models are uncertain, then the uncertainty can be incorporated into the analysis, but you must still specify a particular degree of uncertainty. The more uncertain you are about the prior probabilities of the models, the stronger is the influence of the Bayes factor. But notice that even with fairly high uncertainty, when \(\kappa=12\), the posterior odds were quite different than the Bayes factor. [Next sentence added 2015-Dec-13 9:12am] Moreover, even when there is complete uncertainty about the prior probabilities of the models, when \(\kappa=2\), the posterior odds are not as extreme as the Bayes factor.

R code:
# Specify distribution on prior probability of Model 1, having the disease:
omega=0.01 # Modal prior probability of disease.
kappa=c(2,12,102,10002)[4] # >= 2.0.
# Specify false alarm rate and hit rate of diagnostic test:
FA = 0.05  # p( Test=+ | healthy )
Hit = 0.97 # p( Test=+ | disease )
BF = Hit/FA # Bayes factor
# Create grid on theta, p(disease):
dTheta = 0.0001 # differential interval on theta.
Theta = seq(0,1,dTheta) # grid on theta for approximating the integral.
# Create prior distribution on theta, as a beta distribution:
pTheta = dbeta( Theta , shape1=omega*(kappa-2)+1 , shape2=(1-omega)*(kappa-2)+1 )
pTheta = (pTheta / sum(pTheta)) / dTheta # make sure that densities sum to 1.0
# Compute the prior integrals:
integTheta = sum( dTheta * pTheta * Theta/(FA+(Hit-FA)*Theta) )
integOneMinusTheta = sum( dTheta * pTheta * (1-Theta)/(FA+(Hit-FA)*Theta) )
# Compute posterior probability of disease, given Test=+
priorOdds = integTheta / integOneMinusTheta
pDisGivenTestPos = Hit * integTheta
pHealthGivenTestPos = FA * integOneMinusTheta
show( c( pDisGivenTestPos , pHealthGivenTestPos ,
         pDisGivenTestPos / pHealthGivenTestPos ) )
show( c( priorOdds , BF , priorOdds * BF )  )


  1. Hello John.

    I have a couple of thoughts on your example, perhaps I am missing something.

    The Bayes factor when comparing models is the ratio of the *marginal* likelihoods.
    By ignoring the prior over the models it is assuming that two competing models are equally probable a priori. In many cases this is reasonable, in my view.
    Generally the Bayes factor strikes (in my experience) a generally good balance between fit and complexity.

    In this example you consider as Bayes factor the ratio of likelihoods:
    p(healthy | pos) vs p(diseases | pos).
    This is different from the ratio of marginal likelihoods.
    By ignoring the prior over the models it is assuming that two outcomes (healthy and disease) are equally probable a priori. This is clearly untrue, hence the disappointing behaviour of this indicator.

    I wonder whether the BF is the ratio of marginal likelihoods, or more generally the ratio of two likelihoods?

    thanks, Giorgio

  2. Hi Giorgio:

    The ratio, p(D|M1)/p(D|M2), is the Bayes factor by definition. Whether or not it is calculated by marginalizing over parameters is irrelevant. If the model has parameters, then p(D|M) is calculated by marginalizing (i.e., integrating) over its parameters. If a model has no parameters, then there is no marginalizing needed.

    Regarding priors on the models: If you have a situation in which you strongly believe that p(M1)=p(M2)=0.5, then it is perfectly appropriate to set the priors that way. But it's important to re-state here that setting p(M1)=p(M2)=0.5 is a strong assertion of equality; it is not an expression of uncertainty. Setting p(M1)=0.50 and p(M2)=0.50 should be based on as strong of prior evidence as we use to set priors in the disease diagnosis case to, say, p(M1)=0.01 and p(M2)=0.99.

  3. Hi,

    I know a little about Bayesian inference in psychology, but I don't yet understand Bayesian statistical analysis. I feel I might make some progress if I understood why my attempt at a very simple form of Bayesian t test would be invalid. The idea is that there are two not-unreasonable hypotheses to consider: One in which the distribution of t statistics is centered on zero (the null hypothesis) and one in which it is centered precisely on the value of t that one has obtained from conducting a standard t test ('t obtained'). Assuming uninformative priors, why wouldn't a valid Bayes factor be the ratio of the two probability densities at the point, 't obtained'? This is illustrated in the image at the following link:

    Rich Anderson
    Bowling Green State University

  4. Rich:

    First, this post was superseded by a follow-up here:
    <a href="></a>

    Next: It's easy to do a model comparison of two point-value hypotheses (instead of a point-value null hypothesis vs a diffuse alternative hypothesis). But the issue is where the second point value comes from. If it comes from prior theory, that's fine. But if it comes from the new data, then it's using the new data twice -- once to exactly inform the prior and then once to do the model comparison. I suppose you could do this sort of model comparison as a worst case scenario -- i.e., the comparison that is worst for the null hypothesis. This sort of thing is considered in the Bayesian literature -- I think there are articles by Jim Berger and others. For general background about Bayesian model comparison and null hypothesis testing, see Ch's 10 and 12 of DBDA2E.

  5. Thanks, I will take a look at those chapters! (Note however, I was think of the situation in which all priors are presumed equal; thus the "new data" would only be used to do the model comparison and to to inform the prior). --Rich A