The reference class problem has plagued thinkers for centuries, but as probability and statistics become part of everyday parlance, the issue becomes even more prescient. If we want to draw valuable conclusions, we must take the self out of statistics, writes Aubrey Clayton.
The COVID-19 pandemic has, sadly, made armchair epidemiologists and statisticians of us all. Just to understand the daily news, people with little or no previous expertise have had to quickly become conversant in technical argot including terms like “basic reproduction number (R0),” “positive predictive value,” and “case fatality rate,” among others. Predictably, the introduction of this vocabulary of ideas to the general public has also invited a number of rookie mistakes and elementary fallacies. For example, supposing a particular test for the novel coronavirus had a 99% specificity rate, meaning 99 out of every 100 truly virus-free people will test negative, what is the chance that someone who tests positive actually has the virus? If, before attempting an answer, you don't immediately think to ask what the overall incidence rate of the virus is in the population, then you have committed a well-worn fallacy called Base Rate Neglect.
The correct assembly guide for the relevant pieces of information, as always when reasoning about uncertainty, is Bayes’ Theorem, the simple equation that relates the prior probability of a statement (that is, the probability we assign before considering some new piece of evidence) to its posterior probability (the new probability we give it in light of the evidence). The rule dictates that the posterior probability is proportional to the prior probability and the conditional probability, representing the chance of making the given observation assuming the statement were true. In our above testing scenario, the conditional probability might reflect the accuracy rates of the test (how often it turns up positive if the person does or does not have the virus), and the posterior probability is the one we care about: how likely is a person to have the virus after having tested positive. Bayesian Reasoning 101 tells us that we also need to know the chance the person has the virus before considering the test, that is, the base incidence rate of the virus for that person’s population.
The correct assembly guide for the relevant pieces of information, as always when reasoning about uncertainty, is Bayes’ Theorem.
However, concealed in that last statement is a subtlety of probabilistic thinking not many people have adequately grappled with yet: How do we determine the relevant population to compare an individual to? Say I’m a 40-year-old man in Boston who’s been practicing moderately good social distancing and is not currently experiencing any symptoms of COVID-19. To establish my base rate of having the coronavirus for the purpose of interpreting any test result, should I take the rate from among other Bostonians (perhaps only in my neighborhood?), restricted to other 40-year-old men, only asymptomatic people, only those whose social behaviors exactly match my own, etc.? Or should I perhaps cast a conservatively wide net and compare myself to all Americans or all people in the world? The same considerations apply when interpreting the fatality rate of the virus. Given daunting statistics like 1 percent of infected people dying, before taking that figure to represent our own personal risk, we might first want to know the death rate broken down by factors like age, race, socioeconomic status, severity of symptoms, presence of other health conditions, access to medical care, and so on.
Call this Bayesian Reasoning 201.
Join the conversation