Wednesday, July 18, 2012

The Joy of Confidence Intervals


In my last blog I mentioned that Null Hypothesis Significance Testing (NHST) was a bad idea (despite most of us having been taught it, use it and possibly teach it to future generations). I also said that confidence intervals are poorly understood. Coincidentally, a colleague of mine, knowing that I was of the ‘burn NHST at the stake’ brigade recommended this book by Geoff Cumming. It turns out that within the first 5 pages, it gives the most beautiful example of why confidence intervals tell us more than NHST. I’m going to steal Geoff’s argument blatantly, but with the proviso that anyone reading this blog buy his book, preferably two copies.
OK, imagine you’ve read Chapter 8 of my SPSS/SAS or R book in which I suggest that rather than cast rash judgments on a man for placing an eel up his anus to cure constipation, we use science to evaluate the efficacy of the man’s preferred intervention. You randomly allocate people with constipation to a treatment as usual group (TAU) or to placing an eel up their anus (intervention). You then find a good lawyer.
Imagine there were 10 studies (you can assume they are of a suitably high quality with no systematic differences between them) that had report such scientific endeavors. They have a measure of constipation as their outcome (let’s assume it’s a continuous measure). A positive difference between means indicates that the intervention was better than the control group at reducing constipation.
Here are the results:

Study           Difference
                      between
                      Means             t                       p
Study 1           4.193              3.229            0.002*
Study 2           2.082              1.743            0.086
Study 3           1.546              1.336            0.187
Study 4           1.509              0.890            0.384
Study 5           3.991              2.894            0.006*
Study 6           4.141             3.551             0.001*
Study 7           4.323             3.745             0.000*
Study 8           2.035             1.479             0.155
Study 9           6.246             4.889             0.000*
Study 10          0.863             0.565             0.577

OK, here's a quiz. Which of these statements best reflects your interpretation of these data:
  •  A. The evidence is equivocal, we need more research.
  •  B. All of the mean differences show a positive effect of the intervention, therefore, we have consistent evidence that the treatment works.
  •  C. Five of the studies show a significant result (p < .05), but the other 5 do not. Therefore, the studies are inconclusive: some suggest that the intervention is better than TAU, but others suggest there's no difference. The fact that half of the studies showed no significant effect means that the treatment is not (on balance) more successful in reducing symptoms than the control.
  •  D. I want to go for C, but I have a feeling it's a trick question.

Some of you, or at least those of you bought up to worship at the shrine of NHST probably went for C. If you didn't then good for you. If you did, then don't feel bad because if you believe in NHST then that's exactly the answer you should give. 
Now let's look at the 95% confidence intervals for the mean differences in each study:
Note the mean differences correspond to those we have already seen (I haven't been cunning and changed the data). Thinking about what confidence intervals show us, which of the statements A to D above best fits your view?
Hopefully, many of you who thought C before now think B. If you still think C, then I will explain why you should go for B:
A confidence interval is a boundary within which the population value falls 95 times out of 100. In other words, they reflect the likely true population value: 5 out of 100 will miss it, but 95 out of 100 contain the actual population value. Looking at our 10 studies, only 3 of the 7 contain zero (studies 3, 8 and 10) and for two of them (studies 3 and 10) they only just contain zero. Therefore, in 7 of the 10 studies the evidence suggests that the population difference between group means is NOT zero. In other words, there is an effect in the population (zero would mean no difference between the groups). So, 7 out of 10 studies suggest that the population value, the actual real difference between groups, is NOT ZERO. What's more, even the 3 that do contain zero, show a positive difference, and only a relatively small portion of the tail of the CI is below zero. So, even in the three studies that have confidence intervals crossing zero, it is more likely than not that the population value is greater than zero. As such, across all 10 studies there is strong and consistent evidence that the population difference between means is greater than zero, reflecting a positive effect of the intervention compared to the TAU.
The main point that Cummings makes (he talks about meta-analysis too, but I'm bored of typing now) is that the dichotomous sig/non-significant thinking fostered by the NHST can lead you to radically different conclusions to those you would make if you simply look at the data with a nice, informative confidence interval. In short, confidence intervals rule, and NHST sucks.
More important, it should not be the case that the way we picture the data/results completely alters our conclusions. Given we're stuck with NHST at least for now, we could do worse than use CIs as the necessary pinch of salt required when interpreting significance tests.
Hopefully, that explains some of the comments in my previous blog. I'm off to buy a second copy of Geoff's book ...