An interesting piece in the news section of Nature this week about the problems of statistical tests using p values to determine significance.
The underlying nature of the piece is that p values are unreliable. The headlines of the figure forming the major argument in the text states: “A P value measures whether an observed result can be attributed to chance. But it cannot answer a researcher’s real question: what are the odds that a hypothesis is correct? Those odds depend on how strong the result was and, most importantly, on how plausible the hypothesis is in the ¬first place.”
However, this argument appears to relate to p-values calculated from a test such as a chi-squared or Fisher’s exact, and only these tests, as I’ll explain. Such a test may be something along these lines. You set up a classic choice experiment with a fish and some bait. The fish swims through a tube and reaches a branch in the tube (left or right). The food is in the left branch. You run the experiment 100 times and the fish turns left 73 times. Is this down to chance?
If only chance operated, then in theory, the fish has a 50:50 chance of going either way each time (ignore memory or any other preferences for now). In practice, it is unlikely to go left exactly 50 times and right exactly 50 times. If the left:right ratio was 49:51, it would be a good fit. So would 48:52, but would 73:27? You can test this statistically, and it will tell you if these results are down to chance or not. For this kind of test, the argument in the article holds – see figure here:
This is a weak test, with no replication. There are a whole range of reasons the fish might turn left more, which have nothing to do with the hypothesis of sensing food. However, even if you overcome the shortcomings of the method (the scientist’s job, not the statistician), it is easy to see how a different fish might respond differently (unlike most tests, replication isn’t really ingrained in these in the same way). If we got a p value < 0.05, however, it would tell us that we are 95% certain that this fish turned left more than what would be expected by our guess of what should happen by chance (i.e. chance may not actually be 50:50, even if we think it is). In this sense, the test is robust.
In addition to this, most statistical tests are different. Most are based around linear models (ANOVAs or regressions). The purpose of an ANOVA is to tell us if there is a difference in sample means between categories (are there more snails in site 1, than site 2 or site 3?). If we count every snail at the site, then we have the true population, and there’s no need for a test. However, if we sample, and work out the number per quadrat (the replicate), then we are only estimating the population mean with a sample mean (the mean is the average number per quadrat).
All an ANOVA does is to work out if having a mean for each category gives a better fit to the data than an overall mean for all categories. If so, we can conclude there is a ‘significant difference in the means’. So a p < 0.05 for an ANOVA means we are 95% sure that at least one of the category means is different from the others. Given you have taken a representative sample of sufficient size, this is robust, and repeatable, although you wouldn’t necessarily expect to get the exact same p-value each time, and with a 95% confidence, consideration of type 1 and 2 errors are of course relevant).
Equally, a regression is very similar to an ANOVA. A p < 0.05 means that if you draw a line of best fit through the points, it is a better fit to the data than a horizontal line running through the mean of the samples.
For these tests the assumption that: “A p value measures whether an observed result can be attributed to chance” is not the case. It measures if there is a better way of fitting a model to the data than just looking at how samples cluster around the mean.
There are some other issues in the article, such as the lack of association between significance and effect size in tests such as ANOVA, which do need addressing (although a nice graph of the results helps in this case). The r2 in a regression also gives a good measure of fit, normally providing more meaning than a p value. Additional information is good, but this doesn’t mean the p value is bad.
However, overall, I simply don’t buy the argument that p values are unreliable. Knowing what a statistics test does is important. Interpreting results and hypotheses in the knowledge of what a stats test does is important. But p values are very useful.
As scientists, we normally want to know if an effect can be attributed to chance. Surely we all know that this doesn’t prove our hypothesis? This just sounds like the classic rehash of not understanding cause and effect, which if you’re not easily offended is best described here:
However, removing p values removes the ability to test if something occurs by chance. I’ve recently been working with GLMMs, and the fact that there is no reliable p value is an issue for me. I can’t tell if what I have found is likely to be a real effect or not. I’m sure those with much more understanding of GLMMs can tell me how to proceed. But let’s be honest, I’m lost, and I’m sure anyone who is a non-statistician reading my paper would be as well. With a bit of common sense, they shouldn’t be lost if there are p-values.
For science to progress, we need to have a greater understanding of what statistics do, but we also need to make sure that these statistics are easy to use and easy to understand for the average scientist, even if not for the general public (but wouldn’t it be nice if we could explain it to the public too?). The p value is fit for purpose, we just need to understand what it tells us. Losing it would put scientific understanding back a long way.
Regina Nuzzo (2014). Statistical Errors Nature, 506, 150-152 DOI: 10.1038/506150a