What Do P-Values Really Mean?

This month’s blog theme is nursing research, so I’m discussing the meaning of some statistical concepts to help you interpret the research studies you are reading.  I’m going to talk about several concepts this month that I have found both undergraduate and graduate students struggle to really understand. Last week I presented the difference between statistical significance and clinical significance. Today’s post is all about alpha levels, AKA p-values.

“Statistics means never having to say you’re certain.”  Anonymous

Learning statistics (AKA stats) is the bane of many a nursing student’s life! Not only do you have to have a basics stats course before you are admitted to a baccalaureate nursing program, but you have to take statistics again in your graduate program!  

Why do faculty make you learn about the research process, hypothesis testing, data analysis, likelihood ratios, relative risks, p-values, and the like?

Because as a professional nurse you are expected to care for patients using the most current, evidence-based knowledge and interventions to promote positive patient outcomes. 

Part of that responsibility entails you critically examining the research literature upon which your nursing practice is based. 

You can’t confidently examine that literature without a basic understanding of the research process and the methods researchers use to ensure accurate results.  And you have to have a basic understanding of what the results mean.

Before we talk about the p-value, let’s understand what the researchers are testing, first. Quickly, let’s review hypothesis testing. 

What is Hypothesis Testing?

Scientists conduct research to, hopefully, show that their intervention, for example, is more effective or better than the status quo. So they come up with their research hypothesis that A (their intervention) is not the same as the standard of care (SOC). The status quo or SOC is called the null hypothesis (H0). 

The research or alternative hypothesis is identified as H1. The research hypothesis can be written as different than the SOC (H1 ≠ H0) if the researcher is not sure in which direction the finding will go (positive or negative). Or with previous research support, the researcher could declare the research hypothesis as going in a certain direction, such as better than the SOC (H1 > H0). 

The null hypothesis is always assumed to be true. So how does a researcher demonstrate that H1 is different from H0? They have to disprove the assumption that the null hypothesis is true! So the researcher wants to reject the null hypothesis; so that he can accept the research hypothesis. Does that make sense? Because negatively worded phrases can be confusing – we understand positively worded phrases better – this understanding is a big stumbling block for students. 

So basically, the researcher tests these hypotheses to see which condition is true for their study.  Is the null hypothesis that the SOC and the intervention have the same effect true (H1 = H0)? Is so, the test statistic should have a p-value of greater than the a priori alpha level. The researcher can’t declare the intervention different than the SOC and the research finding is reported as not significant.  

Or is the intervention different than the SOC and therefore true (H1 ≠ H0)? If the research hypothesis is different or better, then a p-value of the test statistic should be less than the alpha level; the researcher will declare victory and label the research finding as statistically significant.

Finding out the answers to these questions is called hypothesis testing. You compare the null hypothesis to the research hypothesis.  There are only two decisions that the researcher can make when testing hypotheses: either the null is true and the intervention is deemed not to work differently than the SOC or the intervention works differently and the null is false.

Research and the “Truth”

While this seems like a fairly straightforward process, what you should remember is that whatever decision the researchers make may be wrong! The reason is that the researchers are using estimates. And even though the probability may be low that the research result is a function of chance alone for p<0.05, there is still a chance that that particular study is one of the 5 in 100 (or 1 in 100, 1 in 1000, etc. for other p-values) that is significant because of random error.

Here’s the kicker: the researchers will never know whether their study is one of 5 in 100 that shows a false positive! All they can do is conduct their study as rigorously as possible, enter the data into the statistical program they are using, run the statistical tests, and report the results.  They will report what they get and make their conclusions based on their a priori alpha levels (the results were not statistically significant; there was a statistically significant difference …). The way to generate confidence that the results are true is to replicate the study and hope for similar statistical results. 

What happens if a statistical test comes back with a p-value of 0.05? If the a priori p-value is <0.05, then the researcher cannot reject the null hypothesis because the result is not less than 0.05, right? P-values near 0.05 are frequently considered “weak” evidence against the null hypothesis, but the American Statistical Association (ASA) has rebutted this misconception by stating that a p-value by itself should not be used as evidence for or against the null hypothesis (ASA, 2016).

But, apparently, there is a rising trend among researchers, especially in the social sciences, to declare p-values in the range of 0.05-0.10 as “marginally significant.” They may also use the term “approaching significance” (APS, 2016; Pritschet, Powell, & Horne, 2016). The Pritschet et al. study presents reasons for how this type of thinking can lead researchers astray. 

The probability that the research results are not a function of random error is called the significance level. 

Probability and Significance

In last week’s post on significance,  I defined statistically significant as “probably caused by something other than mere chance.”

The significance level is known as the alpha level and it is used to signify the probability of a result occurring from the research study that is “caused by something other than mere chance.” P is short for probability, hence p-values. 

The p-value is the probability of making the wrong decision when the null hypothesis is really true. The p-value is set by the researcher before the study begins as an objective measure of uncertainty. 

The p-value is the probability of making the wrong decision (i.e., to reject the null) when the null hypothesis is really true.

Researchers want their research findings to be significant, of course! Statistically significant research results are more likely to get funded and published than research studies with non-significant findings (Frost, 2014). So that p-value wields great power! 

How Does the P-Value Get Chosen?

The p-value is the value at which you reject the null hypothesis or how much evidence do you have against the null hypotheses? The evidence is the data you enter from your study and submit to statistical testing. The p-value is a single cut-off point for deciding whether to reject the null hypothesis.

The choice of which p-value to use is arbitrary – there is no universal rule for when to choose a p-value of <0.05 or p < 0.01: the researcher chooses the p-value they want.  The researcher needs to decide how must uncertainty they are willing to allow to declare their intervention a success, once the data have been entered and the statistical tests have been run.  There are some criteria the researcher considers when choosing the p-value for their study — like which statistical error is more important to prevent (a Type I error or a Type II error) — but we will talk about this concept another time. 

Many researchers, especially in nursing and the social sciences, set the alpha level at p < 0.05. What does this mean? This means the researcher is willing to take a chance that their statistical result will be wrong 5% of the time.  They are willing to incorrectly state that their intervention is different from the SOC, when it’s really not, 5 out of 100 times. So if the study was replicated 20 times, only once would they reject the null hypothesis when it was really true. If the study was repeated 100 times, they would incorrectly reject the null five times. 

When the outcome stakes are higher, as in a drug study, you may see the alpha level set at p < 0.01 or less.  The researchers will want to decrease the uncertainty of a false positive because a higher level of uncertainty could mean that an ineffective drug may take the place of an effective drug. So if the study was replicated 100 times, only once would they reject the null hypothesis when it was, in reality, true.

One more thing: let me clarify “Reality” or “Truth.” Unless you can test the entire population with a specific disease — heart failure, let’s say — we can’t know for sure how the entire population will react to an intervention.  This would be a very expensive study to conduct and it just wouldn’t happen, so we can’t know the population mean from which to make decisions for our patients. So instead we use representative samples and statistical testing to get to a result that we can then generalize to the entire population.  

Samples are an estimation for the population, the statistic is also only an estimate; therefore, if you drew a different sample, you’d get a different value.  No matter how carefully this sample is selected to be a fair and unbiased representation of the population, relying on information from a sample will always lead to some level of uncertainty (Simon, n.d.). The p-value helps the researcher to decrease that level of uncertainty, but it doesn’t eliminate uncertainty altogether.

P-Values and Uncertainty

There are two accepted understandings of the p-value: (a) that the p-value shows how strong the evidence is against the null hypothesis — for example, p-value of < 0.0001 is stronger than a p-value of <0.001 and shows a more significant effect; and (b) how likely one would get a result as large as the one found in the study due to chance alone (Frost, 2014; Halsey, Curran-Everett, Vowler, & Frummond, 2015).

The traditional understandings of what p-values represent have been under fire though to the point that the American Statistical Association has put out a statement to clarify the use of p-values for statistical significance (ASA, 2016; Royal Statistical Society, 2016). They have reframed the understanding of the p-value as evidence of strength or truth of a hypothesis to a vaguer principle of incompatibility with the null (Altman & Krzywinski, 2017). Principle 1 reads as “P-values can indicate how incompatible the data are with a specified statistical model” (ASA, 2016).

A p-value in the significant range tells you that either

  • There is a real difference between the groups OR
  • The effect size is very big (even if the sample is small) OR
  • The sample size is very big (even if the effect size is tiny)

A non-significant p-value tells you that either

  • There is no difference between the groups OR
  • That there were too few subjects to demonstrate such a difference if it existed

but it does not tell you which of these cases is true (Greenhalgh, 2006)

P-values do not tell you how precise a result is, how big the effect size is, or whether the finding itself is important (Royal Statistical Society, 2016).  

P-values ONLY tell you if the result you got in your study was likely to have happened by chance: this is known as the all-or-none principle. This is crucial for you to understand.

All the p-value tells you is how likely a difference as large as the one reported in the study was merely a result of chance.

Example

Let’s pull this all together and look at how the p-value is interpreted. 

The researchers want to study a new protocol for mobility in the ICU. The null hypothesis is that the new protocol is not different from the SOC mobility protocol. 

  • The null hypothesis is that the new protocol is not different from the SOC mobility protocol (H1 = H0).
    • This is the hypothesis that the researchers are testing.
    • This is the hypothesis the researchers want to REJECT. 
  • The research or alternative hypothesis is that the new protocol is better than the SOC protocol (H1 > H0).
    • This is the hypothesis the researchers hope to ACCEPT.
  • The p-value is set at < 0.05.

The researchers do a power analysis and choose a representative sample. They conduct their study and collect their data.  The data are entered into a statistical program and the statistical tests are run.  The researchers will report the test statistic results they get and make their conclusions based on their a priori alpha levels (the results were not statistically significant; there was a statistically significant difference …).

Research data is analyzed with computers. Therefore, the statistical programs researchers use will give them an actual p-value instead of just a “< 0.05” result. For example, a p of 0.028, .052, or 0.18, will most likely be reported as the actual result. We still interpret statistical significance of that result as compared to the alpha level set by the researcher.

For example, a statistical test provides a p of 0.028. This would be considered statistically significant if the alpha level was set as p < 0.05 (because 0.028 is less than 0.05, right?); however, if the alpha level was set at p < 0.01, then p = 0.028 would not be statistically significant (because 0.028 is greater than 0.01). The values of 0.052 and 0.18 would not be statistically significant using either p-value because they are both greater than 0.05 or 0.01. Remember that “marginally significant” or “approaching significance” doesn’t fly!

Don’t be fooled into thinking that a p-value of .001 is less significant than a p-value of .0001. (Another way to say that is to think that the smaller p-value of p = 0.0001 is stronger or more significant than the p-value of 0.001.)  

All this result tells you is that, for the observed sample, a study result of p = 0.0001 is 100 times less likely to be a result of chance than the p = 0.001. It is more incompatible with the truth of the null hypothesis than the p of 0.001, but it doesn’t mean that the research hypothesis is definitely true!

P-values do NOT tell you that one hypothesis is True and the other not! The only thing the p-value tells you is how improbable the research result is in the context of an assumed true null hypothesis (Altman & Kryzwinski, 2017; Frost, 2014). 

If you accept the null hypothesis, it just means that the evidence from your study was not strong enough to disprove the assumption that the intervention is significantly different from the SOC (Shorten & Shorten, 2013).

Nursing Application of The Principle of Uncertainty

Because we don’t know what the “Truth” about a population really is (that is, the population mean), and we are depending on statistical estimates, you should NOT use the results of just one “statistically significant” study to change your nursing practice!

How do we make a decision about using research in practice? We look for more than one study with similar findings. The more studies that point toward the same result – the more confident you can be that while one study may be the random error, two studies showing the same statistically significant results is less likely to be random error, three studies even less likely, etc. 

What are your questions after reading this post about p-values? Let me know in the comments!

Sources

Altman, N., & Krzywinski, M. (2017). Points of significance: Interpreting P values. Nature Methods, 14, 213–214. doi:10.1038/nmeth.4210

Association for Psychological Science (APS). (2016, May 20). Rise in reporting p-values as “marginally significant.”  Retrieved from http://www.psychologicalscience.org/publications/observer/obsonline/rise-in-reporting-p-values-as-marginally-significant.html#.WPZREYjyuCg

Frost, J. (2014, April 17). How to correctly interpret p values. Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values

Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015). The fickle P value generates irreproducible results. Nature Methods 12, 179–185. doi:10.1038/nmeth.3288

Pritschet, L., Powell, D., & Horne, Z. (2016). Marginally significant effects as evidence for hypotheses: Changing attitudes over four decades [Abstract]. Psychological Science, 27, 1036 – 1042.

Royal Statistical Society. (2016, November 15). ASA statement on P-values and statistical significance: Development and impact [YouTube Presentation]. Retrieved from https://www.youtube.com/watch?v=B7mvbOK1ipA

Shorten, A., & Shorten, B. (2013). Hypothesis testing and p values: How to interpret results and reach the right conclusions. Evidence-Based Nursing, 16(2), 36-37. doi:10.1136/eb-2013-101255