Statistical Conclusion Errors in Hypothesis Testing

You learn about statistical conclusion errors in every basic nursing research class and are expected to understand what these errors mean. Wait! Did I learn about this, you ask? Yes, yes, you did. Remember Type I and Type II errors? Those are statistical conclusion errors.

Statistical Conclusion Errors Matrix

Statistical Conclusion Errors Matrix for Type I and Type II Errors for a p-value is set at < 0.05

But I’ll tell you that statistical conclusion errors are a hard concept for many students to understand because hypothesis testing uses the principle of negative inference, which is a process of rejection instead of acceptance. And to be honest, it took me a long time before I could explain Type I and Type II errors to my research students without reading from my notes!  

Statistical conclusion errors are a result of the testing of the null hypothesis for a research study. The tricky part is that the researcher does NOT know if they’ve made a Type I or Type II error when they’ve completed their study! I’ll explain this later – but first let’s just refresh our memory about hypotheses, before we talk about hypothesis testing. 

What is a Hypothesis?

Another word for statistical statements is hypotheses. All a hypothesis is is an educated guess. 

Another word for statistical statements is hypotheses. All a hypothesis is is an educated guess! Based on your knowledge of what you are researching, you make a statement of where you think the result of your research will lead, if there is enough literature for you to make a prediction. That statement is labeled the research hypothesis or alternative hypothesis; designated as H1. The alternative hypothesis is a statement of change; you may have more than one alternative hypothesis. 

The null hypothesis is a statement of no change or no difference.  

For example:

Hø (null hypothesis): There will be no change or no difference in length of stay as a result of X.  Hø=H1 (X is the intervention being tested)

H1: (alternative hypothesis): There will be a change in the length of stay as a result of X.  Hø≠H1 

OR 

There will be a reduction in the length of stay as a result of X. H1<Hø

You’ve probably had to practice writing null and alternative hypotheses in your research class. 

What is the Principle of Negative Inference?

The principle of negative inference refers to how researchers make decisions about their research results. It is a process of rejection – contrary to the way humans tend to think – which is why it is hard for some to get their heads around this concept. 

I can’t remember where I got this from, but the principle of negative inference says basically that there is “no effect until proven otherwise.” Sound familiar? Innocent until proven guilty uses a similar principle (Hopkins, 2002a).

In research, we can disprove, but not prove, things. Therefore, we need something to disprove (Hopkins, 2002b). In any research, Hø is either true (there is no relationship between the independent variable and dependent variable) or Hø is false (there is a relationship). 

So How Does This Principle Really Work? 

Key point: The null hypothesis is assumed to be TRUE. Because the researchers are, usually, trying to show that their intervention/process/test is different or better than the standard, the goal of the researcher/scientist is to Reject the null hypothesis

Logic = if the null hypothesis of no difference is assumed to be true then rejecting the null means the null is not true and there is a difference! 

Again, most researchers are looking to show that their intervention is better than the standard of care (SOC), so they are hoping to reject the null. What is their “hope” based on? Whether or not their research results are statistically significant or not!

If the results are statistically significant – YAY! Reject the null and say that your intervention is better than the standard of care.

If the resulting p-value is greater than the one set by the researchers – Oh well. Fail to reject the null (i.e., accept the null as true) and say that your intervention was not better than the standard of care. Do another study with a larger sample size (the most likely reason for not finding statistical significance) and see what happens!

Behind the Scenes: Hypothesis Testing and Statistical Decision-Making

Hypothesis testing is defined as how “true” a statistical statement is (Macha & McDonough, 2012). All hypothesis testing is testing of the NULL hypothesis! 

There are only two decisions that researchers can make from quantitative research results – either their experimental intervention/test/process is different than the standard, or it’s not.

Hypothesis testing just means that the researcher will conduct their research study and make a decision about the veracity of the null hypothesis based on the research results.

There are Five Basic Steps to Hypothesis Testing (Cook, n.d.):

  1. State your null and alternative hypotheses.
  2. Calculate the test statistic according to the research plan (e.g., t-test, Chi-square, ANOVA, etc.). (Usually done via a statistical software program, like SPSS)
  3. Find the p-value for the test statistic once the test is run. 
  4. Make your decision to ACCEPT (AKA Fail to reject) or REJECT the NULL hypothesis by determining whether the test statistic p-value is greater than or less than the a priori p-value set before the study began (AKA significance level or alpha level). 
  5. Make a logical conclusion about the significance of your study. 

Steps 1-4 are easy. Step 5 will logically follow from steps 1-4, however, Step 5 is where the possibility of error comes in. The error is not in the computer calculation of the test statistic, it’s in the researcher’s decision as a result of that test statistic.

Statistical Conclusion Errors

Because the researcher’s decision about their study is based on the calculated statistics from research samples, the statistical results are a function of chance – and therefore, the decisions made by the researcher may be in error. 

No researcher wants to make statistical conclusion errors.  Statistical conclusion errors are those errors that researchers make unknowingly.  I know that doesn’t seem possible, but let me try to explain why it is something to think about when reading the results of every study.

Definitions

There are two basic types of statistical conclusion errors that can be made: a Type I error and a Type II error. In both of these errors, a decision is made that turns out to be a mistake. 

Type I error: Mistake or error of rejecting a true null hypothesis (saying there are differences when there really are not). Analogous to a False Alarm. 

Type II error: Mistake or error of failing to reject a false null hypothesis (saying there are no differences when there really are). Analogous to a Failed Alarm. 

Reality is that intangible “TRUTH” that we can’t know. When you look at the matrix below, reality is defined as either the null is true and no differences exist or that the null is false and differences really do exist.

If the researcher gets a p-value for the test statistic that is greater than 0.05 (for this example) that result would be NOT be considered statistically significant and the decision will be to fail to reject (i.e., accept) the null hypothesis. If the reality is that the intervention would not make a difference in real life, then the researcher made a correct decision. 

If the researcher gets a p-value for the test statistic that is less than 0.05 (for this example) that result WOULD be considered statistically significant and the decision will be to reject the null hypothesis. If the reality is that the intervention would make a difference in real life, then the researcher made a correct decision. The power of a test is the ability to find a difference if one really exists. 

Statistical Conclusion Errors Matrix

Statistical Conclusion Errors Matrix for Type I and Type II Errors for a p-value is set at < 0.05

Type I error (probability of which is the alpha level) = The resulting p-value is less than the researcher-set alpha level, therefore they interpret the result as a significant finding and that their intervention works – when in reality it really doesn’t work. Wrong decision, but they don’t know that!

Type II error (probability of which is beta) = the resulting p-value is greater than the researcher-set alpha level, therefore they interpret the result as a non-significant finding and that their intervention did not work – when in reality it really does work. Wrong decision, but they don’t know that!

How to Reduce Statistical Conclusion Errors

Since one doesn’t know whether Hø is true or false, researchers try to protect themselves against the possibility of statistical conclusion errors.

Which error to protect against depends on the implications of the research.

If the research is to test the effects of a drug or patient care intervention — the consequences of incorrectly making a decision saying the drug or intervention is better than the SOC, when in reality it is not, could be severe. In that case, you want to protect against a Type I error.

If the research is to test the effects of an orientation program for new graduates on retention, the consequences of incorrectly making a decision saying the program is different/better than the SOC, when in reality it is not, is not as potentially dangerous. Money will be spent on a program that doesn’t increase retention rates.

On the other hand, the consequences of incorrectly making a decision saying the new orientation program is not different than the SOC, when in reality it is, would deprive the hospital of an effective program that will save them money in recruitment costs. So in that case, guarding against a Type II error would be considered more important. 

  • You determine probability of committing a Type I error by setting alpha level (significance level) low (p < 0.05, p< 0.01, p< 0.001 p < 0.0001). Realize that the more you try to protect yourself from falsely rejecting a true null, the more likely you are to accept a false null. The lower the alpha level, the harder it will be to reject the null if there really is a difference. So decreasing the possibility of a type I error results in an increase in the possibility of a type II error.
  • Type II error probability is influenced by alpha level, sample size, effect size, and extraneous variables.
    • Higher alpha levels = less probability of Type II error (p < 0.20, p < 0.10), but a higher probability of committing a Type I error. 
    • Increase the sample size to get closer to the population mean. Not having enough people in the sample is the most frequent cause of a Type II error. The more people in the sample, the easier it is to find a difference if one exists. 
    • Increase effect size (ah, could be hard to do in real life). The greater the effect size the easier it is to find a difference if one really exists, and – bonus – you don’t need as many people in the sample to find a big difference!
    • Reduce erroneous variability so that you’re more likely to find differences. 
    • Increase the power. The power of the test is 1-beta. So if you set beta at .20 (the traditional acceptable rate for beta = 20% to find the smallest worthwhile effect), there will be an 80% probability that you’ll find a difference if one really exists. Beta of .10 = 90% power. But the more power, the larger the sample size will need to be, too. 
What is Truth?

“The presumption is that there is a true state of nature [that is] unknown to the researcher” (Macha & McDonald, 2012, p. 42).

Here’s the tricky part about “Truth” or “Reality”: TRUTH is an abstract concept! We use statistics because we are working with samples, not the whole population.  We then make inferences about the population based on the results we get from our samples.   

Because we don’t test every single person (the whole population of people with X), we don’t know what the TRUTH or REALITY about X really is!  And there is nothing that alerts the researchers that they made a statistical error! 

There will ALWAYS be the possibility of statistical conclusion errors in any research study. “You can never be 100% certain that your conclusions are correct” (Andale, 2017a). 

Realize that the researchers DO NOT interpret their results incorrectly on purpose – they conduct their study, enter the data, apply statistical tests, and get a result. They then take that result and see if the p-value is below their a priori significance level.  If it is, then in good faith, they declare their intervention a success. If in reality there is no difference (that their results were just a product of chance), then they will have made a Type I error– saying their intervention is better than the standard of care, when in reality it is not. 

And because you are now clear on Type I and Type II statistical conclusion errors, here are a few more, less common, statistical conclusion errors to think about. 

Type III error (aka Type 0 error): When you correctly reject the null hypothesis for the wrong reason,  e.g., when the rationale or conclusion for the study is not supported by the data. You may have worded your hypotheses incorrectly or you could be wrong regarding the cause or direction of the difference. These errors are rare, not serious, and usually caused by random chance (Andale, 2017b). 

Type IV error: The Type IV error is a subset of the Type III error. You correctly reject the null hypothesis for the wrong reason but the reason is that you’ve misinterpreted the results. (Andale, 2017b). 

Type IIIIIIIII error: When the correct statistical analysis is performed, but the person funding the study (i.e., the client) doesn’t like the answer and suggests an inappropriate analysis to get the answer he wants. The consultant curses the client for suggesting such an analysis! (I can’t find the cite right now!) 🙂

There are a lot of reasons why error can occur in research studies, that is why the more studies that point to the same result, the more confident you can be that the result is true (and not a statistical conclusion error.)

Disclaimer: I am not a biostatistician! I’m sharing how I learned and then taught my undergraduate and graduate students these concepts. Hopefully, I’m not outright wrong in how I’m explaining these – but if I am, please email me and let me know and I’ll fix the problem. (Just be kind)

References

Andale. (2017a, April 1). Type I and type II errors: Easy definition, examples. Retrieved from http://www.statisticshowto.com/type-i-and-type-ii-errors-definition-examples/

Andale. (2017b, April 1). Type III error and type IV error in statistical tests. Retrieved from http://www.statisticshowto.com/type-iii-error-in-statistical-tests/

Cook, P. F. (n.d.). Statistics for the terrified. [Lecture Notes]. The University of Colorado Anschutz Medical Center, Denver, CO.

Hopkins, W. G. (2002a). Hypothesis testing. Retrieved from http://www.sportsci.org/resource/stats/pvalues.html#hypothesis

Hopkins, W. G. (2002b). Making inferences: Clinical vs statistical significance [Slideshow]. Sportscience 6, sportsci.org/jour/0201/Statistical_vs_clinical.ppt (updated December 2006).

Macha, K., & McDonough, J. P. (2012). Epidemiology for advanced nursing practiceSudbury, MA: Jones & Bartlett Learning.