What Does “Grading the Evidence” Mean in Evidence-Based Practice?
When reviewing a clinical practice guideline (CPG) or systematic review (SR) with practice recommendations, you’ll notice two scales that are used in conjunction with each practice recommendation offered: a levels of evidence scale and a grading scale.
I talked about levels of evidence (LOE) hierarchies in last week’s post. This post will discuss grading scales used to rate practice recommendations – what they are and how to interpret them.
What is the Difference Between a Grading Scale and Levels of Evidence Scale for Evidence-Based Practice?
First, let’s review why we use a levels of evidence scale in EBP. LOE scales rate the strength of the evidence, which may be judged by the type of research design and/or the quality of the methods (usually presented in a commentary, which includes a statement of how confident you can be of the results). Levels of evidence scales can be confusing because some use numbers to label the quality and some use letters; there is no common scale that everyone uses.
In addition to identifying the strength of the evidence in a levels of evidence scale, some resources also will provide a grade for the evidence. Practice recommendations are synthesized from the evidence into succinct points for easy translation into practice. Each practice recommendation should then be graded as to the scope and quality of the evidence that supports the recommendation.
The idea behind the grade is to give the clinician an idea of how ready the evidence is to be put into practice. “Ready” means the degree of confidence the clinician can have in the strength and quality of the cumulative evidence underlying the practice recommendation for the purpose of translating (i.e., implementing) the recommendation into practice. Practice recommendations, in clinical practice guidelines for example, are based on the cumulative evidence to date.
Practice recommendations are “graded” to give clinicians a degree of confidence in the strength and quality of the cumulative evidence underlying the recommendation.
As with LOE scales, there is not a universal grading scale – so when you see practice grades offered, be sure to find the section in the article or CPG that tells you to what the grades refer. Grading of practice recommendations and levels of evidence hierarchies are similar ideas, but not the same thing and are not interchangeable, though some scales do seem like hybrids.
How are Practice Recommendations Graded?
Recommendations for or against healthcare interventions, diagnostic and other clinical decisions are not made in a cavalier manner. Rather, the group making the recommendations (i.e., the authors or organization writing the CPG or SR) considers a lot of factors. Quality, quantity or scope, and consistency are major factors in assigning practice grades (Ebell et al., 2004).
The quality of the research methodology is paramount because if you can’t trust the methods the researchers used to conduct the study, you can’t have confidence in or trust in the results of the study! Quality is assessed by critically appraising the evidence and evaluating the potential for bias in a study. The quantity of the research on a specific topic, as well as the specific research designs and number of subjects, is another factor. It is preferable to have more than one study upon which to make a recommendation. And the consistency of the research results is important, too. Consistency is when the results of primary studies and/or systematic reviews or meta-analyses come to a similar or coherent conclusion (Ebell et al., 2004).
The American Family Physician (AFP) SORT taxonomy (presented below) also considers which outcomes are measured; the scope of the body of evidence in terms of how many studies, how consistent are the results, and how coherent are the results as a whole; as well as comparing the benefits, harms, and costs of the practice recommendation (Ebell et al., 2004).
Within a grading framework, the outcomes related to the implementation of the evidence should be considered. Patient-oriented outcomes give us a more human perspective by measuring outcomes that matter to patients; disease-oriented outcomes don’t always reflect a positive impact on patient outcomes (Ebell et al., 2004). The GRADE approach (presented below) considers the tradeoffs between desirable and undesirable effects, the quality of the evidence evaluated, values and preferences of the patient and clinicians, and information related to resource use (Guyatt et al., 2008).
Examples of Grading Scales for Practice Recommendations
Let’s look at some examples of Grading Scales used for both nursing and medical research; they are known by a variety of names! I’ll present some of the associated LOE scales also to help you compare the scales.
Nursing Grading Scales
The Johns Hopkins Nursing Evidence-Based Practice toolkit includes Quality Guides (their name for grading the evidence) and a Levels of Evidence scale.
Evidence grades are called Quality Guides in this system and identified as High quality (A), Good quality (B), and Low quality or major flaws (C). The definitions for quality are the same for Levels I-III; Levels IV and V have different definitions of quality.
For example, the definition of high-quality evidence for Levels I-III is evidence that is “consistent, generalizable results; sufficient sample size for the study design; adequate control; definitive conclusions; consistent recommendations based on a comprehensive literature review that includes thorough reference to scientific evidence.”
High-quality evidence for Level IV is defined as: “material officially sponsored by a professional, public, private organization, or government agency; documentation of a systematic literature search strategy; consistent results with sufficient numbers of well-designed studies; criteria-based evaluation of overall scientific strength and quality of included studies and definitive conclusions; national expertise is clearly evident; developed or revised within the last 5 years.”
Quality for Level V is divided into organizational experience and non-research evidence definitions (The Johns Hopkins Nursing EBP [(JHNEBP] Evidence Level and Quality Guide).
The JHNEBP LOE scale has 5 levels identified as I – V. Level I is the highest level; Level V acknowledges organizational evidence and non-research evidence. Note that this LOE scale includes qualitative research and non-research sources of evidence – different from the medical levels of evidence scales.
|LEVELS OF EVIDENCE||JOHNS HOPKINS LEVELS OF EVIDENCE
|I||Experimental study, randomized controlled trial (RCT)
Systematic review of RCTs, with or without meta-analysis
Systematic review of a combination of RCTs and quasi-experimental, or quasi-experimental studies only, with or without meta-analysis.
Systematic review of a combination of RCTs, quasi-experimental and non-experimental, or non-experimental studies only, with or without meta-analysis.
Qualitative study or systematic review, with or without meta-analysis.
|IV||Opinion of respected authorities and/or nationally recognized expert committees/consensus panels based on scientific evidence.
– Clinical practice guidelines
– Consensus panels
|V||Based on experiential and non-research evidence.
– Literature reviews
– Quality improvement, program or financial evaluation
– Case reports
– Opinion of nationally recognized expert(s) based on experiential evidence
The Joanna Briggs Institute (JBI; 2014) uses a grading scale based on the GRADE working group (see below). They have only two grades for recommendations: Grade A is a strong recommendation and Grade B is a weak recommendation. The definitions are from the JBI Grades of Recommendation document.
- Grade A: A ‘strong’ recommendation for a certain health management strategy where (1) it is clear that desirable effects outweigh undesirable effects of the strategy; (2) where there is evidence of adequate quality supporting its use; (3) there is a benefit or no impact on resource use, and (4) values, preferences and the patient experience have been taken into account.
- Grade B: A ‘weak’ recommendation for a certain health management strategy where (1) desirable effects appear to outweigh undesirable effects of the strategy, although this is not as clear; (2) where there is evidence supporting its use, although this may not be of high quality; (3) there is a benefit, no impact or minimal impact on resource use, and (4) values, preferences and the patient experience may or may not have been taken into account.
JBI also has an LOE scale for therapeutic interventions, diagnostic studies, prognosis, economic evaluations, and meaningfulness studies (i.e., qualitative studies).
The therapeutic interventions levels are used for the following research designs:
- Levels 1a-d (experimental research),
- Levels 2a-d (quasi-experimental research),
- Levels 3a-e (observational-analytic research),
- Levels 4a-d (observational-descriptive research)
- Levels 5a-c (expert opinion and bench research)
The Registered Nurses of Ontario (RNAO) Best Practice Guidelines are excellent and provide both a LOE and a grade of the quality or scope of the evidence underlying the recommendations. The grading scale is:
- A: There is good evidence to recommend the clinical preventive action.
- B: There is fair evidence to recommend the clinical preventive action.
- C: The existing evidence is conflicting and does not allow making a recommendation for or against use of the clinical preventive action; however other factors may influence decision-making.
- D: There is fair evidence to recommend against the clinical preventive action.
- E: There is good evidence to recommend against the clinical preventive action.
- I: There is insufficient evidence (in quantity and/or quality) to make a recommendation, however other factors may influence decision-making.
The RNAO LOE scale is categorized by type of research design.
- Ia: Evidence obtained from meta-analysis of randomized controlled trials.
- Ib: Evidence obtained from at least one randomized controlled trial.
- IIa: Evidence obtained from at least one well-designed controlled study without randomization.
- IIb: Evidence obtained from at least one other type of well-designed quasi-experimental study, without randomization.
- III: Evidence obtained from well-designed non-experimental descriptive studies, such as comparative studies, correlation studies, and case studies.
- IV: Evidence obtained from expert committee reports or opinions and/or clinical experiences of respected authorities.
Melynk and Fineout-Overholt (2015) do not have a separate grading scale for practice recommendations, but their scale for rating the levels of evidence for therapeutic interventions is used a lot in nursing.
- Level I: Evidence from a systematic review or meta-analysis of all relevant RCTs
- Level II: Evidence obtained from well-designed RCTs
- Level III: Evidence obtained from well-designed controlled trials without randomization
- Level IV: Evidence from well-designed case–control and cohort studies
- Level V: Evidence from systematic reviews of descriptive and qualitative studies
- Level VI: Evidence from single descriptive or qualitative studies
- Level VII: Evidence from the opinion of authorities and/or reports of expert committees
Examples of Medical Grading Scales
There are many grading tools used in health care. The Grading of Recommendations Assessment, Development, and Evaluation (GRADE) grading tool is used by many clinicians for systematic reviews and clinical practice guidelines and is outlined on the GRADE Group website. The GRADE group developed and modified the grading scale to increase consistency in rating a body of evidence “[the] approach provides a system for rating quality of evidence and strength of recommendations that is explicit, comprehensive, transparent, and pragmatic and is increasingly being adopted by organisations worldwide” (Guyatt et al., 2008, p. 928).
The GRADE approach is endorsed by many evidence-based healthcare entities, such as the Cochrane Collaboration, the World Health Organization, the Agency for Healthcare Research and Quality (AHRQ), UpToDate, and the Joanna Briggs Institute (JBI). The GRADE system starts by classifying the evidence according to study design and then upgrades or downgrades the recommendation based on a number of factors including critical appraisal of the risk of bias (study limitations), evidence of reporting or publication bias, inconsistency of results, indirectness of the evidence, imprecision of the research findings, treatment effect size, dose-response relationships, and residual confounders (Guyatt et al., 2008; JBI, 2014). Thus, the GRADE tool provides clinicians with a level of confidence regarding the quality of research studies. There are two qualifiers for the strength of recommendations – strong and weak/conditional/discretionary/qualified. These definitions are from the GRADE handbook.
- A strong recommendation is one for which guideline panel is confident that the desirable effects of an intervention outweigh its undesirable effects (strong recommendation for an intervention) or that the undesirable effects of an intervention outweigh its desirable effects (strong recommendation against an intervention).
- A weak or conditional recommendation is one for which the desirable effects probably outweigh the undesirable effects (weak recommendation for an intervention) or undesirable effects probably outweigh the desirable effects (weak recommendation against an intervention) but appreciable uncertainty exists.
- A weak recommendation implies that not all individuals will be best served by the recommended course of action. There is a need to consider more carefully than usual the individual patient’s circumstances, preferences, and values. When there are weak recommendations caregivers need to allocate more time to shared decision making, making sure that they clearly and comprehensively explain the potential benefits and harms to a patient.
GRADE Quality of Evidence Table
|Grade||Quality of Evidence||Definition|
|A||High||Further research is very unlikely to change our confidence in the estimate of effect.|
|B||Moderate||Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate.|
|C||Low||Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate.|
|D||Very Low||Any estimate of effect is very uncertain.|
Each practice recommendation in an American Heart Association/American College of Cardiology CPG identifies a Level of Evidence rating for evidence quality and a Class of Recommendation grade of the strength of the evidence for the recommendation (Halperin et al., 2016).
The Class of Recommendation scale is a grading scale that rates the strength of the evidence from Strong to Weak, No benefit to Harmful.
- CLASS I is STRONG and delineates the fact that there is a great benefit in implementing the practice recommendation (Benefit >>> Risk),
- CLASS IIa is MODERATE benefit (Benefit >> Risk),
- CLASS IIb is WEAK (Benefit ≥ Risk),
- CLASS III signifies No Benefit (MODERATE) (Benefit = Risk), and
- CLASS III Harm (STRONG) (Risk > Benefit)
The level of evidence scale goes from the highest level, Level A, to Level B-R (randomized studies), B-NR (non-randomized studies), C-LD (limited data studies), and the lowest level is Level C-EO (expert opinion). For example, Level A is defined as “high-quality evidence from more than 1 RCT, meta-analyses of high-quality RCTs, or one or more RCTs corroborated by high-quality registry studies.” The middle levels are categorized as hierarchies of study designs from RCTs to non-randomized observational trials with limitations. Level C-EO is defined as “consensus of expert opinion based on clinical experience.”
The American Family Physician journal articles about clinical topics use a grading scale of the practice recommendations for diagnosis and management based on the evidence – they call this a Strength-of-Recommendation Taxonomy (SORT). They make a point to emphasize patient-oriented evidence, which is described as outcome measures “that matter to patients: morbidity, mortality, symptom improvement, cost reduction, and quality of life. Disease-oriented evidence measures intermediate, physiologic, or surrogate endpoints that may or may not reflect improvements in patient outcomes (e.g., blood pressure, blood chemistry, physiologic function, pathologic findings).”
|STRENGTH OF RECOMMENDATION||DEFINITION||IMPLICATION FOR PRACTICE|
|A||Recommendation based on consistent and good quality patient-oriented evidence||You should do this unless there is a compelling reason not to.|
|B||Recommendation based on inconsistent or limited quality patient-oriented evidence||You should strongly consider doing this.|
|C||Recommendation based on consensus, usual practice, opinion, disease-oriented evidence, and case series for studies of diagnosis, treatment, prevention, or screening.||The evidence that this improves patient outcomes is weaker for this recommendation.|
This journal also provides a LOE scale for how their authors should assess evidence quality and consistency to make their practice recommendation for diagnosis, treatment/ screening/prevention, or prognosis for a particular clinical topic.
Level 1 is their highest level of study quality and is defined as “good-quality, patient-oriented evidence.” For a treatment recommendation, study quality is defined as a “systematic review (SR) or meta-analysis or randomized controlled trials (RCTs) with consistent findings or a high-quality individual RCT, or an all-or-none study. High-quality is defined in the legend: for example, a high-quality RCT means that treatment allocation is concealed, blinding if possible, intention-to-treat analysis, adequate statistical power, and adequate follow-up (greater than 80 percent).”
Consistent evidence is defined as “Most studies found similar or at least coherent conclusions (coherence means that differences are explainable) or if high-quality and up-to-date systematic reviews or meta-analyses exist, they support the recommendation.
Once the author decides on the quality and consistency of the body of evidence for the practice recommendation, they then assign a SORT grade for the practice recommendation itself.
I like the AFP journal as an example because they clearly define their LOE scale and their grading scale definitions. If you are an advanced practice nurse or APRN nursing student, this is a great journal, by the way. I used it a lot for required readings when I taught Advanced Pathophysiology in a master’s program. You might be interested in reviewing their 2-page handout that explains their process and provides an algorithm to help their authors be consistent in their grade recommendations.
The U.S. Preventive Services Task Force (USPSTF) is managed by the Agency for Healthcare Research and Quality (AHRQ). The Guide to Clinical Preventive Services includes the “USPSTF recommendations on screening, counseling, and preventive medication topics and includes clinical recommendations for each topic. This new pocket guide provides family physicians, internists, pediatricians, nurses, nurse practitioners, physician assistants, and other clinicians with an authoritative source for making decisions about preventive services.” Each clinical recommendation is graded using the following scale:
USPSTF Guide to Clinical Preventive Services (AHRQ, 2014):
|Grade||Definition||Suggestions for Practice|
|A||The USPSTF recommends the service. There is high certainty that the net benefit is substantial.||Offer or provide this service.|
|B||The USPSTF recommends the service. There is high certainty that the net benefit is moderate or there is moderate certainty that the net benefit is moderate to substantial.||Offer or provide this service.|
|C||The USPSTF recommends selectively offering or providing this service to individual patients based on professional judgment and patient preferences. There is at least moderate certainty that the net benefit is small.||Offer or provide this service for selected patients depending on individual circumstances.|
|D||The USPSTF recommends against the service. There is moderate or high certainty that the service has no net benefit or that the harms outweigh the benefits.||Discourage the use of this service.|
|I Statement||The USPSTF concludes that the current evidence is insufficient to assess the balance of benefits and harms of the service. Evidence is lacking, of poor quality, or conflicting, and the balance of benefits and harms cannot be determined.||Read the clinical considerations section of USPSTF Recommendation Statement. If the service is offered, patients should understand the uncertainty about the balance of benefits and harms.|
The USPSTF also identifies Levels of Certainty Regarding the Net Benefit of preventive services as High, Moderate, and Low. You can find the definitions of each level on the USPSTF document.
Bottom Line: Because everyone uses a different LOE and grading scale and that these scales can be hard to tell apart, the bottom line is that you should see how the scale and each rating is described to identify whether the scale is telling you about the quality of the research discussed (i.e., research design and/or quality of the methods = LOE scale) or whether the scale is giving you confidence in the practice recommendation itself because of the amount and/or strength of the evidence (= grading scale).
I hope you are enjoying this blog! Please signup for my free monthly newsletter. (A free handout on APA format and Plagiarism will be sent when you sign up via this link.) Or you can check out my blogposts for other free handouts on EBP, theory, advanced practice checklists, professional writing tips, clinical stats, or productivity. Get one or all! Links to these handouts will be in posts of the same category.
Andrews, J., Guyatt, G., Oxman, A. D., Alderson, P., Dahm, P., Falck-Ytter, Y., … Schünemann, H. J. (2013). GRADE guidelines: 14. Going from evidence to recommendations: The significance and presentation of recommendations. Journal of Clinical Epidemiology, 66(7), 719-25.
Agency for Healthcare Research and Quality. (2014, June). How the U.S. Preventive Services Task Force grades its recommendations. Rockville, MD: Author. http://www.ahrq.gov/professionals/clinicians-providers/guidelines-recommendations/guide/appendix-a.html
DiCenso, A., & Guyatt, G. (2005). Interpreting levels of evidence and grades of health care recommendation. In A. DiCenso, G. Guyatt, & D. Ciliska, D. (Eds.), Evidence-based nursing: A guide to clinical practice (pp. 508–525). Philadelphia, PA: Elsevier Mosby.
Ebell, M. H., Siwek, J., Weiss, B. D., Woolf, S. H., Susman, J., Ewigman, B., & Bowman M. (2004). Strength of recommendation taxonomy (SORT): A patient-centered approach to grading evidence in the medical literature. American Family Physician, 69(3), 548-56.
GRADE working group. (2004). Grading quality of evidence and strength of recommendations. British Medical Journal, 328(7454), 1490-1497. http://bmj.bmjjournals.com/cgi/reprint/328/7454/1490
Guyatt, G. H., Oxman, A. D., Vist, G., Kunz, R., Falck-Ytter, Y., Alonso-Coello, P., & Schünemann, H. J. (2008). GRADE: An emerging consensus on rating quality of evidence and strength of recommendations. British Medical Journal, 336(7651), 926-928. doi: https://doi.org/10.1136/bmj.39489.470347.AD
The Joanna Briggs Institute Levels of Evidence and Grades of Recommendation Working Party. (2014). Supporting document for the Joanna Briggs Institute Levels of Evidence and Grades of Recommendation. The Joanna Briggs Institute. Retrieved from http://joannabriggs.org/assets/docs/approach/Levels-of-Evidence-SupportingDocuments.pdf