Understanding Statistical Significance

See attached The article for the word message summary.

Write a 205-word message in which you summarize the major points made by Hayat and explain why it is not appropriate to state that a study’s results prove relationships.
Method

Understanding Statistical Significance

Matthew J. Hayat

b Background: Statistical significance is often misinterpreted as proof or scientific evi- dence of importance. This article addresses the most common statistical reporting

error in the biomedical literature, namely, confusing statistical significance with

clinical importance.

b Objective: The aim of this study was to clarify the confusion between statistical sig- nificance and clinical importance by providing a historical perspective of significance

testing, presenting a correct understanding of the information given by p values and

significance testing, and offering recommendations for the correct use and reporting

of statistical results.

b Approach: The correct interpretation of p values and statistical significance is given, and the recommendations provided include a description of the current recom-

mended guidelines for statistical reporting of the size of an effect.

b Results: This article provides a comprehensive overview of p values and significance testing and an understanding of the need for measures of importance and

magnitude in statistical reporting.

b Discussion: Statistical significance is not an objective measure and does not provide an escape from the requirement for the researcher to think carefully and judge the

clinical and practical importance of a study’s results.

b Key Words: effect size & p value & statistical significance

Statistical significance often is usedmistakenly as the standard for as- sessing the importance of a measured effect. This practice has a long history and is a consequence of the desire for an objective method for stating im- portance. However, significance test- ing is not an objective procedure and does not alleviate the need for careful thought and judgment relevant to the subject matter being studied. The search for an objective measure of im- portance is a daunting task. The re- searcher makes subjective decisions about a research topic to the best of his or her ability given the circum- stances, resources, and design options.

Each research study is limited by sub- jectivity related to these factors, as well as time and decisions about what to measure. In the midst of this sub- jectivity, the researcher often uses sta- tistics to claim proof and scientific breakthrough in the form of statistical significance. It is common practice to consider a p value less than .05 as a form of objective scientific evidence of an effect. Unfortunately, this judgment is subjective and flawed and leads to erroneous conclusions and interpreta- tions (Ziliak & McCloskey, 2008).

Confusing statistical significance with clinical importance is the most common statistical reporting error in

the biomedical literature (Lang, 2007). Significance is defined by Merriam- Webster (2009) as ‘‘the quality of being important,’’ but statistical significance is not a measure of importance.

The purpose of this article was to provide a historical perspective on sig- nificance testing, present a correct un- derstanding of the information given by p values and significance testing, and offer recommendations for the correct use and reporting of statistical results as a means of scientific evidence.

History of Significance Testing

Florence Nightingale was a statistician and a pioneer in the field of medical statistics (Cohen, 1984; Grier, 1978; Hogg, 1989; Kopf, 1978). She is cred- ited with coining the phrase applied statistics and contributed to the field of statistics with innovative new methods for visualizing the presentation of in- formation and new types of statistical graphics (Lewi, 2006). Nightingale was one of a group of scientists developing the field of vital statistics in the 19th century. Although significance testing had not been introduced, Nightingale’s work helped set the stage for the growth and development in the first half of the 20th century of statistical theory, significance testing, and the use of ran- domization in scientific experiments.

Nursing Research May/June 2010 Vol 59, No 3 219

Matthew J. Hayat, PhD, is Assistant Professor and Biostatistician, School of Nursing, Johns Hopkins University, Balti- more, Maryland.

Copyright @ 20 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.10

In 1925, the statistician and geneticist Ronald A. Fisher introduced p values and the concept of significance testingV and the .05 level of significance (Fisher, 1925). Ziliak and McCloskey (2008) presented a review of the history of sig- nificance testing and the adoption of significance testing by the field of psy- chology as the new standard for claim- ing the importance of a study result. Between 1940 and 1955, inferential sta- tistics and reporting of p values gained widespread acceptance in the field of psy- chology (Gigerenzer & Murray, 1987; Hubbard & Ryan, 2000). However, p values are not a measure of impor- tance of a study result. This had a direct and negative impact on the scientific value of publications in virtually every scientific discipline.

The first edition of the Publication Manual of the American Psychological Association (1952) included the follow- ing guideline:

Extensive tables of non-significant results are seldom required. For ex- ample, if only 2 of 20 correlations are significantly different from zero, the two significant correla- tions may be mentioned in the text, and the rest dismissed with a few words (p. 414).

(This guideline was removed in the second edition of the Publication Man- ual of the American Psychological As- sociation, 1974). This guideline led to an emphasis on statistical significance in publications. Results failing to reach statistical significance were deemed unworthy of publication, which led to

publication bias (selective publication of statistically significant results) in the published literature. The flaw was to use statistical significance as an objec- tive measure of scientific evidence and importance.

Sterling (1959) studied the four leading psychology journals at the time and quantified the use of significance testing. The results of his study are displayed in Table 1. More than 81% (294/362) of the articles included sig- nificance testing as the method of choice for reporting study results. In 97% (286/294) of these articles, the null hypothesis was rejected and a sta- tistically significant result was found. Researchers adhered to the arbitrary .05 level of significance.

The guidelines in the second edition of the Publication Manual of the Amer- ican Psychological Association (1974) still focused on the use of p values and significance testing. The following is an excerpt from the second edition, cau- tioning the researcher to avoid attribut- ing importance or magnitude based on a p value:

Caution: do not infer trends from data that fail by a small margin to meet the usual levels of significance. Such results are best interpreted as caused by chance and are best reported as such. Treat the result section like an income tax return. Take what’s coming to you, but no more (p. 19).

Over the past 40 years, many ar- ticles have been published by reputable statisticians citing the pitfalls and mis-

uses of significance testing and encour- aging the reporting of measures of effect size (Berger, 2003; Matthews, 1998). The recommendations in the Publica- tion Manual of the American Psycho- logical Association (2010) are to publish statistics describing the size of an effect and suggest including effect size estimates for nonstatistically signifi- cant results. It will take time and aware- ness for reviewers to take note of the statistical reporting guideline changes in the current edition of the Publication Manual of the American Psychological Association. Recommendations in the current version suggest a focus on reporting of confidence intervals (CIs) and other measures of importance.

Significance Testing

Significance testing is the process of comparing the p value derived from sample data to a study’s predetermined level of significance. The level of sig- nificance is the probability of commit- ting a Type I error, whereby one rejects the null hypothesis when it is actually true. The conventional level of signifi- cance used in most studies is .05, which corresponds to rejecting the null hypothesis incorrectly in approximately 1 out of every 20 experiments.

The researcher decides the signifi- cance level before the study is executed and before data are collected. If the p value for a statistical test is less than the level of significance, the conclusion is made that the result is statistically significant. Thus, there are two possible outcomes to a significance test: Reject the null hypothesis or fail to reject the null hypothesis. In essence, the out- come of a significance test is dichoto- mous in nature and has a nominal level of measurement. In other words, sig- nificance tests produce a qualitative out- come measure. This is ironic when considering the importance placed in evidence-based medicine on signifi- cance testing as a means of quantitative scientific evidence.

Furthermore, the level of significance (e.g., .05) is a subjective quantity chosen by the researcher and its selection deter- mines the result of a significance test. This suggests that significance testing is in reality a subjective procedure. There- fore, contrary to popular belief, a sig- nificance test result fails to provide an objective measure of scientific evidence.

q TABLE 1. Sterling’s (1959) Report of Significance Testing in the Psychology Literature

Journal

Total no. of research articles

No. (%) using significance

tests

Of those using significance tests, no. (%) that reject null hypothesis at e.05

Experimental Psychology (1955) 124 106 (85%) 105 (99%)

Comparative and Physical Psychology (1956)

118 94 (80%) 91 (97%)

Clinical Psychology (1955) 81 62 (77%) 59 (95%)

Social Psychology (1955) 39 32 (82%) 31 (97%)

Total 362 294 (81%) 286 (97%)

Note. From Sterling (1959). Reprinted with permission from The Journal of the American Statistical

Association. Copyright 1959 by the American Statistical Association. All rights reserved.

220 Significance Testing Nursing Research May/June 2010 Vol 59, No 3

Copyright @ 20 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.10

Defining the p Value Researchers may be interested in the probability of a hypothesis being true. However, the p value does not give this information. The p value gives the prob- ability of observing the sample data or something more extreme, assuming the null hypothesis is true. It is the likeli- hood of observing the measured, or more extreme, effect observed in the sample data, assuming no effect ac- tually exists. Thus, if the measured ef- fect is large, it is more than expected due to chance because the calculation begins with assuming that there is no effect, and so the calculated p value would be small. This would lead to rejecting the null hypothesis and con- cluding that the effect is more than expected by chance alone.

Interpreting the p Value The p value does not have any clinical or practical value for a researcher. As an example, suppose a level of signifi- cance of .05 is assumed for a study. Two similar studies are conducted; the result for Study A gives p = .04 and the result for Study B gives p = .000000001. In each case, the p value is less than the predetermined level of significance, and so statistical signifi- cance is concluded for both Study A and Study B. No further inferences can be made with the p value alone. A p value quantifies the probability of the observed effect or something more ex- treme being due to chance, so a smaller p value corresponds to an observed ef- fect or something more extreme being more due to chance. However, the ability to add that an observed result is more due to chance is valueless from a practical standpoint. The result of Study B is not more statistically signifi- cant than the result for Study A. Re- gardless of the relatively closer p value to .05, the evidence of an effect in Study A is not any less meaningful than that for Study B.

Furthermore, suppose we carry out a third experiment, Study C, yielding a p value of .06. It is common in the lit- erature for an author to interpret a p value close to, but slightly larger than, the level of significance incorrectly. For example, some researchers describe p = .06 as evidence of marginal significance or proof of a trend toward significance. This type of conclusion is untrue and lends to incorrect interpretations of the

statistical test results. The only inference that can be made is that of failing to re- ject the null hypothesis and thus con- cluding that the result is inconclusive.

The p value and statistical signifi- cance are influenced by sample size. A small effect in a large study can have the same p value as a large effect in a small one (Goodman, 1999). The prac- tical and clinical importance of a mea- sured effect in a study is addressed by the researcher. If the sample size is very large, the p value necessarily will be very small. An increasingly large sam- ple size yields a decreasingly smaller p value; thus, a large sample size leads to a statistically significant result, regard- less of scientific importance. Also, a sta- tistically significant effect based on a small sample is more impressive than a statistically significant effect based on a large sample (Knapp, 1996), due to large sample sizes buying effects that are not scientifically important.

Failing to Reject the Null Hypothesis Failing to reject the null hypothesis is not the same thing as accepting the null hypothesis. In other words, the null hypothesis is never accepted, and failing to find an effect is not the same thing as showing that there is no effect. If the null hypothesis is not rejected, a true null hypothesis can only be recognized as a possibility. Unfortunately, this technical detail is often missed in the translation and interpretation of non- statistically significant results. If a sig- nificance test fails to reject the null hypothesis, it is incorrect to claim evi- dence of no treatment effect or no dif- ference. At best, the result of the hypothesis test has yielded an incon- clusive result. It cannot be said for certain that there is no treatment effect. Interpreting a nonsignificant statistical result as no effect is equivalent to ac- cepting the null hypothesis to be true, which is not possible with classic hypothesis testing. This vital point is often missed, and incorrect interpreta- tions of significance test results when failing to reject the null hypothesis are common (Ziliak & McCloskey, 2008).

Power, Effect Size, and Sample Size The p value is dependent on statistical power, effect size, and sample size. Statistical power is the probability of detecting an effect when one actually exists. Thus, to quantify statistical

power, the size of an effect needs to be defined by the researcher. This is the minimal practical or clinical difference needed to state that something of im- portance is occurring. Power and effect size have a close-knit relationship with sample size (the number of participants included in a study). Power and sample size have a positive relationship; to achieve more power in a study, a larger sample size is needed. Conversely, sam- ple size and effect size have an inverse relationship, because power to detect a smaller effect size requires a relatively larger sample size. These quantities af- fect the value and interpretation of a p value.

Because the p value gives the prob- ability of observing a specified effect size assuming that the null hypothesis is true, the effect size is used in the defini- tion of a p value for a statistical test. If a small effect size is specified, a large sample size will be needed to detect such a small effect. This means that a statistically significant result will not be found if the sample size is not suffi- ciently large, and if the sample size is very large, a statistically significant re- sult will be found. These properties are not desirable for a researcher, because statistical significance may be driven by the sample size and is not necessarily reflective of the phenomena under study. This dependence of the p value on sample size is a mathematical prop- erty of p values, unrelated to the topic of a research study, and further illumi- nates the gap between statistical signifi- cance and clinical importance.

An example of flawed thinking about statistical significance and impor- tance is seen in the case of the painkil- ler Vioxx. Merck was the manufacturer of the painkiller Vioxx (Ziliak & McCloskey, 2008). First approved by the U.S. Food and Drug Administration in 1999, the distribution of this drug grew dramatically, and by 2003, Vioxx had a $2.5 billion international market. A clinical trial on Vioxx was conducted in 2000, and the findings of this clini- cal trial were published in the Annals of Internal Medicine (Lisse et al., 2003). A news article reported the fol- lowing about the study: ‘‘five patients taking Vioxx had suffered heart attacks during the trial, compared with one taking naproxen (generic drug; control group), a difference that did not reach statistical significance’’ (Berenson, 2005).

Nursing Research May/June 2010 Vol 59, No 3 Significance Testing 221

Copyright @ 20 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.10

The claim in the published study re- sults was that there was no statistical difference at the .05 level in the num- ber of heart attacks between the treat- ment (Vioxx) and control (generic drug) groups. Thus, the 5:1 ratio of heart attacks was ignored. In 2003, a 73-year-old woman died suddenly of a heart attack while taking as directed her prescribed Vioxx pills. As a result of her death, an investigation was opened, and attention was called to the obvious clinical significance of this 5:1 ratio. This is a dramatic example of incorrectly interpreting the failure to reject the null hypothesis as absence of an effect.

Recommendations

Researchers usually are interested in the size of an effect, which can serve as a quantitative measure of importance. Because the p value does not quantify magnitude or importance, it is impor- tant to include appropriate statistical measures, in addition to p values, in the results of a study. Commonly used sta- tistical software packages (e.g., Minitab, SAS, Splus, SPSS, Stata) produce out- put that includes p values for each statistical test. In addition, most proce- dures in these software packages offer the user the option to compute statis- tical measures to assess the magnitude and strength of association. The CI is an estimate of the magnitude of an ef- fect. There are many other statistical measures of magnitude well worth ex- amining in a data analysis, including the correlation coefficient, odds ratio, relative risk, hazard ratio, and regres- sion coefficient. The current edition of the Publication Manual of the Ameri- can Psychological Association states the following:

The inclusion of confidence inter- vals (for estimates of parameters, for functions of parameters such as differences in means, and for effect sizes) can be an extremely effective way of reporting results. Because confidence intervals combine infor- mation on location and precision and can often be directly used to infer significance levels, they are, in general, the best reporting strategy (p. 34).

A CI gives a range of values for which we are confident that the quan-

tity we are interested in may take. The confidence level is a function of the level of significance (the probability of com- mitting a Type I error, rejecting the null hypothesis when it is true) and is given by 100(1 j !)%. For example, if ! = .05, the corresponding confidence level is 100(1 j 0.05) = 95%. The CIs also quantify the uncertainty about the quantity of interest. Instead of esti- mating a parameter (such as the mean of a population) with a single number (such as a mean or range), CIs provide interval estimates for the quantity of interest. In other words, a CI is inter- preted as telling a researcher that with many hypothetical repetitions of an experiment, the population parameter can be expected to be contained in the CI in 100(1 j !)% of the experiments. For example, a clinical interpretation corresponding to a 95% CI for [a, b] for occurrence of a symptom could be: ‘‘We are 95% confident the population mean number of occurrences lies between a and b.’’

Lang and Secic (2006) describe dif- ferences between statistical significance and clinical importance and suggest using CIs to provide scientific evidence of the magnitude of an observed effect.

The current edition of the Publica- tion Manual of the American Psycho- logical Association also states:

For the reader to appreciate the magnitude or importance of a study’s findings, it is almost always necessary to include some measure of effect size in the Results section. Whenever possible, provide a con- fidence interval for each effect size reported to indicate the precision of estimation of the effect size. Effect sizes may be expressed in the original units and are often most easily understood when reported in original unitsI. The general prin- ciple to be followed, however, is to provide the reader with enough information to assess the magni- tude of the observed effect (p. 34).

In addition to using one or more measures of effect or strength to quan- tify evidence from a study, CIs should be presented for each measure selected. This will allow a researcher to make clear and valid statements about clinical significance and to support these state- ments with appropriate and corre- sponding statistical measures.

Discussion

Researchers use statistics to develop evidence-based implications for practice from information reported in research studies. Significance testing has been used to decide which information may be considered evidence supportive of a prac- tice change. The reality is that there is no single objectivemethod that is suitable for every topic or study. Judgment and subjectivity are necessary and part of the decision-making process. Statistical sig- nificance is not a measure of importance; it is a subjective and qualitative construct.

Researchers conducting quantitative analyses should quantify the magnitude of an effect. The value of the data collected should be assessed by examin- ing study design, bias, and confounding variables, as well as meaningfulness of the results to the topic under study. Sta- tistical significance does not imply cau- sality and offers no information about the importance of an association.

The misuse of significance testing has led to flawed research findings and false claims. This practice can change with awareness and understanding. The current version of the Publication Man- ual of the American Psychological Association (2010) provides clear guide- lines and suggestions for statistical reporting. The suggestions in this sixth edition correct many issues raised as a result of misconceptions about signifi- cance testing. The researcher can avoid the pitfalls of significance testing by re- porting a measure of effect size and including a CI for each measure of in- terest. Study measures used in statistical tests that fail to meet statistical signifi- cance should be considered carefully, and measures of effect size should be ex- amined for clinical or practical impor- tance. Reporting of study results should entail a correct interpretation of p val- ues and statistical significance; p values do not provide a measure of effect size or trend. In addition to these report- ing guidelines, the researcher should approach each literature review with awareness of the limitations of signifi- cance testing and carefully evaluate its use in previously published material. q

Accepted for publication: November 19, 2009. The author thanks Gayle Page, DNSc, RN, FAAN, Jerilyn Allen, ScD, RN, FAAN, and Lynn D. Torbeck, MS, for their thoughtful reviews.

222 Significance Testing Nursing Research May/June 2010 Vol 59, No 3

Copyright @ 20 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.10

Corresponding author:Matthew J. Hayat, PhD, School of Nursing, Johns Hopkins University, 525 N. Wolfe St., Room 532, Baltimore, MD 21205 (e-mail: mhayat2@son.jhmi.edu).

References American Psychological Association. (1952).

Publication Manual of the American Psy- chological Association (1st ed.). Washing- ton, DC: Author.

American Psychological Association. (1974). PublicationManual of theAmericanPsycho- logical Association (2nd ed.). Washington, DC: Author.

American Psychological Association. (2010). PublicationManual of theAmericanPsycho- logical Association (6th ed.). Washington, DC: Author.

Berenson, A. (2005 April 24). Newly disclosed e-mails add Vioxx wrinkle; Patient’s death during drug test down- played. Chicago Tribune, Sect. 1, p. 14.

Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18(1), 1Y32.

Cohen, I. B. (1984). Florence Nightingale. Scientific American, 250(3), 128Y137.

Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver& Boyd.

Gigerenzer, G., & Murray, D. J. (1987).

Cognition as intuitive statistics. Hill- sdale, NJ: Erlbaum.

Goodman, S.N. (1999). Toward evidence-based medical statistics. 1:TheP value fallacy.Annals of Internal Medicine, 130(12), 995Y1004.

Grier, M. R. (1978). Florence Nightingale: Saint or scientist? Research in Nursing & Health, 1(3), 91.

Hogg, R. V. (1989). How to cope with statistics. Journal of the American Stat- istical Association, 84, 1Y5.

Hubbard, R., & Ryan, P. A. (2000). The historical growth of statistical significance testing in psychologyVand its future prospects. Educational and Psychological Measurement, 60(5), 661Y681.

Knapp, T. R. (1996). The overemphasis on power analysis. Nursing Research, 45(6), 379Y381.

Kopf, E. W. (1978). Florence Nightingale as statistician. Research in Nursing & Health, 1(3), 93Y102.

Lang, T. (2007). The need for accurate statistical reporting. A commentary on ‘‘Guidelines for reporting statistics in journals published by the American Phys- iological Society: The sequel.’’ Advances in Physiology Education, 31(4), 299Y307.

Lang, T., & Secic, M. (2006). How to report

statistics in medicine: Annotated guidelines for authors, editors, and reviewers. Phila- delphia: American College of Physicians.

Lewi, P. J. (2006). Speaking of graphics. Retrieved April 19, 2009, from http:// www.datascope.be/sog.htm

Lisse, J. R., Perlman, M., Johansson, G., Shoemaker, J. R., Schechtman, J., Skalky, C. S., et al. (2003). Gastrointestinal toler- ability and effectiveness of rofecoxib versus naproxen in the treatment of osteoarthritis: A randomized, controlled trial. Annals of Internal Medicine, 139(7), 539Y546.

Matthews, R. (1998 September 13). The great health hoax. The Sunday Telegraph. Sunday Review Section, 1Y2.

Merriam-Webster. (2009). Significance. Re- trieved April 19, 2009, from http://www. merriam-webster.com/dictionary/significance.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significanceVor vice versa. Journal of the American Statistical Association, 54(285), 30Y34.

Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor, MI: The University of Michigan Press.

Erratum

Evidence-Based Practice in a Military Intensive Care Unit Family Visitation: Erratum

In the article that appears on pages S32YS39 of the January/February 2010 Supplement issue of Nursing Research, the middle initial for author John J. Whitcomb, PhD, RN, CCRN, was incorrectly printed as A.

In addition, there is a change to the contact e-mail address of Dr. Whitcomb. His current e-mail address is jwhitco@ clemson.edu.

Reference

Whitcomb, J. J., Roy, D., & Schmied Blackman, V. (2010). Evidence-based practice in a military intensive care unit family visitation. Nursing Research, 59(Suppl. 1), S32YS39.

Nursing Research May/June 2010 Vol 59, No 3 Significance Testing 223

Copyright @ 20 Lippincott Williams & Wilkins. Unauthorized reproduction of this article is prohibited.10

 

“Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!
Use Discount Code “Newclient”for a 15% Discount!”


Buy Custom Nursing Papers

The post Understanding Statistical Significance appeared first on Student Homeworks.

#write essay #research paper #blog writing #article writing #academic writer #reflective paper #essay pro #types of essays #write my essay #reflective essay #paper writer #essay writing service #essay writer free #essay helper #write my paper #assignment writer #write my essay for me #write an essay for me #uk essay #thesis writer #dissertation writing services #writing a research paper #academic essay #dissertation help #easy essay #do my essay #paper writing service #buy essay #essay writing help #essay service #dissertation writing #online essay writer #write my paper for me #types of essay writing #essay writing website #write my essay for free #reflective report #type my essay #thesis writing services #write paper for me #research paper writing service #essay paper #professional essay writers #write my essay online #essay help online #write my research paper #dissertation writing help #websites that write papers for you for free #write my essay for me cheap #pay someone to write my paper #pay someone to write my research paper #Essaywriting #Academicwriting #Assignmenthelp #Nursingassignment #Nursinghomework #Psychologyassignment #Physicsassignment #Philosophyassignment #Religionassignment #History #Writing #writingtips #Students #universityassignment #onlinewriting #savvyessaywriters #onlineprowriters #assignmentcollection #excelsiorwriters #writinghub #study #exclusivewritings #myassignmentgeek #expertwriters #art #transcription #grammer #college #highschool #StudentsHelpingStudents #studentshirt #StudentShoe #StudentShoes #studentshoponline #studentshopping #studentshouse #StudentShoutout #studentshowcase2017 #StudentsHub #studentsieuczy #StudentsIn #studentsinberlin #studentsinbusiness #StudentsInDubai #studentsininternational