Statistical methods are the hallmark of quantitative research. Examining whether a result is statistically significant is standard content in social work research statistics courses. In all of social science, statistical significance testing has been the accepted procedure for examining results. Despite ongoing efforts aimed at encouraging researchers to report some index of magnitude that is not directly affected by sample size--for example, effect size statistical significance testing appears to remain the standard. In 1994, the APA Publication Manual provided "encouragement" to authors to report effect sizes with little impact (see for example Keselman et al., 1998; Kirk, 1996; Thompson & Snyder, 1998).
The current Publication Manual of the American Psychological Association (2001) states that "it is almost always necessary to include some index of effect size or strength of relationship" (p. 25). This position was influenced by the APA Task Force on Statistical Methods that recommended researchers report alpha levels and effect sizes (Wilkinson &Task Force on Statistical Inference, 1999). This was also the stance advocated by Dr. Jacob Cohen (1990), statistical power expert, who argued that "the primary product of a research inquiry is one of measures of effect size, not p values" (p. 12). The field has slowly responded and effect sizes are increasingly more visible (Fidler et al., 2005; Vacha-Haase & Thompson, 2004). Indeed, some journal editors, especially in psychology, now require authors to report effect size measures (Harris, 2003; Snyder, 2000; Trusty, Thompson, & Petrocelli, 2004). The reporting of effect size measures is also increasing in social work journals, however, it is still common to find studies void of effect size indices (for example, Claiborne, 2006; Engelhardt, Toseland, Goa, & Banks, 2006; Padgett, Gulcur, & Tsemberis, 2006; Perry, 2006). There do not appear to be any social work journals that require effect size reporting.
Whereas many researchers are familiar with the use of effect size measures for power estimation and sample size determination, they are less familiar with using such measures to interpret their findings. Many researchers fail to understand that there are different types of effect size measures with different methods of calculation within each type and that the interpretation of effect size varies depending on the measure used. The purpose of this column is to further the basic understanding of effect size measures: what they mean, how they differ, and to suggest one way of presenting outcome data for easier interpretation. It is because effect sizes are becoming part and parcel of the social science research enterprise that they must be clearly understood. In particular, this column is focused on the use of effect size measures when presenting the results of intervention research.
Measures of effect size provide critically different information than alpha levels. This is because effect size addresses the practical importance of the results as assessed by the magnitude of the effect. It is well known that one can obtain a statistically significant result that is not practically significant. One basic misunderstanding in statistical analysis is thinking that an observed p value that is considered highly significant, for example p = .0001, also reflects a large effect. The p value simply represents the likelihood that a finding is due to chance or sampling error and reveals nothing about the size of the effect. One important reason to use effect size measures is that they can help the researcher consider the importance of his or her findings apart from statistical significance. This degree of analysis is too often neglected, as reflected in comments by Abelson (1995), who noted that statistical significance tests should be used for "guidance rather than sanctification" (p. 9).
Rosenthal and colleagues (2000) demonstrated how effect sizes can be used in conjunction with significance levels to arrive at an inference. Consider a research finding whereby the researcher reported a nonsignificant p value and unknowingly had a large effect size. In this instance the researcher would conclude that there was no effect for the intervention. If this were an assessment of the effectiveness of a new drug for reducing cancer cells, it would not take long to realize that the conclusion represents a serious mistake. Indeed, as frequently found in many studies, being underpowered--that is, not having a large enough sample to obtain statistical significance--is not uncommon. Therefore, it is likely that a researcher may falsely conclude a finding of "no difference" when, in fact, an effect size would show there was a meaningful difference. A conclusion of no difference based on statistical significance testing might lead the researcher in the cancer drug study to stop his or her investigation, but a meaningful effect size might suggest just the opposite--the importance of replicating the study with a larger sample that is sufficiently powered and looking across similar studies to assess the reliability of the effect size findings. Rosenthal and colleagues described this problem in inference as the "failure to perceive the practical importance of nonsignificant results" (p. 4).
In contrast, a researcher might have a large sample and obtain a statistically significant finding but observe only a small effect. A common conclusion is that this is a very significant result when it may be of little practical importance (Abelson, 1995; Cohen, 1990). This is not to suggest that all small effects are of little practical importance. The practical relevance of an effect size must be judged within the context of the problem studied. We should not be fooled by...