Sample Size Planning 1 Running Head : Sample Size Planning Sample Size Planning with Effect Size Estimates
نویسندگان
چکیده
The use of effect size estimates in planning the sample size necessary for a future study can introduce substantial bias in the sample size planning process. For instance, the uncertainty associated with the effect size estimate may result in average statistical power that is substantially lower than the nominal power specified in the calculation. The present manuscript examines methods for incorporating the uncertainty present in an effect size estimate into the sample size planning process for both statistical power and accuracy in parameter estimation (i.e., desired confidence interval width). Several illustrative examples are provided along with computer programs for implementing these procedures. Discussion focuses on the choices among different approaches to determining statistical power and accurate parameter estimation when planning the sample size for future studies. Sample Size Planning 3 Sample Size Planning with Effect Size Estimates When designing a study or an experiment, a number of critical decisions need to be made based on incomplete or uncertain information before data collection begins. One of these critical decisions is planning a priori the sample size needed to achieve the researcher’s goal. The goal of the sample size planning process may be adequate statistical power – the probability of correctly rejecting a false null hypothesis. Alternatively, the goal may be accurate parameter estimation – estimating the effect size with a specified level of precision. If a study has moderate to low statistical power, then there is a moderate to high probability that the time and resources spent on the study will yield a nonsignificant result. If a study results in a wide confidence interval for the effect size, then regardless of statistical significance, little information is gleaned regarding the actual magnitude of the effect. Consequently, it is good research practice – and indeed required by many granting agencies – to plan the sample size for a prospective study that will achieve the desired goal(s). At first glance, study design is relatively simple, if computationally intensive, as there is a deterministic relationship among the criteria (i.e., statistical power or confidence interval width), sample size, the specified critical level (i.e., the Type I error rate ! or the confidence level), and the population effect size. If any three of these quantities are known, then the fourth can be calculated exactly. In practice, the sample size for a prospective study is often calculated by setting the desired level of statistical power at a particular value such as .80 (e.g., Cohen, 1988, 1992) or the width of the standardized mean difference confidence interval to be a certain level (e.g., .10 or .20) for a specified level of !. The necessary sample size for power may then be approximated from Cohen’s (1988) tables or determined exactly using available software such as, for example, G*Power (Erdfelder, Faul, & Buchner, 1996), Statistica (Steiger, 1999), or SAS (O’Brien, 1998; SAS Institute Inc., 2003) among others. The necessary sample size for a Sample Size Planning 4 specified confidence interval for the standardized mean difference can be determined, for instance, from tables presented in Kelley & Raush (2006) or exactly from Kelley’s (2007) MBESS program available in R (R Development Core Team, 2006). As straightforward as this may initially seem, the fine print on this process contains critical details that are often glossed over (e.g., see Lenth, 2001, for a practical discussion of the issues involved in study design). Both statistical power and accurate parameter estimation require an estimated or hypothesized population effect size (e.g., see Muller & Benignus, 1992, p. 217). The requisite sample size calculated in this manner is conditional on the specified population effect size. In other words, the logic of this manner of power calculation is as follows: Assuming the population effect size is a specified value, then with sample size n, power will be .80. This presents a certain irony – if the effect size is already known, why conduct the study? In practice, the effect size is not known precisely and exactly, but estimates of the effect size may be available. The present manuscript examines the relationships among statistical power and accurate parameter estimation, sample size, and estimates of the effect size. Specifically, we first examine the impact of estimated effect sizes on statistical power and then discuss how to use prior information and probability distributions on the effect size to increase design efficiency, improve confidence intervals, and better achieve the desired level of statistical power or accurate parameter estimation. The manuscript is organized as follows: First we discuss traditional approaches to sample size planning and how the use of standard effect size estimates without incorporating information about uncertainty can bias statistical power. We then discuss the benefits and rationale for incorporating a Bayesian perspective in the study design process and illustrate how to use this approach for statistical power calculations given effect size estimates with (a) no prior information and (b) with prior information such as from a meta-analysis. We then discuss this approach when the criterion is accurate parameter estimation, i.e., a desired confidence Sample Size Planning 5 interval width. Finally, we discuss conceptual and practical issues related to sample size planning. Note that definitions and notation are summarized in Table 1A and expanded upon in footnote 2 and more extensive analytical details, as well as additional equations, are sequestered within footnotes. Approaches to Specifying the Population Parameter The population effect size parameter, for instance, !, is a necessary input to the process of determining the sample size required for the desired level of statistical power or accurate parameter estimation. Since the parameter is not known, how then does one proceed? Consider how sample size planning is often initially taught. Two of the more widely adopted introductory statistics texts in psychology (Gravetter & Wallnau, 2006; Howell, 2007) present three approaches to determining the population effect size to use as the basis of planning sample size: (1) assessment of the minimal effect size that is important to detect, (2) Cohen’s conventions, and (3) prior research. 1. Minimally important effect size. If the metric of the dependent variable is not arbitrary (e.g., blood pressure, cholesterol level, etc.) and there is a clear and well-defined clinical therapeutic level on that dependent variable, then sample size planning can be based around that clinical level. Mueller and colleagues present methods for power analysis to detect a specified level of change on the dependent variable that incorporates the uncertainty associated with estimates of the population standard deviation (e.g., Coffey & Muller, 1999; Muller, LaVange, Ramey, & Ramey, 1992; Taylor & Muller, 1995a). In psychology, the dependent variable often is not measured on such clean ratio level scales, clearly demarked therapeutic levels of change are not known, and consequently standardized effect sizes may be the only available metric. The use of Sample Size Planning 6 standardized effect sizes in sample size planning is not without criticism (e.g, Lenth, 2001). In part, this criticism reflects concern about conflating the magnitude of an effect with actual importance – not unlike the confusion behind declaring that because two groups are statistically significantly different, that the difference between the two groups is therefore practically significant. Yet in the absence of any viable alternative, the use of a standardized effect size often is the only option. However, in this context, the choice of which standardized effect size is sufficiently important to detect is arbitrary and may vary across researchers. This naturally leads to considering qualitative interpretations of the magnitude of standardized effect sizes and Cohen’s conventions. 2. Cohen’s conventions. Cohen provided rough qualitative interpretations of standardized effect sizes corresponding to small, medium, and large effects. For the standardized mean difference these are .2, .5, and .8, and for the correlation these are .1, .3, and .5, respectively. Examining statistical power for small, medium, and large effects is essentially equivalent to considering the entire power curve – the graph of how power changes as a function of effect size for a given sample size. Examining a power curve, although informative about the power-effect size relationship, does not provide a systematic or a formal basis for how to proceed. For example, a researcher examining a traditional power curve that displays statistical power as a function of the effect size for a given sample size may conclude that power is quite reasonable for a medium to largish effect size. Another researcher may look at the same curve and conclude that the study is grossly overpowered given a large effect size. Yet another may conclude the study is grossly underpowered given a medium effect. This is an extremely subjective decisionmaking process with little formal justification for the choice of the effect size on which to Sample Size Planning 7 base decisions. Indeed, many may not conduct power analyses at all given how subjective the process may appear. 3. Prior research. Following the recommendations of Wilkinson and the APA Task Force on Statistical Inference (1999), researchers have been encouraged to supplement the traditional p-values with effect size estimates and confidence intervals. Providing and examining effect sizes and corresponding confidence intervals helps shift the research question from solely asking, “Is the effect different from zero?” to inquiring as well, “What is the estimated magnitude of the effect and the precision of that estimate?” (see Ozer, 2007, for a discussion of interpreting effect sizes). As a consequence of this shift in reporting practice, effect size estimates are more readily accessible. When engaged in sample size planning for a future study, researchers often will have estimate(s) of the effect size at hand. These may come from previously published research, extensive internal pilot studies, conference presentations, unpublished manuscripts, or other sources. In this manuscript, we focus on this case – when there is some effect size estimate available that is relevant to the future study. That such estimates should be used in the sample size planning process is almost self-evident. For a researcher to assert the goal of achieving, for example, sufficient statistical power for a small to medium effect size (e.g., " = .30) rests on the premise that a small-medium effect is actually meaningful. Even if that premise is warranted, using that criterion may be grossly inefficient if there is evidence that the effect size is in reality larger. This criticism holds as well for dependent variables measured on well-defined scales with clear therapeutic levels change – if there is evidence that the effect is substantially larger than the minimum change needed to produce a therapeutic effect, designing a study to Sample Size Planning 8 detect that minimum change may be inefficient and costly. All available information should be used in the study design process. The question that arises naturally is how to use that effect size estimate. As we will illustrate, naïvely using effect size estimates as their corresponding population parameters may introduce substantial bias into the sample size planning process. How Naïve use of Effect Size Estimates Biases Statistical Power We now consider the impact of using effect size estimates in power calculations in a straightforward manner and how this can lead to bias in the actual average level of power. Consider a hypothetical researcher who wishes to replicate a two-group study in which an intervention designed to change attitudes towards littering is implemented and littering behavior is subsequently measured. The original study involved a total of 50 participants (i.e., n = 25 per group) with an estimated standardized mean difference of d = .50 between the treatment and the control conditions. It seems quite reasonable to use this effect size estimate to conduct a power analysis to determine the sample size needed for the subsequent study to have adequate statistical power. Indeed, using this estimated effect size our researcher determines that 64 subjects per group are needed to have power of .80 under the assumption that " = .50. At first glance it would seem logical to use the effect size estimates to guide power analyses in this manner. Although sometimes sample estimates are above the population parameter and sometimes below, shouldn’t statistical power calculated on effect size estimates average to .80 across different sample realizations of the same population effect size? Interestingly, the answer is no. Even if the effect size estimator is unbiased with a symmetric sampling distribution, sample size calculations based on that effect size estimate can result in average statistical power that is substantially lower than the nominal level used in the calculations. Bias in estimated statistical power from the use of Sample Size Planning 9 estimated effect sizes emerges from the asymmetrical relationship between sample effect size estimates and actual statistical power (e.g., Gillett, 1994, 2002; Taylor & Muller, 1995b). This bias may in fact be quite substantial. Observed estimates below the population effect size will result in suggested sample sizes for future studies that result in power approaching 1. In contrast, effect size estimates above the population value suggest sample sizes for future studies that drop to power down to !, the Type I error rate, which is also the lower bound for power. This asymmetrical relationship results in average actual power across the sampling distribution of the effect size estimate that is less than the nominal power calculations based on each observed effect size estimate. To understand more clearly how average statistical power can differ from the nominal statistical power, consider the following thought experiment. A large number of researchers all examine the exact same effect using the same procedure, materials, and drawing random samples from the same population where the effect size is " = .20 with n1 = n2 = 25. Thus, each researcher has an independent sample from the sampling distribution of the standardized mean difference and uses this observed standardized mean difference to plan the required sample size necessary to achieve power of .80. Suppose one researcher observes d = .30 and uses this information as if it were the population effect size in a standard power analysis program, concluding that n should be 176 per group in the subsequent study to achieve power of .80. Another researcher observes d = .15 and determines that n should be 699 per group. Yet another researcher observes d = .60 and determines that n should be 45 per group, and so on. Researchers who observe a larger d will determine that they require a smaller sample size than those researchers who observe a smaller d. Figure 1 graphs the sampling distribution of the standardized mean difference based on " = .20 and n = 25, the sample size each Sample Size Planning 10 hypothetical researcher determines is needed for the subsequent study when the observed effect size (d) is used as the population parameter to plan sample size, and finally the actual statistical power for each researcher’s subsequent study based on that sample size given that " is actually .20. Only when the sample estimate is |d| = " = .20, the population standardized mean difference, does the actual power for a subsequent replication equal .80. Thus large observed standardized mean differences result in low statistical power since researchers will conclude that they require a relatively small sample size for the subsequent study. On average, across the sampling distribution of the effect size estimate for this example, statistical power is only .61 – even though each sample size calculation was based on a nominal power of .80. Average statistical power is calculated by numerically integrating over the product of the sampling distribution of the standardized mean difference and the power curve in Figure 1. This bias in average statistical power is reduced both when the initial effect size estimate is measured with greater precision (e.g., based on larger sample sizes) and when the population effect size is larger. This can be seen in Figure 2, which graphs the average statistical power across the sampling distribution of the standardized mean difference as a function of the population standardized mean difference and the sample size. The bias in statistical power is defined as the difference in the average statistical power across the sampling distribution and the nominal power used for each power calculation to determine sample size. The implications of blindly using effect size estimates in statistical power calculations and the resulting bias warrant incorporating information regarding the sampling variability of the effect size estimate into the study design process. Sample Size Planning 11 Clearly, the simple use of an effect size estimate in the sample size planning process is not justifiable. We now discuss how to use effect size estimates – and all of the information associated with the estimate – in the sample size planning process. A Formal Basis for Sample Size Planning using Effect Size Estimates A population effect size is a necessary input to the process when planning the sample size for a future study, whether the goal is a specified level of power or a specified level of precision for the effect size estimate. The present manuscript adopts a Bayesian perspective on the population effect size during the study design process; however, inferences and/or estimation are based solely on the data collected in the future study. Further discourse on amalgamating Bayesian and frequentist perspectives is deferred to the discussion. Adopting the Bayesian perspective for considering the population effect size is a pragmatic solution to the vexing problem of how to use estimates of effect sizes in the sample size planning process. As we have seen, simply using the effect size estimate as a proxy for the parameter value results in levels of statistical power that are lower than specified in the planning process. In contrast to examining a single parameter value, the Bayesian perspective instead provides a probability distribution of parameter values known as the posterior distribution. The posterior distribution is the distribution of plausible parameter values given the observed effect size estimate and is a function of the likelihood of the observed data given a parameter value and the prior distribution of the parameter value. In other words, the posterior distribution provides a whole distribution of parameter values to consider during the planning process. Sample Size Planning 12 Using the Bayesian framework, we can therefore perform a statistical power calculation or accuracy in parameter estimation calculation based on a given sample size and examine the consequent distribution of statistical power or interval precision as a function of the posterior distribution. In this way, the Bayesian framework provides a formal mechanism for incorporating the imprecision associated with the effect size estimate when planning sample size. The specific steps are as follows: 1. Determine the posterior distribution of the population effect size parameter given observed data (e.g., an effect size estimate). The posterior distribution can be thought of as representing the uncertainty associated with the observed effect size estimate as it is the distribution of plausible values of the parameter given the observed data. 2. The posterior distribution is used as input in the study design process to determine the posterior predictive distribution of the test-statistic for a specified future sample size. This represents the distribution of test-statistics for a given sample size across the plausible values for the population parameter. 3. The posterior predictive distribution of a test-statistic thus incorporates the uncertainty associated with estimated effect sizes. It is straightforward to then determine the sample size needed to determine expected (average) statistical power or desired confidence interval width. For instance, power is simply the proportion of the posterior predictive distribution that is larger in magnitude than the critical t-values. Expected power (EP), determined by averaging across the posterior distribution, provides a formal basis for making definitive statements about the probability of the future study reaching the desired goal (i.e., significance or accurate parameter Sample Size Planning 13 estimation). However, by adopting a Bayesian perspective, there is an implicit change in the nature and interpretation of probabilities from conventional power calculations. To illustrate, consider the earlier example where a researcher has an effect size estimate of d = .50 based on n = 25. The traditional power calculation based on ! = d = .50 resulted in n = 64 to achieve power of .80. This is a probability statement about repeatedly conducting the exact same experiment an infinite number of times on samples from the same population: 80 percent of future studies based on n = 64 will be significant if ! = .50. In contrast, the Bayesian concept of expected power provides a different probability. As we illustrate shortly, with no additional information, using n = 64 results in expected power of only .67. This is not a statement about what would happen if the researcher repeated the experiment an infinite number of times. Instead, expected power is a statement about the proportion of researchers, examining different topics, in different populations, using different techniques, who, based on the same observed effect size estimate of .50 and no other information (i.e., different parameter values are all essentially equally likely), all conduct a future study based on n = 64. Sixty-seven percent of these researchers would obtain significant results in the future study. This is a subtle conceptual shift in the definition of power that we revisit and expand upon later after illustrating the actual mechanics and process of calculating expected power. The difficulty in applying Bayes’ Theorem and calculating expected power lies in determining the prior distribution of the parameter. Different choices of prior distributions yield different posterior distributions, resulting in the criticism that the researcher’s subjectivity influences the Bayesian analysis. We first discuss and illustrate Sample Size Planning 14 the non-informative prior case before examining several techniques for incorporating additional information into the posterior distribution. Power calculations based on an effect size estimate and a non-informative prior. Much work has been done to determine prior distributions that are not subjective, allow the observed data to dominate the calculation of the posterior distribution, and thereby minimize the impact of the prior distribution. These non-subjective priors (see Bernardo, 1997, for a deeper philosophical discussion) are also termed “probability matching priors” in that they ensure the frequentist validity of the Bayesian credible intervals based on the posterior distribution. In some cases this probability matching may be asymptotic (e.g., see Datta & Mukerjee, 2004, for a review) whereas, as we will demonstrate, for the effect size estimates d and r this probability match can be exact (Berger & Sun, 2008; Lecoutre, 1999, 2007; Naddeo, 2004). In other words, as discussed in more detail in Biesanz (2010), the Bayesian credible intervals considered in this manuscript under the non-informative prior distribution correspond exactly to confidence intervals for effect sizes calculated following the procedures outlined in Cumming & Finch (2001), Kelley (2007), Steiger & Fouladi (1997), and Smithson (2001). With an exact match between the traditional frequentist confidence interval and the Bayesian credible interval in this context, the posterior distribution represents exactly the same inferential information and uncertainty contained in traditional p-values. Differences between the two perspectives are solely philosophical and interpretational. Suppose that a researcher has an effect size estimate d, as in our attitude-behavior example, or an observed correlation r, but no other sources of information to guide the power analysis such as relevant meta-analyses or comparable studies on the same topic. Sample Size Planning 15 Under a non-informative prior, the posterior distribution of the standardized mean
منابع مشابه
Sample size planning for statistical power and accuracy in parameter estimation.
This review examines recent advances in sample size planning, not only from the perspective of an individual researcher, but also with regard to the goal of developing cumulative knowledge. Psychologists have traditionally thought of sample size planning in terms of power analysis. Although we review recent advances in power analysis, our main focus is the desirability of achieving accurate par...
متن کاملA sample size planning approach that considers both statistical significance and clinical significance
BACKGROUND The CONSORT statement requires clinical trials to report confidence intervals, which help to assess the precision and clinical importance of the treatment effect. Conventional sample size calculations for clinical trials, however, only consider issues of statistical significance (that is, significance level and power). METHOD A more consistent approach is proposed whereby sample si...
متن کاملPlanning sample sizes when effect sizes are uncertain: The power-calibrated effect size approach.
Statistical power and thus the sample size required to achieve some desired level of power depend on the size of the effect of interest. However, effect sizes are seldom known exactly in psychological research. Instead, researchers often possess an estimate of an effect size as well as a measure of its uncertainty (e.g., a standard error or confidence interval). Previous proposals for planning ...
متن کاملSample size for multiple regression: obtaining regression coefficients that are accurate, not simply significant.
An approach to sample size planning for multiple regression is presented that emphasizes accuracy in parameter estimation (AIPE). The AIPE approach yields precise estimates of population parameters by providing necessary sample sizes in order for the likely widths of confidence intervals to be sufficiently narrow. One AIPE method yields a sample size such that the expected width of the confiden...
متن کاملOn Sample Size Determination
One of the questions most frequently asked of a statistician is: how big should the sample be? Managers are anxious to obtain an answer to this fundamental question during the planning phase of the survey since it impacts directly on operational considerations such as the number of interviewers required. There is no magical solution and no perfect recipe for determining sample size. It is rathe...
متن کامل