Boletim Semanal do Sistema de Informação de Mercados Agrícolas da Publicação da Direcção Provincial da Agricultura Telef. 26213308; Fax 26214177 e-mail: jvarimelo @ teledata.co.mz Edição No. 269 Nampula , 18 de Março de 2008 Destaque da Semana com vista ao melhoramento da espécie da mandioca que será utilizada na produção do Mandioca na Panificação
Nonsignificance plus high power does not imply support for the null over the alternativeNonsignificance Plus High Power Does Not Imply Support for the NullOver the Alternative This article summarizes arguments against the use of power to analyze data, and illustrates a key pitfall: Lackof statistical significance (e.g., p O .05) combined with high power (e.g., 90%) can occur even if the datasupport the alternative more than the null. This problem arises via selective choice of parameters at whichpower is calculated, but can also arise if one computes power at a prespecified alternative. As noted by earlierauthors, power computed using sample estimates (‘‘observed power’’) replaces this problem with even morecounterintuitive behavior, because observed power effectively double counts the data and increases as theP value declines. Use of power to analyze and interpret data thus needs more extensive discouragement.
Ann Epidemiol 2012;22:364–368. Ó 2012 Elsevier Inc. All rights reserved.
KEY WORDS: Counternull, Power, Significance, Statistical Methods, Statistical Testing.
to the estimates from the sample used in the power calcu- lation; for a study as completed (observed), it is analogous Use of power for data analysis (post hoc power) has a long to giving odds on a horse race after seeing the outcome.
history in epidemiology . Over the decades, however, 2. Arbitrariness: There is no convention governing the free many authors have criticized such use, noting that power parameters (parameters that must be specified by the provides no valid information beyond that seen in P values analyst) in power calculations beyond the a-level.
and confidence limits . Despite these criticisms, 3. Opacity: Power is more counterintuitive to interpret recommendations favoring post hoc power have appeared correctly than P values and confidence limits. In partic- in many textbooks, articles, and journal instructions, espe- ular, high power plus ‘‘nonsignificance’’ does not imply cially as a purported aid for interpreting a ‘‘nonsignificant’’ that the data or evidence favors the null .
test of the null. Although such recommendations havedwindled in mainstream journals, as Hoenig and Heisey The charge of irrelevance can be made against all fre- note a search on ‘‘power’’ through journal archives quentist statistics (which refer to frequencies in hypothet- reveals that the practice and its encouragement survives ical repetitions), but can be deflected somewhat by noting Furthermore, it is still common in internal reports, that confidence intervals and one-sided p values have especially for litigation, where it may be used to buttress straightforward single-sample likelihood and Bayesian claims of study adequacy when in fact the study has inade- posterior interpretations I therefore review the quate numbers to reach any conclusion.
arbitrariness and opacity issues with the goal of illustrating Statistical power is the probability of rejection (‘‘signifi- them in simple numerical terms. I then review how cance’’) when a given non-null value (the alternative) is ‘‘observed power’’ (power computed using sample esti- correct. That is, power is the probability that p ! a under mates), which is supposed to address the arbitrariness issue, the alternative, where a is a given maximum allowable aggravates the opacity issue. Like many predecessors I type I error (false positive) rate. Among the problems with conclude that post hoc power is unsalvageable as an analytic power computed from completed studies are these: tool, despite any value it has for study planning.
1. Irrelevance: Power refers only to future studies done on populations that look exactly like our sample with respect A P value has no free parameter and a confidence interval From the Department of Epidemiology and Department of Statistics, has only one, a, which is inevitably taken to be 0.05. In University of California, Los Angeles, Los Angeles, CA.
Address correspondence to: Sander Greenland, MA, MS, DrPH, Univer- sity of California, Department of Epidemiology and Department of Statis- native and at least one background parameter (e.g., baseline tics, Campus 177220, Los Angeles, CA 90095-1772. Tel.: þ1 310 455 incidence); because there is no convention regarding their choice, power can be manipulated far more easily than Received October 28, 2011. Accepted February 3, 2012. Published on- a p value or a confidence interval. The reason for lack of Ó 2012 Elsevier Inc. All rights reserved.
360 Park Avenue South, New York, NY 10010 NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE (b) The value of the relative risk (RR) as 3.18 in the power calculation is back-calculated to produce 80% power, rather than determined from context; for example, there was no plaintiff claim that an effect this largewas present. In many legal contexts, a guideline usedfor tort decisions is instead RR Z 2, based on thecommon notion that this represents a (2 À 1)/2 Z convention is not hard to understand: The alternative and 50% individual probability of causation. This notion any background parameter are too context specific (even is incorrect in general, but tends to err on the low side more context specific than an a-level).
of the actual probability of causation at RR Z 2 The following example, although extreme, is real and thus, RR Z 2 is still useful as a pragmatic upper illustrates the plasticity of power calculations compared bound on the RR needed to yield 50% probability of with P values and confidence intervals. While serving as a plaintiff statistical expert concerning data on the relationof gabapentin to suicidality, I was asked to review pooled If one uses the baseline rate of 0.22% cited by the expert, data from randomized trials as used in a U.S. Food and the power for detecting RR Z 2 is under 25%; if one uses Drug Administration (FDA) alert and report regarding instead the 0.05% seen in the gabapentin trials, the power suicidality risk from anti-epileptics (the class of drugs to for detecting RR Z 2 is under 10%. Thus the power reported which gabapentin belongs) and defense expert calculations.
by the defense expert was maximized by first taking the high- The defense expert statistician (a full professor of biostatis- er risk population as the source of the baseline rate, and then tics at a major university and ASA Fellow) wrote: finding an RR that would yield the desired power.
Assuming that the base-rate of suicidality among Regardless of one’s preference, the figures illustrate the placebo controlled subjects is 0.22% as stated in dramatic sensitivity of the power calculations to debatable the FDA alert, we would have power of 80% to choices. Of course, all the powers are arguably irrelevant detect a statistically significant effect of gabapentin to inference (problem 1) The mid-P 95% odds-ratio relative to placebo for gabapentin alone in the confidence limits (8, Ch. 14) from the same combined 4932 subjects (2903 on drug and 2029 on placebo) data are 0.11, 41, whereas the approximate risk-ratio limits used by FDA in their analysis, once the rate for gaba- (8, Ch. 14) after adding ½ to each cell are 0.15 and 8.8, both pentin reached 0.70%, or a relative risk of 3.18. This showing that there is almost no information in the gabapen- computation reveals that even for the subset of gaba- tin trials about the side effect at issue.
pentin data used by FDA in their analysis, a signifi-cant difference between gabapentin and placebowould have been consistently detected for gabapen-tin alone, once the incidence was approximately three times higher in gabapentin treated subjects In the previous example, the low adverse event rate in controls severely limited the actual (before trial) powerand after trial precision. However, genuinely high power The computation and conclusion do not withstand scru- can coincide with nonsignificance, regardless of whether tiny. With regard to problem 2 above, note that the power is computed before the study or from the data (a) There were only 3 cases observed in the 28 placebo- under analysis. This phenomenon seems to especially chal- controlled gabapentin trials contributing to these lenge intuitions. Hence, I provide a simple, hypothetical numbers, and only one case among the placebo groups; example (with reasonable rates for common safety evalua- thus, actual observed baseline rate in the gabapentin trials tion settings) in which there is high power for RR Z 2 was 1/2029 Z 0.05%. The figure of 0.22% used in the and the P value for testing RR Z 1 (the null P value) exceeds expert’s calculation was more than four times this rate; it the usual significance cutoff a of 0.05, yet standard statistical is not from placebo-controlled trials of gabapentin, but is measures of evidence favor the alternative (RR Z 2) over instead from all 16,029 placebo controls in 199 random- the null (RR Z 1). The example is designed to exclude ized trials of all types of anti-epileptics. The gabapentin other issues such as bias, with a rare outcome and large trial controls are only 2029 of 16,029 or 13% of these case numbers to keep the computations simple (although controls; furthermore, only 7% of the gabapentin trial the figures resemble those seen in large postmarketing patients were psychiatric (high suicide risk), compared with 29% of patients in other trials (13, Table 8), so Suppose a series of balanced trials randomize 1000 the lower rate in gabapentin controls is unsurprising.
patients to a new treatment, 1000 to placebo treatment, NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE TABLE 1. Hypothetical randomized trial data exhibiting exceptions can occur ). In the hypothetical example, ‘‘nonsignificance’’ and high power, yet evidential measures favor the observed power is only about 45%.
Observed power is plagued by nonintuitive behavior, traceable to the fact that the alternative used in anobserved power calculation varies randomly and may be contextually irrelevant; hence, the observed power is alsorandom like a p value, rather than fixed in advance as inordinary power calculations . One consequence is with no protocol violations, losses, unmasking, and so on, that, just as a p value can be far from the false-positive (type I error) rate of the test so observed power can From conventional 2 Â 2 table formulas treating the log be far from the true-positive rate (sensitivity) of the test.
RR estimate as an approximately normal variate (see Even more startling is the ‘‘power approach paradox’’ detailed by Hoenig and Heisey Among nonsignificant P Z .07 (and thus ‘‘not significant at the .05 level’’) for results, those with higher observed power are commonly the null hypothesis that the RR is 1.
interpreted as stronger evidence for the null, when in Assuming the 32 events observed arm were as expected in fact just the opposite is the case. Observed power is merely the placebo group, the power for RR Z 2 at aZ0.05 a fixed transform of the p value, which grows as the p value computed from these data is over 85%.
shrinks; thus, higher observed power corresponds witha lower P value and lower relative likelihood for the null Based on these results, do the data favor RR Z 1 over In other words, higher observed power implies more evidence against the null by common evidence measures, Here are some relevant statistics to answer the question: even if the evidence is ‘‘nonsignificant’’ by ordinary testingconventions.
a) The RR estimate is 1.50; in proportional terms, 1.50 is Observed power also involves and encourages a double counting of data. To illustrate, consider the following state- b) The 95% confidence limits are 0.97 and 2.33; in propor- ment: ‘‘We observed no significant difference (p Z .10) tional terms, 1 is closer the lower limit than 2 is to the despite high power.’’ Introducing observed power alongside p gives the impression that one has two pieces of information c) The likelihood ratio comparing RR Z 2 vs. RR Z 1 is relevant to the null. But because observed power is merely a fixed transform of the null p value, it adds no new statistical d) The P value for RR Z 2 is 0.20, 3 times the p value for information; it just an awkward rescaling of the null p value that is even harder to interpret correctly than that p value e) The value of RR having the same p value and likelihood (which is notorious for its misinterpretation as the null (the ‘‘counternull’’ is about 1.52 Z 2.25, even though one-sided p values do have simple Bayesian which is further from the RR estimate than is 2.
interpretations ). In contrast, confidence limits cannotbe constructed from a single p value, and thus do supply addi- Thus, despite ‘‘nonsignificance’’ (p O .05 for RR Z 1) tional and more easily interpreted information beyond and power approaching 90% for RR Z 2 at a Z 0.05, the results favor RR Z 2 over RR Z 1 whether one comparesthem using the point estimate, the confidence interval, theirlikelihoods, their p values, or the counternull value.
There are elements of arbitrariness in all analyses. For all their problems, conventions are an obstacle to manipulation To avoid the arbitrariness problem, post hoc power analyses of results. Thus, although a p value can vary tremendously often focus on ‘‘observed power,’’ that is, the power depending what value of a measure (such as RR) is being computed using the point estimates of the parameters in tested, convention has decreed the null p value (e.g., for the calculation (the baseline rate and effect size). One RR Z 1) as one that must be included if testing is done.
problem with observed power is that it will make most any Of course, such conventions have side effects, and arguably study look underpowered : In approximately normal situ- many of the objections to statistical testing and p values ations with a Z 0.05, such as those common in epidemio- stem from the focus on the null testing. But, as with power, logic studies and clinical trials, the observed power will these objections would be partially addressed if a conven- usually be less than 50% when p O a (although moderate tional alternative value was always tested as well (e.g., NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE RRZ ½ or RR Z 2 depending on the directions observed effect) that can be detrimental to inference from existing data even if they are useful for study planning.
Likewise, the convention of fixing the test criterion a at The problem of ‘‘underpowered studies’’ that 0.05 is arbitrary, but has likely prevented its manipulation.
post hoc power is supposed to address is an artifact of This convention has carried over into interval estimation focusing on whether p ! a (fixed-level testing) in indi- as the nearly universal 95% level seen in both confidence vidual studies. A study can contribute useful data no matter intervals and posterior intervals, and remained in place how small and underpowered it is, as long as it is interpreted despite attempts to unseat it by using a 90% level with proper accounting for its final imprecision. Once its From a precision perspective, however, shifting to 90% data are in, ‘‘underpowered’’ needs to be replaced by its has modest implications, as it narrows approximate normal post-trial analog, imprecisionda problem immediately intervals by only 1 À 1.645/1.960 Z 16%; furthermore, evident and addressed when using confidence intervals the reader is warned of this narrowing by the statement of . Unlike p values and power, those intervals also 90% accompanying the interval. In contrast, power changes supply the minimum information needed to combine indi- arising from shifts in the baseline rate or alternative can vidual study results in a meta-analysis, which is the most have far more spectacular impact, and yet come with no direct way of addressing imprecision.
reference point, simple calculation, or even intuition towarn of this impact.
The latter arbitrariness problem has led to use of observed 1. Beaumont JJ, Breslow NE. Power considerations in epidemiologic studies power, which brings a host of its own problems. Nonethe- of vinyl chloride workers. Am J Epidemiol. 1981;114:725–734.
less, one might ask if observed power or the like remains 2. Cox DR. The planning of experiments. New York: Wiley; 1958.
useful for speculating how much power a future study would 3. Greenland S. On sample-size and power calculations for studies using have. I would question even that much utility: The observed confidence intervals. Am J Epidemiol. 1988;128:231–237.
data are almost never the only source of information on 4. Smith AH, Bates M. Confidence limit analyses should replace power which to base such a forecast. The alternative of interest calculations in the interpretation of epidemiologic studies. Epidemiology.
should be at least partly determined by what effect size is considered important or worth detecting, rather than 5. Goodman SN, Berlin J. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results.
the noisy and possibly biased estimate observed from exist- 6. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power Calculating power from data using a fixed alternative of calculations for data analysis. Am Stat. 2001;55:19–24.
genuine interest is a partial answer to the problems of 7. Senn S. Power is indeed irrelevant in interpreting completed studies observed power, but brings back the arbitrariness issue.
And it still depends on study-peculiar features (such as the 8. Rothman KJ, Greenland S, Lash TL, eds. Modern epidemiology. 3rd ed.
Philadelphia: Lippincott-Wolters-Kluwer; 2008.
observed baseline rate and exposure allocation ratio or prev- 9. Hooper R. The Bayesian interpretation of a P-value depends only weakly alence) that would unlikely apply to a different study popu- on statistical power in realistic situations. J Clin Epidemiol. 2009;62: lation. In fact, it could be advantageous to alter these features for future studies, as power can be sensitive to design 10. Halpern SD, Barton TD, Gross R, Hennessy S, Berlin JA, Strom BL.
choices like allocation ratios (or case-control ratios in case- Epidemiologic studies of adverse effects of anti-retroviral drugs: how wellis statistical power reported? Pharmacoepidemiol Drug Safety. 2005;14: control studies), which can be improved relative to past 11. Cox DR, Hinkley DV. Theoretical statistics. New York: Chapman and In sum, use of power in data analysis and interpretation (as opposed to research proposals) is more prone to grave 12. Casella G, Berger RL. Reconciling Bayesian and frequentist evidence in misinterpretation than are other statistics. Chief among the one-sided testing problem. J Am Stat Assoc. 1987;82:106–111.
them is the mistake that ‘‘high power’’ in the face of non- 13. Office of Biostatistics. Statistical review and evaluation: antiepileptic drugs significance means the null is better supported than the and suicidality. Bethesda, MD: U.S. Food and Drug Administration; 2008.
alternative, a mistake still exploited in unpublished reports 14. Gibbons RD. Supplemental expert report of March 19, 2009 in re: Neuro- ntin Marketing, Sales and Liability Litigation, U.S. District Court of even if no longer common in epidemiologic articles. Thus, Massachusetts (Case 1:04-cv-10981-PBS).
contrary to some articles but in agreement with many 15. Robins JM, Greenland S. The probability of causation under a stochastic others I argue that power analysis is only useful in dis- model for individual risks. Biometrics. 1989;46:1125–1138 [Erratum: cussing sample size requirements of further studies; if there are specific alternatives of interest in an analysis, the P value 16. Greenland S. The relation of the probability of causation to the relative risk and the doubling dose: a methodologic error that has become a social for those alternatives should be given in place of power. This problem. Am J Public Health. 1999;89:1166–1169.
means, in particular, that we need to accustom ourselves and 17. Greenland S, Robins JM. Epidemiology, justice, and the probability of students to concepts (such as power and smallest detectable causation. Jurimetrics. 2000;40:321–340.
NONSIGNIFICANCE PLUS HIGH POWER AND THE NULL ALTERNATIVE 18. Rosenthal R, Rubin DB. The counternull value of an effect size: a new and the following approximations are useful for tables in statistic. Psychol Sci. 1994;5:329–334.
19. Sellke T, Bayarri MJ, Berger JO. Calibration of p values for testing precise null hypotheses. Am Stat. 2001;55:62–71.
1) The 95% confidence limits for RR are exp(bH1.96s).
20. Goodman SJ. A dirty dozen: twelve P-value misconceptions. Semin Hematol.
2) The one-sided P values for RR < eb and RR > eb are 21. Greenland S, Poole C. Problems in common interpretations of statistics in 3) The two-sided P value for RR Z eb is 2F(ÀjbÀbj/s), scientific articles, expert reports, and testimony. Jurimetrics. 2011;51: 4) The rejection rates of the one-sided 0.025-level tests of 22. Rothman KJ. Modern epidemiology. Boston: Little Brown; 1986.
RR < 1 and RR > 1 given RR Z eb are F(b/sÀ1.96) 23. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA. 2001;285:1987–1991.
5) The power of the two-sided 0.05-level test of RR Z 1 24. Poole C. Low P values or narrow confidence intervals: which are more given RR Z eb is the sum of the one-sided 0.025-level durable? Epidemiology. 2001;12:291–294.
rejection rates, F(b/s À 1.96) þ F(Àb/s À 1.96).
6) The likelihood ratio for RR2 Z exp(b2) relative to RR1 Z exp(b1) is exp(À[(b2 À b)2 À (b1 À b)2]/2s2).
Statistics for were computed from the usual normal Statistics for were computed using b Z ln(1.5) approximation to the log risk-ratio estimator ^ and s Z (1/48 þ 1/32 À 2/1000)‘ in these formulas. Because method), where b is the log risk-ratio parameter ln(RR) of the large case numbers, using the two-binomial likelihood (8,Ch. 14). Suppose the sample (observed) log risk ratio is for the table instead of the normal approximation changes b and the estimated asymptotic standard deviation of ^b is the answers only slightly, for example, the approximate ratio s. Let F(z) is the standard cumulative normal distribution of likelihoods for RR Z 2 versus RR Z 1 is 2.3, whereas the (area below z). Then F(Àz) Z 1ÀF(z) is its complement
Clinical Endocrinology (2003) 59 , 427 – 430 Commentary Primary aldosteronism, diagnosed by the aldosterone to renin ratio, is a common cause of hypertension Pitt O. Lim* and Thomas M. MacDonald† ( n = 56, 161/98 mmHg to below 140/90 mmHg) of patients with* Department of Cardiology, Wales Heart Research mild hypertensi