by Stephen T. Ziliak and Deirdre N. McCloskey*
(forthcoming, Journal of Socio-Economics)
* Ziliak: Roosevelt University; McCloskey: University of Illinois at Chicago and Erasmus
University at Rotterdam. We thank for their amazed attention to the present paper
audiences at Baruch College (CUNY), the University of Colorado at Boulder, the Georgia
Institute of Technology, the University of Georgia, the University of Illinois at Chicago, the
annual meeting of the Eastern Economic Association (2003), and the ICAPE Conference on
“The Future of Heterodox Economics” (University of Missouri-Kansas City). Cory Bilton,
David McClough, and Noel Winter provided excellent research assistance. Our special
Abstract. Significance testing as used has no theoretical justification. Our article in the
Journal of Economic Literature (1996) showed that of the 182 full-length papers published in
the 1980s in the American Economic Review 70% did not distinguish economic from
statistical significance. Since 1996 many colleagues have told us that practice has
improved. We interpret their response as an empirical claim, a judgment about a fact. Our
colleagues, unhappily, are mistaken: significance testing is getting worse. We find here
that in the next decade, the 1990s, of the 137 papers using a test of statistical significance in
the AER fully 82% mistook a merely statistically significant finding for an economically
significant finding. A super majority (81%) believed that looking at the sign of a coefficient
sufficed for science, ignoring size. The mistake is causing economic damage: losses of jobs
and justice, and indeed of human lives (especially in, to mention another field enchanted
with statistical significance as against substantive significance, medical science). The
confusion between fit and importance is causing false hypotheses to be accepted and true
hypotheses to be rejected. We propose a publication standard for the future: “Tell me the
oomph of your coefficient; and do not confuse it with merely statistical significance.”
Corresponding author (before Aug. 1, ’03): Steve Ziliak, School of Economics, Georgia
Institute of Technology, Atlanta, GA, 30332-0615; email: stephen.ziliak@econ.gatech.edu;
phone: 404.894.4912 fax: 404.894.1890; (after Aug. 1, ’03) Faculty of Economics, School of
Policy Studies, Roosevelt University, 430 S. Michigan, Chicago, IL 60605.
Sophisticated, hurried readers continue to judge works on the
sophistication of their surfaces. . . . I mean only to utter darkly that
in the present confusion of technical sophistication and
significance, an emperor or two might slip by with no clothes.
New York: Harper and Row, 1988 ed., p. 31.
Seven years ago, in "The Standard Error of Regressions," we showed how
significance testing was used during the 1980s in the leading general interest
journal of the economics profession, the American Economic Review (McCloskey
and Ziliak 1996). The paper reported results from a 19-item “questionnaire”
applied to all of the full-length papers using regression analysis. Of the 182
papers 70% did not distinguish statistical significance from policy or scientific
significance---that is, from what we call “economic significance” (Question 16,
Table 1, p. 105). And fully 96 percent misused a statistical test in some (shall we
say) significant way or another. Of the 70% that flatly mistook statistical
significance for economic significance, further, again about 70% failed to report
even the magnitudes of influence between the economic variables they
investigated (1996, p. 106). In other words, during the 1980s about one half of the
empirical papers published in the AER did not establish their claims as
Some economists have reacted to our finding by saying in effect, “Yes, we
know it’s silly to think that fit is the same thing as substantive importance; but
we don’t do it: only bad economists do.” (Such as, it would seem, the bad ones
who publish in the AER, an implied evaluation of our colleagues that we do not
accept.)And repeatedly in the several score of seminars we have given together
and individually on the subject since 1996 we have heard the claim that "After
the 1980s, the decade you examined in your 1996 paper, best practice
All the better econometricians we have encountered, of course, agree with
our point in substance. This is unsurprising, since the point is obviously true: fit
is not the same thing as scientific importance; a merely statistical significance
cannot substitute for the judgment of a scientist and her community about the
largeness or smallness of a coefficient by standards of scientific or policy oomph.
As Harold Jeffreys remarked long ago, to reject a hypothesis because the data
show “large” departures from the prediction “requires a quantitative criterion of
what is to be considered a large departure” (Jeffreys 1967, p. 384, quoted in
Zellner 1984, p. 277n). Just so. Scientific judgment requires quantitative
judgment, not endlessly more machinery. As lovely and useful as the machinery
is, at the end, having skillfully used it, the economic scientist needs to judge its
output. But the economists and calculators reply, “Don’t fret: things are getting
better. Look for example at this wonderful new test I have devised.”
We are very willing to believe that our colleagues have since the 1980s
stopped making an elementary error. But like them we are empirical scientists.
And so we applied the same 19-item questionnaire of our 1996 paper to all the
full-length empirical papers of the next decade of the AER, just finished, the
Significance testing violating the common sense of first-year statistics and
the refined common sense of advanced decision theory, we find here, is not in
fact getting better. It is getting worse. Of the 137 relevant papers in the 1990s,
82% mistook statistically significant coefficients for economically significant
coefficients (as against 70% in the earlier decade). In the 1980s 53% had relied
exclusively on statistical significance as a criterion of importance at its first use;
“Significance Testing Is a Cheap Way to Get Marketable Results”
William Kruskal, an eminent statistician long at the University of Chicago,
an editor of the International Encyclopedia of the Social Sciences, and a former
president of the American Statistical Association, agrees. “What happened?” we
asked him in a recent interview at his home (William Kruskal 2002). "Why did
significance testing get so badly mixed up, even in the hands of professional
statisticians?" "Well," said Kruskal, who long ago had published in the
Encyclopedia a devastating survey on “significance” in theory and practice
(Kruskal 1968a), “I guess it's a cheap way to get marketable results.”
Bingo. Finding statistical significance is simple, and publishing
statistically significant coefficients survives at least that market test. But cheap t-
tests, becoming steadily cheaper with the Moore’s-Law fall in computation cost,
have in equilibrium a marginal scientific product equal to their cost. Entry
ensures it. In the 1996 paper we discussed the history of statistical versus
economic significance. Viewed from the sociology and economics of the
discipline the notion of statistical significance has been a smashing success.
Many careers have prospered on testing, testing, testing (as David Hendry likes
to put it). But intellectually the testing has been a disaster, as indeed Edgeworth
had warned at the dawn.1 He corrected Jevons, who had concluded that a “3 or 4
per cent” difference in the volume of commercial bills is not economically
important: “[b]ut for the purpose of science, the discovery of a difference in
condition, a difference of 3 per cent and much less may well be important”
(Edgeworth 1885, p. 208). It is easy to see why: a statistically insignificant
coefficient in a financial model, for example, may nonetheless give its discoverer
an edge in making a fortune; and a statistically significant coefficient in the same
model may be offset in its exploitation by transactions costs. Statistical
significance, to put it shortly, is neither necessary nor sufficient for a finding to
be economically important. Yet an overwhelming majority of economists, we have
shown for the 1980s and now again still more for the 1990s, believe statistical
significance is necessary; and a simple majority believe it is sufficient.
Economists are skeptics, members of the tribe of Hume. But Ronald
Aylmer Fisher (1890-1962), who codified the usage we are objecting to, was a
rhetorical magician (as Kruskal once noted, the inventor of such enchanting
phrases as “analysis of variance” or “efficiency”; “significance” was older).
1 Edgeworth (1885, p. 187), we believe, is the first source of the word “significance” in a context of hypothesis testing. Our earlier paper claimed erroneously that John Venn was first (McCloskey and Ziliak 1996, p. 97; see Baird 1988, p. 468). Anyway, the 1880s: for some purposes not a meaningful difference.
Long-lived and persistent, he managed to implant for example a “rule of 2” in
the minds of economic and other scientists. In 1925, for example, listen to him
computing for the masses a first test of significance in his Statistical Methods for
The value for which P=.05, or 1 in 20, is 1.96 or nearly 2; it is
convenient to take this point as a limit in judging whether a
deviation is to be considered significant or not. Deviations
exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a
false indication only once in 22 trials, even if the statistics were the
only guide available. Small effects will still escape notice if the
data are insufficiently numerous to bring them out, but no
lowering of the standard of significance would meet this difficulty
(Fisher 1925 [1941], p. 42; emphasis added).
Notice how a standard of “convenience” rapidly became in Fisher’s prose
an item to be “formally regarded” With Fisher there’s no loss function. There’s
no thinking beyond the statistic. We’re “to take this point as a limit.” Fisher’s
famous and influential book nowhere confronts the difference between scientific
and substantive significance (pp. 123-124, 139-140, concerning soporific drugs
and algae growth). He provided (and then stoutly defended for the rest of his
long life against the decision-theoretic ideas of Neyman, Pearson, and Wald) the
Our policy recommendation is this: that the profession adopt the
standards set forth 120 years ago by Edgeworth, and in the years intervening by
a small but distinguished list of dissenters from the mechanical standard of 5%
Practice Has Improved in a Few Ways, But Not in the Crucial Matter of Significance
Table 1 reports the results distinguished by decade, the 319 full-length
papers using regression from January 1980 to December 1999. (We have at hand
the whole population, not a sample; the urn of nature is poured out before us;
unlike many of our colleagues, therefore, we will refrain from calculating
statistics relevant only to inference from samples to a population, such as the
“statistical significance” of the differences between the two decades.) Like Table
1 in McCloskey and Ziliak (1996) Table 1 here ranks in ascending order each
item of the questionnaire according to "Percent Yes." A "yes" means that the
paper took what every statistical theorist since Edgeworth (with the significant
exception of R. A. Fisher) has regarded as the “correct” action on the matter. For
example, in the 1980s 4.4% of the papers considered the power of the tests (and
we do not believe it accidental that every paper considering power also
considered “a quantitative criterion of what is to be considered a large
departure.”) That is, 4.4% did the correct thing by considering also the
probability of a Type II error. In the 1990s 8% did. That’s an encouraging trend.
The change in practice is more easily seen in Tables 2 and 3, which isolate
improvement and decline. In the 1980s only 44.5% of the papers paid careful
attention to the theoretical and accounting meaning of the regression coefficients
(Question 5). That is, in the 1980s the reader of an empirical paper in the AER
was nearly 6 times out of 10 left wondering how to interpret the economic
meaning of the coefficients. In the 1990s the share taking the correct action rose
to 81%, a net improvement of about 36 percentage points. (This is what we mean
by oomph: a big change, important for the science.) Similarly, the percentage of
papers reporting units and descriptive statistics for regression variables of
interest rose by 34 percentage points, from 32.4% to 66.4% (Question 2). And
gains of more than 20 percentage points were made in the share of papers
discussing the scientific conversation in which a coefficient would be judged
large or small, the share of papers keeping statistical and economic significance
distinct in the "conclusions" section, and the share of papers doing a simulation
to determine whether the estimated coefficients are reasonable. (Our definition
of “simulation” is broad. It includes papers that check the plausibility of the
regression results by making, for example, Harberger-Triangle-type calculations
on the basis of descriptive data. But a paper that uses statistical significance as
the sole criterion for including a coefficient in a later simulation is coded "No,”
which is to say that it does not do a simulation to determine whether the coefficients
These few gains are commendable. Whether they are scientifically
significant is something only we scientists can judge, in serious conversation
with each other (for example: that 8% rather than 4% consider power is nice, but
still leave 92 percent of the papers risking high levels of a Type II error). In
almost every question (that is, in all except perhaps Question 5 concerning the
interpretation of theoretical coefficients, in which the improvement approaches
levels that most people would agree are good practice) the improved levels of
performance are still less than impressive. For example, in the 1990s two-thirds of
the papers did not make calculations to determine whether the estimated
magnitude of the coefficients made sense (Question 17)---only a third, we found,
had simulated the effect of their coefficients with at least the elementary force of
Ec 1. Skepticism of alleged effect is by contrast normal practice in sciences like
chemistry and physics. (By the way, we have found by examining The Physical Review that physicists approximately never use tests of statistical significance; so
too, in the magazine Science, the chemists and geologists; many biologists
reporting their results in Science are less clear-minded on the matter; and in their
own journals the medical scientists, like the social scientists, are hopelessly
confused about substantive error as against sampling error. Bald examples of
this last may be found in the technical notes enclosed with medicines such as
Milton Friedman from 1943 to 1945 was a statistician for the Statistical
Research Group of the Division of War Research at Columbia University (there is
still a non-parametric test named after him). Listen to his experience with
statistical vs. substantive significance:
One project for which we provided statistical assistance was the
development of high-temperature alloys for use as the lining of jet
engines and as blades of turbo superchargers---alloys mostly made
of chrome, nickel, and other metals. . . . Raising the temperature a
bit increases substantially the efficiency of the turbine, turbo
supercharger, or jet engine . . . . I computed a multiple regression
from a substantial body of data relating the strength of an alloy at
various temperatures to its composition. My hope was that I could
use the equations that I fitted to the data to determine the
composition that would give the best result. On paper, my results
were splendid. The equations fitted very well [note: statistically;
with high R2] and they suggested that a hitherto untried alloy
would be far stronger than any existing alloy. . . . The best of the
alloys at that time were breaking at about ten or twenty hours; my
equations predicted that the new alloys would last some two
hundred hours. Really astounding results! . . . So I phoned the
metallurgist we were working with at MIT and asked him to cook
up a couple of alloys according to my specifications and test them.
I had enough confidence in my equations to call them F1 and F2
but not enough to tell the metallurgist what breaking time the
equations predicted. That caution proved wise, because the first
one of those alloys broke in about two hours and the second one in
Friedman learned that statistical significance is not the same as metallurgical
The core confusion over the meaning of significance testing is reported in
Table 3. One problem, which is often taken to be our main objection (it is not,
though bad enough on its own), is that statistical nonsignificance is nonpublic.
In the 1990s only one fourth of the papers avoided choosing variables for inclusion (pretests, that is) solely on the basis of statistical significance, a net
decline in best practice of fully 43 percentage points (Question 14). As Kruskal
Negative results are not so likely to reach publication as are
positive ones. In most significance-testing situations a negative
result is a result that is not statistically significant, and hence one
sees in published papers and books many more statistically
significant results than might be expected. . . . The effect of this is
to change the interpretation of published significance tests in a way
that is hard to analyze quantitatively (1968a, p. 245).
The response to Question 14 shows that economists made it hard in the 1990s to
analyze quantitatively, in Kruskal’s sense, the real-world relevance of their
“significant” results. It’s the problem of searching for significance, which
numerous economists have noted, in cynical amusement or despairing
indignation, is encouraged by the incentives to publish.
"Asterisk econometrics," the ranking of coefficients according to the
absolute value of the test statistic, and "sign econometrics," remarking on the
sign but not the size of coefficient, were widespread in the 1980s. But they are
now a plague. Eighty-one percent of the papers in the 1990s engaged in what we
called “sign econometrics” (in the 1980s 53% did [Question 11]). In their paper
"Tax-based Test of the Dividend Signaling Hypothesis" Bernheim and Wantz
(June 1995, p. 543) report that "the coefficients [in four regressions on their crucial
variable, high-rated bonds] are all negative . . . . However, the estimated values
of these coefficients," they remark, "are not statistically significant at
conventional levels of confidence." The basic problem with sign econometrics,
and with the practice of Bernheim and Wantz, can be imagined with two price
elasticities of demand for, say, insulin, both estimated tightly, one at size -0.1
and the other at –4.0. Both are negative, and would both be treated as “success”
in establishing that insulin use responded to price; but the policy difference
between the two estimates is of course enormous. Economically (and medically)
speaking, for most imaginable purposes -0.1 is virtually zero. But when you are
doing sign econometrics you ignore the size of the elasticity, or the dollar effect of
the bond rating, and say instead, "the sign is what we expected."
Sign econometrics is worse when the economist does not report
confidence intervals. Perhaps because they were not trained in the error-
regarding traditions of engineering or chemistry, economists seldom report
confidence intervals. Thus Hendricks and Porter, on "The Timing and Incidence
of Exploratory Drilling on Offshore Wildcat Tracts" (June 1996, p. 404): "In the
first year of the lease term, the coefficient of HERF is positive, but not significant.
This is consistent with asymmetries of lease holdings mitigating any
information externalities and enhancing coordination, and therefore reducing
any incentive to delay." Yet the reader does not know how much “HERF”---
Hendricks' and Porter's Herfindahl index of the dispersion of lease holdings
among bidders at auction---contributed to the probability the winners would
then engage in exploratory oil drilling. In Life on the Mississippi Mark Twain
noted that "when I was born [the city of] St. Paul had a population of three
persons; Minneapolis had just a third as many" (p. 390). The sign is what a St.-
Paul-enthusiast would want and expect. But the sign gives no guidance as to
whether a size of 1 is importantly different from 3. No oomph.
About two-thirds of the papers ranked the importance of their estimates
according to the absolute values of the test statistics, ignoring the estimated size
of the economic impact (Question 10). In other words, asterisk econometrics
(which is what we call this bizarre but widespread practice), became in the 1990s
a good deal more popular in economics (it has long been popular in psychology
and sociology), increasing over the previous decade by 43 percentage points.
Bernanke and Blinder (1992), Bernheim and Wantz (1995), and Kachelmeier and
Shehata (1992), for example, published tables featuring a hierarchy of p-, F-, and
t-statistics, the totems of asterisk econometrics (pp. 905, 909; p. 547; p. 1130). The
asterisk, the flickering star of *, has become a symbol of vitality and authority in
economic belief systems. Twenty years ago Arnold Zellner pointed out that
economists then (in a sample of 18 articles in 1978) never had “a discussion of
the relation between choice of significance levels and sample size” (one version
of the problem we emphasize here) and usually did not discuss how far from 5%
the test statistic was: “there is room for improvement in analysis of hypotheses
in economics and econometrics” (Zellner 1984, pp. 277-80). Yes.
What is most distressing about Table 3, however, is the rising conflation
of statistical and economic significance, indicated by the responses to Questions
• 82% of the empirical papers published in the 1990s in the
American Economic Review did not distinguish statistical
significance from economic significance (Question 16). In
the 1980s, 70% did not--scandalous enough (McCloskey and
• At the first use of statistical significance, typically in the
"Estimation" or "Results" section, 64% in the 1990s did not
consider anythingbut the size of the test statistics as a
criterion for the inclusion of variables in future work. In the
1980s, 53% ---11 percentage points fewer papers---had done
Following the Wrong Decision Rule Has Large Scientific Costs
Of course, not everyone gets it wrong. The American Economic Review is
filled with examples of superb economic science (in our opinion most of the
papers can be described this way---even though most them, we have seen, make
elementary mistakes in the use of statistical significance; in other words, we do
not accept the opinion of one eminent econometrician we consulted, who
dismissed our case by remarking cynically that after all such idiocy is to be
expected in the AER). Table 4 reports the author rankings by economic
significance, in five brackets. If a paper chose between 15 and 19 actions
correctly, as Gary Solon's paper did (June 1992), then it is in the top bracket, the
best if not perfect practice. If the paper chose between 6 and 8 actions correctly,
as Gary Becker, Michael Grossman, and Kevin Murphy did (June 1994), then it is
Joshua D. Angrist does well in his "The Economic Returns to Schooling in
the West Bank and Gaza Strip" (Dec 1995 pp. 1065-1087). "Until 1972," Angrist
writes, "there were no institutions of higher education in these territories.
Beginning in 1972. . . . higher education began to open in the West Bank.
Previously, Palestinian residents of the territories had to obtain their advanced
schooling abroad. But by 1986, there were 20 institutions granting post-high
school degrees in the territories. As a consequence, in the early and mid 1980's,
the labor market was flooded with new college graduates. This paper studies
the impact of this dramatic influx of skilled workers on the distribution of wages
in the occupied territories" (p. 1064). In a first regression Angrist estimates the
magnitude of wage premia earned by Israelis and Palestinians who work in
The first column of Table 2 shows that the daily wage premium for
working in Israel fell from roughly 18 percent in 1981 to zero in
1984. Beginning in 1986, the Israel wage premium rose steeply. By
1989, daily wages paid to Palestinians working in Israel were 37
percent higher than local wages, nearly doubling the 1987 wage
differential. The monthly wage premium for working in Israel
increased similarly. These changes parallel the pattern of
Palestinian absences from work and are consistent with movements
along an inelastic demand curve for Palestinian labor (p. 1072).
The reader is old magnitudes. She knows the oomph.
Yet even Angrist falls back into asterisk econometrics. On page 1079 he is
testing alternative models, and emphasizes that:
The alternative tests are not significantly different in five out of
nine comparisons (p<0.02), but the joint test of coefficient equality
for the alternative estimates of [θt] leads to rejection of the null
To which his better nature would say, “So?”
David Zimmerman, in his “Regression Toward Mediocrity in Economic
Stature” (1992), and especially the well-named Gary Solon, in his
“Intergenerational Income Mobility in the United States” (1992), have set an
admirable if rare standard for the field. Line by line Solon asks the question
“How much?” and then gives an answer. How much, he wonders, is a son’s
economic well-being fated by that of his father? The sign, the star, the sign-and-
the-star-together, don’t tell. Previous estimates, observes Solon, had put the
father-son income correlation at about 0.2 (p. 394). A new estimate, a tightly fit
correlation of 0.20000000001***, would say nothing new of economic significance.
And a poorly fit correlation with the “expected sign” would say nothing.
Nothing at all. Solon’s attempts at a new estimate, on pages 397 to 405, refer
only once to statistical significance (p. 404). Instead, Solon writes 18 paragraphs
on economic significance: why he believes the “intergenerational income
correlation in the United States is [in fact] around 0.4” (p. 403) and how the
higher correlation changes American stories about mobility. Solon’s paper is
three standard deviations above the average of the AER.
“Minimum Wages and Employment: A Case Study of the Fast-Food
Industry in New Jersey and Pennsylvania” by David Card and Alan B. Krueger
(1994a), falls far below the median for cogency in statistical testing, though well
above the median in other features of scientific seriousness. Card and Krueger
designed their own surveys, collected their own data, talked on the telephone
with firms in their sample, and visited firms that did and did not respond to
their survey, all of which is most unusual among economists, and seems to have
raised scientific standards in the field. It matches the typical procedure in
economic history, for example, or the best in empirical sociology and
experimental physics. Their sample was designed to study prices, wages,
output, and employment in the fast food industry in Eastern Pennsylvania and
Western New Jersey before and after New Jersey raised its minimum wage
above the national and Pennsylvania levels. On pages 775-776 of the article (and
pages 30-33 in their widely cited book [1994b]), Card and Krueger report their
crucial test of the conventional labor market model. The chief prediction of the
conventional model is that full-time equivalent employment in New Jersey
relative to Pennsylvania would fall following the increase in the New Jersey
minimum wage. Specifically Card and Krueger’s null hypothesis says that the
difference-in-difference is zero---that “change in employment in New Jersey”
minus “change in employment in Pennsylvania” should equal zero if as they
suppose the minimum wage is not oomphul. If they find that the difference-in-
difference is zero (other things equal), then New Jersey gets the wage gains
without loss of employment: a good thing for workers. Otherwise, New Jersey
employment under the raised minimum wage will fall, perhaps by a lot: a bad
Yet Card and Krueger fail to test the null they claim. Instead they test two
distinct nulls, “change in employment in New Jersey = zero” and (in a separate
test) “change in employment in Pennsylvania = zero.” In other words, they
compute t-tests for each state, examining average full-time equivalent
employment before and after the increase in the minimum wage. But they do
not test the (relevant) difference-in-difference null of zero. Card and Krueger
report on page 776 a point estimate suggesting employment in New Jersey
increased by “0.6” of a worker per firm (from 20.4 to 21; rather than falling as
enemies of the minimum wage would have expected). Then they report a
second point estimate suggesting that employment in Pennsylvania fell by 2.1
workers per firm (from 23.3 to 21.2). "Despite the increase in wages,” they
conclude from the estimates, “full-time equivalent employment increased in New
Jersey relative to Pennsylvania. Whereas New Jersey stores were initially
smaller, employment gains in New Jersey coupled with losses in Pennsylvania
led to a small and statistically insignificant interstate difference in wave 2” (776;
their emphasis). The errors are multiple: Card and Krueger run the wrong test
(testing the wrong null, by the way, was less common in the AER during the
1980s [Table 1, Question 4]); they "reject" a null of zero change in employment in
New Jersey, having found an average difference, estimated noisily at t = 0.2, of
0.6 workers per firm; they do not discuss the power of their tests, though the
Pennsylvania sample is larger by a factor of 5; they practice asterisk
econometrics (with a “small and statistically insignificant interstate difference”);
and yet they emphasize acceptance of their favored alternative, with italics.
Further attempts to measure with multiple regression analysis the size of the
employment effect, the price effect, and the output effect, though technically
improved, are not argued in terms of economic significance. That’s the main
The cost of following the wrong decision rule is especially clear in "An
Empirical Analysis of Cigarette Addiction" by Gary Becker, Michael Grossman,
and Kevin Murphy (June 1999; you can see that we are anxious not to be accused
of making our lives easy by picking on the less eminent economic scientists).
Sign econometrics and asterisk econometrics decide nearly everything in the
paper, but most importantly the “existence” of addiction.
Our estimation strategy is to begin with the myopic model. We
then test the myopic model by testing whether future prices are
significant predictors of current consumption as they would be in
the rational-addictive model, but not under the myopic model (p.
403). . . . According to the parameter estimates of the myopic
model presented in Table 2, cigarette smoking is inversely related
to current price and positively related to income.
And then: “The highly significant effects of the smuggling variables (ldtax,
sdimp, and sdexp) indicate the importance of interstate smuggling of cigarettes.”
But as Kruskal put it, echoing Neyman and Pearson from 1933, “The
adverb ‘statistically’ is often omitted, and this is unfortunate, since statistical
significance of a sample bears no necessary relationship to possible subject-
matter significance of whatever true departure from the null hypothesis might
obtain” (Kruskal 1968a, p. 240). At N = about 1,400 with high power they can
reject a nearby alternative to the null---an alternative different, but trivially different, from the null (at high sample sizes, after all s/vN approaches zero: all
hypotheses are rejected, and in mathematical fact, without having to look at the
data, you know they will be rejected at any pre-assigned level of significance).
Yet they conclude that “the positive and significant past-consumption coefficient
is consistent with the hypothesis that cigarette smoking is an addictive behavior”
(p. 404). It's sign econometrics, with policy implications. When sign
econometrics meets asterisk econometrics the mystification redoubles:
When the one-period lead of price is added to the 2SLS models in
Table 2, its coefficient is negative and significant at all conventional
levels. The absolute t ratio associated with the coefficient of this
variable is 5.06 in model (i), 5.54 in model (ii), and 6.45 in model
(iii). These results suggest that decisions about current
consumption depend on future price. They are inconsistent with a
myopic model of addiction, but consistent with a rational model of
this behavior in which a reduction in expected future price raises
expected future consumption, which in turn raises current
consumption. While the tests soundly reject the myopic model,
Eventually they report (though never interpret) the estimated magnitudes of the
price elasticities of demand for cigarettes. But their way of finding the elasticites
is erroneous. Cigarette smoking may be addictive. But Becker, Grossman, and
Murphy have not shown why, or how much. (They are, incidentally, inferring
individual behavior from state-wide data; sociologists call this the ecological
fallacy.) Perhaps what they have shown is that statistics play multiple roles:
There are some other roles that activities called “statistical” may,
unfortunately, play. Two such misguided roles are (1) to sanctify
or provide seals of approval (one hears, for example, of thesis
advisors or journal editors who insist on certain formal statistical
procedures, whether or not they are appropriate); (2) to impress,
obfuscate, or mystify (for example, some social science research
papers contain masses of undigested formulas [or tests of
significance] that serve no purpose except that of indicating what a
Table 5 shows what happens if statistical significance is the only criterion
of importance at first use. In a large number of cases, if only statistical
significance is said to be of importance as its first use, then statistical significance
tends to decide the entire empirical argument. Of the 137 full length papers in
the 1990s, 80 papers made both mistakes (Question 7=0 and Question 16=0). To
put it differently, of the 87 papers using only statistical significance as a criterion
of importance at first use, fully 80 considered statistical significance the last
word. Cross tabulations on the 1980s data reveal a similar though slightly better
We Are Not Original
We are not the first social scientists to make the distinction between
economic and statistical significance. One of us has been making the point since
1985 (McCloskey 1985a, 1985b, 1992, 1995), but she learned it from a long, long
line of distinguished, if lonely, protesters of the arbitrary procedures laid down
in the 1920s by the blessed Fisher. We have pointed out before that in the 1930s
Neyman and Pearson and then especially Abraham Wald had distinguished
sharply between practical and statistical significance (McCloskey and Ziliak
1996, pp. 97-98; McCloskey 1985a). But Wald died young, and Neyman and
Pearson carried the day only at the level of high-brow statistical theory (and
Fisher we have just noted failed to measure or mention the matters of
substantive significance that occupied Wald and Neyman and Pearson [Fisher
1925 (1941), pp. 42, 123-124, 138-140, 326-329]). Statistical practice on the ground
stayed with a predetermined level of 5% significance (mainly), regardless of the
loss function, misleading even the Supreme Court of the United States.
Yet some simple souls got it right. Educators have written about the
difference between substantive and statistical significance early and late (Tyler
1931; Shulman 1970; Carver 1978). Psychologists have known about the
difference for nearly a century, though most of them continue like economists to
ignore it (a committee of the American Psychological Association was recently
charged to re-open the question). In 1919 an eminent experimental psychologist,
the alarmingly named Edwin Boring, published an article unmasking the
confusion between arbitrarily-judged-statistical significance and practical
significance (Boring 1919). And empirical sociology would be less easy for
economists to sneer at if more realized that a good many sociologists grasped
the elementary statistical point decades before even a handful of the economists
Of late the protest has grown a little louder, but is still scattered (we
detailed in the 1996 paper the evidence that almost all econometrics textbooks
teach the students to ignore substantive significance in favor of testing without a
loss function and without substantive judgments of the size of coefficients).
James Berger and Robert Wolpert in 1988, though making a slightly different
point (the Bayesian one that Jeffreys and Zellner emphasize), noted the large
number of theoretical statisticians engaging in “discussions of important
practical issues such as `real world’ versus `statistical’ significance”: Edwards,
Lindman, and Savage (1963), I. J. Good (1981), and the like. What we find bizarre
is that in the mainstream statistical literature this “important” point is hardly
mentioned (we found in our 1996 article, though, some honorable exceptions,
such as the first edition of the elementary text by Freedman, Pisani, and Purvis
[1978; we note with alarm that later editions have soft-peddled the issue]).
Among economists the roll of honor is likewise short but distinguished. J. M.
Keynes (virtually), Arnold Zellner, Arthur Goldberger, A. C. Darnell, Clive
Granger, Edward Leamer, Milton Friedman, Robert Solow, Kenneth Arrow, Zvi
Griliches, Glen Cain, Gordon Tullock, Gary Solon, Daniel Hamermesh, Thomas
Mayer, David Colander, Jan Magnus, and Hugo Keuzenkamp are not dunces
and they haven’t minced words (Cain and Watts 1970, pp. 229, 231-231;
Keuzenkamp and Magnus 1995; McCloskey and Ziliak 1996, p. 99 and
numerous other references on pp. 112-14; McCloskey’s citations in her works
cited; Darnell’s comprehensive review of 1997; Hamermesh 1999; Colander 2000;
Keuzenkamp 2000, p. 266; and so forth). Recently, to pick one among the small,
bright stream of revisions of standard practice that appear in our mailboxes,
Clinton Greene (2003) has applied the argument to time-series econometrics,
showing that tests of cointegration based on arbitrary levels of significance miss
the economic point: they are neither necessary nor sufficient.
We are sometimes told that “You’re rehashing issues decided in the
1950s” or “Sure, sure: but the hot new issue is [such and such new form of
specification error, say]” or “I have a metaphysical argument for why a universe
should be viewed like a sample.” When we are able to get such people-in-a-
hurry to slow down and listen to what we are saying (which is not often), we
discover that in fact they do not grasp our main point, and their own practice
shows that they do. It is dangerous, for example, to mention Bayes in this
connection, because the reflexive reply of most econometrically minded folk is
to say “1950s” and have done with it. Our point is not Bayesian (although we
honor the Bayesians such as Leamer and Zellner who have made similar---and
also some different---criticisms of econometric practice). Our (idiotically simple)
point has nothing to do with Bayes’ Theorem: it applies to the most virginal
Our experience is that in the rare cases when people do grasp our point---
that fit and importance are not the same---they are appalled. They realize that
almost everything that has been done in econometrics since the beginning needs
to be redone. The wrong variables have been included, for example (which is to
say errors in specification have vitiated the conclusions); mistaken policies have
We believe we have shown from our evidence in the American Economic Review over the two last decades what scientists from Edgeworth to Goldberger
have been saying: science is about magnitudes. Seldom is the magnitude of the
sampling error the chief scientific issue. (A sympathetic reader might reply it's
not the size that counts; it's what you do with it. But that too is mistaken. As
Friedman’s alloy regression and hundreds of other statistical experiments reveal,
what matters is size and what you do with it. Scientific judgment, like any
judgment, is about loss functions---what R. A. Fisher was most persistent in
What Should Economists Do?
We should act more like the Gary Solons and the Claudia Goldins. We
should be economic scientists, not machines of walking dead recording 5%
levels of significance. In his acceptance speech for the Nobel Prize, Bob Solow
[Economists] should try very hard to be scientific with a small s.
By that I mean only that we should think logically and respect fact.
. . . Now I want to say something about fact. The austere view is
that “facts” are just time series of prices and quantities. The rest is
all hypothesis testing. I have seen a lot of those tests. They are
almost never convincing, primarily because one senses that they
have very low power against lots of alternatives. There are too
many ways to explain a bunch of time series. And sure enough,
the next journal will contain an article containing slightly different
functional forms, slightly different models. My hunch is that we
can make progress only by enlarging the class of eligible facts to
include, say, the opinions and casual generalizations of experts
and market participants, attitudinal surveys, institutional
regularities, even our own judgments of plausibility (Solow 1988).
Solow recommends we “try very hard to be scientific with a small s”; the authors
we have surveyed in the AER, by contrast, are trying very hard to be scientific
with a small t. As Solow says, it’s almost never convincing.
What to do? One of us was advised to remove the 1996 article from his
CV while job hunting—it wasn’t “serious” research. Shut up and follow R. A.
Fisher. The other served fleetingly on the editorial board of the AER. Each time
she saw the emperor had no clothes of oomph she said so (by the way, in the
original Danish of the story the child is not identified as to gender: we think it
was probably a little girl.) The behavior did not endear her to the editors. After
a while she and they decided amiably to part company.
The situation was strange: economic scientists, for example those who
submit and publish papers in the AER, or serve on hiring committees, routinely
violate elementary standards of statistical cogency. And yet it is the messengers
who are to be taken out and shot. This should stop. We should revise
publication standards, and cease shooting messengers who bring the old news
that fit is not the same thing as importance. If the AER were to test papers for
cogency, and refused to publish papers that used fit irrelevantly as a standard of
oomph, economics would in a few years be transformed into a field with
empirical standards. At present (we can say until someone starts claiming that in the 2000s practice has improved), we have shown, it has none. Ask: “Is the paper
mainly about showing and measuring economic significance?” If not, the editor
and referees should reject it. It will not reach correct scientific results. Its
findings will be biased by misspecification and mistaken as to oomph.
(Requiring referees to complete a 19-item questionnaire would probably go
against the libertarian grain of the field; a short form would do: “Does the paper
focus on the size of the effect it is trying to measure, or does it instead recur to
irrelevant tests of the coefficient’s statistical significance?”) To do otherwise---
continuing to decorate our papers with stars and signs while failing to interpret
size---is to discard our best unbiased estimators, and to renege on the promise of
empirical economics: measurement. No size, we should say, no significance.
American Economic Review. Jan. 1980 to Dec. 1999. The 319 full-length papers
using tests of statistical significance. May Supplement excluded.
Angrist, Joseph. 1995. “The Economic Returns to Schooling in the West Bank and
Gaza Strip.” American Economic Review 85(5), pp. 1065-1086.
Baird, Davis. 1988. “Significance tests, history and logic.” Pp. 466-71, in S. Kotz
and N.L. Johnson, eds., Encyclopedia of Statistical Sciences 8. New York: John
Becker, Gary S., Michael Grossman, and Kevin M. Murphy. 1994. “An Empirical
Analysis of Cigarette Addiction.” American Economic Review 84(3), pp. 396-
Berger, James O., and Robert L. Wolpert. 1988. The Likelihood Principle, 2nd ed.
Hayward, CA: Institute of Mathematical Statistics.
Bernanke, Ben S. and Alan S. Blinder. 1992. “The Federal Funds Rate and the Channels
of Monetary Transmission.” American Economic Review 82(4), pp. 901-921.
Bernheim, B. Douglas and Adam Wantz. 1995. “A Tax-Based Test of the Dividend
Signaling Hypothesis.” American Economic Review 85(3), pp. 532-551.
Boring, Edwin G. 1919. “Mathematical versus Scientific Significance.” Psychological
Cain, Glen G. and Harold W. Watts. 1970. “Problems in Making Policy Inferences from
the Coleman Report,” American Sociological Review pp. 228-242.
Card, David and Alan B. Krueger. 1994a. “Minimum Wages and Employment: A Case
Study of the Fast-Food Industry in New Jersey and Pennsylvania.” American
Card, David and Alan B. Krueger. 1994b. Myth and Measurement: the New Economics of the Minimum Wage. Princeton: Princeton University Press.
Carter, Ronald P. 1978. “The Case Against Statistical Significance Testing.” Harvard Educational Review 48(3), pp. 378-398.
Colander, David. 2000. “New Millenium Economics: How Did It Get This Way, and
What Way Is It?” Journal of Economic Perspectives 14(1), pp. 121-132.
Darnell, A. C. 1997. “Imprecise Tests and Imprecise Hypotheses.” Scottish Journal of Political Economy 44 (3), pp. 247-268.
Edgeworth, Francis Y. 1885. “Methods of Statistics.” Jubilee Volume of the Statistical Society,
pp. 181-217. Royal Statistical Society of Britain, June 22-24.
Edwards, W., H. Lindman, and L. J. Savage. 1963. "Bayesian Statistical Inference for
Psychological Research." Psychological Review 70: 193-242. Fisher, Ronald A. 1925 [1941]. Statistical Methods for Research Workers New York: G. E.
Friedman, Milton. 1985. Pp. 77-92, in W. Breit and R.W. Spencer, eds., Lives of the Laureates. Cambridge: MIT Press, 1990. Selection reprinted: M. Friedman and A.
J. Schwartz. 1991. “Alternative Approaches to Analyzing Data.” American Economic Review 81(1), pp. 39-49.
Good, I. J. 1981. "Some Logic and History of Hypothesis Testing." In J.C. Pitt, ed.,
Philosophy in Economics. Dordrecht, The Netherlands: Reidel.
Greene, Clinton A. 2003. “Towards Economic Measures of Cointegreation and Non-
Cointegration.” Unpublished paper, Department of Economics, University of
Missouri, St. Louis. April. Email: clinton_greene@umsl.edu
Hamermesh, Daniel S. 1999. “The Art of Labormetrics.” Cambridge, MA:
National Bureau of Economic Research, Inc.
Hendricks, Kenneth and Robert H. Porter. 1996. "The Timing and Incidence of
Exploratory Drilling on Offshore Wildcat Tracts.” American Economic Review
Kachelmeier, Steven J. and Mohamed Shehata. 1992. “Examining Risk Preferences
Under High Monetary Incentives: Experimental Evidence from the People’s Republic
of China.” American Economic Review 82(5), pp. 1120-1141. Keuzenkamp, Hugo A. 2000. Probability, Econometrics and Truth. Cambridge: Cambridge
Keuzenkamp, Hugo A. and Jan Magnus. 1995. “On Tests and Significance in
Econometrics.” Journal of Econometrics 67, pp. 103-28.
Kruskal, William S. 1968a. "Tests of Statistical Significance." Pp. 238-250, in David Sills,
ed., International Encyclopedia of the Social Sciences 14. New York: MacMillan.
Kruskal, William S. 2002. Personal interview. August 16, 2002, University of Chicago.
Kruskal, William S. 1968b. “Statistics: The Field." Pp. 206-224, in David Sills, ed.,
International Encyclopedia of the Social Sciences 15. New York: MacMillan.
McCloskey, Deirdre, and Stephen Ziliak. 1996. “The Standard Error of Regressions.”
Journal of Economic Literature, Mar 1996: pp. 97-114.
McCloskey, Deirdre. 1985a. “The Loss Function Has Been Mislaid: The Rhetoric of
Significance Tests.” American Economic Review, Supplement 75 (2, May): 201-205.
McCloskey, Deirdre. 1985b. The Rhetoric of Economics. Especially chapters 8 and 9.
McCloskey, Deirdre. 1992. “The Bankruptcy of Statistical Significance,” Eastern Economic Journal 18 (Summer 1992): 359-361.
McCloskey, Deirdre. 1995. “The Insignificance of Statistical Significance,” Scientific
Morgenstern, Oskar. 1950 [1990]. On the Accuracy of Economic Observations. Princeton:
Morrison, Denton E. and Ramon E. Henkel. 1970. The Significance Test Controversy: A
Shulman, L.S. 1970. “Reconstruction of Educational Research.” Review of Educational
Solon, Gary. 1992. “Intergenerational Income Mobility in the United States.” American Economic Review 82(3), pp. 393-408.
Solow, Robert. 1988. In W. Breit and R.W. Spencer, eds., Lives of the Laureates.
Twain, Mark. 1883 [1946]. Life on the Mississippi. New York: Bantam.
Tyler, R.W. 1931. “What is Statistical Significance?” Educational Research Bulletin 10, pp.
Zellner, Arnold. 1984. Basic Issues in Econometrics. Chicago: University of Chicago Press.
Zimmerman, David J. 1992. “Regression Toward Mediocrity in Economic Stature.”
American Economic Review 82 (3), pp. 409-429.
The American Economic Review Had Numerous Errors
In the Use of Statistical Significance, 1980-1999
Does the paper . . . 8. Consider the power of the test?
6. Eschew reporting all standard errors,
t- , p-, and F- statistics, when
11. Eschew “sign econometrics,” remarking
14. Avoid choosing variables for inclusion
solely on the basis of statistical significance?
15. Use other criteria of importance besides
statistical significance after the crescendo?
10. Eschew “asterisk econometrics,” the
ranking of coefficients according to the
17. Do a simulation to determine whether
7. At its first use, consider statistical signi-
ficance to be one among other criteria of
19. Avoid using the word “significance” in
18. In the conclusions, distinguish between
13. Discuss the scientific conversation
within which a coefficient would be judged
2. Report descriptive statistics for regression 66.4
such that statistically significant differences
12. Discuss the size of the coefficients?
5. Carefully interpret the theoretical meaning 81.0
of the coefficients? For example, does it pay
attention to the details of the units of measure-
ment, and to the limitations of the data?
4. Test the null hypotheses that the authors
3. Report coefficients in elasticities, or in
some other useful form that addresses the
Source: All the full-length papers using tests of statistical significance and
published in the American Economic Review in the 1980s (N=182) and
1990s (N=137). Table 1 in McCloskey and Ziliak (1996) reports a small
number of papers for which some questions in the survey do not apply.
Notes: a Of the papers that mention the power of a test, this is the fraction
that examined the power function or otherwise corrected for power.
The Economic Significance of the American Economic Review
(Measured by Net Percentage Difference, 1980-1999)
Does the paper . . .
5. Carefully interpret the theoretical meaning 81.0
of the coefficients? For example, does it
pay attention to the details of the units of
measurement, and to the limitations of the data?
2. Report descriptive statistics for regression 66.4
13. Discuss the scientific conversation
within which a coefficient would be judged
18. In the conclusions, distinguish between
17. Do a simulation to determine whether
. . . But the Essential Confusion of Statistical and Economic Significance
(Measured by Net Percentage Difference, 1980-1999)
Does the paper . . .
14. Avoid choosing variables for inclusion
solely on the basis of statistical significance?
10. Eschew “asterisk econometrics,” the
ranking of coefficients according to the
11. Eschew “sign econometrics,” remarking
such that statistically significant differences
4. Test the null hypotheses that the authors
15. Use other criteria of importance besides
statistical significance after the crescendo?
16. Consider more than statistical significance decisive in an empirical argument? 7. At its first use, consider statistical signi- ficance to be one among other criteria of importance?
Source: All the full-length papers using tests of statistical significance and
published in the American Economic Review in the 1980s (N=182) and
1990s (N=137). Table 1 in McCloskey and Ziliak (1996) reports a small
number of papers for which some questions in the survey do not apply.
Notes: a Of the papers that mention the power of a test, this is the fraction
that examined the power function or otherwise corrected for power.
Author Rankings by Economic Significance
in the 19-Question Survey of the 1990s)
[Year and Month of Publication in Brackets]
Forsythe, Nelson, Neumann, and Wright [9212]
Munnell, Tootell, Browne, and McEneany [9603]
Cukierman, Edwards, and Tabellini [9206]
If Only Statistical Significance Is Said To Be Of Importance At Its First Use (Question 7),
Then Statistical Significance Tends To Decide The Entire Argument
Does not consider the test decisive (Question 16)
Notes: `0’ means “no, did the wrong thing;” `1’ means “yes, did the right thing.” In the
1980s data, when Question 7=0 and Question 16=0 the first row becomes 86-10-96
[McCloskey and Ziliak 1996, Table 1 and Table 5]; in other words, practice was in this
additional sense somewhat better in the 1980s.
All appointments: 1.877.CIVA.NOW (1.877.248.2669) Referring physicians: PLEASE FAX TO 214.361.2153 civadallas.com Patient . Cardiovascular Diagnosis . Patient SSN . Patient Ph . Date of Birth . (Parent or guardian MUST accompany patients under 18 yrs. Please check which CIVA physician you are referring the patient to: [ ] Jack W. Spitzberg, MD, FACP, FACC, FCCC [ ] Charles B. Levin
LT1510 High Efficiency Lithium-Ion Battery ChargerDesign Note 111The LT®1510 current mode PWM battery charger is thesimplest, most efficient solution for fast charging modernrechargeable batteries including lithium-ion (Li-Ion), nickel-metal-hydride (NiMH) and nickel-cadmium (NiCd) thatrequire constant current and/or constant voltage charging. The internal switch is capable of delivering 1.