“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)–hereafter, ASA II(note)–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein.[2] As the ASA II(note) authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.)

So I should say something. But the task is delicate. And painful. Very. I should start by asking: What is it (i.e., what is it actually saying)? Then I can offer some constructive suggestions.

The Invitation to Broader Consideration and Debate

The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. (ASAII(note) p. 1)

The questions around reform need consideration and debate. (p. 9)

Excellent! A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II(note) (~1000 words) if you send them to me.

My focus here is just on the intended positions of the ASA, not the summaries of articles. This comprises around the first 10 pages. Even from just the first few pages the reader is met with some noteworthy declarations:

♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

♦ A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)

♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

♦ It is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive. (p.2)

♦ “Statistically significant”– don’t say it and don’t use it. (p. 2)

(Wow!)

I am very sympathetic with the concerns about rigid cut-offs, and fallacies of moving from statistical significance to substantive scientific claims. I feel as if I’ve just written a whole book on it! I say, on p. 10 of SIST:

In formal statistical testing, the crude dichotomy of “pass/fail” or “significant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.

Since ASA II(note) will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”. (The probability of observing even larger differences, under the assumption of chance variability alone is p.) Confidence intervals (CIs) are already routinely given alongside P-values. So there is clearly more to the current movement than meets the eye. But for now I’m just trying to decipher what the ASA position is.

What’s the Relationship Between ASA I and ASA II(note)?

I assume, for this post, that ASA II(note) is intended to be an extension of ASA I. In that case, it would subsume the 6 principles of ASA I. There is evidence for this. For one thing, it begins by sketching a “sampling” of “don’ts” from ASA I, for those who are new to the debate. Secondly, it recommends that ASA I be widely disseminated. But some Principles (1, 4) are apparently missing[3], and others are rephrased in ways that alter the initial meanings. Do they really mean these declarations as written? Let us try to take them at their word.

But right away we are struck with a conflict with Principle 1 of ASA I–which happens to be the only positive principle given. (See Note 5 for the six Principles of ASA I.)

Principle 1. P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.” (ASA I p. 131)

However, an indication of how incompatible data are with a claim of the absence of a relationship between a factor and an outcome would be an indication of the presence of the relationship; and providing evidence against a claim of no difference between two groups would often be of scientific or practical importance.

So, Principle 1 (from ASA I) doesn’t appear to square with the first bulleted item I listed (from ASA II(note)):

(1) “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, ASA II(note)).

Either modify (1) or erase Principle 1. But if you erase all thresholds for finding incompatibility (whether using P-values or other measures), there are no tests, and no falsifications, even of the statistical kind.

My understanding (from Ron Wasserstein) is that this bullet is intended to correspond to Principle 5 in ASA I – that P-values do not give population effect sizes. But it is now saying something stronger (at least to my ears and to everyone else I’ve asked). Do the authors mean to be saying that nothing (of scientific or practical importance) can be learned from statistical significance tests? I think not.

So, my first recommendation is:

Replace (1) with:

“Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”

Either that, or simply stick to Principle 5 from ASA I : “A p-value, or statistical significance[4], does not measure the size of an effect or the importance of a result.” (p. 132) This statement is, strictly speaking, a tautology, true by the definitions of terms: probability isn’t itself a measure of the size of a (population) effect. However, you can use statistically significant differences to infer what the data indicate about the size of the (population) effect.[4]

My second friendly amendment concerns the second bulleted item:

(2) No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p. 2)

Focus just on “presence”. From this assertion it would seem to follow that no P-values[5], however small, even from well-controlled trials, can reveal the presence of an association or effect–and that is too strong. Again, we get a conflict with Principle 1 from ASA I. But I’m guessing, for now, the authors do not intend to say this. If you don’t mean it, don’t say it.

So, my second recommendation is to replace (2) with:

 “No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.

Without this friendly amendment, ASA II(note) is at loggerheads with ASA I, and they should not be advocating those 6 principles without changing either or both. Without this or a similar modification, moreover, the ability of any other statistical quantity or evidential measure is likewise unable to reveal these things. Or so many would argue. These modest revisions might prevent some readers stopping after the first few pages, and that would be a shame, as they would miss the many right-headed insights about linking statistical and scientific inference.

This leads to my third bulleted item from ASA II(note):

(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)

Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems. Here’s what I say in SIST:

The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,”  but rather, “ how well-probed”  claims are, and what has been poorly probed. (SIST, p. 162)

But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the ASA II(note) guide that the authors agree. Since I don’t think they literally mean (3), why say it?

Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is presupposed by the very important Principle 4 of ASA I (see note (3).

As lawyer Nathan Schachtman observes, in a recent conversation on ASA II(note):

By the time a phase III clinical trial is being reviewed for approval, there is a mountain of data on pharmacology, pharmacokinetics, mechanism, target organ, etc. If Wasserstein wants to suggest that there are some people who misuse or misinterpret p-values, fine. The principle of charity requires that we give a more sympathetic reading to the broad field of users of statistical significance testing. (Schachtman 2019)

Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:

Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. . .  No funny stuff, no posterior distributions, just the likelihood. . . I don’t want everybody coming to me with their posterior distribution – I’d just have to divide away their prior distributions before getting to my own analysis. (ibid., p. 54)

So, my third recommendation is to replace (3) with (something like):

failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”

There’s much else that bears critical analysis and debate in ASA II(note); I’ll come back to it. I hope to hear from the authors of ASA II(note) about my very slight, constructive amendments (to avoid a conflict with Principle 1).

Meanwhile, I fear we will see court cases piling up denying that anyone can be found culpable for abusing p-values and significance tests, since the ASA declared that all p-values are arbitrary, and whether predesignated thresholds are honored or breached should not be considered at all. (This was already happening based on ASA I.)[6] 

Please share your thoughts and any errors in the comments, I will indicate later drafts of this post with (i), (ii),…Do send me other articles you find discussing this. Version (ii) of this post begins a list:

Nathan Schachtman (2019): Has the ASA Gone Post-Modern?

Cook et al.,(2019) There is Still  Place for Significance Testing in Clinical Trials

NEJM Manuscript &amp; Statistical Guidelines 2019Harrington, New Guidelines for Statistical Reporting in the Journal NEJM 2019^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 

References:

Gelman, A. (2012) “Ethics and the Statistical Use of Prior Information”. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics5.pdf

Mayo, D. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge: Cambridge University Press.

Schachtman, N.  (2019).  (private communication)

Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, (and supplemental materials), The American Statistician 70(2), 129–33. (ASA I)

Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19. (ASA II(note))

NOTES
[1]  I gave an invited paper at the conference (“A world Beyond…”) out of which the idea for this volume grew. I was in a session with a few other exiles to describe the contexts where statistical significance tests are of value. I was too much involved in completing my book to write up my paper for this volume, nor did others in our small group. Links are here to: my slides and Yoav Benjamini’s slides. I did post notes to journalists on the Amrhein article here.

 

[2] Excerpts and mementos from SIST are here.

 

[3]  Principle 4 ASA I asserts that “proper inference requires full reporting and transparency”: P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. ….Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. (pp. 131-2)

 

[4] Consider, for example, a two-sided (symmetric) 95% confidence interval estimate of Normal mean: [a, b]. This information can also be given in terms of observed significance levels.
  • CI-lower is the (parameter) value that the data x are just statistically significantly greater than, at the 0.025 level.
  • CI-upper is the (parameter) value that the data x are just statistically significantly smaller than, at the 0.025 level.
There’s a clear duality between statistical significance tests and confidence intervals. (The CI contains those parameter values that would not be rejected at the corresponding significance level, were they the hypotheses under test.) CIs were developed by the same man who co-developed Neyman-Pearson (N-P) tests in the same years (~1930): Jerzy Neyman. There are other ways to get indicated effect sizes such as with (attained) power analysis and the P-value distribution over different values of the parameter. The goal of assessing how severely tested a claim is serves to direct this analysis (Mayo 2018). However, the mathematical computations are well-known (see Fraser’s article in the collection), and continue to be extended in work on Confidence Distributions. See this blog or SIST for references.
      However, confidence intervals as currently used in reform movements inherit many of the weaknesses of N-P tests: they are dichotomous (inside/outside), adhere to a single confidence level, and are justified merely with a long-run performance (or coverage) rationale. By considering the P-values associated with different hypotheses (corresponding to parameter values in the interval), one can scotch all of these weaknesses. See Souvenir Z, Farewell Keepsake, from SIST.
      It is often claimed that anything tests can do CIs do better (sung to the tune of “Annie Get Your Gun”). Not so. (See SIST p. 356). It is odd and ironic that psychologists urging us to use CIs depict statistical tests as exclusively of the artificial “simple” Fisherian variety, with a “nil” null and no explicit alternative, given how Paul Meehl chastised this tendency donkey’s years ago, and given that Jacob Cohen advanced power analysis.
A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)
See SIST, p. 323. For links to all of excursion 5 on power, see this post.
      Of course, the beauty of the simple Fisherian test shows itself when there is no explicit alternative, as when testing assumptions of models–models that all the alternative statistical methods on offer also employ. ASA I also limits itself to the simple Fisherian test: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power…” (p. 130)

 

[5] I assume they intend to make claims about valid P-values, not those that are discredited by failing “audits” due either to violated assumptions, or to multiple testing and other selection effects given in Principle 4, ASA I. The, largely unexceptional, six principles of ASA I (2016) are:
    • P-values can indicate how incompatible the data are with a specified statistical model.
    • P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
    • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
    • Proper inference requires full reporting and transparency.
    • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
    • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
[6] Just because P-values form a continuum, it doesn’t follow that we can’t use very high and very low P-values to distinguish rather lousy from fairly well indicted discrepancies. Beware the “Fallacy of the Continuum”. Would anyone use a confidence level of 0.5 or 0.6?
Categories: ASA Guide to P-values, Statistics | 97 Comments

Post navigation

97 thoughts on ““The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)(ii)

  1. Charles R. Twardy

    Thanks for your public service on this one – clarifying terms may not be exciting, but it’s necessary, and it looks to me as if you’ve done a good job making minimal changes that obey the principle of charity and accomplish the probably-desired task.

    I’m less sanguine than Schachtman about phase III clinical trials. Perhaps I’m over-reacting to what we’ve learned about preclinical cancer trials (Begley 2012 etc.)?

    • Thank you! I hope that the authors of ASA II concur on making these slight changes to the document as soon as possible.

  2. Michael J Lew

    The clarity of the ASA statements would be significantly improved by your additions. (Yep, significantly.)

    • Thank you so much Michael. Maybe you can help to convince Wasserstein and the other authors.

  3. Steven McKinney

    Thank you for your further efforts Mayo. Exhausting as it is, few are as well equipped as you to deconstruct phrases and their underlying concepts and put them back together properly. Such is the mantle of the philosopher.

    Perhaps ASA III will live up to the addage “Third time’s a charm”. They’ll stand a chance if they pay attention to your good guidance.

    I suspect it will take a while, several years if past experience holds up, for enough people to learn this concept “The inference is qualified by using error probabilities to determine not ‘how probable,’ but rather, ‘how well-probed’ claims are, and what has been poorly probed. (SIST, p. 162)” which of course involves consideration of severity in appropriate circumstances.

    Naturally proper use of statistics in a disciplined fashion can lead to honest discussions of how well-probed claims are, including use of the maligned p-value. Some day sensible reviews of adequately specified conditions, and whether the data at hand can well-probe said conditions will return to journals and blogs that have currently wandered out into left field, banishing this or that statistic in some misguided faddish fashion.

  4. Just a quick comment about something being almost missing here.

    > together with a so-called “null hypothesis.”

    The over emphasis (or sole emphasis) on no effect rather than a range of sensible possible effects.

    Re-emphasizing p-values as compatibility assessments of a full range of sensible possible effects mostly breaks the connection with statistical significance being about just no effect (that I think is behind many of the don’ts).

    So on to compatibility/consonance/p-values/severity? curves rather than p-values for no effect and confidence intervals.

  5. Mark Burgman

    The three amendments seem sensible, to me.

    • Thank you Mark. As the editor of Conservation Biology, you are in a very important position to promote a balanced assessment of statistical reforms.

  6. Great post, I think your recommended edits to the ASA II bullets are spot on. It seems like ASA II succumbed to a temptation to say attention-grabbing things, without being careful at the same time to say exactly the *right* things.

    I have to admit that I’m puzzled why people have trouble talking, writing, and thinking clearly about all this… I’m just naive, I guess!

    • Thank you so much!
      You know, I think there is something in what you say, and it’s very important. The claims that I set out, and I could easily add to them, come across as wanting to shock. And of course they succeeded in getting a ton of airplay by announcing such starkly, dramatic “new rules”. However, by coming across that way, they weaken their message of calling for humble, sagacious, thoughtful and fair-minded.statistical practice. That is one of the reasons for my recommendations (which I shared with Ron Wasserstein soon after ASA II appeared). I feel they undermine their own goals. Even these minor revisions would avoid this.

  7. Christian Hennig

    Oh dear! I think I have a good understanding of p-values and if I’m doing “private science”, i.e. analysing data just on my own to find something out (which I have fun doing sometimes), not even having in mind publication, I use p-values all the time to have some “formal assistance” with the question whether what I see is compatible with meaningless random variation, but I’d never rely on a p-value on its own for making a public substantial statement. I still think that tests and p-values are amazing ideas and understanding them properly involves much understanding of reasoning under uncertainty, probability distributions, random variation etc. There are many innovative and promising developments based on statistical testing and p-values, for example Buja et al.’s “Statistical inference for exploratory data analysis and model diagnostics”

    Click to access 06-Buja-Cook-Hofmann-Lawrence-Lee-Swayne-Wickham.pdf

    Then I also think that people got too enthusiastic about tests and p-values and celebrating them too much made them a “standard” and made many people use them without thinking or replacing thinking, as the ASA correctly states. However I believe that this will happen to whatever principle becomes so popular that some journals and institutions start to raise it to a “standard” and make publication, funding etc. dependent on it.

    Many people inside and outside science want easy answers to their questions and they will jump at whatever promises to give them. If at some point any specific and reasonably simple (in format, not in understanding) alternative to p-values becomes a celebrated standard, we’ll have the same problems with that one all over again.

    Despite the fact that many sensible thoughts and statements can be found in ASA I and II, they seem to convey (and are and will be interpreted by many in this way) that significance and p-values are the problem and replacing them by something else could be the solution, carefully though adding side remarks here and there that are probably meant to keep those of us quiet who still defend them. Thanks for your effort to balance thing a bit better! (“Don’t throw out the baby with the bathwater” is just the perfect comment on what goes on here.)

    One specific thing that struck me is that not only are binary decisions needed in some situations, also whatever language interpretation is given of for example p=0.12 will by the very nature of language involve some categorisation. And “don’t say anything more than just p=0.12” will certainly not help non-statisticians understanding what we can get out of stats (let alone “the posterior probability for event XXX is 8% given prior YYY”). Thoughtfulness is all fine but ultimately at least fairly simple messages need to be given to whoever will not dive deeply into statistics, be it decision makers, journalists, doctors, social scientists whose major competence is not quantitative, whoever. So at least at some point in the communication pipeline whatever we do will be boiled down. Pointing out problems with that is very worthwhile, quick fixes there are not (and neither are the ASAs “don’t”s a quick fix.

    • Christian: Thanks so much for your comment and the link. If p-values can serve to distinguish incompatibility from random noise, even as a first step in interpreting results, then the claims of ASA II go overboard and should be modified.
      Your remark that the ASA carefully adds “side remarks here and there that are probably meant to keep those of us quiet who still defend them” is interesting.
      More on your comment later.

      • john byrd

        Deborah,
        “If p-values can serve to distinguish incompatibility from random noise, even as a first step in interpreting results, then the claims of ASA II go overboard and should be modified.” This is so fundamental to good science across the disciplines that I am a bit shocked at reading the ASA II paper. Sure, bad science in the form of cherry-picking, draconian NHST, etc. is not good. Poor practice is not DUE to the ideas behind significance testing, and Fisher was quite careful in how he used p-values, which the authors seem to ignore. Properly used, p-values are viewed in a much larger framework of tests, observations, experimental designs, etc. I think they are often (most often?) used properly and the ASA criticisms are misplaced. Another issue is that poor scientific practice can just as easily plague likelihoods, Bayesian stats, etc. The ASA papers imply we are better off with other choices as though they are not even more problematic, in my view. I hope they will adopt your recommendations.

        • Poor practice can even more easily plague methods that adhere to principles that downplay the sampling plan post data. Readers know how much the LP is discussed on this blog (Likelihood Principle).

          “The likelihood principle emphasized in Bayesian statistics implies, among other things, that the rules governing when data collection stops are irrelevant to data interpretation. (Edwards, Lindman, and Savage 1963, p. 193)

          And of course priors can be, and often are, data dependent. I don’t want the predicament now to be seen as frequentist-Bayes, because that’s not it. It’s the fact that we don’t see these “P-value warnings” balanced by any appraisal of (old and new) background knowledge of the slings and arrows in using other methods. The stat wars are in danger of being seen as proxy wars between competing tribe leaders, each keen to advance one or another tool or school, be it confidence intervals (CIs), Bayes factors, Bayesian posteriors and a new “diagnostic-screening” (DS) model of tests.

          ASA II advocates replication, which is good, but the replications used to determine if your field is in crisis are based on statistical significance tests. If they are so uninformative, how can the statistical crisis in science be based on them? You don’t need a “bright line” to distinguish well warranted from lousy tests, but without being able to spot the extremes, there are really no tests at all.

    • Stuart Hurlbert

      Christtian,
      Re where binary decisions may be needed, as in deciding whether to approve or market a new drug. These will never be made on the basis of a single P value, or estimated effect size or even single analysis or experiment. So one can set goalposts or requirements for all such measures, and how they will be weighted. But using phrases like “statistically significant” are completely superfluous and indeed misleading in those contexts.

      Yes, language will require categorization but not binary categorization especially when we are not dealing with continuous variables, we don’t restrict ourselve to “hot water” and “cold water,” or to “tall trees and “short trees.”

      People who can only understand “simple messages” will always be a problem; it will be important to keep them out of decisionmaker positions.”

      Of all the recommendations make, thesimmple practical one with most widespread support is for journal editors to disallow use of the phrase “statsitically significant.” That is not a “quick fix” by itself, but more than any other proposal it is one that would force scientists (incl. statisticans) to think and write more clearly, including especially for non-scientists and decisionmakers.

      • Stuart:
        Unfortunately, ASA II does not restrict itself to derogating binary categories, so far as I understand it: “To be clear, the problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups, based on arbitrary p-value thresholds.”

        In fact an “arbitrary” set of categories (how many do you want 25, 100?) enables distinguishing well warranted and terribly warranted claims. See Note [6]. We should avoid the fallacy of the continuum.

        • Stuart Hurlbert

          Mayo:

          I agree with both the quoted ASA II statement and your own and see no conflict between them. So I think none of us want an expanded and formallized list of categories, e.g. “unwarranted,” “moderately warranted,” “well warranted”, “terribly warranted,” etc., OR “no clear evidence”, ‘weak suggestion of…,” “moderate evidence of…,” “strong evidence of…” etc.

          So long as the outputs of analysis (e.g. effect sizes, P values, confidence intervals, severity curves, etc., etc.) are clearly presented, often preferably in tables and figures when there are many of them, authors should be offered considerable flexibility in how they are verbally characterized in, e.g., the text of a Results section. The only formal restriction required is that they be disallowed from using the phrase “statistically significant”, as that will drag along a century’s worth of historical baggage that will only serve to keep readers (and editors and referees!) as confused and misinformed as those of earlier generations.

          Your footnote (below) is the only place where I find mention of the “fallacy of the continuum,” so I am not sure exactly how you define it.

          As to your implication about atypical confidence levels, I respond with this quote from Hurlbert & Lombardi(2009), p.331 :

          “Presentation of multiple confidence intervals for individual estimates does have its champions and might sometimes be useful with very simple data sets. Rozeboom (1960) suggested that reports might with “some benefit… simultaneously present several confidence intervals for the parameter being estimated.” Salsburg (1985) proposed using 50, 80, and 99 percent confidence intervals simultaneously in clinical studies. Mayo & Cox (2006) state that “ the provision of confidence intervals, in principle at a range of probability levels, gives the most productive frequentist analysis.””

          *****************

          So long as exact P values are also given the main value of confidence intervals is to caution authors and readers about how imprecise their point estimates of effect sizes may actually be. So there is no need to stick with the traditional 95% CI. An additional consideration is that, again so long as the exact P for a treatment effect in a two-treatment experiment is given and if the expt is monitored over many dates, the most informative way of presenting the results graphically may be to show how the means of the two groups change over time and to present CIs for those means, not for the difference between them. In many such situations, especially where there are multiple treatments and/or multiple response variables, 50, 75 or 80 % CIs would greatly reduce the clutter (e.g. total length of lines & number of overlaps) in the figures. But there can be no universal recipe; everything will depend on context. This flexibility, I note, is not compatible with the main recommendations of Cumming’s New Statistics which seem to envisage only 95% CIs.

          “[6] Just because P-values form a continuum, it doesn’t follow that we can’t use very high and very low P-values to distinguish rather lousy from fairly well indicted discrepancies. Beware the “Fallacy of the Continuum”. Would anyone use a confidence level of 0.5 or 0.6?”

          • try fallacy of the heap

          • So you disagree with many points in ASA II, including dropping “statistical significance level” associated with an observed difference. Did you share that view with the “don’t say” group? I have no idea how the writing of the doc was handled, unlike ASA I, which I know a little bit about.
            You say:
            This flexibility, I note, is not compatible with the main recommendations of Cumming’s New Statistics which seem to envisage only 95% CIs.

            Yes, it makes no sense at all to declare “we’re against 0.05 being a benchmark” while advocating only 0.95 as the standard. But I wouldn’t say it was “incompatible”. I can see no reason for the CI advocates dismissal of the equivalent testing form of CIs, except some kind of insistence that we never construe the analysis as a test. But why? Note that in Cumming’s second ‘new stat’ book, p-values return, but marked with such grave doubts as to work against the reason they were put back. You were onto something in speaking of a CI “crusade”. But that diminishes the effort, and obstructs the proper understanding of tests.

  8. Bob Cousins

    As Deborah Mayo knows, I am an experimental high energy physicist with a career-long interest in foundations of statistics. During numerous bus and plane trips, I have read with interest “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”, as well as about half the papers so far in the Am. Stat. special issue. I hope at some point in the next year to comment at length on both. (Spoiler: I think that something like severity is implicit in much of what we do.) For now, I just say that when I read a collection of papers such as those in Am. Stat., my main reaction is that I wish that more authors knew more about our statistical inference problems and their context in high energy physics. E.g., in our core physics models (“laws” of nature), “All models are wrong” does not capture the key issues; and we have point nulls that really are points on the relevant uncertainty scales; and in our research, tiny departures from the point null can be worth Nobel prizes. And no one would think of giving a p-value without also giving at least an approximate confidence interval. That does not mean that I am satisfied with all of our methods, or with the supposed 5-sigma convention in our field. I have written an introduction for statisticians that at least one prominent Bayesian found to be a good read: “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics”, https://arxiv.org/abs/1310.3791 , published in a special issue of Synthese on the Higgs boson discovery (with an erratum). Once one understands this introduction, I hope the one can then see how the various pronouncements in AmStat or elsewhere match up (or not) with practice in my field.

    • Thanks so much Bob, for your comment and the excellent link. I quite agree about the importance of a better understanding of how statistics is used in physics and in many other areas that link statistics to theories (whether local or high level). That is why I brought in such examples in SIST. Did statistical practitioners really veer that far away from the uses of statistical analysis as piecemeal steps in inquiry, as some allege? I severely doubt it, with exceptions, of course. But why let “the tail wag the dog” as Cook et al. 2019 put it?
      Your comments on the latest ‘reforms will be very valuable!

  9. I am not aware of a single person who interprets p<alpha to conclude a phenomenon is true, real, etc., without first considering previous work, the experimental/survey design, how alpha was set, any multiple adjustments, practical significance, and if the experiment was repeated. And if such a person does exist, that would be a misuse or misunderstanding of the method and not a problem of the method itself.

    As time goes on, I see Mayo's points more and more, that
    -that these criticisms are rehashed year after year, and
    -no real alternatives are offered, nor the cons of said alternatives discussed in same detail (which make it appear that frequentism/NHST,etc. have all the cons)

    I probably mention these things somewhere in these:
    http://www.statisticool.com/objectionstofrequentism.htm
    http://www.statisticool.com/retirestatisticalsignificance.htm

    Justin

    • Justin: Thanks for your comment. You bring out two really important points:

      The results of error statistical tests cannot be apprised without considering “how alpha was set, any multiple adjustments, practical significance, and if the experiment was repeated” and so on.

      Everybody should look at Principle 4 of ASA 1. It is in Note 3 of this current post. The fact that P-values, and other error statistical quantities, cannot be assessed without knowing about aspects of the data and hypothesis generation shows that these are CONTEXT DEPENDENT quantities, and the methodology lets us show how you can get your error probabilities very wrong by failing to take account of data-dredging, multiple testing, stopping rules etc.
      By contrast, accounts that adhere or purport to adhere to the Likelihood Principles do not take account of selection effects, at least not without supplementing them with principles that are not now standard.

      It is a serious gap that we do not see “the cons of said alternatives discussed in same detail (which make it appear that frequentism/NHST,etc. have all the cons)”

      The main alternatives: likelihoodism, Bayes Factors, subjective and default posteriors, and maybe the “diagnostic screening” model of tests should be appraised, because we have a good deal of background information about shortcomings. The ASA recommends looking at background information, and doing so is equally important in putting forward methods.

      It is wrong to throw up our hands and declare that there can be no agreement when only one subset of one school has been scrutinized (and not even the best examples from that school).

      • “Only looking at the cons of one”. Indeed. Shiny new systems often need their own epicycles.

        Likelihood-ism: That can’t be right? If it doesn’t account for the context or stopping rule, it has the wrong likelihood. The Monty Hall puzzle depends critically on what Monty knows, not just what he reveals. Protocol matters.

    • Miodrag Lovric

      Justin, there are thousands of examples, I will give you just one, say, from the book “Statistics for the Behavioral Sciences” (2017) by Gregory J. Privitera, page 269: “When we decide to retain the null hypothesis, we conclude that an effect does not exist in the population. When we decide to reject the null hypothesis, we conclude that an effect does exist in the population”.

      • Miodrag, do they conclude *only* based on having pvalue less than alpha, or by using the full picture (experiment design, replication, etc.) in addition to having pvalue less than alpha. Just like only having Bayes factor greater than 5 is evidence for etc. in Bayesian would be a misuse of Bayesian methods.

        Justin

        • Justin:
          The onus on error statistical testers isn’t to show texts don’t oversimplify or that no one abuses tests. The onus is on the alternative accounts to show they have principled ways to block the sources of irreplication. That P-values are invalidated by multiple testing, post-data subgroups, and host of biasing selection effects, is an ASSET. Justin mentions Bayes factors, but bF advocates will tell you:

          “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100)

          I’m not sure why Justin thinks that taking a BF (maybe of 10) as evidence is regarded as a misuse of Bayesian methods.
          As I’ve said, I do think for purposes of comparisons, we might begin by assuming the quantities aren’t terribly wrong–else we cannot begin the comparison. We should then consider, given the slings and arrows causing irreplication, how the different accounts handle them. I don’t see, for example, why it matters for a BF whether the alternative is data dependent. The same P-hacked hyp can occur in a BF. Selection effects change the sampling distribution, so error statisticians have a principled way to complain. I don’t see the principle by which the BF complains, because they are not computing error probabilities of methods. They might have another way, but they need to tell us.
          Another issue is that the BF compares two hypotheses that typically don’t exhaust the space of alternatives. so H’ might be much more probable than H”, but there may be poor evidence for both. The 0 null vs all other parameter values does exhaust the space, but we get the problem Greenland and I discuss about the spiked prior, and the choice of how to smear the rest of the prior over the alternative. They too have an implicit threshold in reporting, say, that there’s evidence for the null, as they do.

          There are really only a small handful of issues, but very few are discussed explicitly in the ASA survey.

  10. Stuart Hurlbert

    Good effort, Mayo!

    But I gather you are not quite ready to jump fully into the icy waters of neoFisherianism and advocate disallowance of the phrase “statistically significant” by journal editors. You say:

    “Since ASA II will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”.”

    Not too clear if you’d actually be supportive of that language, but if so the response would be that it would contain no more information than the simpler statement, “The P value is…..” and it would be using the phrase “statistically significant” in a way conflicting with historic usage of the term.

    Later you say:

    ‘So, my third recommendation is to replace (3) with (something like):“failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.” ‘

    Here you seemingly imply that, appropriately accompanied, “declarations of statistical significance” are fine, constitute good language. Is that your intent?

    You also state:

    ‘So, my second recommendation is to replace (2) with: “No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.””

    I would say that, assuming the statistical model is correct or reasonable and the analysis is conducted appropriately, a low P value indeed is evidence of “plausibility, presence or truth … of an association or effect.” You seem to say the same at least as far as “presence” is concerned – and its not clear that these three concepts are all that different from each other.

    Yes, surprisingly, statisticians collectively have a long way to go before they are in a position to give good advice to the courts, regulatory agencies, and legislatures. Those are areas where very clear language is as important as the statistical procedures themselves. One can wonder whether the high and increasing percentage of US statisticians for whom English is not their first language may not be part of the problem, even though, in my experience, plenty of foreign grad students can write better in English than many American grad students.

    • By the way, am I right to suspect “you are not quite ready to jump fully into the icy waters of” (apparent) antiFisherianism and advocate disallowance of the phrase “statistical significance level p” or the like? Statistical significance was Fisher’s baby. That’s why I said last year, when you asked if I wished to sign against saying “significant” while retaining “significance”, that it seemed at odds with your Fisherian position. As you know, of course, Fisherians like Cox, still use significance level in the Fisherian way, as the attained p-value.

      • Stuart Hurlbert

        I-d say there’s a world of difference between “statistical significance” (in real data analysis contexts a usually unneeded label or synonym for “P-value”, albeit one completely compatible with our label “neoFisherian significance assessment” and Cox’s usage) and any use of “statistically signifiCANT” (which will always seem to imply an alpha or critical P lurking in the background). Maybe it’s just the schoolmarm in me!

        • So you would retain significANCE.

          • Stuart Hurlbert

            Mayo:
            Considered in isolation, “retain significance” is as ambiguous as “retire significance,” and I oppose ambiguous usage, especially as it has been at the root of so many needless statistical controversies for so many decades. So I can’t answer your question.

            In commenting on their title, I said as much to the authors of the Nature commentary before it was published but they thought it was too late to consider a change.

            I certainly agree on the utility of calculating P-values and interpreting correctly as part of most statistical analyses.

  11. Stuart: Thanks for your comment. I just have time for one point, on my second recommendation, as I am scrambling to finish a paper on “The Stat Wars: Errors & Casualties” for a conference next week in Germany.

    I appreciate your point:
    “I would say that, assuming the statistical model is correct or reasonable and the analysis is conducted appropriately, a low P value indeed is evidence of “plausibility, presence or truth … of an association or effect.” You seem to say the same at least as far as “presence” is concerned – and its not clear that these three concepts are all that different from each other.”

    Excellent! Then you concur that ASA II as stated calls for modification. (I agree as well that the 3 words don’t substantially differ). Given you were a leader in the “do not say significant” movement over a year ago, your recommending this revision is bound to be taken seriously by the authors. Will you recommend this?

    All of the statistical measures must be assumed to be licit, and not seriously violated, in order to be talking about and appraising them. I am putting to one side for this purpose the comparative ease of warranting the various model assumptions to even get the measure, be it a confidence level, likelihood ratio, Bayes Factor etc.

  12. Stuart Hurlbert

    Yes. I try not to say things I can’t defend publicly!

    Have you considered developing for a TAS article a condensed list of the principles in ASA I and in Wasserstein et al. (2019) with appropriate rewordings or deletions, getting Wasserstein, Lazar and Schirm (and possibly others) on board as co-authors on an “ASA III”?

    As I’ve discussed with others (can’t remember all), one principle in ASA I that should simply be deleted is the one that said something like a P-value of .05 is only weak evidence against the null hypothesis.

    The statement is meaningless without any explicit explanation of “weak relative to what?” Of course the origin of the statement is almost certainly the notion of many Bayesians that a more accurate measure of evidence is a Bayesian posterior based on a so-called “objective” prior of 0.05. I think the inclusion of that “principle” was a sop to Bayesians particiipating in ASA 1 unhappy about P values not being sufficiently “dissed”!

    Of course radical revision is an alternative to blunt deletion.

    • Michael J Lew

      There were indeed compromises made in the interests of acceptability to the various statistical leanings of the participant during the drafting of ASA I. The phrase “the so-called null hypothesis” is one that is such a compromise, but I do not see it as detracting from the messages.

      The idea that data that yield a P-value close to 0.05 contain only weak evidence against the (so-called) null hypothesis does not seem to be a compromise to me, and I do not recall heated debate on the issue. The weakness of the evidence does not depend on accepting any alternative statistical yardstick of evidence, as a simple plot of such data would convince most experienced eyes that P=0.05 is pretty weak evidence.

      On the topic of strength of evidence, it might be appropriate for me to remind that evidence has more dimensions of interest that just strong-weak axis. Evidence points towards and against various values of model parameters and it might be specific or diffuse. Statistical evidence is always contained within a statistical model and so we have to be mindful that a different model will give a different picture of the evidence from the same data.

      • Stuart Hurlbert

        Michael,
        Good to have you back online.

        But, sorry, “weak” is a relative term and a “simple plot” and “eyes” cannot obviate the need for clearer standard. Certainly you’d agree that P=0.05 is stronger evidence that P=0.10….. So ball remains in your court.

        As for the notion that the principle in question reflected micro-sabotage by Bayesians in ASA 1, our long critique of Bayesians in Hurlbert & Lombardi (2009) contains plenty of support. Here’s a brief excerpt:

        “Berger & Sellke (1987) give examples where the point null H0: δ = 0 is being tested against the standard composite H1: δ ≠ 0. They apply a so-called “objective” or “non-informative” Bayesian approach where H0 and H1 are both assigned a prior probability of 0.5. Not surprisingly, data sets yielding a P value of 0.05 yield Bayesian posterior probabilities several-fold higher. That is interpreted to mean that, despite the P of 0.05, “there is at best very weak evidence against” H0. They imply that the posterior probability is the true “magnitude of the evidence against H0.”
        “Bayesian priors can yield results reflecting not just an investigators’s true beliefs but also political, financial or religious motivations. Such results could damage science or society, at least in the short run, in the hands of statistically unsophisticated decision makers. Dennis (1996, 2004) gives examples of investigators concerned about management of rare species, pollutant concentrations in rivers downstream from mining operations, and efficacy of dietary supplements, and wonders whether such investigators’ ‘prior beliefs’ about those situations might vary according to where their salary or research funds were coming from. In a rather different sphere, Unwin (2003) uses a Bayesian approach to estimate the probability of the existence of god. A good ‘objective’ Bayesian, he gives ‘exists’ and ‘does not exist’ both a prior of 0.50, then evaluates a data set consisting of six facts, and ends up with a posteriror probability of 0.67 in favor of ‘exists,’ which is then upgraded to 0.95 by an additional injection of personal belief. As Dawkins (2006:132) notes in his critique of this analysis, “It sounds like a joke, but that really is how he [Unwin] proceeds… I can’t get excited about personal opinions, whether Unwin’s or mine.” ”
        “One of the major logical incongruities here is that significance tests in all disciplines are mostly used where H0 is known or strongly suspected a priori to be false. So by Bayesian logic it should be assigned a low prior probability, e.g. 0.01 or 0.10. Casella & Berger (1987) note this would result in a much lower posterior probability. Berger & Sellke (1987), referring to a hypothetical example, suggest, however, that even using a prior as low as 0.15 for H0 would constitute “blatant bias toward H1 [and]… hardly be tolerated in a Bayesian analysis.” So much for the desirability of using priors to express prior information or personal belief. The “bias” responsible for low priors and such contradictions is better labeled the wisdom of the investigator in selecting for study, independent variables that indeed do influence or are correlated with the dependent variables of interest. As Royall (1997:73) has noted, attempts to find completely objective or ‘non-informative’ priors “have been unsuccessful for a simple reason – pure ignorance cannot be represented by a probability distribution.””

        • Michael J Lew

          Yes, P=0.05 does indicate stronger evidence than P=0.1, all else equal. However, is P=0.05 from n=4 the same strength of evidence as P=0.05 from n=400? From Student’s t-test as it is from a proportions test? From an ‘exact’ test as it is from an approximation test? From one-tailed and two-tailed tests? In the primary F-test of an ANOVA and the individual contrasts? From optional stopping and fixed sampling rules?

          P-values can be used as indices of evidence, but there is a non-linear relationship between the P-value and the strength of evidence, and the relationship is not invariant across all circumstances.

          If you insist on a formalisation of statistical evidence then I would suggest that the law of likelihood is our best option, but even that is not so bullet-proof that it can be used without caveats. However, I am confident that your eyeballing of evidence in simple datasets would be similar to mine, and similar to Mayo’s. Make a graph with a simple dataset that yield P=0.5, P=0.05, P=0.005, P=0.0005, and P=0.00005 and you will see for yourself that P=0.05 is pretty weak evidence when calibrated by eye.

          • The same relativity to sample size holds in other methods, but it’s wrong to spoze overall methodologies of science have a single resource (of course n = 5 won’t pass tests of assumptions). A severity interpretation automatically takes into account sample size. Probability is one thing, measures of population effect siize another, and a comparative appraisal, such as a likelihood ratio, is not a test. That H’ is “more likely” than H” (in the formal sense, which differs from our informal use of the word) doesn’t tell us that there’s good evidence for either, even putting aside selection effects. All comparativisms have this limitation, but again, I reject “Unitarianism”, where it’s assumed a single method must do everything when all of science abounds in majorly different strategies of inquiry. There’s as much bias in metastatistics (driven by the desire to have one’s hypothesis about stat method “win”) as there is in science generally. We strive for protections against such bias in science, and put safeguards in place to avoid them. What about in meta-stat or stat foundations or whatever you want to call it? Where are the safeguards?

            • Christian Hennig

              “(of course n = 5 won’t pass tests of assumptions)” – chances are n=5 will pass tests of assumptions (if not with any severity – but severity calculations for misspecification tests are rare or non-existing) because all such tests will be low powered. (Assumption testing is tricky!)

              • Christian:
                I see your point, but failing to reject a null hyp of “the assumption holds” is not evidence for the assumption holding. We have to look at the violations the test is capable of detecting, and there are
                practically none of interest with n= 5. See the typology attempt on SIST, p. 155. I agree that much more attention should be baed on testing assumptions. Discarding tests altogether wouldn’t be a good first move.

                • Christian Hennig

                  We cannot have evidence that “the assumption holds” anyway, because they never hold. (OK we may disagree here, or rather it depends on the definition of “hold”.)

                  In any case, if you require model assumptions tested with some severity (if this has been done in any situation explicitly I’d be grateful for a reference), doesn’t that mean that with n=5 you can’t do anything? I’d rather think in that case we need the model assumption to “do some work”, i.e., do something model-based despite not being able to check the model from the data with any reliability, and interpret accordingly. Once more accepting that whatever we do gives quite limited information is key, not refusal to do anything.

          • “P=0.05 is pretty weak evidence when calibrated by eye.”

            I have totally no idea what “by eye” means rigorously, and everyone has different eyes. I think at some point, we have to defer to numbers from a reliable process to make it unbiased/fair, or something.

            Justin

    • You wrote:
      “As I’ve discussed with others (can’t remember all), one principle in ASA I that should simply be deleted is the one that said something like a P-value of .05 is only weak evidence against the null hypothesis.”

      It’s part of principle 6:
      “For example, a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis.”

      I agree with you, but of course including “by itself” should make us say we don’t have a clue how to interpret it. But it is odd that they are prepared to classify something as “weak” which would seem to spoze a non-arbitrary classification. Of course the entire argument that we need to lower P-values is based on using a Bayes Factor classification scheme that does not hesitate to pigeonhole.

      “near” is vague, but because they’re doing 2-sided tests, we could have p = 0.026–needn’t be weak.

      It actually turns out the reforms based on Bayes factors will infer with high posterior an alternative that is poorly indicated by a significance tester or confidence interval user. This is in SIST e.g., p. 263.

      • Michael J Lew

        The evidence in the data that give a two-tailed P-value of 0.05 is exactly the same as the evidence in the data that give a one-tailed P-value of 0.025: they are the same data!

        • Yes, that is my point (putting aside the fact there’s been a ‘selection effect’). A 2 SE different needn’t be weak. Is there weak evidence that the parameter exceeds the lower bound of the 95% confidence interval? It should be weak, according to this thesis, yet no one ever says that. So if 0, say, is excluded from the 95% interval, you’d need to conclude there’s only weak evidence the population effect exceeds 0.

        • Sander Greenland

          I disagree that “The evidence in the data that give a two-tailed P-value of 0.05 is exactly the same as the evidence in the data that give a one-tailed P-value of 0.025: they are the same data!”

          First, evidence depends entirely on the model for data generation (DGM); the same data can give contradictory evidence given contradictory models. E.g., P=0 for Y independent of X provides you no evidence independence if your DGM is that it was found from units selected only when Y=X, but can be decisive if from a random sample.

          Second (and this may just reflect a wording error in the quote): In common apps in which the 2-sided P-value is taken from an absolute value or square of a signed statistic, the information used for the 1-sided P includes the sign bit, which is discarded by the 2-sided P-value. So then the evidence against the tested hypothesis used by them is not the same, but instead is degraded by 1 bit for the 2-sided P-value, as reflected in the doubling of min(1-sided) to get the 2-sided, which imposes the needed 1-bit penalty on the Shannon-information (surprisal, logworth, S-value) scale -log2(p). This is the penalty needed to account for letting the data choose which of the two 1-sided P-values to report.

          Note I am using “side” because this refers to the hypothesis (the side of the deviation from the point tested), whereas “tail” refers to side of the test-statistic distribution which is often 1-tail even from a 2-sided test, e.g., the chi-squared test of equality of two binomial proportions in a 2×2 table is a 1-tail test of a 2-sided hypothesis.

          • Michael J Lew

            Sander, you seem to have missed the first phrase: “[t]he evidence in the data”. A statistical model is necessary to extract or interpret the statistical evidence in the data, but the model does not affect the data itself. Thus I can stand by my statement.

            • Sander Greenland

              I still disagree and think your response displays the objectivist “mind-projection fallacy” that Edwin Jaynes decried. There is no evidence in the data except in the eye of the beholder, at least by the definition of “data” used in data processing, where a dataset is just a collection of alphanumeric records, any more than there is beauty. “Evidence” is a relation between those data and a target as determined by a model for the way they were generated. Again the data contain no evidence about a target-population structure if they were cherry-picked to create or obliterate that structure; the point of sampling designs is to maximize information. Saying there is “evidence in the data” is a dangerous shorthand that assumes we all agree on that model and target, which in my field is rarely the case!

    • Hurlbert wrote:

      Have you considered developing for a TAS article a condensed list of the principles in ASA I and in Wasserstein et al. (2019) with appropriate rewordings or deletions, getting Wasserstein, Lazar and Schirm (and possibly others) on board as co-authors on an ‘ASA III'”?

      I think it’s extremely important for the authors of ASA II to make the revisions–at lease (1) and (2). There is nothing in ASA I that is seriously misleading. But these points are. It would be painless and should be done soon. I think the statement (1) was inadvertent, so why not fix it?

      I’m sure they wouldn’t write ASA III with me, and I’m not expecting them to.

      • Stuart Hurlbert

        Mayo:

        I probably shouldn’t have used “ASA III” as shorthand for what I was envisaging, which was an article that would be expected to undergo the standard review process but that would NOT be one put forward as an officially approved ASA policy statement.

        I doubt that Wasserstein, Lazar and Schirm actually disagree w/ anything you or I (or many others here) agree on, and my thought was that they might be willing to sign on to the proposed article as individuals who’ve thought and written (and edited!) a lot about the issues but NOT as individuals officially representing ASA.

        Anyone could take the lead on this, but so far you seem to have the best combination of energy, writing skills and “fire in the belly”!

        • Thanks for your confidence in my ability, but I can’t envisage them wanting to. I’d be glad to hear of others who might be interested. It needn’t be a paper but rather a forum, say at a statistical meeting (JSM?), so that others (not invited to a “world beyond p<" conference) can have their views and background experience heard. This would have to be lead by a statistician, which I am not.

    • Stuart Hurlbert

      I made an error in the preceding: “an objective prior of 0.05” should have read “an objective prior of 0.50”

  13. I keep seeing, “P values, if used appropriately….”

    Isn’t the core issue that we’re pretty sure they won’t be? If I understand right, Geoff Cumming has long argued for confidence intervals over p-values not because they are fundamentally different, but because he thinks psychologists are less likely to misuse them. Perhaps we can move the debate there.

    After all, the F-111’s backwards swing-wing control, if used appropriately, would not have crashed so many aircraft.

    • john byrd

      Clearly, if you want to draw an aircraft analogy for p-values, try the B-52, which is not perfect, but is still reliable and in service since the 1950’s. There is no other tool in our statistical kit that has proven so useful for so long.

    • There’s nothing subtle or tricky about the misinterpretations that result in irreplications of tests, and they occur equally for CIs––data dredging, multiple testing and biasing selection effects. Some think they can be ignored for CIs but they cannot. (Biasing selection effects create much worse problems for accounts that do not explicitly pick up on error probabilities and lack error control: Bayes factors, likelihood ratios, posterior probability assignments.) Confidence intervals are swell–but they are just inversions of N-P tests, and require a testing interpretation to avoid giving merely a long-run performance rationale, and to distinguish the evidence associated with different points in the interval. And of course, it’s imperative that one move away from the adherence to a single .95 confidence level. My book SIST has quite a lot on CIs.Confidence intervals are often interpreted as if the actual interval gets a probability assignment, which they don’t.

      • Thanks Stuart and Mayo, both good responses. John B, let us debate aircraft analogies over tea sometime.

        I think Stuart is closest to my intended point: “if used appropriately” all methods work. The problem is they aren’t. P-values in social science have been more abused than used. So, is the problem “between keyboard and chair” or is it evidence of bad design, like rear-wheel steering, unguarded stick blenders, and overly transparent glass doors.

        If PEBKAC, maybe we can fix it by better education. Maybe, but history offers poor odds.

        If design, we could replace it with Bayes, or CI, or neoFisher, or MML, or…. But is this just fun scrapping, or is there clear evidence here that gains on one question are not offset by losses on others? Is any one method favored?

        Or is it just people being lazy? In the online bits of Severe Testing, I think I hear that method doesn’t matter if you don’t actually TEST. Advocates of other methods often imagine their new method would be “used appropriately”. But presumably so did Fisher and NP. It may be too much to ask of any default system.

        In many areas, some designs *are* better than others. Oxo peelers over standard ones, frequency format over Bayes’ theorem. Is anything known about the various approaches competing with p-values? None will be perfect, but one might have less costly mistakes.

        If the fundamental problem is amateurism, perhaps psychology should forsake teaching in-house statistics, and require experimenters to consult statisticians.

        • ctwardy:
          Consulting statisticians is wise, but it’s important to realize that the questionable research practices QRPs that cause problems don’t require complex statistics to grasp and avoid. Consumers know from day to day inundations by sales pitches to look out for the selective reporting of snake oil salesman. They understand wht Fisher called “the political principle” that anything can be proved by selective reporting. The WORST thing that can be done is to switch to methods where the data dredging and selective reporting becomes invisible. Accounts based on error probabilities of methods–I call them error statistical–have ways to pick up on these moves. That is why fraudbusting and replication use error statistical tools.

          Any alternative method, to be preferable or even acceptable, must show how it grapples with and avoids the two main problems causing unwarranted inferences and non-repliction:
          (a) data-dredging, multiple testing, post data subgroups and other biasing selection effects. Error probabilities pick up on these, and alleged error probabilities become invalidated.

          (b) taking a statistical inference as warranting a substantive claim H*, when in fact there are many ways H* can be false or unwarranted by the data that haven’t been probed at all. An abusive form of testing allows inferring from a statistically significant result to a substantive scientific hypothesis that entails or explains the stat sig effect: we know data underdetermine hypotheses and there are a multitude of rival ways to fit the data.
          Simple Fisherian tests, without an alternative, may appear to licence this, but it’s fallacious. The error probabilities no longer hold with respect to H*.

          So what can alternatives say about how they block the problems in (a), (b)?
          (a) Methods that don’t control error probabilities, e.g., likelihood ratios, Bayes Factors, posterior probabilities–would need to show how they block moves that lead to the problems in (a).
          (b) With respect to blocking the problem in (b), the problem that I see is that data confirm a hypothesis H by Bayes theorem by increasing its probability: Pr(H|x) > Pr(H). But if H entails x, then its posterior does go up (unless Pr(H) is already maximal). In other words probabilistic affirming the consequent goes through for Bayesians.

          Error statistical methods have ways to block and identify (a) and (b). The onus is on alternative methods to show they have principled ways to avoid these fallacious moves. It’s not sufficient to say “we thoughtfully consider background information, and that will prevent any counterintuitive inferences.”

          A paper came out recently in Clinical Trials, (responding to the Amhrein et al paper)
          The link is here https://errorstatistics.com/2019/06/01/dont-let-the-tail-wag-the-dog-by-being-overly-influenced-by-flawed-statistical-inferences/)

          “The proposals for abandoning p-values altogether often suggest adopting the exclusive use of Bayesian methods. For these proposals to be convincing, it is essential their presumed superior attributes be demonstrated without sacrificing the clear merits of the traditional framework. Many of us have dabbled with Bayesian approaches and find them to be useful for certain aspects of clinical trial design and analysis, but still tend to default to the conventional approach notwithstanding its limitations. While attractive in principle, the reality of regularly using Bayesian approaches on important clinical trials has been substantially less appealing – hence their lack of widespread uptake.” (p. 224)

    • Stuart Hurlbert

      This notion of our supposed incorrigible stupidity as grounds for demanding a switch from P values to CIs has been discussed in extenso in Hurlbert & Lombardi (2009), pp. 330-333. Here’s a brief excerpt:

      “Fiona Fidler (pers. comm.) kindly critiqued this manuscript for us and offers that we agree on most major points. We agree that rigid institutionalization or prohibition of any one technique would be counter-productive, that misuse and misinterpretation of significance tests has been the main problem, and that “best statistical practice requires consideration of the full range of possible statistical techniques and researchers’ informed judgement to choose the most appropriate design, measure and analyses to serve the particular research goals” (Fidler & Cumming 2008). Her one key disagreement is that she believes significance tests must be deemphasized and used less frequently because our students and colleagues will never learn to use them appropriately. We respond by asking that the neoFisherianism be given a chance. If our students and colleagues have not responded well to force-feeding with the paleoFisherian and Neyman-Pearsonian paradigms, perhaps that speaks well to their intelligence.”

      Most debates and re-debates amount only wheel-spinning because so many people like scrapping than they do spending days getting up to speed on the historical literature.

      • Miodrag Lovric

        Hi Stuart

        I have read your paper and you have raised many valuable points. One of these points is that p-values do not overstate the evidence against the point null hypothesis. However, not for the reasons that you mentioned. One of your arguments (p. 340) is the following: “Even more damning to Berger and Sellke’s claims is the fact that if the prior probability of H0 is set at < 0.35, and if the observed P value is 0.05, then the posterior probability for H0 will always be < 0.05 (Krueger 2001)."

        This is wrong because of the fact that the posterior probability (and Bayes factor) increases with the sample size. This finally leads to the famous Jeffreys-Lindley paradox. To make it short, for any prior probability strictly larger than 0 (in your case 0.35), and fixed p-value (say 0.05), you can always find sample size large enough for which posterior probability of Ho will be 0.95, which of course supports the null.

        I will give you one more example. In the case of testing a normal mean with known variance, if you set the prior probability of the null as low as, say, 0.035 (10 times smaller than yours), for the sample n=1000, and p-value 0.05, posterior of the null is 0.14; if you increase the sample size to 13 million elements, your posterior probability of Ho for the same observed p-value will become 0.95! This sample size is quite normal in the era of Big data.

        Once again, I enjoyed reading your paper.

  14. Miodrag Lovric

    Deborah

    As the Editor-in-Chief of the International Encyclopedia of Statistical Science (617 contributors from 105 countries), I was concerned about the crisis of statistics education in developing countries. However, since 2015 I realized that we have a statistical crisis in science. I conducted at least 15 seminars in Australia, Brazil, New Zealand, and now in the USA to establish a new paradigm in statistical testing. For decades, point-null hypothesis testing has produced countless criticisms and recently even methodological crisis in some fields of science and has done serious damage to the image of statistics and statisticians. For example, recently Hubbard in three papers (including American Statistician, Journal of Applied Statistics) stated that many most eminent statisticians do not understand the difference between p-values and type I error rate!

    For 70 years non-statisticians criticized significance testing and statisticians were silent. Now for the first time we have a response from the most respectable statistics society. However, as they have stated in ASA II is the following “What you will NOT find in this issue is one solution that majestically replaces the outsized role that statistical significance has come to play.”

    So, they simply don’t know what to do. The had debated for two years and didn’t find a consensus. The reason why, according to me, is that we don’t have authorities like Fisher, or Jeffreys, or Neyman. The new generation is forced to specialized in some non-significant areas (far from the foundational issues) in order not to perish. However, as you know, in order to understand these fundamental issues one has to spend many years and read hundreds of papers. They simply don’t have time for that.

    I expected that ASA would have organized some powerful Committee that would suggest a new approach to statistical testing, not just making some new quasi-religion of statistics. Their statement is nothing else but a propagation of political correctness in science. I cannot subscribe to that.

    Have they considered what millions of researches will be doing now? These people are utterly confused now, starting from Bob Cousins.

    Did they think how professors will be teaching statistics without statistical significance and without rejection of a null hypothesis? Thousands of books have to be rewritten, but how? Hence, I think that ASA had to spend several more years, to establish something new, and only then to announce their stance.

    We need to sit all together, frequentists, (neo)Fisherians, Bayesians, Likelihoodists, etc. to harmonize statistical inference. Apparently, it is possible to achieve with confidence and credible intervals, and one-sided testing. However, San Andreas fault, i.e., point-null hypotheses lead to Jeffreys-Lindley paradox (my paper on JLP will be published soon).

    Now, back to your blog.
    (a) Your first bulleted item corresponds to the last item on the ASA II bullet list “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).” I believe that this principle refers to the simple thing that was reiterated hundreds of times: statistical significance is not the same entity as practical significance.

    (b) Your second bulleted item “No p-value can reveal the plausibility, presence, truth, or importance of an association or effect” is wrong. If the p-value is zero we can be almost 100% sure that null is wrong because of p-value == 0 only when corresponding test statistic converges to infinity. A similar opinion, for example, was shared by Karl Pearson:
    “There is only one case in which an hypothesis can be definitely rejected, namely when its probability is zero” (Pearson, quoted from the American Statistician, February 1994, Vol. 48, No. I, paper “Karl Pearson and R. A. Fisher on Statistical Tests: A 1935 Exchange from Nature”, p. 6, https://www.jstor.org/stable/pdf/2685077.pdf)

    Finally, I agree with your third recommendation “failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”

    However, I also believe that we need to change the paradigm of point-null hypothesis testing. Otherwise, next year we can expect ASA III and ban of p-values, in the light of Lindley’s “we will all be Bayesians in 2020, and then we can be a united profession.”

    • Sander Greenland

      In all practice with which I am familiar, it is very wrong to say this is wrong:
      “No p-value can reveal the plausibility, presence, truth, or importance of an association or effect.”
      That statement is correct, at least if we adopt Mayo’s modification or similar, e.g.:
      “By itself, no p-value can reveal the plausibility, presence, truth, or importance of an association or effect.”

      In contrast, this statement is false in practice, an example of mathematics reification:
      “If the p-value is zero we can be almost 100% sure that null is wrong because of p-value == 0 only when corresponding test statistic converges to infinity.”
      Why is this false? Because in practice we should not be 100% sure of all the auxiliary assumptions that went into calculating the P-value. There are examples of tightly controlled physics experiments whose results (with P near 0) had to be retracted because they turned out to be due to equipment failings, not the null being violated.

      In soft sciences like medicine, the operations surrounding clinical trials turn out to involve vast details on which results hinge. Considering them in conjunction reveals why even the smallest P-value needs to be taken as tentative and very uncertain, pending a level of scrutiny and validation which in reality is rarely applied. This uncertainty is why in practical terms P=0.04 does not signify much either way about the supposedly tested hypothesis, since all P-values are sensitive to deviations in various other assumptions (e.g., randomness of losses, measurement errors, etc.).

      All that is good reason to adopt Mayo’s modification.

      • Miodrag Lovric

        Sander, it is good to see you here. I read your recent paper about misapprehensions of p-values that you wrote together with Stephen J. SENN, Kenneth J. ROTHMAN, John B. CARLIN, Charles POOLE, Steven N. GOODMAN, and Douglas G. ALTMAN, titled “Statistical Tests, P-values, Confidence Intervals, and Power: A Guide to Misinterpretations”.

        You have listed “25 common misconceptions”. Now, you can add your own definition of p-value in your paper as the misconception # 26:

        “The P-value is then the probability that the chosen test statistic would have been at least as large
        as its observed value if every model assumption were correct, including the test hypothesis”

        Obviously, you wanted to give some original definition, but unfortunately, hundreds of them tried the same and failed.

        A similar error was made recently by Gelman in his American Scientist paper “The statistical crisis in science”, when he defined p-value as “Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation”, and then corrected it at https://statmodeling.stat.columbia.edu/2014/10/14/didnt-say-part-2/

        At least I can agree with Gelman that there is a statistical crisis in science!

        My question to you all: why the above definition is deficient?

        • Sander Greenland

          Do you really want to defend traditional definitions even if they are clearly out of touch with practical realities?

          You are so far the only one to criticize our definition. And that’s because it’s correct, in fact it’s the only fully correct unconditional definition I’ve seen. The usual definitions including yours are pure-logic fantasies based on conditions that are never verified in full and usually suspect in studies of human subjects in social, health, and medical sciences (which assume implicitly all sorts of dubious things like random loss and random measurement error). Then too part of the auxiliary assumptions is that there is no violation of the model via QRPs (eg P-hacking), which as we well know are often violated.

          I am fighting the mathematical fantasy world of the usual definitions which math idolaters in statistics have burdened and damaged generations of human sciences with.

          As for physics (not that it matters for my work), check out the OPERA experiment reporting faster-than-light neutrinos – which turned out to be due to equipment faults:
          https://en.wikipedia.org/wiki/Faster-than-light_neutrino_anomaly
          -“OPERA scientists announced the results of the experiment in September 2011 with the stated intent of promoting further inquiry and debate. Later the team reported two flaws in their equipment set-up that had caused errors far outside their original confidence interval: a fiber optic cable attached improperly, which caused the apparently faster-than-light measurements, and a clock oscillator ticking too fast.[3] The errors were first confirmed by OPERA after a ScienceInsider report;[4] accounting for these two sources of error eliminated the faster-than-light results.[5][6]”

          • Miodrag Lovric

            Sander, I am not criticizing your definition, your (and other six co-authors) definition is wrong.

            Let us dissect your “improved” non-traditional definition of the p-value. (BTW, my prior is almost 1 that someone else wrote that, not you, I think that I know who is that, he made similar blunders on p-values in his two papers).

            “The P-value is then the probability
            that the chosen test statistic would have been at least as large
            as its observed value if every model assumption were correct,
            including the test hypothesis”

            So the “improvement” is that You used the words “as least as large”, instead of traditional (at least as extreme) Why is this wrong? Because It covers only p-values when the alternative is one-sided and we use a right-tailed test. Only in this situation, we focus on the plausible values of the test statistics that are as least as large as the one obtained.

            If we deal with the point-null hypothesis (as we usually do), your definition covers only one half of a p-value that is located in the right tail; not the values of the test statistic that are as least as SMALL as the one we have observed, with a negative sign. In other words, we have also to include all the values that are smaller than our t (because you mentioned t-test in the sentence before the definition). Hence your definition is plainly wrong and it will give only p-value/2 and will finally show that Berger was right and that p-values overestimate the evidence!

            Finally, your “non-traditional definition is erroneous since it cannot explain p-values for the left-tail test. Example: We test a claim that the average height of adult females in New Zealand (I know that you like New Zealand) is 170 cm at most. Hence, Ho: mu >= 170cm, H1: mu < 170

            Now, where is the critical region? Every student knows this, on the left. So what is the p-value here? P(T<= -abs(t)), or Integral (from obtained possible negative test statistics up to minus infinity). In words, the area at least AS SMALL (not large as you stated) as obtained test statistic!

            I know this is elementary, but I have found this misconception on p-values in dozens of peer-to-peer journals.

            I am really surprised that I am the only one who has noticed this elementary but very common misapprehension of p-values! Perhaps, as you know I was the single Editor of the International Encyclopedia of Statistical Science and had to correct and improve at least 300 entries myself. There wasn't anyone to help me, In any case, thank you once again for your contributed papers 🙂

            Even if you want to defend your position by saying that you were thinking about the absolute value of the test statistics this is also erroneous because (1) we almost never use the absolute value of the Student test statistic, starting from the Gosset original paper. and (2) that definition will not incorporate a left-tail test.

            Recommendation: to avoid all the deficient points in your definition you can use Casella & Berger's Theorem 8.3.27, Statistical inference, p. 397: Let T(x) be a test statistic such that large values of T give evidence that H1 is true….and now you can copy your definition.

            Or you can say like Berger & Delampady in their Testing Precise Hypothesis, (1987, p. 317)
            Let T(X) be a test statistic, extreme values of which are deemed to be evidence against H0….See, there is no "at least as large as" in this definition. BTW, Berger usually uses the previous definition.

            In summary, the definition of p-value given by you and six other people (Stephen J. SENN, Kenneth J. ROTHMAN, John B. CARLIN, Charles POOLE, Steven N. GOODMAN, and Douglas G. ALTMAN). the title is “Statistical Tests, P-values, Confidence Intervals, and Power: A Guide to Misinterpretations", should include this definition as misapprehension # 26.

            Finally, your paper was cited 368 times and downloaded 169,000 times, and not a single person noticed the major error? And the paper is devoted to almost universal misconceptions of the p-values. This only tells me how deep is the statistics crisis in science.

            Best regards

            • Sander Greenland

              Miodrag: OK, I see I was quite wrong about what you were criticizing in our definition. For me the important “modern” part of it concerns being unconditional on the model (although I seem to recall that Fisher noted this interpretation in passing somewhere).

              Now on to the major error: You noted the paper was cited 368 times and downloaded 169,000 times. Also, it was read and commented on by several statisticians of sound repute (see the author list and the acknowledgments, which includes contributors to the current discussion). From these statistics we can’t be sure no one noticed a problem, but my conjecture is that our error went unremarked (if not unnoticed) because nobody read our English with the exacting precision you did (unsurprising perhaps because it was written for and likely most read by an audience a fair bit short of your technical level).

              In any event, thank you for pointing out the problem at last; I only regret that you had not done so earlier in a letter to the TAS editor or in e-mail to us.
              We can now repair the oversight (although below you will see that I reject Casella-R. Berger on this matter)…

              To start, consider that the “at least as large” wording might be traceable to our most prominent source of wisdom:
              Cox & Hinkley (CH) Theoretical Statistics and Cox’s related writings.
              On p. 66 CH defines a test statistic as a random variable T computed from the data for which
              a) the distribution of T is at least approximately known and the same for all simple hypotheses comprising H0
              (those are point models comprising a model manifold M in the data-expectation space, which defines our test model);
              and
              b) the larger the value of T the stronger the evidence of departure from H0
              (what I now call “the greater the refutational information”).
              CH then define the “level of significance” (P-value) as p_obs = Pr(T >= t_obs; H0)
              – Again, as they use it, H0 is composite if there are “nuisance” parameters in the problem (parameters other than the target that are not constrained to a point by H0), so in effect any rejection of their H0 would be a rejection of all fully-specified models within H0 and thus of the entire assumption set used to compute p_obs, not just the targeted parameter constraint (e.g., beta=0).

              Apparently the same definition (including “the larger is t(y), the stronger is the inconsistency* of y with H0”) can be found in sec. 2.1 of Cox 1977, The Role of Significance Tests with discussion by Emil Spjøtvoll, Søren Johansen, Willem R. van Zwet, J. F. Bithell, Ole Barndorff-Nielsen and M. Keuls (Scandinavian Journal of Statistics, 4(2), pp. 49-70), yet I don’t see where anyone objected to it. And so on in later work.

              – My big question to you is then, do you find anything objectionable in the CH definition? For it appears to me that the first part of their (b) is exactly what you object to, so if that is the case then the oversight reaches to a remarkably elite level of statistics. And if instead you are fine with it then you must explain further how it works for you whereas ours didn’t.

              Turning to alternative wordings that would meet your objection, Casella-Berger’s “Let T(x) be a test statistic such that large values of T give evidence that H1 is true” seems quite deficient if taken out of context (as your suggestion seems to advise doing):
              Consider a normal(mu,1) counterexample with H0: mu=0, H1: mu= -5, true mu= 5.
              Then for large enough samples T can be as large as you like but won’t be giving evidence that H1 is true (at least in any sense I can see).

              So: If substituting “extreme” for “as large or larger” will satisfy you, then (pending other ideas) I think I’d settle for that correction, noting only that it may need elaboration encompass “too good a fit” tests (e.g., the lower tail in Fisher’s analysis of whether Mendel’s data had been massaged) for which “extreme” would not automatically be understood to encompass “extremely small”, especially since it doesn’t work with less than 3 df.

              As for whether this oversight is partly responsible for the actual crisis of statistics in science: I confess I thought that was due to the other 25 items we listed (items like thinking the P-value is the probability of H0), not from trying to use the CH definition for a lower-tail test, but if you have an example of a research report clearly reaching a wrong conclusion because of it, please by all means send it. And again, thanks for pointing out the problem! We shall be more careful in the future. Looking forward to your response.

              *As a side note: CH say the random variable P corresponding to p-obs is uniform under H0 (they assume the test model is fixed, as in not being gamed to the data) and “We use p_obs as a measure of consistency of the data with H0…” – This is just the same as the “compatibility” idea but using a word (“consistency”) that they themselves later use to mean something entirely different (convergence to the parameter) and is also used differently in logic and for yet something else again in potential-outcome modeling. Hence I prefer “compatibility.”

              • Miodrag and Sander:
                Cox and Hinkley say: We’re really Involved with Two 1-sided tests:“More commonly,” Cox and Hinkley (1974, p. 79) note “large and small values of [the test statistic] indicate quite different kinds of departure… . Then it is best to regard the tests in the two directions as two different tests… .one to examine the possibility” μ > μ0, the other for μ < μ0 .“.. look at both [p+ and p- ] and select the more significant of the two i.e., the smaller. To preserve the physical meaning of the significance level, we make a correction for selection.” (ibid. p. 106)In the continuous case, we double the separate p-values.

                I think this is the better way to see the 2-sided test. I don't see the problem with this aspect of Greenland's definition. As he notes, he's one of the few people to deny "p-values exaggerate evidence" (Senn, Hurlbert,me as well) and even gets the others to sign on to declare that the issue turns on one's philosophy of statistics! I often quote this.

                We should try to keep this discussion closely to the specific topic, even if it's tempting to bring up various issues you'd want to ask someone here. We can do that elsewhere.

              • Miodrag Lovric

                Dear Sander, thanks so much for your kind recognition 🙂 I agree with almost everything you have replied, but don’t think that Mayo would be happy to give you a long reply. However, I think that we need to dig as deep as we can into the rabbit hole of p-values.

                First, let me start with the following (wrong) definition:
                “It [p-value] can be written as
                P value = Pr(t(X) ≥ t(x)|H0),”
                (Armitage – Encyclopedia of Biostatistics, entry “p-values”),

                or the funny one

                “HISTORY (of p-values): The p-value was introduced by Gibbens and Pratt in 1975.”

                (The Concise Encyclopedia of Statistics, Yadolah Dodge, p. 434).

                Now back to your question. The CH book was written in very difficult times when statistics world was divided into a Fisherian one and NP. In those days a “hybrid” approach to statistical testing did not exist. Hence, they devoted one chapter (Chapter 3) to the “pure” significance testing (the Fisherian ones, without alternative hypothesis), and one (Chapter 4) to Significance testing (of the NP type).

                Definition of p-value on page 66 that you quoted concerns simple null hypothesis, that is one that completely specifies density, They used expression “level of significance” instead of the modern one, p-value. After that, they start discussing composite null hypotheses, and on page 76 they say:

                “Small values of T indicate a departure from H0 in that direction. Given a very small value of

                p_obs = P(T ≤ t_obs; Ho) …”

                which means that in case of a composite null they extended their original definition reserved for a simple null, by viewing p-values based on smaller values of a test statistic.

                Finally, on page 90 they finally gave their overview on p-values by stating:

                “…we defined for every sample point a significance level [read p-value], the probability of obtaining under H0 a value of the test statistic as or more extreme than that observed”.

                Hence, CH book is a little bit confusing since it contains different definitions of p-values depending on the nature of the null hypothesis, but the above one is their final verdict, which is of course correct.

                Thanks again for your thoughtful reply.

      • Miodrag Lovric

        Sander, this depends upon the valid definition of the p-value. If we agree on the broader definition of the p-value, which means that it is based on the “truthfulness” of the null model (including of course the null hypothesis), than your auxiliary assumptions are irrelevant, we assume that they are satisfied. However, if we limit p-value only to the true null hypothesis you are right.

        Finally, I think that the story of the equipment failings is an urban myth, but Bob can give some more light on this.

      • Sander: We can’t evaluate any of these prescriptions if the quantities are illicit. Else we must say that no likelihood ratios, Bayes factors, confidence intervals, etc. are informative about what they purport to be. I think that would be entirely reasonable, but know that is not what the authors of ASA II mean (I asked). Thus, my insertion of “in itself” was mainly to recognize that isolated P-values, P-value reports without any reporting of the test stat and hypotheses do not reveal the presence of an effect. P-values that have passed audits reasonably well DO reveal the presence of effects, and this I take it is behind Principle 1 of ASA I.

        • Sander Greenland

          As usual I think we agree at the core of this (your edits seem to me on the right track anyway) but as per my particular applied background (observational epidemiology), my way of putting the problem is colored more cautiously:

          We should not use the usual evidential or inferential or decision interpretations of stats (as embodied in traditional definitions, like the usual P-value definition of testing a hypothesis) because they are implicitly conditioned on what may be scores of possibly uncertain auxiliary assumptions (like no loose cables in the equipment; no unaccounted-for selection of P-values; etc.) – Unless there are audits that reduce their uncertainties to a point that doubts about their conjunction (the background model) are no longer raised by any stakeholder, for only then can we say anything “reveals presence” of a targeted effect; otherwise it may only be revealing presence of an uncontrolled, undesired effect, and the alternative explanations that cannot be ruled out to everyone’s satisfaction by the design and conduct description (including of audits) should remain as part of the statistical interpretation.

          This kind of unconditionally cautious view and interpretation of stats has been a hallmark of good epidemiologic reporting since I was a student. Not that all follow good practice by any means, but it has become a normative standard for a large segment of the community since Rothman’s efforts to popularize some version of it began in the 1970-80s. My divergence there is mainly that replacing “significance levels”/P-values with “[over]confidence intervals” proved to be no panacea for abuses – an empirical fact that moved me back closer to the “neo-Fisherian” tester’s view (as I understood it from chapters 2-6 of Cox & Hinkley’s 1974 book and Cox’s subsequent 1977 and 1982 articles).

          • Sander: But then we can’t use the usual interpretations of any of the alternative methods either. I was trying to keep close to the framework these remarks assume, in order to try to be constructive. They do not, for example, question the many assumptions required to report odds ratios and Bayes factors, they take them at face value (the Bayes factor in favor of no effect is k). The reason I don’t go back to square one is that the ASA II, in its current form, is very problematic.

            You wrote:
            “My divergence there is mainly that replacing “significance levels”/P-values with “[over]confidence intervals” proved to be no panacea for abuses – an empirical fact that moved me back closer to the “neo-Fisherian” tester’s view (as I understood it from chapters 2-6 of Cox & Hinkley’s 1974 book and Cox’s subsequent 1977 and 1982 articles)”.

            What was found? That’s the kind of information that should be brought to bear on these stat reforms. Were they?

            • Sander Greenland

              Mayo: I was referring to the widespread use of CIs as NHSTs, I don’t have a citation at hand but have seen studies documenting how that use predominates in the literature they surveyed. Not really surprising when such use is in effect forced by (for example) JAMA editorial policy requiring every discussed association be declared “significant” or “not significant” based on the 0.05 cutoff (and without even requiring the addition of “statistically” ahead of it, an abuse that was decried a century ago).

              • Sander:
                I’m not sure I get your point, or your position now. If you can’t use statistical tests, then everyone will use CIs as tests. I seem to recall your writing that it will not do to replace tests with CIs, that we also need tests, and I wish I could remember where you said it. I would have to search dozens and dozens of papers. But it doesn’t matter if you no longer hold that view. So do you?

                If we need tests, as I think we do, then we need thresholds at least between warranted, shaky, and terrible.

                Of course, tests could be recovered by saying the data are “statistically incompatible” with, say, no association, at level .95 (or .05) when 0 is excluded from the .95 interval. You’d lose the information from the P-value unless several confidence levels, say from ..5 to .99, were given. But you’d also need to distinguish the hypotheses corresponding to different pts in the CI. I like CIs but only if they too are reconceived: they inherit the problems of N-P tests.

                All of this is still largely irrelevant to the real sources of bad statistics. But our topic here is revising ASA II, and I still say it is worthwhile and very important to do so.

                • Sander Greenland

                  Mayo: I think if you read my 2019 TAS fest contribution, https://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
                  you will see I use the word “test” with no prejudice against the concept, which I think subsumes a useful set of tools. Unfortunately (and I think in part because statistical theory developed before computer-aided statistics, and in part because of human weakness for precise answers even in the face of great uncertainty) that toolkit became the obsession of statistical training and nonexpert practice, at the expense of estimation and capturing the remaining uncertainties about targets (which could be a costly added computational burden in a bygone era in which “computer” was a job title).

                  As for cutoffs for P-values, I see no pressing need (and much harm) in analyses whose goal is summarization of information via an abstract statistical model derived from the data-generating process; given the space (now unlimited in online supplements) one can simply present all the resulting estimates and indications of their uncertainties, eg via P-values, likelihood ratios, posterior odds, etc. as long as those are given for multiple points of interest so that one can visualize the relevant portion of the P-value function, likelihood function, posterior distribution, etc. Intervals can help or hinder depending on how their arbitrary dichotomization is handled; as you note that issue can be addressed by Cox’s suggestion to present intervals for several %-levels.

                  I only see particular cutoffs as a needed vice when decisions and hence value judgments are to be based on one final test. In those cases I still plead for attention to conflicts among interested parties, a complication Neyman noted as I reviewed in the 2016 TAS round,

                  Click to access utas_a_1154108_sm5079.pdf

                  and then at length in
                  https://doi.org/10.1093/aje/kwx259

                  This means (a) no one cutoff should be some general standard, as you and scores of others said in Lakens et al. (which I often cite for this point), and (b) the appropriate hypothesis to test for that decision may not be the null – “nil” to those addicted to Fisher’s trampling of ordinary English usage, a trampling matched in harm by Neyman’s misuse of “confidence” for his behavioral-decision intervals, and by the common misuse of “significance” (especially as mandated by JAMA, enforced without even the modifier “statistical”).

                  That said, again I agree that tuning of recommendations (ASA, yours or mine) is worthwhile, especially since rigidity and adherence to tradition (especially in terms and descriptions) has been a major contributor to misunderstanding, misuse and misinterpretation of statistics.

                  • You say: “I agree that tuning of recommendations (ASA, yours or mine) is worthwhile”. I hope, in that case, you will help me in pursuing them.

                    However, it’s also clear that other issues which have been unquestioned and unchallenged need to see the light of day in any such best practice guide.

                    I quickly read your American Journal of Epi article (I hadn’t seen it before–it’s not the one I had in mind where you explicitly say we need tests, not just CIs). It’s very good and clear. I just don’t see why these issues do not find their way into the ASA guides. They remain hidden (and will not vanish with the alternatives on offer.) Notably, something you and I have talked about (and I reference you in SIST on this) is:

                    “The bias is built directly into Bayesian hypothesis testing in the form of spikes of prior probability placed on null hypotheses. Yet in soft sciences these spikes rarely have any basis in (and often conflict with) actual prior information (20–25).
                    Neyman himself recognized that nullism is an incorrect general view, noting that false negatives could be more costly than false positives for some stakeholders (27, pp. 104–108; 28).
                    One-sided P values can further help mitigate nullism by shifting the focus from a precise hypothesis (such as the null), which is unlikely to be exactly true, to the hypothesis or probability that the targeted parameter lies in a particular direction (23, 37).
                    Although Bayesians have raised important criticisms of significance testing, they often overlook limitations of Bayesian inference (43, 44) and sometimes claim that P values overstate evidence against the null (45–47). … it is based on a Bayesian standard of evidence (the Bayes factor) which is of doubtful validity for evaluating refutational measures like the frequentist P value (20, 48)”

                    All of these points are discussed in my book SIST (2018, CUP). ASA I, which is where P-values are defined, is based on the nil null. Why did no one object? On this blog alone, you, me, Hurlbert, Miodrag have raised this issue. Neyman, as you note, opposed nullism.

                    Your other point, especially on “result-driven analysis” should also arise in a scrutiny of today’s statistical practice:

                    “The unlimited sensitivity of effect estimates from bias models implies that any desired inference can be manufactured by back-calculating to the plausible-looking models and priors that produce it, thus providing an avenue for motivated statistical reasoning (54). Analysts can completely deceive readers (and themselves) by failing to report result-driven analysis selection.”

                    If Wasserstein and others held a forum to discussthese concerns that remain hidden in the debates, he would find a lot more agreement than when everyone merely repeats the retreads vs P-values, or puts forward their own favorite without giving it a hard time.

    • Miodrag:
      ASA II is said to be a group of recommendations that are open to broad consideration. We should not be seeing textbooks rewritten to follow a standpoint that is so unclear. My blogpost was the result of my trying to identify what ASA II is, and what is its relation to ASA I. We have seen it is not consistent with ASA I, or at least I have brought out conflicts that require attending to.

      The kind of forums that you recommend need to happen first. For the same reason, journal editors should not be asked to overhaul their requirements until there have been adequate forums to hear the background information that currently exists about error statistical and alternative methods–or so I would recommend.

      • Miodrag Lovric

        Thanks, Deborah. However, it seems to me that ASA II is more rigid. Look at this statement on page 8:

        “Statistics education will require major changes at all levels to move to a post “p < 0.05” world"

        My question to you is the following: If they are serious that we should say goodbye to statistical significance, should we also remove rejection of the null hypothesis from teaching? If we do, then should we also eliminate references to Types I and II errors, because we cannot commit them if we don't make a decision about the fate of the null? Additionally, should we annihilate the topic on power since, again, here we should mention the rejection of the false null? This will make life much easier to certain people 🙂 These topics would then (posthumously) live only in the domain of pure theory.

        Next, this discussion didn't include confidence intervals. Maybe you could include their following statement in your bullet list:

        "We need to stop using confidence intervals as another means of dichotomizing (based, on whether a null value falls within the interval)."

        Should we teach CI as "compatibility intervals"? On page 5 they say: "Amrhein, Trafimow, and Greenland (2019) and Greenland (2019) argue that the use of words like “significance” in conjunction with p-values and “confidence” with interval estimates misleads users into overconfident claims. They propose that researchers think of p-values as measuring the compatibility between hypotheses and data, and interpret interval estimates as “compatibility intervals.”

        Does this mean that ASA III will start with Don’t Say “Confidence interval” because it has been "proved" that it is the fallacy to place confidence in confidence intervals (https://www.ncbi.nlm.nih.gov/pubmed/26450628)

        • I agree they have much bolder and much less modest aspirations, but this goes against their own calls for modesty, humility, thoughtfulness, bringing to bear background information, coupled with the claim these are merely recommendations which call for broad analysis. To my knowledge, there has been no opportunity, not even a weak one, to invite those who have had experience with the broad alternatives being put forward as more acceptable than statistical significance tests to weigh into the debate. If you’re truly trying to find something out, you do not limit the input to data that will agree with your broad position and not challenge it. It becomes, as you said, a piece of “political correctness”.

    • Christian Hennig

      There are arguments in favour and against p-values, in favour and against objective or subjective Bayes, arguments in favour and against likelihood inference and there will be arguments againt (and maybe in favour of) anything new people come up with now believing that they may be more intelligent than the brightest people in our profession for 100+ years. Because this is so, I don’t think we need “a new paradigm”. What we need is pluralism, which includes some scepticism and modesty about any approach – the best bit of ASA II is probably “Accept Uncertainty”!

      • Christian:
        My problem is that I don’t see “modesty” as regards to the meta-stat task they set for themselves. I think the declarations are too strong, which is why I urge the modifications.

        • Christian Hennig

          Fair enough, I should’ve indicated that my posting was a reply to Miodrag in the first place; the “reply-hierarchy” seems exhausted.

    • Stuart Hurlbert

      Miodrag

      My thoughts on various points you are making. Recall that these are from someone who thinks there is no possibility of or need for proscribing or requiring (e.g., in an ASA document, in a journal’s “instructions to authors, or in a textbook) particular statistical methodologies; the only possible consensus is that we should understand the methods we use, we use them correctly and we interpret them appropriately. And as a first order of business we need to focus on those principles that apply universally, not those specific to only certain types or sizes of studies.

      1) I think it is not “point-null hypothesis testing that…. has done serious damage to the image”…”etc. Rather it has been the misuse, misinterpretation, and frequent overreliance – BY STATISTICIANS AND OTHER SCIENTISTS – that has done the damage.

      2) “ASA II” is a convenient but somewhat misleading label for a paper (Wasserstein et al. 2019) by three people serving to introduce 43 articles representing diverse viewpoints but never intended to be a consensus document, even if it pointed out a few points common to many of the articles. It was never intended to be an authoritative statement about “what to do.” Nevertheless, it successfully provoked a large number of people to read, think about, and debate the issues.

      There’s an analogy with the Ten Commandments which usefully names a whole bunch of bad behaviors and tells us not to engage in them (like ASA I) but gives almost no advice, other than honor your father and mother, as to how specifically to be a good person and lead a good life. The restraint obviously is appropriate because as long as you avoid the “nots”, the ways of leading a good life (i.e. of using statistics appropriately) are legion. And the sages knew that the way forward was not to take a little more time, create a bigger committee, and come up with the Hundred and Ten Commandments.

      3) Your point about the forces pushing statisticians to focus on “non-significant areas” to the neglect (or disdain) of the “foundational” ones is an excellent and highly relevant one.

      4) It would be a mistake I think for ASA to “organize some powerful Committee that would suggest a new approach to statistical testing.” Better that anyone claiming to have the “intellectual power” as to merit a seat on that committee 1) demonstrate their “power” by putting out a better introductory statistics text and/or by writing a review articles focused on one or more key issues, and 2) then submit all their published work (if any) on foundational issues for rigorous scrutiny by others. Lots of shovel work in the Augean stables still needed before we call in the muralists, florists and odor control folks.

      5) More, smarter, and more public-spirited (i.e. willing and prompt) reviewers and knowledgeable editors willing and able to referee protracted author-referee disputes is an m.o. preferable to formation of powerful committees.

      6) This is not the place to deal with the issue extensively, but I wonder if your new JLP paper advocating one-sided testing 1) confronts the long-standing arguments against such, as summarized in Lombardi, C.M. and S.H. Hurlbert, 2009. Misprescription and misuse of one-tailed tests. Austral Ecology 34:447-468, Appendix , and 2) avoids the common conflation of significance testing/assessment with scientific hypothesis testing?

      Stuart

      • Stuart:
        I’m replying to some of your remarks to Miodrag:
        “Recall that these are from someone who thinks there is no possibility of or need for proscribing or requiring (e.g., in an ASA document, in a journal’s “instructions to authors, or in a textbook) particular statistical methodologies; the only possible consensus is that we should understand the methods we use, we use them correctly and we interpret them appropriately”

        Wait a second. Do you really mean this? If so, then you would not send around calls for people to sign on to “don’t use ‘significance’”, and you would not send around letters to journal editors urging/nudging/asking (whatever) them to adopt ASA II. Yet you have done both of these things. (I’ve received the first, and journal editors have reported on the second.) If this blogpost ends that behavior it will be more than worth it.

        The ASA may claim, of course, no one is REQUIRING you to abide by this, it’s a mere recommendation. But everyone knows that there is a very strong onus on people in the field to go along with what the ASA wants; there are all kinds of career/professional implications. The ASA should invite discussion from other points of view, but so far I have not seen that. Correct me if I’m wrong.
        As Miodrag points out, it’s clear within this document, that the ASA plans professional strategies to get journal editors and professional groups to accept their recommendations. At minimum, they should wish to make the modifications I recommend so that it is not inconsistent with ASA I. It is disingenuous for you to allege this isn’t intended to strongly urge its view on the field. If we take seriously what it says:
        ♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)
        It would stand to reason that these tests not be used at all.

        “’ASA II’ is a convenient but somewhat misleading label for a paper (Wasserstein et al. 2019) by three people serving to introduce 43 articles representing diverse viewpoints but never intended to be a consensus document, …”

        Perhaps that gives it plausible deniability. That’s precisely why I call on them, and others call upon them, to clarify this document. ASA I was intended as principles to be adopted. And ASA II suggests this is a continuation.
        You are skirting what ASA II DOES purport to do (and if this is not so, they must clarify; landslides of lawsuits will otherwise quite correctly find support to free test abusers from culpability in (mis)interpreting clinical trials). It maintains:

        ♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof). (p. 1)
        ♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)
        ♦ A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)
        ♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

        (Since they’ve made it clear that any threshold, however small, is “arbitrary”, this is to, again, deny p-values can be informative (conflicting with ASA I, Principle 1).

        Please see my other comments such as:
        https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/#comment-183696

        https://errorstatistics.com/2019/06/17/the-2019-asa-guide-to-p-values-and-statistical-significance-dont-say-what-you-dont-mean-some-recommendations/#comment-183724

  15. The American Statistical Association ASA has a blog or foum called ASA Connect. Below is Greenland’s response to me on ASA Connect, and my response to him. They differ from what has been said here, and relate directly to the problem of needed revisions in ASA II.

    Greenland:
    A problem with the proposal
    Replace (1) with: “Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”
    is that it sounds like an endorsement of using “statistical significance” (as opposed to merely allowing P-values a role), at odds with the ASA II. That could be remedied by changing to
    Replace (1) with: “Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on a P-value (“statistical significance” or lack thereof).”
    —————————-

    Mayo to Greenland:

    Many of the “don’ts” include the term “statistically significant”. They could hardly be taken as endorsing its use.
    For example,

    “Don’t believe that an association or effect is absent just because it was not statistically significant.”

    How might it be replaced? “Don’t believe that an association or effect is absent just because the p-value of the observed difference is not sufficiently small to be deemed incompatible with H0.”? I was going for a minimalist change.

    Your recommendation is fine, but it’s not clear that “only on a P-value” conveys the point or the parenthetical (“statistical significance” or lack thereof). It might be better to revert to Principle 5 from ASA I (which it was intended to capture):

    “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.” (p. 132)

    This isn’t put quite right either, in that a p-value isn’t statistical significance. They must mean “A p-value, or observed statistical significance level does not measure the size of an effect or the importance of a result.” (For Fisher, of course, this would be what’s meant by statistical significance level.)

    Perhaps they should remove “or statistical significance” and write:

    “A p-value does not measure the size of an effect or the importance of a result.”

    But this requires changing both ASA I and II.

  16. Professor Mayo, thanks for this illuminating blog post. This is especially helpful as I am working through your SIST. In particular, it seems to me that a lot of concerns about objectivity and subjectivity of statistical inference (p.5-6 of ASA II) are already addressed in Excursion 4 Tour I of SIST. I am curious though, what you think of ASA II’s appeal to the notion of openness/transparency. If I understand you correctly, you pointed out in SIST (p. 236) that there is a tendency to use“transparency” as a way to recognize the involvement of human judgment in statistical inference, but simply making clear of one’s choices and reasons does not necessarily make a sound epistemic practice, unless we have a rational system in place (p. 237). It would seem ASA II’s appeal to transparency is open to similar criticisms.

    ASA II states that transparency is a consequence of being thoughtful. Thoughtfulness is a core component in the third bulleted item, which says that a “declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. “ (p. 4). The thought seems to be that, if one is thoughtful about one’s own epistemic practice, then she would recognize that her experimental and modeling decisions might be open to reasonable disagreement. So, to be a responsible and thoughtful member of the scientific community, one ought to refrain from declaring “statistical significance”, which, I assume, has a connotation of objectivity. This would make sense of ASA II’s suggestion of reporting p-values as “continuous descriptive statistics”—presumably, this allows everyone to make their own mind about whether or not they are low enough. This notion of transparency is implied by the quote you posted here and in SIST (p.234): Gelman’s “Bayesians Want Everybody Else to be Non-Bayesian”, regarding the desire of wanting everyone else reporting nothing but the likelihood (which presumably due to the assumption that they are nothing but the unsullied report of the observation.) The idea is that then everyone can construct their own Bayesian analysis based on the likelihood.

    Does this work? I am not sure. As already pointed out in your SIST (p.237): I still need a principled way to adjudicate between two experts who come to competing conclusion on the data; otherwise I would just have to rely on questionable criteria, such as popularity or salesmanship. The issue gets even messier when we consider the role of statistical evidence plays in the public. People often expect binary answers, and my fear is that a transparent but vague statistical report might cause more confusion than insight in this situation.

    • Lok: Thanks for your comment. This will not be a response to all your points–just arrived in Munich–but I want to point you to the ASA I Principle 4 on transparency about multiple testing, data-dependent subgroups, stopping rules and other biasing selection effects. I think the ASA guides are in some tension with themselves on this important principle. The principle gets its rationale from the fact that an a valid P-value or other error probability is changed by the sampling distribution. Alternatives to P-values, e.g., BFs, likelihood ratios, do not assess error probabilities, and thus aren’t obviously concerned with such selection effects. You see this in such places as Excursion 1 Tour II (fully linked under excerpts on this blog), and throughout the book. There wasn’t agreement as to whether “alternative measures of evidence” also demand Principle 4 (in ASA I). It is in limbo in ASA II because it is not mentioned. Yes, “transparency” is occasionally mentioned, but it’s not clear they are alluding to the biasing selection effects delineated in ASA I–as a requirement for P-values and other error probabilities.
      I’ll have to return to your other points later in the week. I hope to blog parts of the conference here this week (“Statistical reasoning and Scientific Error”)

      • Thanks for the pointer, Prof. Mayo. It is indeed interesting that Principle 4 from ASA 1 is not explicitly mentioned. I think I too quickly assumed that the notion of transparency in ASA 2 was a reference to Principle 4, but, as you pointed out, this is not at all clear. In particular, Principle 4 seems to imply that a researcher can responsibly say “statistically significant”, given she fully and transparently reports her experimental decisions—this seems to be in conflict with ASA 2’s flat out declaration “don’t say statistically significant”.

        It certainly sounded like Principle 4 was written with p-values in mind. I know that in your 1996 book(ch 10), you argued that posterior distributions can also be manipulated using stopping rules. Do you still hold this view on the matter?

        I hope the conference is going/went well!

        • Lok: I’ve been traveling, and unable to check my comments, sorry.
          Yes,the point is that ignoring optional stopping, in the context of the famous two-sided Normal testing example,is guaranteed to lead to a 2SE observed difference from Ho, even when Ho is true. You will see in Excursion 1 Tour II, Armitage’s point that in the case of a matching prior, the Bayesian gives the high prob to not-Ho, even though Ho is true. His point is that it would appear to be a problem for a Bayesian as well. That’s not how Savage sees it. In the spike and smear Bayesian move, Savage can say, as he does, that the stopping rules does not alter the posterior. But the fact that it doesn’t alter their posterior is not a good thing from the error statistician’s point of view. It will give high posterior to the alternative (how high depends on the assignment), even though Ho is true. The equivalent thing happens with confidence intervals (CIs): the Bayesian is sure to exclude 0 (or whatever Ho is) even assuming it is true. Check pp 430-1 of SIST where Berger and Wolpert concede that the Bayesian is bound to be mislead in this case (though they will try to say that, given their beliefs, they’re not really mislead…). (Check the quote in SIST by R. Little pointing out that frequentists and Bayesians point to this example as a case in their favor!).
          Readers who don’t have SIST can find a full excerpt of Excursion I Tour II on this blog, search “excerpt”.

          You wrote:
          “Principle 4 seems to imply that a researcher can responsibly say “statistically significant”, given she fully and transparently reports her experimental decisions—this seems to be in conflict with ASA 2’s flat out declaration “don’t say statistically significant”.”

          Yes, ASA II is inconsistent with ASA I, and the authors should announce just what its position is in relation to these two docs. I will say more on this in my reply to Hurlbert.

  17. Pingback: The NEJM Issues New Guidelines on Statistical Reporting: Is the ASA P-Value Project Backfiring? (i) | Error Statistics Philosophy

  18. Pingback: Palavering about Palavering about P-values | Error Statistics Philosophy

  19. Pingback: (Excerpts from) ‘P-Value Thresholds: Forfeit at Your Peril’ (free access) – Summer Seminar PhilStat

  20. Pingback: Bad Statistics is Their Product: Fighting Fire With Fire | Error Statistics Philosophy

  21. Pingback: 5 September, 2018 (w/updates) RSS 2018 – Significance Tests: Rethinking the Controversy | Error Statistics Philosophy

  22. Pingback: Schachtman Law » A Proclamation from the Task Force on Statistical Significance

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.