Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)

.

The October 2019 issue of the European Journal of Clinical Investigations came out today. It includes the PERSPECTIVE article by Tom Hardwicke and John Ioannidis, an invited editorial by Gelman and one by me:

Petitions in scientific argumentation: Dissecting the request to retire statistical significance, by Tom Hardwicke and John Ioannidis

When we make recommendations for scientific practice, we are (at best) acting as social scientists, by Andrew Gelman

P-value thresholds: Forfeit at your peril, by Deborah Mayo

I blogged excerpts from my preprint, and some related posts, here.

All agree to the disagreement on the statistical and metastatistical issues:

  • “Very different views have been expressed, and consensus is distinctly lacking among experts (eg see 21 heterogeneous commentaries accompanying the American Statistical Association’s 2016 Statement on P‐Values [ASAI])” Hardwicke and Ioannidis.
  • “The foundations and the practice of statistics are in turmoil, with corresponding threads of argument in biology, economics, political science, psychology, public health, and other fields that rely on quantitative research in the presence of variation and uncertainty. Lots of people (myself included) have strong opinions on what should not be done, without always being clear on the best remedy” Gelman.
  • “The 43 papers in the special issue ‘Moving to a world beyond ‘p < 0.05’’ offer a cacophony of competing reforms” Mayo.

Despite the admitted disparate views, ASA representatives come out, in 2019, forcefully on the side of: Don’t use P-value thresholds (“at all”) in interpreting data, and Never describe results as attaining “statistical significance at level p”. Should the ASA, as an umbrella group, be striving to provide a relatively neutral forum for open, pressure-free, discussion of different methods–their pros and cons? This is a leading question, true. As an outsider, I’m interested to know what both insiders and outsiders think.[i]

  • [i] It’s hard to imagine the American Philosophical Association coming out with a recommendation against one way of doing philosophy, but of course the situation with statistics is very different. (I do recall a push for “pluralism” in philosophy, which has taken on many meanings, and which I’m not up on.)

Links to ASA I and IInote:

Wasserstein, and N. Lazar.  2016 ASA Statement on P-Values and Statistical Significance (ASA I).

Wasserstein, R., Schirm A. and N. Lazar. “Moving to a world beyond ‘p< 0.05‘”  (ASA II)note

Categories: ASA Guide to P-values, P-values, stat wars and their casualties | 16 Comments

Post navigation

16 thoughts on “Hardwicke and Ioannidis, Gelman, and Mayo: P-values: Petitions, Practice, and Perils (and a question for readers)

  1. I feel a bit guilty for being the last to get my editorial in. That’s because it came at the start of our summer Seminar in Phil Stat, and it would be over 2 weeks before I could even really start. I gathered my comments would be controversial, so I reread everything. I am very grateful to Ioannidis for his encouragement and enthusiasm. I’m also very grateful to those I acknowledge: Hand, Schachtman and Spanos.,

  2. We get divergent results when our models are not adequate to the complexity of the phenomena. I don’t see much progress ahead so long as diverse research isms continue to cling to their pet simplicities.

  3. Christian Hennig

    Despite my general p-value defending stance, I do think that “evidence language” is better (in the sense of having less potential for misunderstanding) than “significance language”. “The data show strong evidence against the null” just says so much clearer what we’ve learned than “deviation from the null is significant at level 1%”; also we can talk about weak/some/strong/very strong evidence getting away from oversimplifying binary thinking.

    • Christian: You’d need some thresholds beyond which to declare strong evidence and this would result in all the worst problems, as a particular small p-value won’t always indicate the same degree of strength. In fact the best thing about measuring statistical distance with stat sig levels is that their meaning is fixed, but then can be interpreted according to discrepancies indicated, sample size, assumption
      Moreover there are different accounts of “evidence”. Some will say there’s evidence even when biasing selection effects would bar a valid p-value report. They (e.g., likelihoodists) might say they take such things (selection effects) into account at a later stage for “action” or the like. But then illicit p-values could count as evidence because some accounts don’t evaluate error probabilities in determining evidence. No, that would be a disaster.

      • Christian Hennig

        Fair enough. I agree thresholds are needed, as language for interpretation is necessarily discrete. I don’t disagree with anything you wrote there, I just say that if I use “evidence language”, chances are the person I’m talking to will have a clearer idea about the result than if I use “significance language” (there’s quite a bit of experience with students, clients, non-statistician collaborators behind this).

        • Christian: Note what I added to my reply as regards different notions of evidence. It’s precisely that p-values are their own special scale that enables interpreting them for the case at hand.

          • Christian Hennig

            This current discussion at Andrew’s may be related:
            https://statmodeling.stat.columbia.edu/2019/09/24/chow-and-greenland-unconditional-interpretations-of-statistics/
            My interpretation of evidence is what Chow and Greenland refer to as “unconditional”. When I say “evidence against the null” I don’t mean this conditionally on certain assumptions, but against the null model, all assumptions included (particularly meaning that the null model may be violated in other ways then what is of primary interest to the scientist – obviously one could run misspecification tests to detect some of that stuff).

            You’re right that misunderstandings are also possible with this use of language. At the moment we leave pure mathematics, there are ambiguities and more or less subtle different uses of the same word. I stand by my preference for “evidence language” mainly because I have seen much more misuse of “significance language”.

  4. Christian: But you haven’t said how you’d recommend translating. In SIST, I first consider low P-values as “indicating” a discrepancy. Risks from violated assumptions & selection effects have to be checked before we infer evidence. That is the task of auditing.

  5. Evidence sounds like a positive thing but what we have is the absence of an explanation. We are faced with a situation where the null hypothesis provides a very poor explanation for the observed event, meaning it could happened on that hypothesis but with a very low probabilty. So we’re advised to seek another explanation for what we observed.

    • The point of testing for statistical significance is merely to identify a difference not readily due to random or chance variability alone. One can then estimate the indicated magnitude. However, arriving at an “explanation” would generally require more.

      • Yes, once the null hypothesis is put in doubt the field is open for alternative explanations of the observed result. A mistake often made is to take the apparent lawfulness as self-explanatory.

  6. Brad Efron

    Mayo, thanks for those articles. I’ve been appalled by the “dump p-values” movement, for just the reasons you say. If it happens, which I don’t think it will, there’ll be less, not more, reproducability. Brad

    • Brad:
      Thanks so much for your comment. I think it’s important for people to know that many, many statisticians disagree w/ the “dump p-values” movement–not just with their recommendations not to say “significance”, but the entire way in which this movement is being conducted. Maybe it’s just another example of the weird times we’re living through, but the damage is real.

    • Sander Greenland

      Brad, there’s a “dump p-values” movement which is completely disconnected from the ASA recs and what colleagues and I have been writing, which is “keep P-values but stop calling them significance levels; stop dichotomizing P-values, present them in full and let the reader apply their own cutoffs if they want to interpret them that way (as both Cox and Lehmann advised in their books); and stop calling intervals obtainable by inverting P-values ‘confidence intervals'” (which Bowley labeled a confidence trick back in 1934). Some people seem to leave everyone confused about the profound differences in these movements, which I find especially aggravating after having spent so much time defending P-values from those who ban them or want to.

      • Sander:
        Nothing in my editorial speaks of a “dump p-value” movement, but I can’t think of an enterprise that is doing more to dump P-values than ASA II. I realize that ASA II quotes you in (personal communication, January 25, 2019): “’The core human and systemic problems are not addressed by shifting blame to p-values and pushing alternatives as magic cures—especially alternatives that have been subject to little or no comparative evaluation in either classrooms or practice,’ Greenland said.” It’s too bad that ASA II seems unconcerned with the need for a comparative evaluation, and didn’t temper its position by your remark. Nor are they led to a anything but the most stilted and ungenerous interpretations and uses of tests.

  7. rkenett

    At the risk of being repetitive, below are my two cents:
    1. We need to be constructive
    The ASA backed up statements (I+II) appear destructive. They seem driven by a disconnect with practitioners. Glad to see Brad Efron’s realistic perspective.
    2. Generalisation of findings
    This is an untapped direction. The machine learners use it by comparing overfitted models to validation sets. Statisticians can do more, This however requires the ability to combine qualitative (conceptual) and quantitative thinking. Not trivial to either statisticians or philosophers… See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070

I welcome constructive comments that are of relevance to the post and the discussion, and discourage detours into irrelevant topics, however interesting, or unconstructive declarations that "you (or they) are just all wrong". If you want to correct or remove a comment, send me an e-mail. If readers have already replied to the comment, you may be asked to replace it to retain comprehension.

Blog at WordPress.com.