This is an interesting post but the author’s usage of Lindley’s paradox seems to be unrelated to the Lindley’s paradox I’m familiar with:
> If we raise the power even further, we get to “Lindley’s paradox”, the fact that p-values in this bin can be less likely then they are under the null.
Lindley’s paradox as I know it (and as described by Wikipedia [1]) is about the potential for arbitrarily large disagreements between frequentist and Bayesian analyses of the same data. In particular, you can have an arbitrarily small p-value (p < epsilon) from the frequentist analysis while at the same time having arbitrarily large posterior probabilities for the null hypothesis model (P(M_0|X) > 1-epsilon) from the Bayesian analysis of the same data, without any particularly funky priors or anything like that.
I don’t see any relationship to the phenomenon given the name of Lindley’s paradox in the blog post.
Ultimately I think the paradox comes from mixing two paradigms that aren't really designed to be mixed.
That said you can give a Bayesian argument for p-circling provided you have a prior on the power of the test. The details are almost impossible to work out except for a case by case calculation because unless I'm mistake the shape of the p-value distribution when the null-hypothesis does not hold is very ill defined.
However it's quite possible to give some examples where intuitively a p-value of just below 0.05 would be highly suspicious. You just need to mix tests with high power with unclear results. Say for example you're testing the existence of gravity with various objects and you get a probability of <0.04% that objects just stay in the air indefinitely.
> One could specify a smallest effect size of interest and compare the plausibility of seeing the reported p-value under that distribution compared to the null distribution. 6 Maier and Lakens (2022) suggest you could do this exercise when planning a test in order to justify your choice of alpha-level
Huh, I’d never thought to do that before. You pretty much have to choose a smallest effect size of interest in order to do a power analysis in the first place, to figure out how many samples to collect, so this is a neat next step to then base significance level off of it.
In a perfect world everybody would be putting careful thought into their desired (acceptable) type I and type II error rates as part of the experimental design process before they ever collected any data.
Given rampant incentive misalignments (the goal in academic research is often to publish something as much as—or more than—to discover truth), having fixed significance levels as standards across whole fields may be superior in practice.
This is an interesting post but the author’s usage of Lindley’s paradox seems to be unrelated to the Lindley’s paradox I’m familiar with:
> If we raise the power even further, we get to “Lindley’s paradox”, the fact that p-values in this bin can be less likely then they are under the null.
Lindley’s paradox as I know it (and as described by Wikipedia [1]) is about the potential for arbitrarily large disagreements between frequentist and Bayesian analyses of the same data. In particular, you can have an arbitrarily small p-value (p < epsilon) from the frequentist analysis while at the same time having arbitrarily large posterior probabilities for the null hypothesis model (P(M_0|X) > 1-epsilon) from the Bayesian analysis of the same data, without any particularly funky priors or anything like that.
I don’t see any relationship to the phenomenon given the name of Lindley’s paradox in the blog post.
[1] https://en.wikipedia.org/wiki/Lindley%27s_paradox
That said you can give a Bayesian argument for p-circling provided you have a prior on the power of the test. The details are almost impossible to work out except for a case by case calculation because unless I'm mistake the shape of the p-value distribution when the null-hypothesis does not hold is very ill defined.
However it's quite possible to give some examples where intuitively a p-value of just below 0.05 would be highly suspicious. You just need to mix tests with high power with unclear results. Say for example you're testing the existence of gravity with various objects and you get a probability of <0.04% that objects just stay in the air indefinitely.
> One could specify a smallest effect size of interest and compare the plausibility of seeing the reported p-value under that distribution compared to the null distribution. 6 Maier and Lakens (2022) suggest you could do this exercise when planning a test in order to justify your choice of alpha-level
Huh, I’d never thought to do that before. You pretty much have to choose a smallest effect size of interest in order to do a power analysis in the first place, to figure out how many samples to collect, so this is a neat next step to then base significance level off of it.
Given rampant incentive misalignments (the goal in academic research is often to publish something as much as—or more than—to discover truth), having fixed significance levels as standards across whole fields may be superior in practice.