5.1. General observations
Suppose X1, . . . , Xn are i.i.d. N(θ, σ2), σ2 known. The Jeffreys–Lindley paradox shows that for inference aboutθ, P-values and Bayes factors may provide contradic- tory evidence and hence can lead to opposite decisions. Once again, as in Section2, the evidence againstH0contained in P-values seems unrealistically high. We argue in this section that part of this conflict arises from the fact that different types of asymptotics are being used for the Bayes factors and the P-values. We begin with a quick review
of the two relevant asymptotic frameworks in classical statistics for testing a sharp null hypothesis.
The standard asymptotics of classical statistics is based on what are called Pitman alternatives, namely, θn = θ0+d/√
n at 1/√
n-distance from the null. The Pitman alternatives are also called contiguous in the very general asymptotic theory developed by Le Cam (videRoussas, 1972; Le Cam and Yang, 2000; Hájek and Sidák, 1967). The log-likelihood ratio of a contiguous alternative with respect to the null is stochastically bounded asn→ ∞. On the other hand, for a fixed alternative, the log-likelihood ratio tends to−∞(under the null) or∞(under the fixed alternative). If the probability of Type I error is 0 < α <1, then the behavior of the likelihood ratio has the following implication. The probability of Type II error will converge to 0 < β < 1 under a contiguous alternativeθn and to zero if θ is a fixed alternative. This means the fixed alternatives are relatively easy to detect. So in this framework it is assumed that the alternatives of importance are the contiguous alternatives. Let us call this theory Pitman type asymptotics.
There are several other frameworks in classical statistics of which Bahadur’s (Bahadur, 1971; Serfling, 1980, pp. 332–341)has been studied most. We focus on Ba- hadur’s approach but the one due toRubin and Sethuraman (1965), based on moderate deviations, seems also relevant. In Bahadur’s theory, the alternatives of importance are fixed and do not depend onn. Given a test statistic, Bahadur evaluates its performance at a fixed alternative by the limit (in probability or a.s.) of n1(log P-value)when the alternative is true. Bahadur’s exact slope is negative of twice this limit. An equivalent description of the limiting value is to fix the probability of Type II error at the fixed alternative and calculate the limit of 1nlog(probability of Type I error). This limit and the above limit of 1n(log P-value)coincide. Thus the limit measures how significant is the test, on an average, under the alternative. It also measures how fast the probability of Type I error tends to zero. It may seem a little odd that the probability of Type II error is fixed rather than that of Type I error. But if the probability of Type I error is fixed and the other error at a fixed alternative is studied, one runs into a somewhat anomalous situation where the probability of Type II error gets much smaller than the probability of Type I error even though Type I error is supposed to be more important. Thus fixingα (at any conventional value) and lettingn→ ∞violates the usual convention of treating Type I error as the more important of the two. One may also ask what value ofαshould one choose. Unfortunately, Bahadur’s theory does not answer this question but it does suggestαshould tend to zero very fast and not be fixed at the conventional values.
Which of these two asymptotics is appropriate in a given situation should depend on how the sample size is chosen. If the sample size is chosen, as is often the case, such that the probability of Type I error isαand probability of Type II error at a given alternativeθ isβ, where 0 < α,β < 1 are preassigned, then it is easy to verify that
|θ−θ0| ∼1/√
n, i.e.,θis essentially a contiguous alternative. On the other hand, if there is a fixed alternativeθof importance where protection against Type II error is desired, andnis chosen to be very large, much larger than the sample size in the previous case, θ is not a contiguous alternative. In this framework, fixing αat a conventional value of 0.05 or 0.01 would clearly lead to aβ smaller thanα. In this kind of problem, the limiting P-value of Bahadur is a good measure of performance.
5.2. Comparison of decisions via P-values and Bayes factors in Bahadur’s asymptotics In this and the next subsection, essentially Bahadur’s approach is followed both for P- values and Bayes factors. In Subsection5.4, a Pitman type asymptotics is used for both.
We first show that if the P-value is sufficiently small, as small as it is typically in Ba- hadur’s theory, BF01will tend to zero, calling for rejection ofH0, i.e., the evidence in the P-value points in the same direction as that in the Bayes factor or posterior proba- bility, removing the sense of paradox in the result of Jeffreys and Lindley. One could, therefore, argue that the P-values or the significance levelαassumed in Section4are not small enough. The asymptotic framework chosen is not appropriate when contigu- ous alternatives are not singled out as alternatives of importance.
We now verify the claim about limit of BF01. Without loss of generality, takeθ0=0, σ2=1. First note that
(33) log BF01= −n
2X2+1
2logn+Rn, where
Rn= −logπ X|H1
−1
2log(2π )+o(1)
providedπ(θ|H1)is a continuous function ofθ and is positive at allθ. If we omitRn from the right-hand side of(33), we haveSchwarz’s (1978)approximation to the Bayes factor via BIC.
The logarithm of P-value corresponding to observedXis
(34) logp=log 2
1−Φ√ n|X|
= −n 2X2
1+o(1)
by standard approximation to a normal tail (videFeller, 1973, p. 175orBahadur, 1971, p. 1). Thus 1nlogp→ −θ2/2 and by(33), log BF01→ −∞. This result is true as long as|X| > c(logn/n)1/2,c >√
2. Such deviations are called moderate deviations, vide Rubin and Sethuraman (1965). Of course, even for such P-values,p ∼ (BF01/n)so that P-values are smaller by an order of magnitude. The conflict in measuring evidence remains but the decisions are the same.
In the next subsection we pursue the comparison of the three measures of evidence based on the likelihood ratioλ, the P-value based on the likelihood ratio test and the Bayes factor BF01under general regularity conditions.
5.3. Comparison of P-value with likelihood ratio and Bayes factor in Bahadur’s asymptotics
In this subsection, we consider i.i.d. observations with a density satisfying certain regu- larity conditions (videWoodroofe, 1978).
In Bahadur’s theory, the smaller the limit of the P-values (in an appropriate scale as mentioned above), the better the test statistic.Bahadur and Raghavachari (1972)showed the likelihood ratio test is optimal under regularity conditions, i.e., smallest limits of P-values are obtained when likelihood ratio test statistics are used. So we focus on such tests. We need the following large deviation result for likelihood ratio test statistic (vide
Woodroofe, 1978) which is valid under appropriate regularity conditions:
(35) PH0(−logλ⩾nε)∼(nε)k/2−1e−nε/ (k/2)
forε=εn→0 withnεn→ ∞asn→ ∞wherekis the difference between numbers of parameters underH0andH1. To understand such results heuristically, notice that
−2 logλis asymptotically aχ2withkdegrees of freedom. By a simple integration by parts (as for the normal tail),
P 1
2χk2> nε
∼(nε)k/2−1e−nε/ (k/2).
Eq.(35)can be interpreted as p=exp(logλ)nk/2−1O(1) if(logλ)/nis at most O(1). So
logp=logλ+ k 2 −1
logn+O(1).
Using Schwarz’s approximation by BIC(Schwarz, 1978; Ghosh and Ramamoorthi, 2003, Chapter 1), the Bayes factor can be expressed as
(36) log BF01=logλ+k
2logn+O(1).
From(35)and(36), it follows that when the value of the test statistic−2 logλis large, of the three measures of evidence againstH0,λis smaller (larger) than the P-value for k >2 (k <2) and the P-value in turn is smaller than BF01sinceλcan be at most O(n) whether the null or any (fixed) alternative is true. Incidentally, while the alternative hypothesis has a bearing on bothλand BF01,λis obtained by maximizing with respect toθwhereas BF01is obtained by integration.
We explore below if any simple reasonable adjustment ofλ or p can reduce the conflict with BF01. The mathematical similarity of these three quantities is quite striking specially because they are conceptually so different.
One reason the P-value tends to be small is that as a tail probability it depends on deviations fromH0 which are larger than those observed. It is tempting to think this can be corrected by using instead the density of what is observed. Since−2 logλis a χ2 variable withkdegrees of freedom (approximately – under regularity conditions, exactly – under normality assumption), the density agrees with the P-value up to O(1), so nothing is gained. The density may be unsatisfactory since, unlike the P-value, it is not invariant under smooth one-one transformations.
A way out is to calculate the probability that−logλlies betweennεandn(ε+dε).
This would agree with BF01up to O(1). A more fruitful approach is to replace logλby the penalized log likelihood ratio, namely, Schwarz’s BIC difference (∆BIC), given by
∆BIC=logλ+k 2logn.
As pointed out bySchwarz (1978),∆BIC agrees with log BF01up to O(1)but does not depend on the prior. For the testing problem considered in detail in Section2,∆BIC
tends to be bigger than logarithm of the calibrated P-value forn not too small, say, n >20. The calibrated P-value is a sort of nonparametric lower bound to a Bayes Factor, while∆BIC may be bigger or smaller than logarithm of a Bayes Factor depending on the prior, the difference being negligible and of order o(1), if the Jeffreys prior is used as the conditional prior given the alternative hypothesis.
5.4. Pitman alternative and rescaled priors
We consider once again the problem of testingH0: θ=0 versusH1: θ=0 on the basis of a random sample from N(θ,1). Suppose that the Pitman alternatives are the most important ones and the priorπ(θ|H1)puts most of the mass on Pitman alternatives.
One such prior is N(0, δ/n). Then BF01=√
δ+1 exp
−n 2
δ δ+1
X2
. If the P-value is close to zero,√
n|X|is large and therefore, BF01is also close to zero, i.e., for these priors there is no paradox. All three measures are of the same order but the result ofBerger and Sellke (1987)for symmetric unimodal priors still implies that P-value is smaller than the Bayes factor. In particular, the comments ofSellke et al.
(2001, Section 2)on the weak evidential value ofP =0.05 are still true.
The points made above are quite different from those inAndrews (1994).Andrews (1994)essentially shows that for priors of this kind and for fairly general parametric families satisfying usual regularity conditions, the Bayesian criterion is approximately a monotone function (with monotonicity in the right direction) of the classical test statistics. In the present case, it is clear from the above expression for BF01 that the monotonicity is exact. Such monotonicity does not establish that the scales of evidence corresponding to standard P-values suggested by Fisher and those suggested by Jef- freys are very similar as claimed in this subsection. It may be pointed out that exact monotonicity would hold for several standard cases.