Objective Bayesian model selection methods- 123docz.net

In this section, we introduce the eight default Bayesian approaches to model selection that will be considered in this chapter, the Well Calibrated Priors (WCP) approach, the Conventional Prior (CP) approach, the Intrinsic Bayes Factor (IBF) approach, the In- trinsic Prior (IPR) approach, the Expected Posterior Prior (EP) approach, the Fractional

Bayes Factor (FBF) approach, asymptotic methods and the Bayesian Information Crite- rion (BIC), and Lower Bounds on Bayes Factors and Posterior Probabilities (LB). These approaches will be illustrated by examples.

At first sight eight methods might appear as an explosion of methodology. There are however deep connections among the methods. For instance, in the exposition the concept of a minimal training sample will be a building block to several of the methods.

We introduce the simplest version of the concept here.

For theq modelsM1, . . . , Mq, suppose that (ordinary, usually improper) noninformative priorsπiN(θi),i =1, . . . , q, are available. Define the corresponding marginal or predictive densities ofX,

mNi (x)=

fi(x|θi)πiN(θi)dθi.

Because we will consider several training samples, we index them byl.

DEFINITION 1. A (deterministic) training sample,x(l), is a subset of the sample x which is called proper if 0< mNi (x(l)) <∞for allMi, and minimal if it is proper and no subset is proper. Minimal Training Samples will be denoted MTS.

More general versions of training samples will be introduced in the sequel. This definition is the original one in (BP96a). Real and “imaginary” training samples date back at least toLempers (1971)andGood (1950)and references therein. In the frequentist literature, the different methods can be written as:

Likelihood Ratio×Correction Factor.

A Bayesian version for grouping the different methods is:

Un-normalized Bayes Factor×Correction Factor, or in symbols,

Bayes Factorij =Bij = mNi (x)

mNj (x)×CFj i=BijN×CFj i. Several of the methods described in this article can be written in this form.

How to judge and compare the methods? (BP96a) proposed the following (Bayesian) principle:

PRINCIPLE1. Testing and model selection methods should correspond, in some sense, to actual Bayes factors, arising from reasonable default prior distributions.

It is natural for a Bayesian to accept that the best discriminator between procedures is the study of the prior distribution (if any) that give rise to the procedure or that is implied by it. Other properties like large sample size consistency, are rough as compared with the incisiveness ofPrinciple 1. This is the case, particularly in parametric problems, where such properties follow automatically when there is correspondence of a procedure with

a real Bayesian one. One of the best ways of studying any biases in a procedure is by ex- amining the corresponding prior for biases. It is of paramount importance that we know how to interpret Bayes factors as posterior to prior odds. Indeed, other “Bayesian” mea- sures, different from Bayes Factors and Posterior Probabilities have been put forward.

But then how to interpret them? One of the main advantages of the “canonical” Bayesian approach based on Odds is its natural scientific and probabilistic interpretation. It is so natural, that far too often practitioners misinterpretp-values as posterior probabilities.

The main message is: Bayesians have to live with Bayes Factors, so we rather learn how to use this powerful measure of evidence.

2.1. Well calibrated priors approach

When can we compare two models using Bayes factors? The short answer is when we have (reasonable) proper priors (reasonable, excluding for example “vague” proper priors or point masses).

There is an important concept however that allows to compute exact Bayes Factors with fully improper or partially improper priors (i.e. priors which integrate infinity for some parameters but integrate finite with respect to the other parameters). The concept is “well calibrated priors”. An important sub-class of well calibrated priors are priors which are “predictively matched ”. Other well calibrated kind of priors will be introduced in the sequel.

DEFINITION2. For the modelsMi: fi(y|θi)andMj: fj(y|θj)the priorsπiN(θi)and πjN(θj)are predictively matched, if for any minimal training sampley(l) holds the identity,

mNi y(l)

y(l)|θi

πN(θi)dθi

=mNj (6) y(l)

y(l)|θj

πN(θj)dθj.

If two models are predictively matched, it is perfectly sensible to compare them with Bayes Factors that use the improper priors for which they are predictively matched, since the scaling correction given by training samples (see Eq.(14)) cancels out and the correction becomes unity. Strikingly there are large classes of well calibrated models and priors. For a general theory of predictively matched priors seeBerger et al. (1998).

Three substantial particular cases follow, namely location, scale and location-scale models, from the following property:

PROPERTY1.

(7) f (yl|à)=f (yl−à), πN(à)=1,

f (yl−à)ã1 dà=1, f (yl|σ )=1/σf (yl/σ ), πN(σ )=1/σ,

(8)

f (yl/σ )ãdσ

σ2 =c0ã 1

|yl|

and

(yl−à)/σ

, πN(à, σ )=1/σ,

(9)

(yl1−à)/σ

ãf

(yl2 −à)/σ

ãdσ

σ3 = 1

2ã |yl1 −yl2|,

where the minimal training samples are one dataylin the location and scale case and two distinct observationsyl1 =yl2in the location-scale case. Finally,c0=∞

0 f (v)dv.

A direct application of property(9)is the “robustification” of the Normal distribution.

EXAMPLE3. Let the base modelM0be Normal, but since it is suspected the existence of a percentage of outliers, an alternative Student-t modelMν, withνdegrees of free- dom is proposed (for exampleν=1, i.e. the Cauchy model). Then the following Bayes Factor, of the Normal vs. Student is perfectly legal (Bayesianly) and appropriate,

B(0,ν)= (10)

f0(y|à, σ )dàσdσ fν(y|à, σ )dàσdσ .

If we now predict a future observation according to the Bayesian Model averaging approach,

we have constructed a robustifier against outliers. SeeSpiegelhalter (1977)andPericchi and Pérez (1994). It is also a powerful robustifier, switching smoothly from Normal towards Student-tinference as the presence of outliers is more pronounced. Other alternative models, e.g., asymmetrical, may be added to the location-scale set of candidate models. When a procedure is obtained in a reasonable Bayesian manner (subjective or objective) it is to be expected that the procedure is also efficient in a frequentist sense.

In this case the Bayes Factor above is optimal, “Most Powerful Invariant Test Statistics”

(Cox and Hinkley, 1974).

Do predictively matched priors obeyPrinciple 1? To see that it does obey the Princi- ple in a rather strong way, consider formula(14)below. Applying it to a minimal training sample (MTS), results inBijN(x(l))≡1, for any Minimal Training Samplex(l). As a consequence, the uncorrectedBijN(x)arises from any proper priorπ(θk|x(l)), and such priors are typically sensible, since are based on the same MTS for both models. Another way to see this is that it is identical to any Intrinsic Bayes Factor, see Section2.3.

The next approach, finds in the property of “well calibration” an important justification.

2.2. Conventional prior approach

It wasJeffreys (1961, Chapter 5)who recognized the problem of arbitrary constants arising in hypotheses testing problems, implied by the use of “Jeffreys’ Rule” for choosing

objective-invariant priors for estimation problems. For testing problems then a conven- tion have to be established. His approach is based on: (i) using noninformative priors only for common parameters in the models, so that the arbitrary multiplicative constant for the priors would cancel in all Bayes factors, and (ii) using default proper priors for orthogonal parameters that would occur in one model but not the other. These priors are neither vague nor over-informative, but that correspond to a definite but limited amount of information. He presented arguments justifying certain default proper priors in general, but mostly on a case-by-case basis. This line of development has been successfully followed by many others (for instance, by Zellner and Siow, 1980; see (BP01), for other references). Here I revisit some examples and formally justify the use of the conventional, partially proper priors, based on the property of well calibration and “predictively matched” priors.

EXAMPLE 4 (Normal mean, Jeffreys’ conventional prior). Suppose the data is X = (X1, . . . , Xn), where the Xi are i.i.d. N(à, σ22) under M2. Under M1, the Xi are N(0, σ12). Since the mean and variance are orthogonal in the sense of having diago- nal expected Fisher’s information matrix, Jeffreys equatedσ12 =σ22 =σ2. Because of this, Jeffreys suggests that the variances can be assigned the same (improper) noninformative priorπJ(σ )=1/σ, since the indeterminate multiplicative constant for the prior would cancel in the Bayes factor. (See below for a formal justification.)

Since the unknown mean à occurs in onlyM2, it needs to be assigned a proper prior. Jeffreys, comes up with the following desiderata for such a prior that in retrospect appears as compelling: (i) it should be centered at zero (i.e. centered at the null hypothesis); (ii) have scaleσ (i.e. have the information provided by one observation); (iii) be symmetric around zero, and (iv) have no moments. He then settles for the Cauchy prior Cauchy(0, σ2) as the simplest distribution that obeys the desiderata. In formulae, Jef- freys’s conventional prior for this problem is:

(11) π1J(σ1)= 1

σ1, π2J(à, σ2)= 1

σ2ã 1

π σ2(1+à2/σ22).

This solution is justified as a Bayesian prior, as the following property shows.

PROPERTY2. The priors(11)are predictively matched.

The priors in(11)are improper, but one data will make them proper. Denote byyl such a generic data point. It is clear that forM1,

m2(yl)=

yl|à, σ2 Ca

à|0, r2 dàdσ

(expressing the Cauchy as a scale mixture and interchanging orders of integration)

y|0, σ2(1+1/λ)dσ σ

Ga(λ|1/2,2)dλ

after substitution and using identity(8) 1

2ã |yl| ãGa(λ|1/2,2)dλ= 1 2ã 1

|yl|.

In summary, had we used a training sample to “correct” the priors in(11), the correction would have exactly cancelled out, for whatever training sample yl = 0. Moreover, choosing the scale of the prior foràto beσ2(the only available nonsubjective ‘scaling’

in the problem) and centering it at M1 are natural choices, and the Cauchy prior is known to be robust in various ways. It is important to recall that the property of well calibration should be checked when using a conventional, partially improper prior. It is clear that the argument given inProperty 2would work with any scale mixture of Normal priors.

EXAMPLE5 (Linear model, Zellner and Siow conventional priors). It was suggested in Zellner and Siow (1980), a generalization of the above conventional Jeffreys prior for comparing two nested models within the normal linear model. LetX= [1:Z1:Z2]be the design matrix for the ‘full’ linear model under consideration, where 1 is the vector of 1’s, and (without loss of generality) it is assumed that the regressors are measured in terms of deviations from their sample means, so that 1tZj = 0, j = 1,2. It is also assumed that the model has been parameterized in an orthogonal fashion, so that Zt1Z2=0. The normal linear model,M2, fornobservationsy=(y1, . . . , yn)t is

y=α1+Z1β1+Z2β2+ε,

whereεisNn(0, σ2In), then-variate normal distribution with mean vector 0 and co- variance matrixσ2times the identity. Here, the dimensions ofβ1andβ2arek1−1 and p, respectively.

For comparison ofM2with the modelM1: β2=0,Zellner and Siow (1980)propose the following default conventional priors:

π1ZS(α,β1, σ )=1/σ,

π2ZS(α,β1, σ,β2)=h(β2|σ )/σ,

whereh(β2|σ )is the Cauchyp(0,Zt2Z2/(nσ2))density h(β2|σ )=c|Zt2Z2|1/2

(nσ2)p/2

1+βt2Zt2Z2β2 nσ2

−(p+1)/2

withc=[(p+1)/2]/π(p+1)/2. Thus the improper priors of the “common”(α,β1, σ ) are assumed to be the same for the two models (again justifiable by the property of being predictively matched as inExample 4), while the conditional prior of the (unique toM2) parameterβ2, givenσ, is assumed to be the (proper)p-dimensional Cauchy distribution, with location at 0 (so that it is ‘centered’ atM1) and scale matrixZt2Z2/(nσ2), “. . .a matrix suggested by the form of the information matrix,” to quoteZellner and Siow (1980). (But see criticisms about this choice, particularly for unbalanced designs, in the sequel and in detail inBerger and Pericchi, 2004.)

Again, using the fact that a Cauchy distribution can be written as a scale mixture of normal distributions, it is possible to compute the needed marginal distributions,mi(y), with one-dimensional numerical integration, or even an involved exact calculation, is possible, as Jeffreys does. Alternatively, Laplace or another approximation can be used.

In fact, perhaps the best approximation to be used in this problem is(20)below.

When there are more than two models, or the models are nonnested, there are various possible extensions of the above strategy.Zellner and Siow (1980)utilize what is often called the ‘encompassing’ approach (introduced byCox, 1961), where one compares each submodel,Mi, to the encompassing model,M0, that contains all possible covari- ates from the submodels. One then obtains, using the above priors, the pairwise Bayes factorsB0i, i=1, . . . , q. The Bayes factor ofMjtoMi is then defined to be

Bj i=B0i/B0j. (12)

EXAMPLE 5 (Continued, Linear model, conjugate g-priors). It is perplexing (since Zellner suggestion of a g-prior was for estimation but for testing he suggested other priors), that one of the most popular choices of prior for hypothesis testing in the normal linear model is the conjugate prior, called a g-prior inZellner (1986). For a linear model

M: y =Xβ+ε, ε∼Nn

0, σ2In

whereσ2andβ=(β1, . . . , βk)t are unknown andXis an(n×k)given design matrix of rankk < n, theg-prior density is defined by

π(σ )= 1

σ, π(β|σ )isNk

0, gσ2

XtX−1 .

Ofteng=nis chosen (see, also,Shively et al., 1999), while sometimesgis estimated by empirical Bayes methods (see, e.g.,George and Foster, 2000; Clyde and George, 2000).

A perceived advantage ofg-priors is that the marginal density,m(y), is available in closed form and it is given by

m(y)= (n/2)

2πn/2(1+g)k/2

yty− g

(1+g)ytX

XtX−1 Xty

−n/2

Thus the Bayes factors and posterior model probabilities for comparing any two linear models are available in simple and closed form.

A major problem is thatg-priors have some undesirable properties when used for model selection, as shown in (BP01): one of them may be called “finite sample incon- sistency”. Suppose one is interested in comparing the linear model above with the null modelM∗: β=0. It can be shown that, as the least squares estimateβˆ (and the non- centrality parameter) goes to infinity, so that one becomes certain thatM∗is wrong, the Bayes factor ofM∗toMgoes to the nonzero constant(1+g)(k−n)/2. It was this defect what causedJeffreys (1961)to reject conjugate-priors for model selection, in favor of the Cauchy priors discussed above (for which the Bayes factor will go to zero when the evidence is overwhelmingly againstM∗).

The conventional prior approach is appealing in principle. However, there seems to be no general method for determining such conventional priors. Nevertheless it is a promising approach in the sense that as long as knowledge of the behavior of Bayes Factors accumulates, the acquired wisdom may be used with advantage at the moment of assuming good Conventional Priors, for substantial classes of problems. Furthermore Conventional Priors obeyPrinciple 1, when it can be established that the priors are well calibrated, and also that are free of the “finite sample inconsistency”.

The remaining methods that are discussed here, have the advantage of applying automatically to quite general situations.

2.3. Intrinsic Bayes factor (IBF) approach

For theq modelsM1, . . . , Mq, suppose that (ordinary, usually improper) noninformative priorsπiN(θi),i=1, . . . , q, have been chosen, preferably as ‘reference priors’ (see Berger and Bernardo, 1992), but other choices are possible. Define the corresponding marginal or predictive densities ofX,

mNi (x)=

fi(x|θi)πiN(θi)dθi.

The general strategy for defining IBF’s starts with the definition of a proper and minimal

‘deterministic training sample,’Definition 1, above, which are simply subsetsx(l),l= 1, . . . , L, of sizem, as small as possible so that, using the training sample as a sample all the updated posteriors under each model become proper.

The advantage of employing a training sample to define Bayes factors is to usex(l) to “convert” the improperπiN(θi)to proper posteriors,πiN(θi|x(l)), and then use the latter to define Bayes factors for the rest of the data denoted byx(−l). The result, for comparingMj toMi, is the following property (BP96a).

PROPERTY3.

Bj i(l)=

fj(x(−l)|θj,x(l))πjN(θj|x(l))dθj fi(x(−l)|θi,x(l))πjN(θi|x(l))dθi

using Bayes Rule, can be written as (assuming that all quantities involved exist) Bj i(l)=Bj iN(x)ãBijN

x(l) , where

(13) Bj iN =Bj iN(x)= mNj(x)

mNi (x) and BijN(l)=BijN x(l)

=mNi (x(l)) mNj(x(l))

are the Bayes factors that would be obtained for the full datax and training sample x(l), respectively, if one were to useπiNandπjN.

The correctedBj i(l)no longer depends on the scales ofπjNandπiN, but it depends on the arbitrary choice of the (minimal) training samplex(l). To eliminate this depen- dence and to increase stability, theBj i(l)are averaged over all possible training samples

x(l),l = 1, . . . , L. A variety of different averages are possible; here consideration is given only the arithmetic IBF (AIBF) and the median IBF (MIBF) defined, respectively, as

(14) Bj iAI =Bj iNã 1

L L

l=1

BijN, Bj iMI =Bj iNãMed BijN(l)

where “Med” denotes median. (The Geometric IBF has also received attention, Bernardo and Smith, 1994.) For the AIBF, it is typically necessary to place the more

“complex” model in the numerator, i.e. to letMj be the more complex model, and then defineBijAI byBijAI =1/Bj iAI. The IBFs defined in(14)are resampling summaries of the evidence of the data for the comparison of models, since in the averages there is sample re-use. (Note that the Empirical EP-Prior approach, defined in Section2.5is also a resampling method.)

These IBFs were defined in (BP96a) along with alternate versions, such as the en- compassing IBF and the expected IBF, which are recommended for certain scenarios.

The MIBF is the most robust IBF (seeBerger and Pericchi, 1998), although the AIBF is justified trough the existence of an implicit intrinsic prior, at least (but not only) for nested models (see next section).

EXAMPLE 1 (Continued, Exponential model, AIBF and MIBF). We have a minimal training samplexl of sizem=1,

(15) B01N =f (xl|λ0)

mN1(xl) =xlλ0exp(−λ0xl), and then

B10AI =B10NãB01NI (xl)=(n)exp(−λ0 xi) (

xi)nλn0 ã 1 n

l=1

xlλ0exp(−λ0xl) and for the Median IBF, the arithmetic average is replaced by the Median of the correc- tions.

EXAMPLE 4 (Continued, Normal mean, AIBF and MIBF). Start with the noninformative priors π1N(σ1) = 1/σ1 and π2N(à, σ2) = 1/σ22. Note that π2N is not the reference prior recommended; butπ2N yields simpler expressions for illustrative pur- poses. It turns out that minimal training samples consist of any two distinct observations x(l)=(xi, xj), and integration shows that

mN1 x(l)

= 1

2π(xi2+xj2), mN2 x(l)

= 1

√π (xi−xj)2ã

Standard integrals yields the following (unscaled) Bayes factor for datax, when using π1Nandπ2Ndirectly as the priors:

B21N = (16) 2π

n ã

1+nx¯2 s2

n/2

wheres2=n

i=1(xi− ¯x)2. Using(14), the AIBF is then equal to

(17) B21AI =B21N ã 1

L L

l=1

(x1(l)−x2(l))2 2√

π[x12(l)+x22(l)],

while the MIBF is given by replacing the arithmetic average by the median.

EXAMPLE5 (Continued, Linear models, AIBF and MIBF). IBF’s for linear and related models are developed inBerger and Pericchi (1996a, 1996b, 1997)andRodríguez and Pericchi (2001)for Dynamic Linear Models. Suppose, forj =1, . . . , q, that modelMj forY (n×1)is the linear model

Mj: y=Xjβj +ε, εj ∼Nn

0, σj2In

whereσj2andβj = (βj1, βj2, . . . , βj kj)t are unknown, andXj is an(n×kj)given design matrix of rankkj < n. We will consider priors of the form

πjN(βj, σj)=σj−(1+qj), qj >−1.

Common choices ofqj areqj =0 (the reference prior), orqj =kj(Jeffreys rule prior).

When comparing modelMi nested inMj, (BP96a) also consider a modified Jeffreys prior, havingqi =0 andqj =kj−ki. This is a prior between reference and Jeffreys Rule.

For these priors, a minimal training sampley(l), with corresponding design matrix X(l) (under Mj), is a sample of size m = max{kj} +1 such that all (XtjXj) are nonsingular. Calculation then yields

(18) Bj iN =π(kj−ki)/2

2(qi−qj)/2 ã((n−kj+qj)/2)

((n−ki+qi)/2) ã |XtiXi|1/2

|XtjXj|1/2ã Ri(n−ki+qi)/2 R(nj −kj+qj)/2 , whereRi andRj are the residual sums of squares under modelsMi andMj, respectively. Similarly,BijN(l)is given by the inverse of this expression withn,Xi,Xj, Ri and Rjreplaced bym,Xi(l),Xj(l), Ri(l)andRj(l), respectively; hereRi(l)andRj(l)are the residual sums of squares corresponding to the training sampley(l).

Plugging these expressions in(14)results in the Arithmetic and Median IBFs for the three default priors being considered. For instance, using the modified Jeffreys prior and definingp=kj −ki >0, the AIBF is

Bj iAI = |XtiXi|1/2

|XtjXj|1/2ã Ri

(n−ki)/2

ã 1 (19) L

l=1

|Xtj(l)Xj(l)|1/2

|Xti(l)Xi(l)|1/2 ã Rj(l)

Ri(l)

(p+1)/2

To obtain the MIBF, replace the arithmetic average by the median. Note that the MIBF does not requireMito be nested inMj.

When multiple linear models are being compared, IBFs can have the unappealing feature of violating the basic Bayesian coherency conditionBj k =Bj iBik. To avoid this

Objective Bayesian model selection methods

Models for the underlying data – Bayesian inference

Intrinsic discrepancy and expected information