Guttman (1967)first suggested the idea of the posterior predictive distribution; he used the terminology density of a future observation to describe the concept. Rubin ap- plied the idea of the posterior predictive distribution to model checking and assessment (Rubin, 1981)and gave a formal Bayesian definition of the technique(Rubin, 1984).
Gelman et al. (1996b)provided additional generality by broadening the class of diag- nostics measures that are used to assess the discrepancy between the data and the posited model.
4.1. Description of posterior predictive model checking
Letp(y|ω)denote the sampling or data distribution for a statistical model, whereω denotes the parameters in the model. Letp(ω)be the prior distribution on the parame- ters. Then the posterior distribution ofωisp(ω|y)≡ p(y|ω)p(ω)
ωp(y|ω)p(ω)dω. Letyrepdenote replicate data that one might observe if the process that generated the datay is repli- cated with the same value ofωthat generated the observed data. Thenyrepis governed by the posterior predictive distribution (or the predictive distribution of replicated data conditional on the observed data),
(1) p
yrep|y
=
p yrep|ω
p(ω|y)dω.
The posterior predictive approach to model checking proposes using this as a reference distribution for assessing whether various characteristics of the observed data are un- usual.Gelman (2003)characterizes the inclusion of replicate data as a generalization of the model building process. Bayesian inference generalizes a probability model from the likelihoodp(y|ω)to the joint distributionp(y,ω) ∝ p(y|ω)p(ω). Model check- ing further generalizes top(y,yrep,ω)with the posterior predictive approach using the factorizationp(ω)p(y|ω)p(yrep|ω)(in whichy,yrepare two exchangeable draws from the sampling distribution).
To carry out model checks test quantities or discrepancy measuresD(y,ω)are de- fined(Gelman et al., 1996b), and the posterior distribution ofD(y,ω)compared to the posterior predictive distribution ofD(yrep,ω), with any significant difference between them indicating a model failure. IfD(y,ω)=D(y), then the discrepancy measure is a test statistic in the usual sense.
Model checking can be carried out by graphically examining the replicate data and the observed data, by graphically examining the joint distribution of D(y,ω) andD(yrep,ω)(possibly for several different discrepancy measures), or by calculat- ing a numerical summary of such distributions.Gelman et al. (2003)provide a detailed discussion of graphical posterior predictive checks. One numerical summary of the model diagnostic’s posterior distribution is the tail-area probability or as it is sometimes known, the posterior predictivep-value:
pb=P D
yrep,ω
⩾D(y,ω)|y
= (2)
I[D(yrep,ω)⩾D(y,ω)]p yrep|ω
p(ω|y)dyrepdω,
whereI[A]denotes the indicator function for the eventA. Small or large tail area proba- bilities indicate thatDidentifies an aspect of the observed data for which replicate data generated under the model does not match the observed data.
Because of the difficulty in dealing with(1)or(2)analytically for all but the most simple problems,Rubin (1984)suggests simulating replicate data sets from the posterior predictive distribution. One draws L simulations ω1, ω2, . . . ,ωL from the posterior distributionp(ω|y)ofω, and then drawsyrep,lfrom the sampling distributionp(y|ωl), l = 1,2, . . . , L. The process results in L draws from the joint posterior distribution p(yrep,ω|y). Then graphical or numerical model checks are carried out using these sampled values.
4.2. Properties of posterior predictivep-values
Though the tail area probability is only one possible summary of the model check it has received a great deal of attention. The inner integral in(2)can be interpreted as a traditionalp-value for assessing a hypothesis about a fixed value ofω given the test measureD. If viewed in this way, the various model checking approaches represent dif- ferent ways of handling the parameterω. The posterior predictivep-value is an average of the classicalp-value over the posterior uncertainty about the trueω. The Box prior predictive approach averages over the prior distribution for the parameter.
Meng (1994)provides a theoretical comparison of classical and Bayesianp-values.
One unfortunate result is that posterior predictive p-values do not share some of the features of the classical p-values that dominate traditional significance testing.
In particular, they do not have a uniform distribution when the assumed model is true, instead they are more concentrated around 0.5 than a uniform distribution.
This is a result of using the same data to define the reference distribution and the tail event measured by the p-value. For some (see, e.g., Bayarri and Berger, 2000;
Robins et al., 2000) the resulting conservatism of thep-values is a major disadvan- tage; this has motivated some of the alternative model checking strategies described earlier. Posterior predictive checks remain popular because the posterior predictive dis- tributions of suitable discrepancy measures (not just thep-values) are easy to calculate and relevant to assessing model fitness.
4.3. Effect of prior distributions
Because posterior predictive checks are Bayesian by nature, a question arises about the sensitivity of the results obtained to the prior distribution on the model parameters.
Because posterior predictive checks are based on the posterior distribution they are gen- erally less sensitive to the choice of prior distribution than are prior predictive checks.
Model failures are detected only if the posterior inferences under the model seem flawed. Unsuitable prior distributions may still be judged acceptable if the posterior inferences are reasonable. For example, with large sample sizes the prior distribution has little effect on the posterior distribution, and hence on posterior predictive checks.
Gelman et al. (1996a)comment that if the parameters are well-estimated from the data, posterior predictive checks give results similar to classical model checking procedures for reasonable prior distributions. In such situations the focus is on assessing the fit of the likelihood part of the model.
Strongly informative prior distributions may of course have a large impact on the results of posterior predictive model checks. The replicated data sets obtained under strong incorrect prior specifications may be quite far from the observed data. In this way posterior predictive checks maintain the capability of rejecting a probability model if the prior distribution is sufficiently poorly chosen to negatively impact the fit of the model to the data. Gelman et al. (1996a, p. 757) provide one such example. On the contrary, a strong prior distribution, if trustworthy, can help a researcher assess the fit of the likelihood part of the data more effectively.
4.4. Definition of replications
In the description thus far the replications have been defined as data sets that are exchangeable with the original data under the model. This is implemented as an in- dependent draw from the sampling distribution conditional on the model parameters.
This is a natural definition for models where the dataydepend on a parameter vectorω which is given a (possibly noninformative) prior distribution. In hierarchical models, where the distribution of the datay depends on parametersωand the parameters ω are given a prior or population distribution that depends on parametersα, there can be
more than one possible definition of replications. To illustrate we consider a commonly occurring situation involving a hierarchical model.
Suppose thatyare measurements (perhaps weights) with the measurementyi of ob- jectihaving a Gaussian distribution with mean equal to the true measurementωi and known variance. If a number of related objects are measured, then it is natural to model elements ofωas independent Gaussian random variables conditional on parametersα (the population mean and variance). Then one possible definition of replicate data sets corresponds to taking new measurements of the same objects. This corresponds to the joint distribution p(α,ω,y,yrep) having factorization p(α)p(ω|α)p(y|ω)p(yrep|ω).
The final term in the factorization reflects the assumption that, conditional onyand the parameters,yrepdepends only on the true measurements of the objects(ω). In practice the replicate data are obtained by simulating from the posterior distributionp(α,ω|y) and thenp(yrep|ω).
An alternative definition in this case would take the replications as corresponding to measurements of new objects from the same population. Thiscorresponds to the joint distributionp(α,ω,y,yrep)having factorizationp(α)p(ω|α)p(y|ω)p(yrep|α). The fi- nal term in the factorization now reflects the assumption that, conditional onyand the parameters,yrepdepends only on the population parameters(α)(because we have new objects from that population). In practice the replicate data for this case are obtained by simulating from the posterior distributionp(α|y), then simulating new “true” mea- surements fromp(ω|α), and finally simulating replicated data sets based on these new
“true” measurementsp(yrep|ω).
Diagnostics geared at assessing the likelihood, e.g., assessing whether symmetry is a reasonable model, would likely use the first definition corresponding to repeated mea- surements on the same objects. Diagnostics geared at assessing the appropriateness of the population distribution would likely use the alternative definition to ask whether measurements of objects from the assumed population would look like our sample.
This one example is intended to demonstrate that, for sophisticated modeling efforts, it is important to carefully define the replicate data that serve as a reference distribution for the observed data.
4.5. Discrepancy measures
Technically, any function of the data and the parameters can play the role of a discrep- ancy measure in posterior predictive checks. The choice of discrepancy measures is very important. Virtually all models are wrong, and a statistical model applied to a data set usually explains certain aspects of the data adequately and some others inadequately.
The challenge to the researcher in model checking is to develop discrepancy measures that have the power to detect the aspects of the data that the model cannot explain sat- isfactorily. A key point is that discrepancy measures corresponding to features of the data that are directly addressed by model parameters will never detect a lack of fit. For example, asking whether a normal model produces data with the same mean as the ob- served data, that is choosingT (y;ω)=mean{yi}, is sure to conclude that the model is adequate because the location parameter in the normal model will be fit to the observed data and the model will then generate replicate data centered in the same place. The
particular model at hand may in fact fit quite poorly in the tails of the distribution but that will not be uncovered by a poor choice of discrepancy. Discrepancy measures that relate to features of the data not directly addressed by the probability model are better able to detect model failures(Gelman et al., 2003). Thus a measure of tail size (perhaps based on quantiles of the observed and replicate data) is more likely to detect the lack of fit in the situation described above. Failure to develop suitable discrepancy measures may lead to the incorrect conclusion that the model fits the data adequately. For a prac- tical problem, a good strategy is to examine a number of discrepancies corresponding to aspects of practical interest as well as some standard checks on overall fitness(Gelman et al., 1996a).
4.6. Discussion
This section has outlined the theory behind posterior predictive model checks and some of the issues to be addressed before implementation. The fact that posterior predictive checks tend to be somewhat conservative, and the related fact that posterior predic- tivep-values are not uniformly distributed under the true model, have been cited as arguments against the use of posterior predictive checks. Despite this a number of advantages have made them one of the more practical tools available for practicing Bayesian statisticians. Posterior predictive model checks are straightforward to carry out once the difficult task of generating simulations from the posterior distribution of the model parameters is done. One merely has to take the simulated parameter values and then simulate data according to the model’s sampling distribution (often a common probability distribution) to obtain replicate data sets. The important conceptual tasks of defining discrepancy measures and replications will typically require interaction with subject matter experts. This too can be thought of as an advantage in that the probing of the model required to define suitable discrepancies is a useful exercise. Finally posterior predictive checks lead to intuitive graphical and probabilistic summaries of the quality of the model fit.
The succeeding sections discuss two applications of posterior predictive model checks. Successful applications in the literature includeBelin and Rubin (1995),Rubin and Wu (1997),Glickman and Stern (1998),Gelman et al. (2000, 2003),Fox and Glas (2003),Sorensen and Waagepetersen (2003),Gelman (2004), andSinharay (in press, b).