Review of 3 study designs
• Information on the disease status (Y ) and the exposure status (X) is obtained from a random sample at one time point A snap shot of population.
In this study, we measure a single observation for each variable of interest from each subject, represented as (Y i, X i) for i ranging from 1 to n To evaluate the relationship between the dependent variable Y and the independent variable X, we employ regression techniques, such as logistic regression, particularly when Y is binary.
P[Y = 1|X = 1]/(1 − P[Y = 1|X = 1]) P[Y = 1|X = 0]/(1 − P[Y = 1|X = 0]) β 1 = log odds-ratio between exposure population (X = 1) and non exposure population (X = 0) β > 0 =⇒ the exposure population
• Data (Y i , X i ) can be summarized as
X = 0 n 01 n 00 then the MLE of β 1 is given by βb 1 = log n 11 n 00 n 10 n 01
• Feature: All numbers n 00 , n 01 , n 10 , n 11 are random.
Causal inference cannot be established, as the stability of βb 1 may be compromised, particularly if the sample size (n 11) is insufficient However, valuable public health insights can still be derived, including the prevalence of the disease within the population and the proportion of individuals exposed to relevant factors.
• Can account for confounders in the model.
2 Prospective cohort study (follow-up study):
• A cohort with known exposure status (X) is followed over time to obtain their disease status (Y ).
• A single observation of (Y ) may be observed (e.g., survival study) or multiple observations of (Y ) may be observed (longitudinal study).
• Stronger evidence for causal inference Causal inference can be made if X is assigned randomly (if X is a treatment indicator in the case of clinical trials).
• When single binary (0/1) Y is obtained, we have
Here, n 1+ and n 0+ are fixed (sample sizes for the exposure and
• A sample with known disease status (D) is drawn and their exposure history (E) is ascertained Data can be summarized as
E n 01 n 00 n +1 n +0 where the margins n +1 and n +0 are fixed numbers.
• Assuming no bias in obtaining history information on E, association between E and D can be estimated. n 11 ∼ Bin(n +1 , P[E|D]), n 10 ∼ Bin(n +0 , P[E|D]).
Odds ratio: estimate from this study θb = n 11 n 00 n 10 n 01 estimates the following quantity θ = P[E|D]/(1 − P[E|D])
• If disease is rare, i.e., P[D|E] ≈ 0, P[D|E] ≈ 0, relative risk of disease can be approximately obtained: θ ≈ P[D|E]
More efficient than prospective cohort study in this case.
• Problem: recall bias! (it is difficult to ascertain exposure history
Introduction to longitudinal studies
A longitudinal study is a prospective cohort study where repeated measures are taken over time for each individual.
A longitudinal study is usually designed to answer the following questions:
1 How does the variable of interest change over time?
2 How is the (change of) variable of interest associated with treatment and other covariates?
3 How does the variable of interest relate to each other over time?
Data examples
In the Framingham study, each of 2634 participants was examined every 2 years for a 10 year period for his/her cholesterol level.
1 How does cholesterol level change over time on average as people get older?
2 How is the change of cholesterol level associated with sex and baseline age?
3 Do males have more stable (true) baseline cholesterol level and change rate than females?
A subset of 200 subjects’ data is used for illustrative purpose.
A glimpse of the raw data newid id cholst sex age time
Cholesterol level over time for a subset of 200 subjects fromFramingham study
What we observed from this data set:
1 Cholesterol levels increase (linearly) over time for most individuals.
2 Each subject has his/her own trajectory line with a possibly different intercept and slope, implying two sources of variations: within and between subject variations.
3 Each subject has on average 5 observations (as opposed to one observation per subject for a cross-sectional study)
4 The data is not balanced Some individuals have missing observations (e.g., subject 2’s Cholesterol is missing at time = 2)
5 The inference is NOT limited to these 200 individuals Instead, the inference is for the target population and each subject is viewed as a random person drawn from the target population.
Each of 275 Indonesian preschool children was examined up to six consecutive quarters for the presence of respiratory infection (yes/no).
Information on age, sex, height for age, xerophthalmia (vitamin A deficiency) was also obtained.
• Was the risk of respiratory infection related to vitamin A deficiency after adjusting for age, sex, and height for age, etc.?
Features of this data set:
1 Outcome is whether or not a child has respiratory infection, i.e., binary outcome.
2 Some covariates (age, vitamin A deficiency and height) are
A glimpse of the infection data
Obs id infect xero sex visit season
Proportions of respiratory infection and vitamin A deficiency
Example 3: Epileptic seizure counts from the progabide trial
In the progabide trial, 59 individuals with epilepsy were randomly assigned to receive either the anti-epileptic medication progabide or a placebo Seizure counts were meticulously recorded over four consecutive two-week intervals, alongside participants' ages and baseline seizure counts from the eight weeks prior to treatment assignment.
• What is the treatment effect adjusting for available covariates?
Features of this data set:
1 Outcome is count data, implying a Poisson regression.
2 Baseline seizure counts were for 8 weeks, as opposed to 2 weeks for other seizure counts.
3 Randomization may be taken into account in the data analysis.
A glimpse of the seizure data
Obs id seize trt visit interval age
Epileptic seizure counts from the progabide trial
Features of longitudinal data
Common features of all examples:
• Each subject has multiple time-ordered observations of response.
• Responses from the same subjects may be “more alike” than others.
• Inference is NOT in study subjects, but in population from which they are from.
• # of subjects >> # of observations/subject
• Source of variations – between and within subject variations.
• Different types of responses (continuous, binary, count).
• Objectives depend on the type of study – “mean” behavior, etc.
Subject Data Subject Data Time
2 x 2 2 x 21 , x 22 , , x 25 t 21 , t 22 , , t 25 y 2 y 21 , y 22 , , y 25 t 21 , t 22 , , t 25 For simplicity, we consider one covariate case.
Why longitudinal studies?
1 A longitudinal study allows us to study the change of the variable of interest over time, either at population level or individual level.
2 A longitudinal study enables us to separately estimate the cross-sectional effect (e.g., cohort effect) and the longitudinal effect (e.g., aging effect):
In a cross-sectional study, we analyze the relationship between age and the outcome variable, represented by the model y i1 = β 0 + β C age i1 + ε i1, where y ij denotes the observed outcome and age ij indicates the individual's age Here, n i = 1 signifies that we are considering a single observation per individual, and β C represents the cross-sectional effect of age on the outcome.
With longitudinal data (n i > 1), we can entertain the model
Then y i1 = β 0 + β C age i1 + ε i1 (let j = 1), y ij − y i1 = β L (age ij − age i1 ) + ε ij − ε i1 That is, β L is the longitudinal effect of age and in general β L 6= β C
3 A longitudinal study is more powerful to detect an association of interest compared to a cross-sectional study, =⇒ more efficient, less sample size (number of subjects).
4 A longitudinal study allows us to study the within-subject and between-subject variations.
Suppose b ∼ (à, σ b 2 ) is the blood pressure for a patient population.
However, what we observe is Y = b + ε, where ε ∼ (0, σ ε 2 ) is the measurement error.
If we have only one observation Y i for each subject from a sample of n patients, then we can’t separate σ ε 2 and σ b 2 Although we can use data
Y 1 , Y 2 , , Y n to make inference on à, we can’t make any inference on σ b 2
However, if we have repeated (or longitudinal) measurements Y ij of blood pressure for each subjects, then
Now, it is possible to make inference about all quantities à, σ b 2 and σ ε 2
5 A longitudinal study provides more evidence for possible causal interpretation.
Challenges in analyzing longitudinal data
Key assumptions in a classical regression model: There is only one observation of response per subject, =⇒ responses are independent to each other For example, when y = cholesterol level, y i = β 0 + β 1 sex i + β 2 age i + ε i
In a longitudinal study, observations from the same subject show greater similarity to each other, indicating that responses from these subjects are not independent In contrast, observations from different subjects remain independent.
What happens if we treat observations as independent (i.e., ignore the correlation)?
1 In general, the estimation of the associations (regression coefficients) of the outcome and covariates is valid.
2 However, the variability measures (e.g, the SEs from a classical regression analysis) are not right: sometimes smaller, sometimes bigger than the true variability.
3 Therefore, the inference is not valid (too significant than it should be if the SE is too small).
Sources of variation and correlation in longitudinal data:
In a study examining blood pressure, between-subject variation can be modeled effectively when measurements are taken within a short time frame The model can be expressed as y_ij = b_i + ε_ij, where b_i represents the true blood pressure of subject i, characterized by a variance of σ_b² Additionally, ε_ij denotes the independent random measurement error, which has a variance of σ_ε² and is independent of b_i This approach allows for a clearer understanding of individual differences in blood pressure readings.
For j 6= k, corr(y ij , y ik ) = cov(y ij , y ik ) pvar(y ij )var(y ik )
= σ b 2 σ b 2 + σ ε 2 Therefore, if the between-subject variation σ b 2 6= 0, then data from the same subjects are correlated.
Serial correlation in blood pressure measurements can occur when the time intervals between readings are large, making it unreasonable to assume a constant blood pressure for each individual The model can be expressed as y_ij = b_i + U_i(t_ij) + ε_ij, where b_i represents the true long-term blood pressure, U_i(t_ij) denotes a stochastic process reflecting biological fluctuations, and ε_ij signifies independent random measurement errors In this context, the observed correlation arises from both the true blood pressure and the biological variability over time.
In longitudinal studies involving humans, where the number of observations per subject is small to moderate, there may be insufficient data to accurately assess serial correlation Consequently, most of the observed correlation can often be attributed to complex variations between subjects.
Methods for analyzing longitudinal data
The two-stage method involves summarizing each subject's outcomes and regressing these summary statistics on one-time covariates, making it particularly effective for continuous longitudinal data However, this approach is becoming outdated, as mixed models offer superior capabilities for achieving similar results.
2 Mixed (effects) model approach: model fixed effects and random effects; use random effects to model correlation.
3 Generalized estimating equation (GEE) approach: model the dependence of marginal mean on covariates Correlation is not a main interest Particularly good for discrete data.
4 Transition models: use history as covariates Good for prediction of future response using history.
Two-stage method for analyzing longitudinal data
• Outcome (usually continuous): y i1 , , y in i measured at t i1 , , t in i ; one-time covariates: x i1 , , x ip
• Two-stage analysis is conducted as follows:
In Stage 1, we derive summary statistics from the data of subject i, denoted as y i1, , y in i This can be accomplished by calculating the mean, represented as y¯ i = (y i1 + + y in i )/n i, or by fitting a linear regression model for each subject, expressed as y ij = b i0 + b i1 t ij + ε ij In this model, we assume that the true response of subject i at time t ij follows a linear relationship, where b i0 indicates the true response at the baseline (t = 0), and b i1 reflects the rate of change in the true response over time The term ε ij is interpreted as measurement error.
In Stage 2, we treat the summary statistics as new responses by regressing them on one-time covariates After calculating the means and standard errors of bb i0 and bb i1, we can compare these values across genders The regression equations are defined as bb i0 = α 0 + α 1 x i1 + + α p x ip + e i0 and bb i1 = β 0 + β 1 x i1 + + β p x ip + e i1, where α k represents the effect of covariate x k on the true baseline response, and β k indicates the effect of x k on the change rate of the true response.
Analyzing Framingham data using two-stage method
• Stage I: For each subject, fit y ij = b i0 + b i1 t ij + ε ij and get estimates bb i0 and bb i1
The SAS program for Stage I begins by setting options for line size and page size, followed by importing the dataset "cholst.dat" containing variables such as newid, id, cholst, sex, age, and time The data is then sorted by newid and time, and a print procedure displays selected observations A regression analysis is performed to estimate the relationship between cholst and time, producing output that includes intercepts and slopes for each newid This output is merged with the original dataset, retaining only the first observation for each newid Summary statistics for the intercepts and slopes are generated, along with a correlation analysis to examine the relationship between these two parameters.
Part of output from above SAS program:
Summary statististics for intercepts and slopes 2
Variable Mean Std Error Variance t Value Pr > |t|
Correlation between intercepts and slopes 3
Variable N Mean Std Dev Sum Minimum Maximum b0hat 200 220.68935 41.68917 44138 141.14286 360.16667 b1hat 200 2.55025 3.62947 510.05058 -14.00000 11.74286
Prob > |r| under H0: Rho=0 b0hat b1hat b0hat 1.00000 -0.26939
Parameter mean SE t P[T ≥ |t|] bb 0 221 3 75 < 0001 bb 1 2.55 0.257 10 < 0001 corr(d bb 0 ,bb 1 ) = −0.27
In this study, we utilize the sample means of bb 0 and bb 1 to estimate the population means of b 0 and b 1 Specifically, the sample mean of bb 1, which is 2.55, along with its standard error of 0.257, allows us to effectively address the first objective of our research.
2 However, since var(bb i0 ) and var(bb i1 ) contain variability due to estimating the true baseline response b i0 and change rate b i1 for individual i, so
Sample variances S 2 b b 0 and S 2 b b 1 are unbiased estimates of var(bb i0 ) and var(bb i1 ) and would overestimate var(b i0 ) and var(b i1 ).
3 Similarly, corr(bb 0 ,bb 1 ) 6= corr(b 0 , b 1 ).
Therefore, corr(d bb 0 ,bb 1 ) = −0.27 cannot be used to estimate the correlation between the true baseline response b 0 and true change rate b 1
4 We will use mixed model approach to address the above issues later.
1 Try to compare E(b 0 ) and E(b 1 ) between males and females.
2 Try to compare var(b 0 ) and var(b 1 ) between males and females.
3 Try to examine the effects of age and sex on b 0 using bb 0 = α 0 + α 1 sex + α 2 age + e 0
Technically, we should use b 0 instead of bb 0 However, bb 0 is an unbiased estimate of b 0 (and b 0 is not observable), so using bb 0 is valid.
4 Try to examine the effects of age and sex on b 1 using bb 1 = β 0 + β 1 sex + β 2 age + e 1 Similar to the above argument, using bb 1 here is valid.
The SAS program for Stage II includes a t-test to evaluate the equality of means and variances of intercepts and slopes between sexes, using the variables b0hat and b1hat Additionally, two regression analyses are conducted: the first examines the association between the intercept (b0hat) and the variables of sex and age, while the second focuses on the relationship between the slope (b1hat) and the same demographic factors.
Part of output from above SAS program:
Test equality of mean and variance of intercepts and slopes between sexes 4
The TTEST Procedure Variable: b0hat sex N Mean Std Dev Std Err Minimum Maximum
Diff (1-2) 6.3629 41.6719 5.8960 sex Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Method Variances DF t Value Pr > |t|
Equality of Variances Method Num DF Den DF F Value Pr > F
Variable: b1hat sex N Mean Std Dev Std Err Minimum Maximum
Diff (1-2) -1.5629 3.5529 0.5027 sex Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Method Variances DF t Value Pr > |t|
Equality of Variances Method Num DF Den DF F Value Pr > F
Regression to look at the association between intercept and sex, age 5
The REG Procedure Model: MODEL1 Dependent Variable: b0hat Analysis of Variance
Source DF Squares Square F Value Pr > F
Parameter Standard Variable DF Estimate Error t Value Pr > |t|
Regression to look at the association between slope and sex, age 6
The REG Procedure Model: MODEL1 Dependent Variable: b1hat Analysis of Variance
Source DF Squares Square F Value Pr > F
Dependent Mean 2.55025 Adj R-Sq 0.0892 Coeff Var 135.82170
Parameter Standard Variable DF Estimate Error t Value Pr > |t|
1 Comparison of E(b 0 ) and E(b 1 ) between males and females:
E(b bb 0 ) : 223.97(female),217.6(male),p-value = 0.28 E(b bb 1 ) : 1.75(female),3.31(male),p-value = 0.002.
2 Comparison of var(b 0 ) and var(b 1 ) between males and females:
However, the above tests do NOT compare var(b 0 ) and var(b 1 ) between males and females We will use mixed model approach to address this problem.
3 Model for true baseline response b 0 : bb 0 = α 0 + α 1 sex + α 2 age + e 0 ,
Research indicates that for each additional year of age, there is an associated increase of 2 units in baseline cholesterol levels, even after accounting for sex Additionally, when considering baseline age, it is found that males typically have a higher baseline cholesterol level compared to their counterparts.
4 Model for change rate of the true response b 1 : bb 1 = β 0 + β 1 sex + β 2 age + e 1 , βb 0 = 6.14(1.35), βb 1 = 1.74(0.5), βb 2 = −0.11(0.03).
Research indicates that for each additional year of age, cholesterol levels decrease by 0.11 when adjusted for sex Furthermore, when accounting for baseline age, males exhibit a cholesterol level change rate that is 1.74 units higher than that of females.
Some remarks on two-stage analysis:
1 The first stage model should be reasonably good for the second stage analysis to be valid and make sense.
2 Two-stage analysis can only be used when the covariates considered are one-time covariates (fixed over time).
3 Summary statistics of a time-varying covariates cannot be used in the second stage analysis because of error in variable issue.
4 When the covariates considered are time-varying covariates, two-stage analysis is not appropriate Mixed effects modeling or GEE approach can be used.
5 Two-stage analysis can be applied to discrete response (binary or count data) However, mixed effect modeling or GEE approach can be more flexible.
6 Although two-stage approach can be used to make inference on the approach Therefore, mixed model approach should be used whenever possible.
2 Linear mixed models for normal longitudinal data
• What is a linear mixed model?
2 Random intercept and slope model
• Choose a variance matrix of the data
• Analyze Framingham data using linear mixed models
• GEE for mixed models, missing data issue
What is a linear mixed (effects) model?
A linear mixed model extends traditional linear regression to analyze longitudinal data by incorporating both fixed and random effects The fixed effects represent population-level influences, while the random effects account for subject-specific variations, capturing the correlation arising from these differences This approach effectively models the complexities of correlated data over time.
Fixed effects refer to covariate effects that remain constant across subjects within a study sample These effects are of particular interest in research, as they help identify the relationships between variables For example, in standard regression models, fixed effects are represented by the regression coefficients in the equation y = α + xβ + ε.
Random effects refer to the variations in covariate effects that differ among individual subjects These effects are unique to each subject, making them random and unobservable, as each subject is considered a random sample from a larger population.
Subject Outcome Time Random intercept
A random intercept model assumes: y ij = β 0 + β 1 t ij + β 2 x ij 2 + ã ã ã + β p x ijp + b i + ε ij
Random intercept model: y ij = β 0 + β 1 t ij + β 2 x ij2 + ã ã ã + β p x ijp + b i + ε ij where β’s are fixed effects of interest, b i ∼ N(0, σ b 2 ) are random effects, ε ij ∼ N(0, σ ε 2 ) are independent (measurement)errors.
Interpretation of the model components:
2 β k : Average increase in y associated with one unit increase in x k , the kth covariate, while others are held fixed.
3 β 0 + β 1 t ij + β 2 x ij2 + ã ã ã + β p x ijp + b i = true response for subject i at t ij
4 β 0 + b i is the intercept for subject i =⇒ b i = deviation of intercept of subject i from population intercept β 0
5 σ b 2 = between-subject variance, σ ε 2 = within-subject variance.
6 Total variance of y: Var(y ij ) = σ b 2 + σ ε 2 , constant over time.
7 Correlation between y ij and y ij 0 : corr(y ij , y ij 0 ) = σ b 2 σ b 2 + σ ε 2 = ρ
8 Correlation is constant and positive.
By considering b i as a random variable, we can draw inferences applicable to the entire population from which the sample originates In contrast, treating b i as a fixed variable limits our conclusions to the specific study sample only.
In longitudinal studies, the sample size (n i) is typically small, leading to a proportional increase in the number of subjects (b i) as the total data points grow Consequently, the standard properties of parameter estimates, such as consistency, may not be maintained if b i is assumed to be fixed.
When no x, random intercept only model reduces to y ij = β 0 + β 1 t ij + b i + ε ij
II Random intercept and slope model:
Subject Outcome Time Random Random intercept slope
A random intercept and slope model assumes: y ij = β 0 + β 1 t ij + β 2 x ij2 + ã ã ã + β p x ijp + b i0 + b i1 t ij + ε ij
Random intercept and slope model: y ij = β 0 + β 1 t ij + β 2 x ij2 + ã ã ã + β p x ijp + b i0 + b i1 t ij + ε ij , β k the same as before, random effects b i0 , b i1 are assumed to have a bivariate normal distribution
Usually, no constraint is imposed on σ ij ; ε ij ∼ N(0, σ ε 2 ).
Interpretation of the model components:
1 Mean structure is the same as before:
2 β k : Average increase in y associated with one unit increase in x k , the kth covariate, while others are held fixed. subject i at t ij
4 β 0 + b i = the intercept for subject i =⇒ b i0 = deviation of intercept of subject i from population intercept β 0
5 β 1 + b i1 = the slope for subject i =⇒ b i1 = deviation of slope of subject i from population slope β 1
6 V ar(b i0 + b i1 t ij ) = σ 00 + 2t ij σ 01 + t 2 ij σ 11 = between-subject variance (varying over time).
8 Total variance of y: Var(y ij ) = σ 00 + 2t ij σ 01 + t 2 ij σ 11 + σ ε 2 , not a constant over time.
9 Correlation between y ij and y ij 0 : not a constant over time.
When no x, random intercept and slope model reduces to y ij = β 0 + β 1 t ij + b i0 + b i1 t ij + ε ij
• A correlated error model y ij = β 0 + β 1 t ij + β 2 x ij2 + ã ã ã + β p x ijp + e ij , where e ij are correlated (for the same subjects) normal errors (may contain random effects and ε ij ).
1 Compound symmetric (exchangeable) variance matrix
Here, −1 < ρ < 1 A random intercept model is almost equivalent to this model.
Here, −1 < ρ < 1 It assumes that the error (e i1 , e i2 , e i3 ) T is an autoregressive process with order 1 This structure is more appropriate if y is measured at equally spaced time points.
Here, 0 < ρ < 1 This error structure reduces to AR(1) when y is measured at equally spaced time points This structure is
Here no restriction is imposed on σ ij This structure may be used only if (potential) time points are the same for all subjects and the number is relatively small.
IV General linear mixed models
General model 1: fixed effects + random effects + pure measurement error:
For example, y ij = β 0 + β 1 t ij + β 2 x ij + b i0 + b i1 t ij + ε ij , where ε ij is the pure measurement error (has an independent error structure with a constant variance).
Software to implement the above model: Proc Mixed in SAS:
Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ random intercept t / subject=id type=un; /* specify the covariance matrix */
/* for random effects */ repeated / subject=id type=vc; /* specify the variance structure for error */ run;
General model 2: fixed effects + random effects + stochastic process
For example, y ij = β 0 + β 1 t ij + β 2 x ij + b i0 + b i1 t ij + U i (t ij ), where U i (t) is a stochastic process with AR(1), a spatial power variance structure, or other variance structure.
Software to implement the above model: Proc Mixed in SAS:
Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ random intercept t / subject=id type=un; /* specify the covariance matrix */
/* for random effects */ repeated / subject=id type=sp(pow)(t); /* specify the variance structure for error */ run;
When time points are evenly spaced, the AR(1) variance structure for U i (t) can be specified using the statement `repeated cat_t / subject=id type=ar(1);`, where `cat_t` represents the class time variable.
General model 3: fixed effects + random effects + stochastic process + pure measurement error
For example, y ij = β 0 + β 1 t ij + β 2 x ij + b i0 + b i1 t ij + U i (t ij ) + ε ij , where U i (t) is a stochastic process with some variance structure (e.g.,a spatial power variance structure), ε ij is the pure measurement error.
Software to implement the above model: Proc Mixed in SAS:
Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ random intercept t / subject=id type=un; /* specify the covariance matrix */
/* for random effects */ repeated / subject=id type=sp(pow)(t) local; /* specify error variance structure */ run;
If the time points are equally spaced, we can use type=ar(1) in the repeated statement if assuming AR(1) for U i (t):
General model 4: fixed effects + un-structured error
For example, y ij = β 0 + β 1 t ij + β 2 x ij + e ij , where e ij is the error with un-structured variance matrix
Software to implement the above model: Proc Mixed in SAS:
Proc Mixed data= method=; class id; model y = t x / s; /* specify t x for fixed effects */ repeated cat_t / subject=id type=un; /* specify error variance structure */ run;
Note that no “random” statement can be used in the above model When the number of different time points is big, there will be too many parameters to estimate.
Estimation and inference for linear mixed models
Let θ consist of all parms in random effects (e.g., b i0 , b i1 ) and errors (ε ij ).
We want to make inference on β and θ There are two approaches:
Maximize `(β, θ;y) jointly w.r.t β and θ to get their MLEs.
(a) Get REML estimate of θ from a REML likelihood ` REM L (θ;y)
(take into account the estimation of β) Leads to less biased θb REM L For example, in a linear regression model σb REM L 2 = Residual Sum of Squares n − p − 1
• After we fit a linear mixed model such as y ij = β 0 + β 1 t ij + β 2 x ij2 + ã ã ã + β p x ijp + b i0 + b i1 t ij + ε ij ,
SAS will output a test for each β k , including the estimate, SE, p-value (for testing H 0 : β k = 0), etc.
To test a contrast between β k in SAS, we can utilize the estimate statement within Proc Mixed This procedure will provide the estimate, standard error (SE) for the contrast, and the p-value indicating whether the contrast is significantly different from zero.
Programs 2 and 3 for Framingham data.
2.3 How to choose random effects and the error structure?
1 Use graphical representation to identify possible random effects.
2 Use biological knowledge to identify possible error structure.
3 Use information criteria to choose a final model:
AIC = −2{`(β,b θ;b y) − q} where q = # of elements in θ Smaller AIC is preferred.
BIC = −2{`(β,b θ;b y) − 0.5 × q × log(m)}, m = # of subjectsAgain, smaller BIC is preferred.
Analyze Framingham data using linear mixed models
• Model to address objective 1: How does cholesterol level change over time on average as people get older?
The basic model proposed by the data is represented by the equation y_ij = b*_i0 + b*_i1 t_ij + ε_ij, where y_ij denotes the jth cholesterol level measurement from subject i, t_ij indicates the number of years since the study began, and b*_i0 and b*_i1 are random variables that follow a specific distribution.
,and ε ij are independent errors distributed as N(0, σ ε 2 ).
Cholesterol levels in individuals fluctuate linearly over time, characterized by unique intercepts and slopes that are inherently random, reflecting the variability among subjects within the population.
At baseline (t = 0), the term b ∗ i0 represents the true but unobserved cholesterol level for individual i, while b ∗ i1 indicates the rate of change in the true cholesterol level for that same individual.
The parameter β 0 represents the average baseline cholesterol level across the entire population, while β 1 indicates the average change in cholesterol levels as individuals age This means that β 1 reflects the longitudinal or aging effect on cholesterol levels, demonstrating how these levels tend to change with increasing age.
4 σ 00 is the variance of the true baseline cholesterol level b ∗ i0 ; σ 11 is the variance of the change rate b ∗ i1 of the true cholesterol level; b ∗ i0 and the change rate b ∗ i1 of true cholesterol level.
? The random variables b ∗ i0 and b ∗ i1 can be re-written as b ∗ i0 = β 0 + b i0 , b ∗ i1 = β 1 + b i1 , where b i0 , b i1 have the following distribution:
? Model (2.1) then can be re-expressed as y ij = β 0 + β 1 t ij + b i0 + b i1 t ij + ε ij (2.2)Therefore, β 0 , β 1 are fixed effects and b i0 , b i1 are random effects.
The SAS program for fitting the mixed model without covariates on the Framingham data is presented as follows: it begins with the title "Framingham data: mixed model without covariates" and utilizes the PROC MIXED procedure The dataset used is 'cholst', with 'newid' specified as a class variable The model includes 'cholst' as the dependent variable, with 'time' as a fixed effect A random intercept and time effect are incorporated, with the covariance structure set to unstructured for the subject 'newid' Additionally, a repeated measures structure with a variance components type is specified for 'newid'.
The following is the output from the above program:
Framingham data: mixed model without covariates 1
The Mixed Procedure Model Information
Dependent Variable cholst Covariance Structures Unstructured, Variance
Components Subject Effects newid, newid Estimation Method REML
Residual Variance Method ParameterFixed Effects SE Method Model-BasedDegrees of Freedom Method Containment
Class Level Information Class Levels Values newid 200 1 2 3 4 5 6 7 8 9 10 11 12 13
Observations Not Used 0 Total Observations 1044
Iteration History Iteration Evaluations -2 Res Log Like Criterion
The Mixed Procedure Estimated G Matrix
Row Effect newid Col1 Col2
Covariance Parameter Estimates Cov Parm Subject Estimate
-2 Res Log Likelihood 9960.1 AIC (smaller is better) 9968.1 AICC (smaller is better) 9968.2 BIC (smaller is better) 9981.3
Null Model Likelihood Ratio Test
DF Chi-Square Pr > ChiSq
Standard Effect Estimate Error DF t Value Pr > |t|
Type 3 Tests of Fixed Effects
Effect DF DF F Value Pr > F time 1 191 136.83 ChiSq
Effect Estimate Error DF t Value Pr > |t|
Intercept 220.57 2.7172 198 81.18 ChiSq
Intercept 1 -2.4708 0.2122 -2.8867 -2.0549 135.59 1) + β_2 trt_i + β_3 trt_i I(j > 1) It is important to note that log(t_ij) is commonly referred to as an offset in the model.
• Interpretation of β’s: log of seizure rate λ Group Before RAND (j = 1) After RAND (j > 1)
Therefore, β 1 = time and placebo effect, β 2 = difference in seizure rates at baseline between two groups, β 3 = treatment effect of interest (accounting for time & placebo effect).
If randomization is taken into account (β 2 = 0), we can consider the following model log(à ij ) = log(t ij ) + β 0 + β 1 I(j > 1) + β 2 trt i I(j > 1)
• See the SAS program seize gee.sas and its output seize gee.lst for details.
First part of seize gee.sas options ls ps00 nodate;
/* Proc Genmod to fit population average (marginal) */
/* model using GEE approach for the epileptic seizure */
The analysis of data seizure was conducted using the dataset "seize.dat," where variables such as id, seize, visit, treatment (trt), and age were inputted The number of observations was tracked, with an interval set to 2, which adjusted to 8 if the visit was zero A logarithmic transformation of the interval was applied, and treatment assignment was determined based on visit status The results were presented in "Model 1: overall effect of the treatment," utilizing the PROC GENMOD procedure to analyze the relationship between seizure occurrences and treatment assignment, including interaction effects between treatment and assignment.
/ dist=poisson link=log offset=logtime; repeated subject=id / type=exch corrw; run;
SAS first uses indep corr to get reg coeff est.
Analysis Of Initial Parameter Estimates
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 1.3476 0.0341 1.2809 1.4144 1565.44 |Z|
Intercept 1.3476 0.1574 1.0392 1.6560 8.56 |Z|
A program to adjust for age title "Model 3: adjusting for other covariates (age)"; proc genmod data=seizure; class id; model seize = assign trt assign*trt age
/ dist=poisson link=log offset=logtime scale=pearson; repeated subject=id / type=exch corrw; run;
Output of the program to adjust for all covariates Model 3: adjusting for other covariates 3 Algorithm converged.
Col1 Col2 Col3 Col4 Col5
Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates
Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z|
Intercept 2.2601 0.4330 1.4113 3.1088 5.22