What Is Longitudinal Data?
Longitudinal data is prevalent across various fields, including medicine, public health, education, business, economics, psychology, and biology In economics, this type of data is referred to as panel data, while other related terms include repeated measures and time series.
A longitudinal study is designed by using a simple random sample of subjects, where a single measurement is taken repeatedly over time for each participant The timing of these measurements is crucial, as readings taken closer together are likely to show greater similarity than those taken further apart For instance, in a weight loss study, a subject's weight may be recorded weekly, creating a comprehensive longitudinal data set Other examples include tracking patients' blood pressure during clinic visits or evaluating elementary school children's math achievement at the end of each school year over a six-year period.
Longitudinal data are inherently multivariate, which can be understood in two contexts: statistical definitions and practical applications In statistics, multivariate refers to having multiple responses per subject, while in applied fields, it indicates the use of several variables for analysis For instance, multiple linear regression is considered univariate in statistical terms but multivariate in practical scenarios Longitudinal analyses fit both definitions of multivariate, differing from traditional multivariate data where consistent measurements are collected from each subject In traditional settings, multiple measurements—such as height, weight, and head circumference—are analyzed collectively as a single multivariate outcome Conversely, longitudinal data allows for variability in the number of measurements contributed by each subject.
Longitudinal data exhibits a hierarchical structure, with individual observations nested within subjects Each observation represents a single measurement related to a subject, which encompasses all measurements taken from that entity While the term "subject" can refer to various entities like cases or experimental units, we often use "person" since much of our data focuses on human beings However, if the subjects are animals or countries, we will appropriately refer to them as such.
Longitudinal data is characterized by multiple observations of subjects that are organized over time In longitudinal surveys involving humans, calendar time, such as months or years, serves as the key dimension for differentiating these observations.
Related Data Types
Long-term medical studies often involve years between observations, while short-term studies can have intervals of seconds or minutes Instead of time, we can use linear metrics such as distance for measurements, like assessing repeated measures along a plant root in millimeters from the tip This approach can also apply to experiments where subjects, like rats or humans, are timed in tasks Longitudinal data analysis may improve when time is represented as its natural logarithm As long as the measurement metric is linearly ordered—whether it be time, log(time), centimeters, or trials—we categorize it as longitudinal data, with time serving as the primary dimension for separating measurements.
Measures can be taken at various times, which may include all real numbers, only positive numbers, a specific range (such as from the start to the end of a study), or a subset of integers, particularly when data arises from consecutive trials or when time is measured in whole days or months since a particular event.
Longitudinal data represent a specific type of repeated measures data where multiple measurements are taken from the same subject or experimental unit over time These observations can be distributed temporally or spatially, such as tracking groundwater contamination across a region or analyzing a 3D MRI of a rat brain Additionally, repeated measures data can be organized within larger units, as seen in household surveys that collect information from all members of a household, or in educational research where children's test scores are clustered within classrooms In these cases, the classroom, teacher, or school serves as the experimental unit for the repeated measures.
Repeated measures data can be categorized into different types, each with unique characteristics Unlike longitudinal data, which can be arranged in a linear sequence, spatial and three-dimensional repeated measures lack a definitive linear order, though they maintain specific distances and directions between measurements Additionally, clustered data is unordered, with subunits exhibiting symmetrical relationships Each data type offers valuable insights and merits comprehensive exploration.
Time series data is a specific type of longitudinal data, characterized by observations collected over an extended period on a single unit Notable examples include daily sunspot counts and the annual lynx trapping figures in Canada from 1821 to 1934 In contrast, longitudinal data encompasses multiple time series measured across a sample of subjects, typically featuring fewer repeated measurements than traditional time series While some subjects may have only two observations, meaningful longitudinal analysis often begins with three to four observations per subject, with the average typically ranging from three to ten or more.
Longitudinal repeated measures data can be complex, especially when measuring multiple variables over time, such as heart rate before, during, and after exercise For example, collecting data from a participant's resting heart rate, peak heart rate, and recovery rate over ten days results in either 30 repeated measures or 10 longitudinally repeated trivariate observations In clinical trials, we often encounter multivariate longitudinal data, which involves multiple measurements taken repeatedly This book primarily focuses on univariate longitudinal data, analyzing single measurements taken over time For instance, a trivariate heart rate example can be simplified to a single measure like the peak rate minus the resting rate, allowing for straightforward analysis Additionally, Chapter 13 addresses bivariate longitudinal data, while Chapter 11 introduces repeated binary or count longitudinal data.
Inferences from Longitudinal Data
The Population Mean
With univariate observationsy i on a simple random sample of sizenfrom a population under study, we may use the sample mean ¯ y n i=1 y i n of the observations to estimate the population meanà, which is a scalar.
Longitudinal observations allow us to collect multiple data points on subject i over time, resulting in a population mean that is not a static value but rather a function à(t) that varies with time This function may remain constant, trend upward or downward, or exhibit complex cyclical patterns By analyzing a set of observations Y ij from a random sample of subjects i at a specific time t j, we can compute the sample average to gain insights into these temporal dynamics.
Y¯ j n i=1 Y ij n of these observations to estimateà j =à(t j ), the population mean at time t j
Individual Variability
In a univariate simple random sample, the population variance, denoted as σ², is a single scalar value estimated by the sample variance, s² = Σ(yi - ȳ)² / (n - 1) The population standard deviation, σ, indicates the extent to which observations differ from the population mean For normally distributed data, approximately 68% of observations fall within one standard deviation of the mean, while about 95% lie within two standard deviations These percentages hold true as long as the data distribution resembles a bell curve Additionally, the variance reflects the precision of our estimate of the population mean, with the standard deviation of the mean calculated as σ / √(n - 1).
In longitudinal data analysis, the population variance is not represented by a single value; instead, it varies at each time point \( t_j \), denoted as \( \sigma_{jj} \), with the corresponding standard deviation being \( \sqrt{\sigma_{jj}} \) This population variance \( \sigma_{jj} \) reflects the variability of responses \( Y_{ij} \) at specific time intervals The use of two subscripts for \( \sigma_{jj} \), without a superscript of 2, is particularly useful for introducing covariance parameters in subsequent analyses.
Covariance and Correlation
The population covariance σ(t j, t l) between two time points t j and t l reflects the relationship between their respective values This covariance is a function of both time points rather than just one Importantly, the absolute value of the covariance σ jl is limited by the product of the standard deviations associated with t j and t l.
The correlation ρjl between observations at times tj and tl is calculated by dividing the covariance σjl by the product of the square roots of the variances σjj and σll This calculation ensures that the correlation value remains within the range of -1 to 1, expressed as ρjl = σjl / (σjj^1/2 * σll^1/2).
A correlation is usually easier to interpret than a covariance because of the restricted range.
A positive correlation ρ jl tells us that if observationY ij is greater than its population mean à j , then it is more likely than not that observation
The value of Y il is expected to exceed its population mean The greater the distance of Y ij above the mean, the higher we anticipate Y il will be above its mean When Y ij is at the population mean, Y il has an equal probability of being either above or below its own population mean.
When the correlation between Y ij and Y il is zero, knowing the value of Y ij does not provide any insight into whether Y il is above or below its mean, indicating equal probabilities for both scenarios Conversely, a very high and positive population correlation ρ jl between observations at times t j and t l suggests a strong expectation of similar behavior in those observations.
Y il to be almost the same number of population standard deviations above à l as Y ij is above à j Let E[Y il |Y ij ] be the expected value of observation
Y il at timet l on subjectigiven that we know the value of the observation
Y ij at timet j Knowing Y ij changes our best guess, or expected value of
Y il as long as the correlation is non-zero In general, assuming multivariate
1.3 Inferences from Longitudinal Data 7 normal data,
The right-side factor multiplying ρjl indicates how many standard deviations Yij is above its mean Conversely, the left side represents the anticipated number of standard deviations that Yil exceeds its mean It is important to note that we do not expect Yil to match the exact number of standard deviations above its mean as Yij.
When Y ij exceeds its mean, the shrinkage factor corresponds directly to the correlation ρ jl In cases where ρ jl is negative, the same formula remains valid; however, we anticipate that Y il will fall below its mean when Y ij is above its mean.
The population profile represents the relationship between a variable \( j \) and time \( t_j \), while a subject profile consists of observations \( Y_{ij} \) plotted against times \( t_j \) Assuming multivariate normality, the collection of all subject profiles fluctuates around the population mean profile, influenced by the population's variances and covariances This concept will be further illustrated in Chapter 8.
We calculate the samplecovariance s jl between observationsY ij andY il taken on subjects i= 1, , nat times t j andt l by s jl = (n−1) − 1 n i=1
The sample variance at time j is calculated using the formula \( s_{jj} = \frac{1}{n-1} \sum_{i=1}^{n} (Y_{ij} - \bar{Y}_{j})^2 \), which is identical to the formula for \( s_{jl} \) when \( j = l \) To denote the estimate of the unknown parameter \( \rho_{jl} \), we use a circumflex symbol, represented as \( \hat{\rho}_{jl} \).
Covariates
A regression model is developed from a dataset containing a univariate response variable \( y_i \) and a vector of covariates \( x_i \) based on a random sample of size \( n \), aiming to analyze how the distribution of \( y_i \) varies with \( x_i \) Standard multiple linear regression posits that the mean of \( y_i \) is influenced by \( x_i \), while assuming a constant variance of \( y_i \) across different values of \( x_i \) In cases where covariates and longitudinal responses are measured, these covariates can impact the population mean in both straightforward and complex ways Additionally, covariates may also influence the variance and correlation functions, although these models remain underdeveloped and are currently the focus of ongoing research.
Predictions
Longitudinal data models provide comprehensive insights into the profiles of individual subjects, enabling accurate predictions for new subjects By analyzing initial observations of a new subject, we can identify similar profiles from previous subjects to forecast future outcomes The population mean serves as a foundational reference for these predictions, while the variance and covariance of the measurements, combined with the early observations, create a tailored adjustment to the population mean specific to the new subject.
Contrasting Longitudinal and Cross-Sectional Data
A cross-sectional study serves as a key alternative to longitudinal study designs, focusing on the collection of a univariate response \( y_i \) and covariates \( x_i \) for subjects \( i \), where \( i \) ranges from 1 to the sample size \( n \) This data is typically analyzed using regression analysis, with \( x_i \) represented as a vector of elements \( x_{ik} \), where \( k \) runs from 1 to \( K \), and the first element \( x_{i1} = 1 \) acts as the intercept in the model The standard linear regression model is expressed as \( y_i = x_i \alpha + \delta_i \), where \( \alpha = (\alpha_1, \ldots, \alpha_K) \) is a vector of unknown regression coefficients, and \( \delta_i \) denotes a random error term with a mean of 0 and variance \( \sigma^2 \) It is assumed that observations are independent given the unknown parameters \( \alpha \) and \( \sigma^2 \), and the residuals \( \delta_i \) are often presumed to follow a normal distribution \( \delta_i \sim N(0, \sigma^2) \).
Under model (1.1), the expected value of a response y i with covariates x i is
The average response of a population with a specific covariate vector \( x_i \) can be analyzed by examining the coefficient \( \alpha_l \) for \( l > 1 \) This coefficient represents the change in the population average when transitioning from subjects characterized by covariate vector \( x_i \) to those with \( x^*_i \), where \( x^*_{il} = x_{il} + 1 \) while other covariates remain unchanged The mean change is quantified as the difference in average responses between the original subset with covariate vector \( x_i \) and the new subset with \( x^*_i \) It is important to note that this analysis does not reflect changes in an individual’s response \( y_i \) resulting from altering a single covariate value from \( x_{il} \) to \( x_{il} + 1 \).
1.4 Contrasting Longitudinal and Cross-Sectional Data 9
In randomized trials, subjects are assigned to either a control group (x ik = 0) or a treatment group (x ik = 1) by the investigator, rather than based on their intrinsic characteristics In this context, α k represents the anticipated change in an individual's response if they were assigned to the treatment group instead of the control group.
In a cross-sectional study examining various ages, the age coefficient in regression reflects differences among age groups rather than the aging effects on individuals For instance, in a dental study measuring total cavities, the age coefficient is typically positive, indicating that older populations tend to have more cavities However, tracking 27-year-olds a decade later would not necessarily reveal a corresponding increase in cavities Instead, the age coefficient represents the difference in average cavities between the original 27-year-olds and the 37-year-olds at the time of the study Cavity counts are influenced by age group characteristics, and any observed changes in younger adults are likely due to advancements in fluoridation, dietary habits, dental hygiene, and education over time.
Regression coefficients, aside from the randomized treatment assignment coefficient, do not indicate how individuals in the population may respond to changes in covariate values This limitation persists in longitudinal data models, with a crucial exception being the factor of time.
Collecting repeated measurements over time allows us to directly assess the effects of aging on patients With the right statistical tools, we can analyze and summarize patterns of change both at the population level and for individual patients For instance, when evaluating cavities, we may observe a general trend of increasing cavities among all adults, with older adults exhibiting a higher number of cavities compared to their younger counterparts, both at the start and conclusion of the study.
Using a longitudinal design offers several advantages, particularly when comparing two groups, such as men and women or treatment and control groups Unlike cross-sectional studies, which only capture the current differences between groups, longitudinal studies allow for the analysis of trends over time for each group This approach enables researchers to answer critical questions regarding the evolving dynamics and behaviors of each group, providing deeper insights into their distinct trajectories.
• Is the average response equal in the two groups at all times?
• If not, is the average response pattern over time the same in the two groups, apart from a constant level shift?
• If not, are the differences increasing or decreasing over time? and
• If not monotone, just how do the differences change over time?
Benefits of Longitudinal Data Analysis
Efficiency and Cost
In section 6.7, we demonstrate that longitudinal data collection can be more cost-effective than simple random sampling Collecting data incurs expenses, and it is typically less expensive to gather additional data from existing subjects than to recruit new participants Moreover, obtaining multiple measurements from the same individual can yield valuable insights, making longitudinal data collection a beneficial approach.
Prediction
Longitudinal data collection enables the prediction of trends for new subjects based on prior measurements, unlike cross-sectional studies For instance, if longitudinal data indicates that most participants show an increase from their initial levels, it is reasonable to predict that a new entrant will also experience an upward trend in subsequent measurements.
Time Trends
To analyze individual trends over time, longitudinal data is essential This approach allows us to determine the rate of change for individuals, revealing whether those who begin with lower values increase at the same pace as those starting at higher levels.
Time Frames for Observing Longitudinal Data
In this section, we consider a number of familiar measurements and various time intervals between measurements These are listed in table 1.1 We
1.6 Time Frames for Observing Longitudinal Data 11
Human height Minutes No through days Weeks Perhaps newborns?
Year(s) For children/adolescents Decades Not sure: maybe for adults/seniors?
Human weight Minutes No through hour(s) Days Maybe not, depending
On accuracy of scale Week(s) Yes, in weight-loss program Months Certainly (kids) Wage/salary