History of Longitudinal Analysis and its Progress
Many researchers continue to apply incorrect statistical methods when analyzing longitudinal data, often overlooking its unique characteristics To improve their analyses, they can adopt advanced models specifically designed for longitudinal data, but must first verify, evaluate, and modify these methods accordingly In health and aging research, for instance, accurately assessing changes in health status is crucial; using inappropriate methods can lead to significant bias in parameter estimates and predictions Therefore, employing advanced models and methods in these fields is vital for reliable outcomes.
Discussions on the theory of random effects and growth date back to the nineteenth century, with notable contributions from Gompertz in 1820 and later by Ware and Liang in 1996 A significant milestone occurred in 1918 when Fisher introduced the first repeated measures analysis through his influential article on analysis of variance (ANOVA).
Fisher's introduction of variance-component models and the concept of “intraclass correlation” marked a significant advancement in statistical analysis His work laid the groundwork for mixed modeling, leading to developments such as the split-plot design and multilevel ANOVA (Yates, 1935; Jackson, 1939) For many years, these variance decomposition methods were the primary tools for analyzing repeated measurements, providing a foundation for modern mixed modeling techniques Additionally, early mathematical formulations of trajectories emerged to study patterns of change over time in biological and social research (Baker, 1954; Rao, 1958; Wishart, 1938; Bollen and Curran, 2006) However, until the early 1980s, the analysis of longitudinal data remained largely confined to traditional repeated measures analysis in biomedical contexts.
Traditional approaches to repeated measures analysis have significant limitations, prompting concerns about accurately measuring and analyzing patterns of change over time (Singer and Willett, 2003) Over the past three decades, longitudinal data analysis has advanced significantly, driven by developments in mixed-effects modeling, multilevel analysis, and individual growth perspectives These statistical advancements have been complemented by progress in computer science, particularly with powerful statistical software packages The availability of such software has enabled scientists to effectively analyze longitudinal data using complex statistical methods that were previously deemed unattainable (Singer and Willett, 2003).
Over the past three decades, the application of statistical techniques to longitudinal data has seen significant methodological advancements, particularly with the development of modern mixed-effects models These models facilitate the analysis of longitudinal data through complex multivariate regression procedures, allowing for both fixed and random effects This flexibility enables researchers to effectively model the autoregressive processes that characterize individual trajectories, capturing average changes over time as well as changes specific to each observational unit By integrating measurable covariates and unobservable characteristics, mixed-effects models yield more reliable analytic results for longitudinal processes Additionally, they are robust to missing data and accommodate irregularly spaced measurements, enhancing their utility in various research contexts.
Recent advancements in Bayes-type approximation methods have enhanced the estimation of parameters in longitudinal data analysis, particularly for nonlinear functions like proportions and counts These flexible techniques allow researchers to model nonnormal outcome data effectively, enabling accurate estimation of complex random effects and nonlinear predictions Consequently, statisticians and quantitative methodologists have utilized these methods to refine longitudinal models, significantly expanding the scope of mixed-effects modeling Additionally, some methodologists have advanced growth curve modeling by integrating latent factors and classes within the structural equation modeling (SEM) framework.
Longitudinal Data Structures
Multivariate Data Structure
In experimental studies, classical repeated measures data are primarily utilized in ANOVA, typically structured in a multivariate format This format features a single row of data for each subject, with repeated measurements recorded horizontally across various time points For example, in a Randomized Controlled Clinical Trial assessing the effectiveness of acupuncture treatments on PTSD (posttraumatic stress disorder), the PTSD Checklist (PCL) score serves as the response variable, measuring the severity of PTSD symptoms with a 17-item summary scale at four distinct time points, where the PCL score ranges from 17 to 85.
85 In the multivariate data format, the repeated measurements for each subject are specified as four outcome variables lined in the same row, with time points indicated as suffixes attached to the variable name Additionally, two covariates are included in the dataset: Age and Female (male = 0, female = 1) To identify the subject for further analysis, each individual’s ID number is also incorporated Below is the data matrix for the first five subjects in the multivariate data format.
Table 1.1 presents data for five subjects, each row containing four outcome variables (PCL1–PCL4), an ID number, and two covariates: Age and Female Among the subjects, one is under 30, one is over 50, and the others are aged between 38 and 44, with a gender distribution of four men and one woman The outcome variables are organized horizontally within the same row, reflecting a multivariate data structure of repeated measurements, also known as the wide table format This format highlights that cross-sectional data is a specific instance of the multivariate structure, where the outcome variable is recorded at a single time point The primary benefit of employing a multivariate data structure is its ability to capture complex relationships among multiple variables simultaneously.
Table 1.1 Multivariate Data of Repeated Measurements
ID PCL1 PCL2 PCL3 PCL4 Age Female
The empirical growth records of each subject can be visually analyzed, as demonstrated by Singer and Willett (2003) For instance, Table 1.1 allows for a straightforward comparison of each subject's trajectory through horizontal alignment of repeated measurements This visual representation facilitates an in-depth examination of the response variable's change over time Consequently, the convenience of this method has led to the development of various latent growth models, which are essential in the field of longitudinal data analysis.
The multivariate data structure presents notable challenges for longitudinal data analysis Primarily, time serves as a crucial covariate for examining changes in the response variable; however, in a wide table format, it is only indirectly represented through suffixes attached to time points, making it difficult to analyze the time effect explicitly Additionally, when intervals between successive waves are unevenly spaced—either by design or across subjects—the multivariate framework struggles to accommodate these variations Furthermore, as covariate values may change over time, neglecting this time-varying aspect can lead to biased analytic results and inaccurate predictions While there are complex methods to incorporate time-varying covariates within the multivariate structure, these approaches tend to be cumbersome and not user-friendly.
Univariate Data Structure
Modern longitudinal modeling often relies on univariate data structures due to the limitations of multivariate formats In this approach, each subject is represented by multiple rows, with time explicitly identified as a key predictor in tracking individual developmental trajectories Specifically, for n time points in a longitudinal analysis, each subject has n corresponding rows in the dataset, although some rows may be missing due to lost observations during follow-ups Table 1.2 illustrates this data in the univariate format, contrasting with the multivariate presentation in Table 1.1.
In Table 1.2, each subject is represented by four rows of data corresponding to specific time points, creating a block design for the univariate longitudinal dataset The PCL score measurements are organized vertically under a single name, with suffixes removed, while a new covariate, Time, indicates each time point This structure allows for repeated measurements at four time points, with each subject's ID number and baseline predictors, such as Age and Female, repeated four times in the data matrix Consequently, this format results in fewer columns and more rows compared to a multivariate data structure.
Therefore, the univariate longitudinal data structure is also referred to as the long table format.
The univariate longitudinal data matrix retains the same information as the multivariate format but differs in structure and includes a time factor This format offers significant advantages, such as allowing researchers to effectively manage unequally spaced intervals by explicitly defining time as a predictor variable Additionally, the vertical arrangement of covariate values for the same individual across multiple rows simplifies the specification of time-varying covariates As a result, the univariate structure has emerged as the preferred format for longitudinal data analysis.
Balanced and Unbalanced Longitudinal Data
In longitudinal data analysis, it is essential for researchers to assess whether the dataset is "balanced" or "unbalanced." Balanced repeated-measures data, as defined in the classical ANOVA model, feature an equal number of observations across all groups.
ID Time PCL Age Female
In longitudinal data analysis, researchers differentiate between balanced and unbalanced designs based on the number of time points and the timing of observations A balanced design features a consistent set of time points for all individuals, resulting in complete and uniform repeated measurements without missing data, commonly used in clinical studies where subjects are randomized for treatment evaluations Conversely, an unbalanced design occurs when subjects have varying sets of time points for their measurements, often seen in aging and health research where attrition due to mortality or illness leads to a rapid decrease in sample size over time To address this, researchers may replenish the sample by randomly selecting new participants at certain follow-up waves, resulting in longitudinal data that is inherently unbalanced Additionally, unbalanced designs may also be implemented when measurement timing aligns with specific benchmark events, such as assessing body fat changes around menarche.
In longitudinal data analysis, while many designs aim for a balanced perspective, the reality is that most longitudinal data is unbalanced due to the common occurrence of missing data When participants drop out of a study, they typically have fewer time points than those who complete it, leading to an unbalanced data structure In clinical studies with a balanced design, some subjects may start later than the designated beginning, resulting in delayed entries that further contribute to this imbalance Although observational studies often plan for all respondents to be reassessed at fixed intervals, attrition frequently disrupts this balance, making it challenging to achieve truly balanced longitudinal data.
Missing Data Patterns and Mechanisms
Incomplete information and large-scale missing data present significant challenges in longitudinal data analysis, particularly in tracking individual changes over time However, statisticians and quantitative methodologists have created various statistically efficient and robust methods to mitigate the effects of unbalanced data structures, which will be explored in the following chapters.
1.4 MISSING DATA PATTERNS AND MECHANISMS
Missing data in repeated measurements poses a significant challenge in longitudinal data analysis, as highlighted in Section 1.1 This issue often arises due to participant dropouts or unanswered survey items during longitudinal studies.
Longitudinal data is collected over a specific observational period, where outcomes and relevant variables are recorded at predetermined intervals Researchers can only analyze responses from participants available at each follow-up, leading to various types of missing data While some missing data may occur randomly, often they are not random and can be influenced by factors such as age, gender, and health status In these cases, missing data may not significantly affect the quality of longitudinal analysis However, when missing data is systematically related to the outcome variable, neglecting it can harm the estimation and prediction of changes over time Therefore, it is crucial for researchers to comprehend different missing data patterns and mechanisms before conducting formal longitudinal data analysis.
Missing data can be categorized based on specific patterns that indicate which values are observed and which are absent in the data matrix These patterns include univariate, multivariate, monotone, nonmonotone, and file matching types A univariate missing pattern occurs when data is missing from a single variable, while a multivariate pattern involves missing data across multiple variables, either for the entire unit or specific items in a questionnaire A monotone missing pattern is characterized by a variable being absent for a subject at a specific time and all subsequent occasions, whereas a nonmonotone pattern occurs when data is missing at one time point but reappears later Nonmonotone patterns can pose greater challenges in longitudinal data analysis compared to monotone patterns Additionally, file-matching patterns arise when variables are not observed together Various statistical techniques exist to address the impact of these missing data patterns on the quality of longitudinal data analysis.
Missing data mechanisms describe how missing information relates to the values of variables within a dataset, and they can be classified into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) MCAR occurs when the missing data is unrelated to both the missing responses and the observed responses, making the observed values representative of the entire sample In contrast, MAR indicates that the missing data is dependent on the observed responses but not on the missing values themselves In longitudinal data analysis, most missing data is classified as MAR, which is why many longitudinal models are designed based on the MAR assumption for managing missing values.
In certain cases, missing data can be classified as MNAR (Missing Not At Random), where specific values are absent Ignoring this missing data mechanism in longitudinal data analysis can lead to significant bias and inaccurate predictions Over time, researchers have created various robust models and methods to address MNAR in this context However, these statistical models are still less developed compared to those for MAR (Missing At Random), and research in this area continues to evolve.
Chapter 14 of this book emphasizes the significance of missing data analysis in longitudinal studies, providing a comprehensive overview of the mathematical definitions and conditions related to MCAR, MAR, and MNAR It also details various statistical models that effectively address different mechanisms of missing data.
Sources of Correlation in Longitudinal Processes
Longitudinal data is characterized by intraindividual correlation, which arises from repeated measurements of the same subject and violates the conditional independence assumption commonly used in multivariate regression modeling This correlation necessitates that statisticians develop methods for effective longitudinal data analysis Variability in longitudinal processes can be categorized into three components: between-subjects variability, within-subject variability, and random errors The first two components represent the systematic variations in longitudinal data, allowing researchers to model intraindividual correlation by incorporating these patterns of variability.
11 1.5 Sources of correlation in longitudinal processes term for uncertainty as regularly specified in general linear and generalized linear regression models, and therefore, it can be estimated as regression residuals.
Between-subjects variability highlights the individual differences in unrecognized characteristics affecting response trajectories While observable individual and contextual factors can address much of this variability, some differences stem from unobservable biological and environmental influences, such as genetic predispositions and physiological traits For instance, individuals may respond differently to the same medication dose for a specific condition, and positive behavioral patterns established in childhood can slow the decline in functionality with age Additionally, the social environment significantly impacts drug use relapse rates among adolescents transitioning from rehabilitation.
Ignoring unobserved heterogeneity in longitudinal data analysis can lead to significant bias in parameter estimates, particularly affecting standard error estimators (Diggle et al., 2002; Fitzmaurice et al., 2004; Verbeke and Molenberghs, 2000) To address between-subject variability, researchers often use individual-specific random effects based on known distributions This approach accounts for individual differences, thereby managing intraindividual correlation in longitudinal data and ensuring conditional independence in repeated measurements for the same observational unit Various statistical models and methods in longitudinal data analysis are specifically designed to handle this intraindividual correlation through the specification of random effects.
Within-subject variations represent a key aspect of variability in longitudinal processes, as repeated measurements for the same individual often show greater similarity than those from randomly selected individuals due to biological, genetic, and environmental factors This intraindividual consistency results in positive correlations over time, as changes in the response variable are influenced by a person's unique physical and environmental contexts However, it is important to note that this correlation tends to decay over time, a phenomenon known as serial correlation For instance, an individual's blood sugar levels are likely to be more consistent between successive measurements than between non-adjacent time points Recognizing these correlation patterns allows researchers to effectively account for intraindividual correlation by establishing within-subject covariance structures in repeated-measures analyses.
Between-subjects and within-subject variations are closely interrelated, as clustering in repeated measurements for the same observational unit highlights individual differences in change patterns over time This individual variability also indicates the homogeneity of repeated measurements within each unit In longitudinal data analysis, researchers often focus on one source of systematic variability—either within-subject or between-subjects—to achieve conditional independence in the data Due to the strong interaction between these two sources, longitudinal data may not allow for the simultaneous modeling of all components, especially when dealing with qualitative outcome variables Only in specific scenarios will both types of variability need to be included in the same statistical model, which will be explored in subsequent chapters.
In longitudinal data analysis, if the subject-specific random effects and covariance structure effectively capture intraindividual correlation, the leftover within-subject variability is considered random and noninformative This residual variability can be modeled as random errors, similar to traditional approaches in general and generalized linear models However, for qualitative response variables, defining within-subject variability components is more challenging, necessitating complex statistical methods for accurate estimation of random errors.
Time Scale and the Number of Time Points
When designing longitudinal data analysis, researchers must carefully choose the time scale, which can vary from weeks to years, based on the study's nature and observation period In clinical studies, outcome measurements may relate to both rapid and gradual changes in patient status For instance, lung cancer research may focus on survival rates over six months, making weeks an appropriate time scale, while prostate cancer studies require longer observation periods, suggesting months as a better choice In behavioral and social sciences, changes in status often occur gradually, with months or years being suitable time scales for tracking events like recovery from disability or changes in marital status Additionally, in health services research, the time scale must align with the type of service, as patients in short-stay hospitals may only be there for days, while nursing home stays can extend for years.
Basic Expressions of Longitudinal Modeling
When designing a longitudinal study, determining the optimal number of time points is crucial While there are no strict guidelines, it is important to have more than two or three time points A dataset with only two time points can only reveal linear patterns of change over time, regardless of the variation in response values between those points.
To accurately capture the trajectory of individuals, a minimum of four time points is necessary to reflect potential changes in response variables, as three time points only allow for an exponential pattern While there is no definitive maximum number of time points for longitudinal data analysis, the ideal number should be determined based on individual study needs Increasing the number of time points can enhance estimation precision, provided the sample size is large enough; however, it may also raise financial costs and complicate statistical modeling For clinical experimental studies, I recommend using four to six time points, including a baseline and three to five follow-ups, to effectively assess the impact of new treatments In contrast, observational studies with larger sample sizes should ideally incorporate six to ten time points to accurately describe time trends and group differences in response variables.
1.7 BASIC EXPRESSIONS OF LONGITUDINAL MODELING
In a typical longitudinal dataset, each subject is represented by multiple data entries corresponding to specific time points For a random sample of N subjects, there are n predesigned time points, though the actual number of observed time points for each subject, denoted as n_i, may vary due to missing data The response measurement for subject i at time point j is represented as Y_ij Consequently, the repeated measurements of the response variable Y for subject i can be organized into an n_i × 1 column vector, referred to as Y_i.
In a longitudinal dataset with N subjects, there are N corresponding vectors The response variable is typically analyzed through repeated measurements, taking into account the time factor and other relevant covariates For cross-sectional data analysis, various regression models are employed, assuming conditional independence of observations based on defined model parameters.
If the response variable Y has a continuous scale, it seems that a linear regression model on the response variable Y can be written as ββ ε
In the context of longitudinal data analysis, the traditional model represented by Y_ij = X_ij b + ε_ij, where X_ij is a 1 × M vector of covariates, b is a vector of fixed regression coefficients, and ε_ij is the normally distributed random error term, often fails to yield efficient and consistent parameter estimates This inefficiency arises due to the correlation of random errors across different time points, even when observed covariates are included To ensure the conditional independence of random errors, it is essential to incorporate additional parameters that account for intraindividual correlation.
Addressing intraindividual correlation is crucial in longitudinal data analysis, leading to the development of advanced methodologies According to prominent statisticians (Diggle, 1988; Diggle et al., 2002), modeling a longitudinal process must account for three key sources of variability: first, the average responses differ randomly among subjects, with some consistently higher or lower; second, a subject's measurement profile may reflect time-varying processes; and third, individual measurements involve subsampling, introducing additional variation This breakdown of stochastic variations in repeated measurements aids in establishing correlations between measurements from the same subject Utilizing the linear regression model outlined in Equation (1.1), a comprehensive multivariate linear model can incorporate all three sources of variability.
Y ij X ij b W T i i ( ) ij ij , (1.2) where b i indicates the variation in the average response between subjects, the term
The model W T i ( ) ij describes independent stationary processes characterized by a specific type of serial correlation, where T ij denotes the time value for subject i at time point j, and ε ij accounts for subsampling uncertainty within subjects For analytical simplicity, the components b i , W T i ( ) ij , and ε ij are typically assumed to have a normal distribution with a defined variance or covariance function With the inclusion of b i and W T i ( ) ij, the random component ε ij is conditionally independent of errors at other time points, ensuring that the estimate of b remains unbiased Consequently, the term X ij ′ββ represents the mean response for subject i at time point j.
In longitudinal data analysis, three sources of variability are often interconnected, leading researchers to focus on a single source of systematic variability A widely used method for analyzing such data is mixed-effects modeling, which incorporates between-subjects random effects to account for intraindividual correlation In this context, let n_i represent the number of repeated measurements for subject i within a sample of N individuals, and Y_i denote the observed outcomes.
Yij=X9ijb+bi+W~i(Tij)+εij,
Longitudinal modeling involves analyzing repeated measurements of subjects represented as an n-dimensional vector A common approach is the linear mixed model, which typically includes a single term for random effects, denoted by β.
Y X e (1.3) where X i is a known n i × M matrix of covariates with the first column taking constant
In linear mixed models, the vector b represents unknown population parameters, with each element b_i corresponding to the random effect for a specific subject The primary predictor, time, is typically modeled as a continuous variable or through polynomials included in the design matrix X This approach allows for the separation of fixed regression coefficients from the random effects, enhancing the efficiency and robustness of the regression coefficients compared to general linear models with potentially dependent residuals In some cases, it may be necessary to specify multiple random effects, such as those for both the intercept and time factor, which are then articulated in terms of b_i Future chapters will delve deeper into these extended mixed-effects models.
Given the specification of b i or b i , the elements in e i are assumed to be condition- ally independent Specifically, the term e i is an n i × 1 column vector of random errors for subject i, given by ε ε ε
In longitudinal modeling, a common approach is to define the covariance structure of within-subject random errors in linear regression models while leaving between-subject random effects unspecified This method is essential when the random-effects approach fails to provide reliable results or when within-subject variability is significant compared to between-subject variability In this framework, the error vector is assumed to follow a multivariate normal distribution with a mean of zero and a covariance matrix reflecting repeated measures The time factor is treated as a classification factor with discrete levels to capture these repeated effects Over the years, statisticians have developed various covariance pattern models for empirical analyses Although similar to classical repeated-measures ANOVA models, this linear regression model for repeated measurements is generally classified as a family of linear mixed models due to its ability to account for between-subject variability.
The two perspectives on continuous response variables can be effectively applied to statistical modeling for various nonnormal outcome variables, including rates, proportions, multinomial data, and count data.
The article discusses advanced statistical models for analyzing longitudinal data, emphasizing the importance of understanding variability within these datasets It outlines the structure of these models, represented by Yi=X9ib+bi+ei, where ei consists of various error terms Subsequent chapters will provide detailed specifications and statistical inferences for both linear and nonlinear longitudinal models, supported by empirical examples to illustrate their practical applications.
Organization of the Book and Data Used for Illustrations
Randomized Controlled Clinical Trial on the Effectiveness of
EFFECTIVENESS OF ACUPUNCTURE TREATMENT ON PTSD
A randomized controlled clinical trial was conducted by the Department of Defense's Deployment Health Clinical Center at Walter Reed National Military Medical Center from February 2006 to October 2007, aiming to evaluate the effectiveness of a brief, 4-week acupuncture treatment (89-minute sessions) combined with standard PTSD care in active-duty military personnel diagnosed with PTSD The study employed a two-arm design, comparing the outcomes of participants receiving adjunctive acupuncture with those receiving usual PTSD care alone Recruitment included 68% from primary care clinics at WRNMMC, 19% from self-referrals through advertisements, and 13% from provider and patient referrals, totaling 55 subjects—28 in the acupuncture group and 27 in the control group Participants in the acupuncture group were randomly assigned to one of three licensed acupuncturists and were assessed at baseline and at 4 and 8 weeks.
Twelve weeks after randomization, participants in the control group were provided with the opportunity to receive the study acupuncture intervention, along with a list of local mental health services, following the completion of study follow-ups.
Asset and Health Dynamics among the Oldest Old (AHEAD)
The second dataset originates from the Survey of AHEAD, a comprehensive longitudinal study focused on older Americans, conducted by the Institute for Social Research at the University of Michigan Funded by the National Institute on Aging, this survey serves as a supplement to the Household and Retirement Survey (HRS) Wave I of the AHEAD survey took place in October, providing valuable insights into the aging population.
Between 1993 and April 1994, a study focused on individuals aged 70 and older (born in 1923 or earlier) was conducted through the Health and Retirement Study (HRS) This process identified 9,473 households and a total of 11,965 individuals within the target demographic The initial respondents have been monitored through telephone interviews every two to three years, with proxy interviews conducted for those who passed away between waves Currently, the AHEAD survey encompasses 10 waves of data collection from 1993 to 2006.
2008, 2010, and 2012 As a longitudinal, multidisciplinary, and US population-based study, AHEAD provides a highly representative and reliable data base for longitudi- nal data analysis of older Americans aged 70 years or older.
The AHEAD study collects comprehensive data on various domains, such as demographic characteristics, health status, healthcare usage, housing structure, disability, retirement plans, and insurance coverage Survival data from follow-up waves are linked to the National Death Index (NDI) This book utilizes AHEAD data from six waves between 1998 and 2008, with the 1998 panel serving as the baseline For empirical analysis, a random sample of 2,000 individuals from the baseline AHEAD sample is selected, and the weight factor for adjusting oversampling of certain subpopulations is not applied in the examples provided Readers wishing to conduct their own analyses should use the complete AHEAD sample and include the weight variable to ensure unbiased results.
Methods and Applications of Longitudinal Data Analysis http://dx.doi.org/10.1016/B978-0-12-801342-7.00002-2
Copyright © 2016 Higher Education Press Published by Elsevier Inc All rights reserved.
2.1.3 Effect Size Between Two Means and its Confidence Interval 23
2.1.4 Empirical Illustration: Descriptive Analysis on the Effectiveness of
Acupuncture Treatment in Reduction of PTSD Symptom Severity 29
2.2.1 Specifications of One-Factor ANOVA 37
2.2.2 One-factor Repeated Measures ANOVA 39
2.2.3 Specifications of Two-Factor Repeated Measures ANOVA 42
2.2.4 Empirical Illustration: A Two-Factor Repeated Measures ANOVA – The
Effectiveness of Acupuncture Treatment on PCL Revisited 45
2.3.2 Hypothesis Testing on Effects in MANOVA 49
2.3.4 Empirical Illustration: A Two-Factor Repeated Measures MANOVA on the Effectiveness of Acupuncture Treatment on Two Psychiatric
Traditional methods in longitudinal analysis summarize key features of raw data without complex adjustments for intricate structures or missing observations These methods include basic statistics, paired t-tests for outcome scores at two time points, effect size (d), analysis of variance (ANOVA) for repeated measures, and repeated measures multivariate analysis of variance (MANOVA) While often seen as simplistic, these approaches are frequently utilized in biomedical and epidemiologic studies to draw conclusions In randomized controlled clinical trials, for instance, paired t-test results are commonly used to summarize findings, especially when sample sizes are small and randomization partially accounts for effects in longitudinal processes.
Traditional methods of longitudinal data analysis 2
This chapter presents key traditional descriptive methods, including time plots for trend analysis, paired t-tests, effect sizes, and their confidence intervals An empirical example demonstrates the practical application of these descriptive techniques Additionally, ANOVA is discussed with a corresponding empirical illustration, followed by an exploration of repeated measures MANOVA for analyzing longitudinal data with multiple response variables The chapter concludes with a summary of these traditional methods, highlighting their advantages and limitations.
Time plots of trends, paired two-time t-tests, effect sizes, and confidence intervals are descriptive statistics that can be manually calculated, but they fail to account for potential bias due to missing observations.
In longitudinal data analysis, time plots are essential for visualizing how response scores evolve over time This straightforward method plots outcome measurements against specific time points, illustrating trends and trajectories in the data There are two primary perspectives on this technique: one focuses on individual patterns of change, revealing intraindividual growth patterns through distinct trajectories for each subject These growth curves highlight variability in response measurements among individuals To compare growth patterns across different population groups, researchers can create separate intraindividual time plots based on discrete covariates like treatment, age, gender, or race/ethnicity This approach is particularly valuable in fields such as medicine, public health, biology, psychology, and criminology, where identifying individuals with unique response values or high risks for dynamic events is critical.
The second time plots perspective involves analyzing time trends within a population by calculating the average response measurements at predetermined time points These mean scores are then presented in a time plot, illustrating the overall pattern of change in the mean response score for the entire population over time.
Descriptive approaches in research can compute standard errors and confidence intervals, displaying these statistics alongside time plots to illustrate population trends This method is particularly beneficial in social sciences, where the focus is often on understanding changes over time within a specific population rather than individual cases When analyzing large sample sizes, which is typical in observational studies, individual growth curves can clutter time plots, hindering the ability to identify general patterns By utilizing time plots to represent trends for a specific population, researchers can effectively summarize and evaluate changes, providing valuable insights for policy development Additionally, comparing population trends across different groups in a single graph allows for the examination of variations in change patterns over time.
The intraindividual change and population trends approaches differ significantly, as the latter allows for variations within a group over time due to factors like group exits, deaths, or new arrivals While these approaches serve different scientific or policy objectives, they are interconnected since aggregate changes arise from individual ones (Verbrugge and Liu, 2014) In Section 2.1.4, we will illustrate this relationship with an empirical example featuring two time plots: one showing subject-specific transitions and the other displaying mean response measurements for two population groups.
When utilizing raw longitudinal data, it is crucial to acknowledge certain limitations in plotting time trends The basic time plots presented in this chapter serve as initial approaches to represent longitudinal data However, substantial missing data can lead to incomplete intraindividual curves, resulting in potentially biased population trend plots Additionally, the observed changes over time may be confounded by other covariates, creating a false association Strong covariance among repeated measurements, both within individuals and across population groups, further complicates the accuracy of these time plots Without addressing these issues, time trend representations can be misleading To accurately depict time trends, more advanced statistical methods are necessary, which will be explored in subsequent chapters alongside specific mixed-effects models.
Time plots of trends offer a visual method to observe changes in response measurements over time, but their statistical significance must be verified using analytical methods The paired t-test, commonly known as the pre–post paired t-test in biomedical research, is the simplest approach to assess the statistical significance of time trends This test evaluates a specific value that adheres to a Student’s t distribution if the null hypothesis is valid Various types of t-tests exist for different scenarios, including one-sample location tests, unpaired two-sample tests, paired two-sample tests, and tests on regression line slopes; however, this chapter will not cover the latter category.
The t-test statistic is a crucial tool in statistics, extending the z-score test for hypothesis testing when population variances are unknown It is calculated using the formula t = z/s, where z represents the z-score under the null hypothesis and s is the sample standard deviation The squared sample standard deviation follows a chi-square distribution with specific degrees of freedom Once the t value is calculated, the corresponding p-value can be determined using the Student's t-distribution table If the p-value is below the predefined significance level, the null hypothesis is rejected In longitudinal studies, a t-test assesses whether the mean difference between two time points is zero, such as in biomedical research where patient blood pressure is measured before and after treatment For correlated repeated measurements, a paired t-test is applied, with degrees of freedom calculated as N/2 − 1, where N is the total number of observations.
To assess the impact of a medical treatment on patient responses, let Y pre and Y post represent the mean scores before and after treatment for a sample of N patients The researcher aims to determine if there is a significant difference between these two mean scores, leading to the formulation of the null hypothesis (H0: Y pre = Y post) and the alternative hypothesis (H1: Y pre ≠ Y post).
In a two-tailed test, we set the null hypothesis as H0: post = pre and the alternative hypothesis as H1: post ≠ pre Assuming equal variances for pre- and posttest mean scores with the same sample size, the formula for the paired t-test is t = (Y post - Y pre) / (s / √N).