Before deciding how to deal with missing data, you’ll find it useful to determine which variables have missing values, in what amounts, and in what combinations. In this sec- tion, we’ll review tabular, graphical, and correlational methods for exploring missing values patterns. Ultimately, you want to understand why the data is missing. The answer will have implications for how you proceed with further analyses.
15.3.1 Tabulating missing values
You’ve already seen a rudimentary approach to identifying missing values. You can use the c omplete.cases() function from section 15.2 to list cases that are complete, or conversely, list cases that have one or more missing values. As the size of a dataset grows, though, it becomes a less attractive approach. In this case, you can turn to other R functions.
The m d.pattern() function in the m ice package will produce a tabulation of the missing data patterns in a matrix or data frame. Applying this function to the sleep dataset, you get the following:
> library(mice)
> data(sleep, package="VIM")
> md.pattern(sleep)
BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD 42 1 1 1 1 1 1 1 1 1 1 0 2 1 1 1 1 1 1 0 1 1 1 1 3 1 1 1 1 1 1 1 0 1 1 1 9 1 1 1 1 1 1 1 1 0 0 2 2 1 1 1 1 1 0 1 1 1 0 2 1 1 1 1 1 1 1 0 0 1 1 2 2 1 1 1 1 1 0 1 1 0 0 3 1 1 1 1 1 1 1 0 1 0 0 3 0 0 0 0 0 4 4 4 12 14 38
The 1’s and 0’s in the body of the table indicate the missing values patterns, with a 0 indicating a missing value for a given column variable and a 1 indicating a nonmissing value. The first row describes the pattern of “no missing values” (all elements are 1).
The second row describes the pattern “no missing values except for Span.” The first column indicates the number of cases in each missing data pattern, and the last col- umn indicates the number of variables with missing values present in each pattern.
Here you can see that there are 42 cases without missing data and 2 cases that are missing Span alone. Nine cases are missing both NonD and Dream values. The dataset contains a total of (42 x 0) + (2 x 1) + … + (1 x 3) = 38 missing values. The last row gives the total number of missing values present on each variable.
15.3.2 Exploring missing data visually
Although the tabular output from the md.pattern() function is compact, I often find it easier to discern patterns visually. Luckily, the VIM package provides numerous func- tions for visualizing missing values patterns in datasets. In this section, we’ll review several, including aggr(), matrixplot(), and scattMiss().
The a ggr() function plots the number of missing values for each variable alone and for each combination of variables. For example, the code
library("VIM")
aggr(sleep, prop=FALSE, numbers=TRUE)
Number of Missings 02468101214 BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger Combinations BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
42 9 3 2 2 2 1 1
Figure 15.2 aggr() produced plot of missing values patterns for the sleep dataset.
produces the graph in figure 15.2. (The VIM package opens up a GUI interface. You can close it; we’ll be using code to accomplish the tasks in this chapter.)
You can see that the variable NonD has the largest number of missing values (14), and that 2 mammals are missing NonD, Dream, and Sleep scores. Forty-two mammals have no missing data.
The statement aggr(sleep, prop=TRUE, numbers=TRUE)produces the same plot, but proportions are displayed instead of counts. The option numbers=FALSE (the default) suppresses the numeric labels.
The m atrixplot() function produces a plot displaying the data for each case.
A graph created using matrixplot(sleep) is displayed in figure 15.3. Here, the numeric data is rescaled to the interval [0, 1] and represented by grayscale colors, with lighter colors representing lower values and darker colors representing larger values.
By default, missing values are represented in red. Note that in figure 15.3, red has been replaced with crosshatching by hand, so that the missing values are viewable in grayscale. It will look different when you create the graph yourself.
The graph is interactive: clicking on a column will re-sort the matrix by that variable.
The rows in figure 15.3 are sorted in descending order by BodyWgt. A matrix plot allows you to see if the presence of missing values on one or more variables is related to the actual values of other variables. Here, you can see that there are no missing values on sleep variables (Dream, NonD, Sleep) for low values of body or brain weight (BodyWgt, BrainWgt).
Figure 15.3 Matrix plot of actual and missing values by case (row) for the sleep dataset. The matrix is sorted by BodyWgt.
The m arginplot() function produces a scatter plot between two variables with infor- mation about missing values shown in the plot’s margins. Consider the relationship be- tween amount of dream sleep and the length of a mammal’s gestation. The statement
marginplot(sleep[c("Gest","Dream")], pch=c(20), col=c("darkgray", "red", "blue"))
produces the graph in figure 15.4. The p ch and c ol parameters are optional and pro- vide control over the plotting symbols and colors used.
The body of the graph displays the scatter plot between Gest and Dream (based on complete cases for the two variables). In the left margin, box plots display the distribution of Dream for mammals with (dark gray) and without (red) Gest values.
Note that in grayscale, red is the darker shade. Four red dots represent the values of Dream for mammals missing Gest scores. In the bottom margin, the roles of Gest and Dream are reversed. You can see that a negative relationship exists between length of gestation and dream sleep and that dream sleep tends to be higher for mammals that are missing a gestation score. The number of observations with missing values on both variables at the same time is printed in blue at the intersection of both margins (bottom left).
The V IM package has many graphs that can help you understand the role of missing data in a dataset and is well worth exploring. There are functions to produce scatter plots, box plots, histograms, scatter plot matrices, parallel plots, rug plots, and bubble plots that incorporate information about missing values.
12 4 0
0 100 200 300 400 500 600
0123456
Gest
Dream
Figure 15.4 Scatter plot between amount of dream sleep and length of gestation, with information about missing data in the margins
15.3.3 Using correlations to explore missing values
Before moving on, there’s one more approach worth noting. You can replace the data in a dataset with indicator variables, coded 1 for missing and 0 for present. The result- ing matrix is sometimes called a s hadow matrix. Correlating these indicator variables with each other and with the original (observed) variables can help you to see which variables tend to be missing together, as well as relationships between a variable’s “miss- ingness” and the values of the other variables.
Consider the following code:
x <- as.data.frame(abs(is.na(sleep)))
The elements of data frame x are 1 if the corresponding element of sleep is missing and 0 otherwise. You can see this by viewing the first few rows of each:
> head(sleep, n=5)
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger 1 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3 2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3 3 3.385 44.5 NA NA 12.5 14.0 60 1 1 1 4 0.920 5.7 NA NA 16.5 NA 25 5 2 3 5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
> head(x, n=5)
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger 1 0 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 3 0 0 1 1 0 0 0 0 0 0 4 0 0 1 1 0 1 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0
The statement
y <- x[which(sd(x) > 0)]
extracts the variables that have some (but not all) missing values, and
cor(y)
gives you the correlations among these indicator variables:
NonD Dream Sleep Span Gest NonD 1.000 0.907 0.486 0.015 -0.142 Dream 0.907 1.000 0.204 0.038 -0.129 Sleep 0.486 0.204 1.000 -0.069 -0.069 Span 0.015 0.038 -0.069 1.000 0.198 Gest -0.142 -0.129 -0.069 0.198 1.000
Here, you can see that Dream and NonD tend to be missing together (r=0.91). To a lesser extent, Sleep and NonD tend to be missing together (r=0.49) and Sleep and Dream tend to be missing together (r=0.20).
Finally, you can look at the relationship between the presence of missing values in a variable and the observed values on other variables:
> cor(sleep, y, use="pairwise.complete.obs") NonD Dream Sleep Span Gest BodyWgt 0.227 0.223 0.0017 -0.058 -0.054 BrainWgt 0.179 0.163 0.0079 -0.079 -0.073 NonD NA NA NA -0.043 -0.046 Dream -0.189 NA -0.1890 0.117 0.228 Sleep -0.080 -0.080 NA 0.096 0.040 Span 0.083 0.060 0.0052 NA -0.065 Gest 0.202 0.051 0.1597 -0.175 NA Pred 0.048 -0.068 0.2025 0.023 -0.201 Exp 0.245 0.127 0.2608 -0.193 -0.193 Danger 0.065 -0.067 0.2089 -0.067 -0.204 Warning message:
In cor(sleep, y, use = "pairwise.complete.obs") : the standard deviation is zero
In this correlation matrix, the rows are observed variables, and the columns are indica- tor variables representing missingness. You can ignore the warning message and NA values in the correlation matrix; they’re artifacts of our approach.
From the first column of the correlation matrix, you can see that nondreaming sleep scores are more likely to be missing for mammals with higher body weight (r=0.227), gestation period (r=0.202), and sleeping exposure (0.245). Other columns are read in a similar fashion. None of the correlations in this table are particularly large or striking, which suggests that the data deviates minimally from MCAR and may be MAR.
Note that you can never rule out the possibility that the data are NMAR because you don’t know what the actual values would have been for data that are missing. For example, you don’t know if there’s a relationship between the amount of dreaming a mammal engages in and the probability of obtaining a missing value on this variable.
In the absence of strong external evidence to the contrary, we typically assume that data is either MCAR or MAR.