In this section, we’ll look at frequency and contingency tables from categorical vari- ables, along with tests of independence, measures of association, and methods for graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels package. In the following examples, assume that A, B, and C represent categorical variables.
The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations:
> library(vcd)
> head(Arthritis)
ID Treatment Sex Age Improved 1 57 Treated Male 27 Some 2 46 Treated Male 29 None 3 77 Treated Male 30 None 4 17 Treated Male 32 Marked 5 36 Treated Male 46 Marked 6 23 Treated Male 58 Marked
Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, we’ll create frequency and contingency tables (cross-classifications) from the data.
7.2.1 Generating frequency tables
R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1.
Table 7.1 Functions for creating and manipulating contingency tables
Function Description
table(var1, var2, …, varN) Creates an N-way contingency table from N categorical variables (factors)
xtabs(formula, data) Creates an N-way contingency table based on a formula and a matrix or data frame
prop.table(table, margins) Expresses table entries as fractions of the marginal table defined by the margins
margin.table(table, margins) Computes the sum of table entries for a marginal table defined by the margins
addmargins(table, margins) Puts summar y margins (sums by default) on a table ftable(table) Creates a compact "flat" contingency table
In the following sections, we’ll use each of these functions to explore categorical vari- ables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or the xtabs() function, then manipulate it using the other functions.
ONE-WAY TABLES
You can generate simple frequency counts using the table() function. Here’s an example:
> mytable <- with(Arthritis, table(Improved))
> mytable Improved
None Some Marked 42 14 28
You can turn these frequencies into proportions with prop.table():
> prop.table(mytable) Improved
None Some Marked 0.500 0.167 0.333
or into percentages, using prop.table()*100:
> prop.table(mytable)*100 Improved
None Some Marked 50.0 16.7 33.3
Here you can see that 50 percent of study participants had some or marked improve- ment (16.7 + 33.3).
TWO-WAY TABLES
For two-way tables, the format for the table() function is
mytable <- table(A, B)
where A is the row variable, and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula style input. The for- mat is
mytable <- xtabs(~ A + B, data=mydata)
where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs.
If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated).
For the Arthritis data, you have
> mytable <- xtabs(~ Treatment+Improved, data=Arthritis)
> mytable
Improved
Treatment None Some Marked Placebo 29 7 7 Treated 13 7 21
You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have
> margin.table(mytable, 1) Treatment
Placebo Treated 43 41
> prop.table(mytable, 1) Improved
Treatment None Some Marked Placebo 0.674 0.163 0.163 Treated 0.317 0.171 0.512
The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51 percent of treated individuals had marked improvement, compared to 16 percent of those receiving a placebo.
For column sums and column proportions, you have
> margin.table(mytable, 2) Improved
None Some Marked 42 14 28
> prop.table(mytable, 2) Improved
Treatment None Some Marked
Placebo 0.690 0.500 0.250 Treated 0.310 0.500 0.750
Here, the index (2) refers to the second variable in the table() statement.
Cell proportions are obtained with this statement:
> prop.table(mytable) Improved
Treatment None Some Marked Placebo 0.3452 0.0833 0.0833 Treated 0.1548 0.0833 0.2500
You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a sum row and column:
> addmargins(mytable) Improved
Treatment None Some Marked Sum Placebo 29 7 7 43 Treated 13 7 21 41 Sum 42 14 28 84
> addmargins(prop.table(mytable)) Improved
Treatment None Some Marked Sum Placebo 0.3452 0.0833 0.0833 0.5119 Treated 0.1548 0.0833 0.2500 0.4881 Sum 0.5000 0.1667 0.3333 1.0000
When using addmargins(), the default is to create sum margins for all variables in a table. In contrast:
> addmargins(prop.table(mytable, 1), 2) Improved
Treatment None Some Marked Sum Placebo 0.674 0.163 0.163 1.000 Treated 0.317 0.171 0.512 1.000
adds a sum column alone. Similarly,
> addmargins(prop.table(mytable, 2), 1) Improved
Treatment None Some Marked Placebo 0.690 0.500 0.250 Treated 0.310 0.500 0.750 Sum 1.000 1.000 1.000
adds a sum row. In the table, you see that 25 percent of those patients with marked improvement received a placebo.
NOTE The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany".
A third method for creating two-way tables is the CrossTable() function in the gmod- els package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. See listing 7.11 for an example.
Listing 7.11 Two-way table using CrossTable
> library(gmodels)
> CrossTable(Arthritis$Treatment, Arthritis$Improved) Cell Contents
|---|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|---|
Total Observations in Table: 84
| Arthritis$Improved
Arthritis$Treatment | None | Some | Marked | Row Total | ---|---|---|---|---|
Placebo | 29 | 7 | 7 | 43 | | 2.616 | 0.004 | 3.752 | | | 0.674 | 0.163 | 0.163 | 0.512 | | 0.690 | 0.500 | 0.250 | | | 0.345 | 0.083 | 0.083 | | ---|---|---|---|---|
Treated | 13 | 7 | 21 | 41 | | 2.744 | 0.004 | 3.935 | | | 0.317 | 0.171 | 0.512 | 0.488 | | 0.310 | 0.500 | 0.750 | | | 0.155 | 0.083 | 0.250 | | ---|---|---|---|---|
Column Total | 42 | 14 | 28 | 84 | | 0.500 | 0.167 | 0.333 | | ---|---|---|---|---|
The CrossTable() function has options to report percentages (row, column, cell);
specify decimal places; produce chi-square, Fisher, and McNemar tests of indepen- dence; report expected and residual values (Pearson, standardized, adjusted standard- ized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details.
If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next.
MULTIDIMENSIONAL TABLES
Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in listing 7.12.
Listing 7.12 Three-way contingency table
> mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)
> mytable , , Improved = None
q
Sex
Treatment Female Male Placebo 19 10 Treated 6 7 , , Improved = Some Sex
Treatment Female Male Placebo 7 0 Treated 5 2 , , Improved = Marked Sex
Treatment Female Male Placebo 6 1 Treated 16 5
> ftable(mytable) Sex Female Male Treatment Improved Placebo None 19 10 Some 7 0 Marked 6 1 Treated None 6 7 Some 5 2 Marked 16 5
> margin.table(mytable, 1) Treatment
Placebo Treated w
43 41
> margin.table(mytable, 2) Sex
Female Male 59 25
> margin.table(mytable, 3) Improved
None Some Marked 42 14 28
> margin.table(mytable, c(1, 3)) Improved
Treatment None Some Marked Placebo 29 7 7 e
Treated 13 7 21
> ftable(prop.table(mytable, c(1, 2))) Improved None Some Marked
Treatment Sex r
Cell frequencies
Marginal frequencies
Treatment x Improved marginal frequencies Improve proportions for Treatment x Sex
Placebo Female 0.594 0.219 0.188 Male 0.909 0.000 0.091 Treated Female 0.222 0.185 0.593 Male 0.500 0.143 0.357
> ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) Improved None Some Marked Sum
Treatment Sex Placebo Female 0.594 0.219 0.188 1.000 Male 0.909 0.000 0.091 1.000 Treated Female 0.222 0.185 0.593 1.000 Male 0.500 0.143 0.357 1.000
The code in q produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table.
The code in wproduces the marginal frequencies for Treatment, Sex, and Improved.
Because you created the table with the formula ~Treatement+Sex+Improve, Treatment is referred to by index 1, Sex is referred to by index 2, and Improve is referred to by index 3.
The code in e produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment x Sex combination is provided in r. Here
you see that 36 percent of treated males had marked improvement, compared to 59 percent of treated females. In general, the proportions will add to one over the indices not included in the prop.table() call (the third index, or Improve in this case). You can see this in the last example, where you add a sum margin over the third index.
If you want percentages instead of proportions, you could multiply the resulting table by 100. For example:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100
would produce this table:
Sex Female Male Sum Treatment Improved Placebo None 65.5 34.5 100.0 Some 100.0 0.0 100.0 Marked 85.7 14.3 100.0 Treated None 46.2 53.8 100.0 Some 71.4 28.6 100.0 Marked 76.2 23.8 100.0
While contingency tables tell you the frequency or proportions of cases for each com- bination of the variables that comprise the table, you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section.
7.2.2 Tests of independence
R provides several methods of testing the independence of the categorical variables.
The three tests described in this section are the chi-square test of independence, the Fisher exact test, and the Cochran-Mantel–Haenszel test.
CHI-SQUARE TEST OF INDEPENDENCE
You can apply the function chisq.test() to a two-way table in order to produce a chi-square test of independence of the row and column variables. See this next listing for an example.
Listing 7.13 Chi-square test of independence
> library(vcd)
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> chisq.test(mytable)
Pearson’s Chi-squared test data: mytable
q
X-squared = 13.1, df = 2, p-value = 0.001463
> mytable <- xtabs(~Improved+Sex, data=Arthritis)
> chisq.test(mytable)
Pearson’s Chi-squared test
w
data: mytable
X-squared = 4.84, df = 2, p-value = 0.0889 Warning message:
In chisq.test(mytable) : Chi-squared approximation may be incorrect
From the results q, there appears to be a relationship between treatment received and level of improvement (p < .01). But there doesn’t appear to be a relationship w
between patient sex and improvement (p > .05). The p-values are the probability of ob- taining the sampled results assuming independence of the row and column variables in the population. Because the probability is small for q, you reject the hypothesis that treatment type and outcome are independent. Because the probability for w isn’t
small, it’s not unreasonable to assume that outcome and gender are independent. The warning message in listing 7.13 is produced because one of the six cells in the table (male-some improvement) has an expected value less than five, which may invalidate the chi-square approximation.
FISHER’S EXACT TEST
You can produce a Fisher’s exact test via the fisher.test() function. Fisher’s exact test evaluates the null hypothesis of independence of rows and columns in a contingen- cy table with fixed marginals. The format is fisher.test(mytable), where mytable is a two-way table. Here’s an example:
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> fisher.test(mytable)
Treatment and Improved not independent
Gender and Improved independent
Fisher’s Exact Test for Count Data data: mytable
p-value = 0.001393
alternative hypothesis: two.sided
In contrast to many statistical packages, the fisher.test() function can be applied to any two-way table with two or more rows and columns, not a 2x2 table.
COCHRAN–MANTEL–HAENSZEL TEST
The mantelhaen.test() function provides a Cochran–Mantel–Haenszel chi-square test of the null hypothesis that two nominal variables are conditionally independent in each stratum of a third variable. The following code tests the hypothesis that Treat- ment and Improved variables are independent within each level Sex. The test assumes that there’s no three-way (Treatment x Improved x Sex) interaction.
> mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)
> mantelhaen.test(mytable)
Cochran-Mantel-Haenszel test data: mytable
Cochran-Mantel-Haenszel M^2 = 14.6, df = 2, p-value = 0.0006647
The results suggest that the treatment received and the improvement reported aren’t independent within each level of sex (that is, treated individuals improved more than those receiving placebos when controlling for sex).
7.2.3 Measures of association
The significance tests in the previous section evaluated whether or not sufficient evi- dence existed to reject a null hypothesis of independence between variables. If you can reject the null hypothesis, your interest turns naturally to measures of association in or- der to gauge the strength of the relationships present. The assocstats() function in the vcd package can be used to calculate the phi coefficient, contingency coefficient, and Cramer’s V for a two-way table. An example is given in the following listing.
Listing 7.14 Measures of association for a two-way table
> library(vcd)
> mytable <- xtabs(~Treatment+Improved, data=Arthritis)
> assocstats(mytable)
X^2 df P(> X^2) Likelihood Ratio 13.530 2 0.0011536 Pearson 13.055 2 0.0014626 Phi-Coefficient : 0.394
Contingency Coeff.: 0.367 Cramer’s V : 0.394
In general, larger magnitudes indicated stronger associations. The vcd package also provides a kappa() function that can calculate Cohen’s kappa and weighted kappa for
a confusion matrix (for example, the degree of agreement between two judges classify- ing a set of objects into categories).
7.2.4 Visualizing results
R has mechanisms for visually exploring the relationships among categorical variables that go well beyond those found in most other statistical platforms. You typically use bar charts to visualize frequencies in one dimension (see chapter 6, section 6.1). The vcd package has excellent functions for visualizing relationships among categorical variables in multidimensional datasets using mosaic and association plots (see chapter 11, section 11.4). Finally, correspondence analysis functions in the ca package allow you to visually explore relationships between rows and columns in contingency tables using various geometric representations (Nenadic and Greenacre, 2007).
7.2.5 Converting tables to flat files
We’ll end this section with a topic that’s rarely covered in books on R but that can be very useful. What happens if you have a table but need the original raw data? For ex- ample, say you have the following:
Sex Female Male Treatment Improved Placebo None 19 10 Some 7 0 Marked 6 1 Treated None 6 7 Some 5 2 Marked 16 5
but you need this:
ID Treatment Sex Age Improved 1 57 Treated Male 27 Some 2 46 Treated Male 29 None 3 77 Treated Male 30 None 4 17 Treated Male 32 Marked 5 36 Treated Male 46 Marked 6 23 Treated Male 58 Marked [78 more rows go here]
There are many statistical functions in R that expect the latter format rather than the former. You can use the function provided in the following listing to convert an R table back into a flat data file.
Listing 7.15 Converting a table into a flat file via table2flat table2flat <- function(mytable) {
df <- as.data.frame(mytable) rows <- dim(df)[1]
cols <- dim(df)[2]
x <- NULL
for (i in 1:rows){
for (j in 1:df$Freq[i]){
row <- df[i,c(1:(cols-1))]
x <- rbind(x,row) }
}
row.names(x)<-c(1:dim(x)[1]) return(x)
}
This function takes an R table (with any number of rows and columns) and returns a data frame in flat file format. You can also use this function to input tables from pub- lished studies. For example, let’s say that you came across table 7.2 in a journal and wanted to save it into R as a flat file.
Table 7.2 Contingency table for treatment versus improvement from the Arthritis dataset
Treatment
Improved
None Some Marked
Placebo 29 7 7
Treated 13 17 21
This next listing describes a method that would do the trick.
Listing 7.16 Using the table2flat() function with published data
> treatment <- rep(c("Placebo", "Treated"), times=3)
> improved <- rep(c("None", "Some", "Marked"), each=2)
> Freq <- c(29,13,7,17,7,21)
> mytable <- as.data.frame(cbind(treatment, improved, Freq))
> mydata <- table2flat(mytable)
> head(mydata) treatment improved 1 Placebo None 2 Placebo None 3 Placebo None 4 Treated None 5 Placebo Some 6 Placebo Some [12 more rows go here]
This ends the discussion of contingency tables, until we take up more advanced topics in chapters 11 and 15. Next, let’s look at various types of correlation coefficients.