Frequency and contingency tables

In this section, we’ll look at frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels package. In the following examples, assume that A, B, and C represent categorical variables.

The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations:

> library(vcd)

> head(Arthritis)

ID Treatment Sex Age Improved 1 57 Treated Male 27 Some 2 46 Treated Male 29 None 3 77 Treated Male 30 None 4 17 Treated Male 32 Marked 5 36 Treated Male 46 Marked 6 23 Treated Male 58 Marked

Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, we’ll create frequency and contingency tables (cross-classifications) from the data.

7.2.1 Generating frequency tables

R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1.

Table 7.1 Functions for creating and manipulating contingency tables

Function Description

table(var1, var2, …, varN) Creates an N-way contingency table from N categorical variables (factors)

xtabs(formula, data) Creates an N-way contingency table based on a formula and a matrix or data frame

prop.table(table, margins) Expresses table entries as fractions of the marginal table defined by the margins

margin.table(table, margins) Computes the sum of table entries for a marginal table defined by the margins

addmargins(table, margins) Puts summar y margins (sums by default) on a table ftable(table) Creates a compact "flat" contingency table

In the following sections, we’ll use each of these functions to explore categorical variables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or the xtabs() function, then manipulate it using the other functions.

ONE-WAY TABLES

You can generate simple frequency counts using the table() function. Here’s an example:

> mytable <- with(Arthritis, table(Improved))

> mytable Improved

None Some Marked 42 14 28

You can turn these frequencies into proportions with prop.table():

> prop.table(mytable) Improved

None Some Marked 0.500 0.167 0.333

or into percentages, using prop.table()*100:

> prop.table(mytable)*100 Improved

None Some Marked 50.0 16.7 33.3

Here you can see that 50 percent of study participants had some or marked improvement (16.7 + 33.3).

TWO-WAY TABLES

For two-way tables, the format for the table() function is

mytable <- table(A, B)

where A is the row variable, and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula style input. The format is

mytable <- xtabs(~ A + B, data=mydata)

where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs.

If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated).

For the Arthritis data, you have

> mytable <- xtabs(~ Treatment+Improved, data=Arthritis)

> mytable

Improved

Treatment None Some Marked Placebo 29 7 7 Treated 13 7 21

You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have

> margin.table(mytable, 1) Treatment

Placebo Treated 43 41

> prop.table(mytable, 1) Improved

Treatment None Some Marked Placebo 0.674 0.163 0.163 Treated 0.317 0.171 0.512

The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51 percent of treated individuals had marked improvement, compared to 16 percent of those receiving a placebo.

For column sums and column proportions, you have

> margin.table(mytable, 2) Improved

None Some Marked 42 14 28

> prop.table(mytable, 2) Improved

Treatment None Some Marked

Placebo 0.690 0.500 0.250 Treated 0.310 0.500 0.750

Here, the index (2) refers to the second variable in the table() statement.

Cell proportions are obtained with this statement:

> prop.table(mytable) Improved

Treatment None Some Marked Placebo 0.3452 0.0833 0.0833 Treated 0.1548 0.0833 0.2500

You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a sum row and column:

> addmargins(mytable) Improved

Treatment None Some Marked Sum Placebo 29 7 7 43 Treated 13 7 21 41 Sum 42 14 28 84

> addmargins(prop.table(mytable)) Improved

Treatment None Some Marked Sum Placebo 0.3452 0.0833 0.0833 0.5119 Treated 0.1548 0.0833 0.2500 0.4881 Sum 0.5000 0.1667 0.3333 1.0000

When using addmargins(), the default is to create sum margins for all variables in a table. In contrast:

> addmargins(prop.table(mytable, 1), 2) Improved

Treatment None Some Marked Sum Placebo 0.674 0.163 0.163 1.000 Treated 0.317 0.171 0.512 1.000

adds a sum column alone. Similarly,

> addmargins(prop.table(mytable, 2), 1) Improved

Treatment None Some Marked Placebo 0.690 0.500 0.250 Treated 0.310 0.500 0.750 Sum 1.000 1.000 1.000

adds a sum row. In the table, you see that 25 percent of those patients with marked improvement received a placebo.

NOTE The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany".

A third method for creating two-way tables is the CrossTable() function in the gmodels package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. See listing 7.11 for an example.

Listing 7.11 Two-way table using CrossTable

> library(gmodels)

> CrossTable(Arthritis$Treatment, Arthritis$Improved) Cell Contents

|---|

| N |

| Chi-square contribution |

| N / Row Total |

| N / Col Total |

| N / Table Total |

|---|

Total Observations in Table: 84

| Arthritis$Improved

Arthritis$Treatment | None | Some | Marked | Row Total | ---|---|---|---|---|

Placebo | 29 | 7 | 7 | 43 | | 2.616 | 0.004 | 3.752 | | | 0.674 | 0.163 | 0.163 | 0.512 | | 0.690 | 0.500 | 0.250 | | | 0.345 | 0.083 | 0.083 | | ---|---|---|---|---|

Treated | 13 | 7 | 21 | 41 | | 2.744 | 0.004 | 3.935 | | | 0.317 | 0.171 | 0.512 | 0.488 | | 0.310 | 0.500 | 0.750 | | | 0.155 | 0.083 | 0.250 | | ---|---|---|---|---|

Column Total | 42 | 14 | 28 | 84 | | 0.500 | 0.167 | 0.333 | | ---|---|---|---|---|

The CrossTable() function has options to report percentages (row, column, cell);

specify decimal places; produce chi-square, Fisher, and McNemar tests of independence; report expected and residual values (Pearson, standardized, adjusted standardized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details.

If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next.

MULTIDIMENSIONAL TABLES

Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in listing 7.12.

Listing 7.12 Three-way contingency table

> mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)

> mytable , , Improved = None

Sex

Treatment Female Male Placebo 19 10 Treated 6 7 , , Improved = Some Sex

Treatment Female Male Placebo 7 0 Treated 5 2 , , Improved = Marked Sex

Treatment Female Male Placebo 6 1 Treated 16 5

> ftable(mytable) Sex Female Male Treatment Improved Placebo None 19 10 Some 7 0 Marked 6 1 Treated None 6 7 Some 5 2 Marked 16 5

> margin.table(mytable, 1) Treatment

Placebo Treated w

43 41

> margin.table(mytable, 2) Sex

Female Male 59 25

> margin.table(mytable, 3) Improved

None Some Marked 42 14 28

> margin.table(mytable, c(1, 3)) Improved

Treatment None Some Marked Placebo 29 7 7 e

Treated 13 7 21

> ftable(prop.table(mytable, c(1, 2))) Improved None Some Marked

Treatment Sex r

Cell frequencies

Marginal frequencies

Treatment x Improved marginal frequencies Improve proportions for Treatment x Sex

Placebo Female 0.594 0.219 0.188 Male 0.909 0.000 0.091 Treated Female 0.222 0.185 0.593 Male 0.500 0.143 0.357

> ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) Improved None Some Marked Sum

Treatment Sex Placebo Female 0.594 0.219 0.188 1.000 Male 0.909 0.000 0.091 1.000 Treated Female 0.222 0.185 0.593 1.000 Male 0.500 0.143 0.357 1.000

The code in q produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table.

The code in wproduces the marginal frequencies for Treatment, Sex, and Improved.

Because you created the table with the formula ~Treatement+Sex+Improve, Treatment is referred to by index 1, Sex is referred to by index 2, and Improve is referred to by index 3.

The code in e produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment x Sex combination is provided in r. Here

you see that 36 percent of treated males had marked improvement, compared to 59 percent of treated females. In general, the proportions will add to one over the indices not included in the prop.table() call (the third index, or Improve in this case). You can see this in the last example, where you add a sum margin over the third index.

If you want percentages instead of proportions, you could multiply the resulting table by 100. For example:

ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100

would produce this table:

Sex Female Male Sum Treatment Improved Placebo None 65.5 34.5 100.0 Some 100.0 0.0 100.0 Marked 85.7 14.3 100.0 Treated None 46.2 53.8 100.0 Some 71.4 28.6 100.0 Marked 76.2 23.8 100.0

While contingency tables tell you the frequency or proportions of cases for each combination of the variables that comprise the table, you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section.

7.2.2 Tests of independence

R provides several methods of testing the independence of the categorical variables.

The three tests described in this section are the chi-square test of independence, the Fisher exact test, and the Cochran-Mantel–Haenszel test.

CHI-SQUARE TEST OF INDEPENDENCE

You can apply the function chisq.test() to a two-way table in order to produce a chi-square test of independence of the row and column variables. See this next listing for an example.

Listing 7.13 Chi-square test of independence

> library(vcd)

> mytable <- xtabs(~Treatment+Improved, data=Arthritis)

> chisq.test(mytable)

Pearson’s Chi-squared test data: mytable

X-squared = 13.1, df = 2, p-value = 0.001463

> mytable <- xtabs(~Improved+Sex, data=Arthritis)

> chisq.test(mytable)

Pearson’s Chi-squared test

data: mytable

X-squared = 4.84, df = 2, p-value = 0.0889 Warning message:

In chisq.test(mytable) : Chi-squared approximation may be incorrect

From the results q, there appears to be a relationship between treatment received and level of improvement (p < .01). But there doesn’t appear to be a relationship w

between patient sex and improvement (p > .05). The p-values are the probability of ob- taining the sampled results assuming independence of the row and column variables in the population. Because the probability is small for q, you reject the hypothesis that treatment type and outcome are independent. Because the probability for w isn’t

small, it’s not unreasonable to assume that outcome and gender are independent. The warning message in listing 7.13 is produced because one of the six cells in the table (male-some improvement) has an expected value less than five, which may invalidate the chi-square approximation.

FISHER’S EXACT TEST

You can produce a Fisher’s exact test via the fisher.test() function. Fisher’s exact test evaluates the null hypothesis of independence of rows and columns in a contingency table with fixed marginals. The format is fisher.test(mytable), where mytable is a two-way table. Here’s an example:

> mytable <- xtabs(~Treatment+Improved, data=Arthritis)

> fisher.test(mytable)

Treatment and Improved not independent

Gender and Improved independent

Fisher’s Exact Test for Count Data data: mytable

p-value = 0.001393

alternative hypothesis: two.sided

In contrast to many statistical packages, the fisher.test() function can be applied to any two-way table with two or more rows and columns, not a 2x2 table.

COCHRAN–MANTEL–HAENSZEL TEST

The mantelhaen.test() function provides a Cochran–Mantel–Haenszel chi-square test of the null hypothesis that two nominal variables are conditionally independent in each stratum of a third variable. The following code tests the hypothesis that Treat- ment and Improved variables are independent within each level Sex. The test assumes that there’s no three-way (Treatment x Improved x Sex) interaction.

> mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis)

> mantelhaen.test(mytable)

Cochran-Mantel-Haenszel test data: mytable

Cochran-Mantel-Haenszel M^2 = 14.6, df = 2, p-value = 0.0006647

The results suggest that the treatment received and the improvement reported aren’t independent within each level of sex (that is, treated individuals improved more than those receiving placebos when controlling for sex).

7.2.3 Measures of association

The significance tests in the previous section evaluated whether or not sufficient evi- dence existed to reject a null hypothesis of independence between variables. If you can reject the null hypothesis, your interest turns naturally to measures of association in order to gauge the strength of the relationships present. The assocstats() function in the vcd package can be used to calculate the phi coefficient, contingency coefficient, and Cramer’s V for a two-way table. An example is given in the following listing.

Listing 7.14 Measures of association for a two-way table

> library(vcd)

> mytable <- xtabs(~Treatment+Improved, data=Arthritis)

> assocstats(mytable)

X^2 df P(> X^2) Likelihood Ratio 13.530 2 0.0011536 Pearson 13.055 2 0.0014626 Phi-Coefficient : 0.394

Contingency Coeff.: 0.367 Cramer’s V : 0.394

In general, larger magnitudes indicated stronger associations. The vcd package also provides a kappa() function that can calculate Cohen’s kappa and weighted kappa for

a confusion matrix (for example, the degree of agreement between two judges classify- ing a set of objects into categories).

7.2.4 Visualizing results

R has mechanisms for visually exploring the relationships among categorical variables that go well beyond those found in most other statistical platforms. You typically use bar charts to visualize frequencies in one dimension (see chapter 6, section 6.1). The vcd package has excellent functions for visualizing relationships among categorical variables in multidimensional datasets using mosaic and association plots (see chapter 11, section 11.4). Finally, correspondence analysis functions in the ca package allow you to visually explore relationships between rows and columns in contingency tables using various geometric representations (Nenadic and Greenacre, 2007).

7.2.5 Converting tables to flat files

We’ll end this section with a topic that’s rarely covered in books on R but that can be very useful. What happens if you have a table but need the original raw data? For example, say you have the following:

Sex Female Male Treatment Improved Placebo None 19 10 Some 7 0 Marked 6 1 Treated None 6 7 Some 5 2 Marked 16 5

but you need this:

ID Treatment Sex Age Improved 1 57 Treated Male 27 Some 2 46 Treated Male 29 None 3 77 Treated Male 30 None 4 17 Treated Male 32 Marked 5 36 Treated Male 46 Marked 6 23 Treated Male 58 Marked [78 more rows go here]

There are many statistical functions in R that expect the latter format rather than the former. You can use the function provided in the following listing to convert an R table back into a flat data file.

Listing 7.15 Converting a table into a flat file via table2flat table2flat <- function(mytable) {

df <- as.data.frame(mytable) rows <- dim(df)[1]

cols <- dim(df)[2]

x <- NULL

for (i in 1:rows){

for (j in 1:df$Freq[i]){

row <- df[i,c(1:(cols-1))]

x <- rbind(x,row) }

}

row.names(x)<-c(1:dim(x)[1]) return(x)

}

This function takes an R table (with any number of rows and columns) and returns a data frame in flat file format. You can also use this function to input tables from published studies. For example, let’s say that you came across table 7.2 in a journal and wanted to save it into R as a flat file.

Table 7.2 Contingency table for treatment versus improvement from the Arthritis dataset

Treatment

Improved

None Some Marked

Placebo 29 7 7

Treated 13 17 21

This next listing describes a method that would do the trick.

Listing 7.16 Using the table2flat() function with published data

> treatment <- rep(c("Placebo", "Treated"), times=3)

> improved <- rep(c("None", "Some", "Marked"), each=2)

> Freq <- c(29,13,7,17,7,21)

> mytable <- as.data.frame(cbind(treatment, improved, Freq))

> mydata <- table2flat(mytable)

> head(mydata) treatment improved 1 Placebo None 2 Placebo None 3 Placebo None 4 Treated None 5 Placebo Some 6 Placebo Some [12 more rows go here]

This ends the discussion of contingency tables, until we take up more advanced topics in chapters 11 and 15. Next, let’s look at various types of correlation coefficients.

Adding text, customized axes, and legends

A solution for our data management challenge