Analytic packages for large datasets

R provides several packages for the analysis of large datasets:

■ The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.

■ Several packages offer analytic functions for working with the massive matri- ces produced by the bigmemory package . The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigtabu- late package provides table() , split() , and tapply() functionality and the bigalgebra package provides advanced linear algebra functions.

■ The biglars package offers least-angle regression, lasso, and stepwise regres- sion for datasets that are too large to be held in memory, when used in conjunc- tion with the ff package .

■ The Brobdingnag package can be used to manipulate large numbers (numbers larger than 2^1024).

Working with datasets in the gigabyte to terabyte range can be challenging in any lan- guage. For more information on the methods available within R, see the CRAN Task View: High-Performance and Parallel Computing with R (


appendix H Updating an R installation

As consumers, we take for granted that we can update a piece of software via a

“Check for updates…” option. In chapter 1, I noted that the update.packages() function can be used to download and install the most recent version of a contrib- uted package. Unfortunately, there’s no corresponding function for updating the R installation itself. If you want to update an R installation from version 4.1.0 to 5.1.1, you must get creative. (As I write this, the current version is actually 2.13.0, but I want this book to appear hip and current for years to come).

Downloading and installing the latest version of R from CRAN (http://cran.r- is relatively straightforward. The complicating factor is that customizations (including previously installed contributed packages) will not be included in the new installation. In my current set-up, I have 248 contributed packages installed. I really don’t want to have to write their names down and reinstall them by hand the next time I upgrade my R installation.

There has been much discussion on the web concerning the most elegant and efficient way to update an R installation. The method described below is neither elegant nor efficient, but I find that it works well on a variety of platforms (Windows, Mac, and Linux).

In this approach, the installed.packages() function is used to save a list of packages to a location outside of the R directory tree, and then the list is used with the install.packages() function to download and install the latest contributed packages into the new R installation. Here are the steps:

1 If you have a customized file (see appendix B), save a copy outside of R.

2 Launch your current version of R and issue the following statements

oldip <- installed.packages()[,1]

save(oldip, file="path/installedPackages.Rdata")

where path is a directory outside of R.

3 Download and install the newer version of R.

4 If you saved a customized version of the file in step 1, copy it into the new installation.

5 Launch the new version of R, and issue the following statements

load("path/installedPackages.Rdata") newip <- installed.packages()[,1]

for(i in setdiff(oldip, newip)) install.packages(i)

where path is the location specified in step 2.

6 Delete the old installation (optional).

This approach will install only packages that are available from the CRAN. It won’t find packages obtained from other locations. You’ll have to find and download these separately. Luckily, the process will display a list of packages that can’t be installed. Dur- ing my last installation, globaltest and Biobase couldn’t be found. Since I got them from the Bioconductor site, I was able to install them via the code

source( biocLite("globaltest")


Step 6 involves the optional deletion of the old installation. On a Windows machine, more than one version of R can be installed at a time. If desired, uninstall the older version via Start > Control Panel > Uninstall a Program. On Mac and Linux platforms, the new version of R will overwrite the older version. To delete any rem- nants on a Mac, use the Finder to go to the /Library/Frameworks/R.frameworks/

versions/ directory and delete the folder representing the older version. On a Linux platform, it’s probably best to leave well enough alone.

Clearly, updating an existing version of R is more involved than is desirable for such a sophisticated piece of software. I’m hopeful that someday this appendix will simply say “Select the Check for Updates… option” to update an R installation.




