Automatic Selection of the Smoothing

The smoothing parameters for regression splines encompass the degree of the splines, and the number and placement of the knots. For smoothing

5.5 Automatic Selection of the Smoothing Parameters 157

115 100 75 50 25 12

Smoother Matrix

•••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• •

•

Row 115

•••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••

•

Row 100

•••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••• •

•

Row 75

•••• • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • •

Row 50

•••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • •

Row 25

•••• • • •••••• ••••••••••

••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • •

Row 12

Equivalent Kernels

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded, indicating an equivalent kernel with local support. The left panel represents the elements ofSas an image. The right panel shows the equivalent kernel or weight- ing function in detail for the indicated rows.

splines, we have only the penalty parameterλto select, since the knots are at all the unique trainingX’s, and cubic degree is almost always used in practice.

Selecting the placement and number of knots for regression splines can be a combinatorially complex task, unless some simplifications are enforced.

The MARS procedure in Chapter 9 uses a greedy algorithm with some additional approximations to achieve a practical compromise. We will not discuss this further here.

5.5.1 Fixing the Degrees of Freedom

Since dfλ = trace(Sλ) is monotone in λfor smoothing splines, we can in- vert the relationship and specify λ by fixing df. In practice this can be achieved by simple numerical methods. So, for example, in Rone can use smooth.spline(x,y,df=6)to specify the amount of smoothing. This encour- ages a more traditional mode of model selection, where we might try a cou- ple of different values of df, and select one based on approximateF-tests, residual plots and other more subjective criteria. Using df in this way pro- vides a uniform approach to compare many different smoothing methods.

It is particularly useful in generalized additive models(Chapter 9), where several smoothing methods can be simultaneously used in one model.

5.5.2 The Bias–Variance Tradeoff

Figure 5.9 shows the effect of the choice of dfλ when using a smoothing spline on a simple example:

Y =f(X) +ε, f(X) = sin(12(X+ 0.2))

X+ 0.2 , (5.22)

withX ∼U[0,1] andε∼N(0,1). Our training sample consists ofN = 100 pairsxi, yi drawn independently from this model.

The fitted splines for three different values of dfλ are shown. The yellow shaded region in the figure represents the pointwise standard error of ˆfλ, that is, we have shaded the region between ˆfλ(x)±2ãse( ˆfλ(x)). Since ˆf =Sλy,

Cov(ˆf) = SλCov(y)STλ

= SλSTλ. (5.23)

The diagonal contains the pointwise variances at the trainingxi. The bias is given by

Bias(ˆf) = f−E(ˆf)

= f−Sλf, (5.24)

5.5 Automatic Selection of the Smoothing Parameters 159

6 8 10 12 14

0.91.01.11.2

• •

•

• •

•

• •

•

0.0 0.2 0.4 0.6 0.8 1.0

-4-202

O O

OO O

O O O

O O

O O O O

O O O

O O O O O O

O OO O OO

O O

O O O

OO O O O O O

O O O O O O O OO

O O

O O O O

OO O

O O

O O O

O O OO

O O

O O O

O OO OO O O

O O

0.0 0.2 0.4 0.6 0.8 1.0

-4-202

O O

OO O

O O O

O O

O O O O

O O O

O O O O O O

O OO O OO

O O

O O O

OO O O O O O

O O O

O O O O OO

O O

O O O O

OO O

O O

O O O

O O OO

O O

O O O

O OO OO O O

O O

0.0 0.2 0.4 0.6 0.8 1.0

-4-202

O O

OO O

O O O

O O

O O O O

O O O

O O O O O O

O OO O OO

O O

O O O

OO O O O O O

O O O O O O O OO

O O

O O O O

OO O

O O

O O O

O O OO

O O

O O O

O OO OO O O

O O

EPE CV

X X

X dfλ= 5

dfλ= 9 dfλ= 15

dfλ

Cross-Validation

EPE(λ)andCV(λ)

FIGURE 5.9.The top left panel shows the EPE(λ) and CV(λ) curves for a realization from a nonlinear additive error model (5.22). The remaining panels show the data, the true functions (in purple), and the fitted curves (in green) with yellow shaded±2×standard error bands, for three different values of dfλ.

wheref is the (unknown) vector of evaluations of the truef at the training X’s. The expectations and variances are with respect to repeated draws of samples of size N = 100 from the model (5.22). In a similar fashion Var( ˆfλ(x0)) and Bias( ˆfλ(x0)) can be computed at any point x0 (Exer- cise 5.10). The three fits displayed in the figure give a visual demonstration of the bias-variance tradeoff associated with selecting the smoothing parameter.

dfλ= 5:The spline under fits, and clearlytrims down the hills and fills in the valleys. This leads to a bias that is most dramatic in regions of high curvature. The standard error band is very narrow, so we estimate a badly biased version of the true function with great reliability!

dfλ= 9:Here the fitted function is close to the true function, although a slight amount of bias seems evident. The variance has not increased appreciably.

dfλ = 15:The fitted function is somewhat wiggly, but close to the true function. The wiggliness also accounts for the increased width of the standard error bands—the curve is starting to follow some individual points too closely.

Note that in these figures we are seeing a single realization of data and hence fitted spline ˆf in each case, while the bias involves an expectation E( ˆf). We leave it as an exercise (5.10) to compute similar figures where the bias is shown as well. The middle curve seems “just right,” in that it has achieved a good compromise between bias and variance.

The integrated squared prediction error (EPE) combines both bias and variance in a single summary:

EPE( ˆfλ) = E(Y −fˆλ(X))2

= Var(Y) + Eh

Bias2( ˆfλ(X)) + Var( ˆfλ(X))i

= σ2+ MSE( ˆfλ). (5.25)

Note that this is averaged both over the training sample (giving rise to ˆfλ), and the values of the (independently chosen) prediction points (X, Y). EPE is a natural quantity of interest, and does create a tradeoff between bias and variance. The blue points in the top left panel of Figure 5.9 suggest that dfλ= 9 is spot on!

Since we don’t know the true function, we do not have access to EPE, and need an estimate. This topic is discussed in some detail in Chapter 7, and techniques such as K-fold cross-validation, GCV andCpare all in common use. In Figure 5.9 we include the N-fold (leave-one-out) cross-validation curve:

Linear Models and Least Squares

Local Methods in High Dimensions