Predicting New y Values Using Regression

In all the regression analyses we have done so far, we have been summarizing and making inferences about relations in data that have already been observed. Thus, we have been predicting the past. One of the most important uses of regression is trying to forecast the future. In the road resurfacing example, the county highway director wants to predict the cost of a new contract that is up for bids. In a regression relating the change in systolic blood pressure for a specified dose of a drug, the doctor will want to predict the change in systolic blood pressure for a dose level not used in the study. In this section, we discuss how to make such regression predictions and how to determine prediction intervals which will convey our uncertainty in these predictions.

Confidence Interval for Intercept 0

The required degrees of freedom for the table value ofta2isn2, the error df.

bˆ0 ta2se

A 1 n x2

Sxx

Predictor Coef SE Coef T P Constant 47.475 4.428 10.72 0.000 SoilpH –7.859 1.090 –7.21 0.000 S = 2.72162 R-Sq = 74.3% R-Sq(adj) = 72.9%

Analysis of Variance

Source DF SS MS F P Regression 1 385.28 385.28 52.01 0.000 Residual Error 18 133.33 7.41

Total 19 518.61

There are two possible interpretations of a yprediction based on a given x.

Suppose that the highway director substitutes x 6 miles in the regression equa-

tion . This can be interpreted as either

“The average cost E(y) of all resurfacing contracts for 6 miles of road will be

$20,000.”

“The cost y of this specific resurfacing contract for 6 miles of road will be

$20,000.”

The best-guess prediction in either case is 20, but the plus or minus factor differs. It is easier to estimate an average value E(y) than predict an individual y value, so the plus or minus factor should be less for estimating an average. We discuss the plus or minus range for estimating an average first, with the understanding that this is an intermediate step toward solving the specific-value problem.

In the mean-value estimating problem, suppose that the value of x is known.

Because the previous values of x have been designated x1, . . . , xn, call the new value xn1. Then is used to predict E(yn1). Because and are unbiased, is an unbiased predictor of E(yn+1). The standard error of the estimated value can be shown to be

Here Sxxis the sum of squared deviations of the original n values of xi; it can be calculated from most computer outputs as

Again, t tables with n 2 df (the error df ) must be used. The usual approach to forming a confidence interval—namely, estimate plus or minus t (standard error)—

yields a confidence interval for E(yn1). Some of the better statistical computer packages will calculate this confidence interval if a new x value is specified without specifying a corresponding y.

standard error (se bˆ1)2

se B

n (xn1 x)2 Sxx yˆn1

bˆ bˆ0 1

yˆn1 bˆ0 bˆ1xn1 yˆ2.0 3.0x and gets yˆ20

11.4 Predicting New yValues Using Regression 595

Confidence Interval for E(Yn+1)

The degrees of freedom for the tabled t-distribution are n 2.

yˆn1 ta2seB 1

n (xn1 x)2 Sxx yˆn1ta2seB

n (xn1x)2

Sxx E(yn1)

For the tree growth retardation example, the computer output displayed here shows the estimated value of the average growth retardation, E(yn1), to be 16.038 when the soil pH is x4.0. The corresponding 95% confidence interval on E(yn1) is 14.759 to 17.318.

The plus or minus term in the confidence interval for E(yn1) depends on the sample size nand the standard deviation around the regression line, as one might expect. It also depends on the squared distance of xn1 from (the mean of the previous xivalues) relative to Sxx. As xn1gets farther from , the term

gets larger. When xn1is far away from the other x values, so that this term is large, the prediction is a considerable extrapolation from the data. Small errors in estimating the regression line are magnified by the extrapolation. The term could be called an extrapolation penaltybecause it increases with the degree of extrapolation.

Extrapolation—predicting the results at independent variable values far from the data—is often tempting and always dangerous. Using it requires an assumption that the relation will continue to be linear, far beyond the data. By definition, you have no data to check this assumption. For example, a firm might find a negative correlation between the number of employees (ranging between 1,200 and 1,400) in a quarter and the profitability in that quarter; the fewer the employees, the greater the profit. It would be spectacularly risky to conclude from this fact that cutting the number of employees to 600 would vastly improve profitability. (Do you suppose we could have a negative number of employees?) Sooner or later, the declining number of employees must adversely affect the busi- ness so that profitability turns downward. The extrapolation penalty term actually understates the risk of extrapolation. It is based on the assumption of a linear relation, and that assumption gets very shaky for large extrapolations.

(xn1 x)2Sxx (xn1 x)2

Sxx

x x 596 Chapter 11 Linear Regression and Correlation

extrapolation penalty

Regression Analysis: GrowthRet versus SoilpH The regression equation is

GrowthRet = 47.5 – 7.86 SoilpH

Predictor Coef SE Coef T P Constant 47.475 4.428 10.72 0.000 SoilpH –7.859 1.090 –7.21 0.000 S = 2.72162 R-Sq = 74.3% R-Sq(adj) = 72.9%

Analysis of Variance

Source DF SS MS F P Regression 1 385.28 385.28 52.01 0.000 Residual Error 18 133.33 7.41

Total 19 518.61

Predicted Values for New Observations New

Obs Fit SE Fit 95% CI 95% PI 1 16.038 0.609 (14.759, 17.318) (10.179, 21.898) Values of Predictors for New Observations

New Obs SoilpH 1 4.00

11.4 Predicting New yValues Using Regression 597 The confidence and prediction intervals also depend heavily on the assumption of constant variance. In some regression situations, the variability around the line increases as the predicted value increases, violating this assumption. In such a case, the confidence and prediction intervals will be too wide where there is relatively little variability and too narrow where there is relatively large variability. A scatterplot that shows a “fan’’ shape indicates nonconstant variance. In such a case, the confidence and prediction intervals are not very accurate.

EXAMPLE 11.9

For the data of Example 11.4, and the following Minitab output from that data, obtain a 95% confidence interval for E(yn1) based on an assumed value for xn1of 6.5. Compare the width of the interval to one based on an assumed value for xn1of 4.0.

Regression Analysis: GrowthRet versus SoilpH The regression equation is

GrowthRet = 47.5 - 7.86 SoilpH

Predictor Coef SE Coef T P Constant 47.475 4.428 10.72 0.000 SoilpH -7.859 1.090 -7.21 0.000 S = 2.72162 R-Sq = 74.3% R-Sq(adj) = 72.9%

Analysis of Variance

Source DF SS MS F P Regression 1 385.28 385.28 52.01 0.000 Residual Error 18 133.33 7.41

Total 19 518.61 Predicted Values for New Observations New

Obs Fit SE Fit 95% CI 95% PI 1 16.038 0.609 (14.759, 17.318) (10.179, 21.898) 2 –3.610 2.765 (–9.418, 2.199) (–11.761, 4.541)XX XX denotes a point that is an extreme outlier in the predictors.

Values of Predictors for New Observations New

Obs SoilpH 1 4.00 2 6.50

Solution For xn1 4.0, the first of the two fit entries shows an estimated value equal to 16.038. The confidence interval is shown as 14.759 to 17.318. For xn16.5, the estimated value is 3.610 with a confidence interval of 9.418 to 2.199. The second interval has a width 11.617, much larger than the first interval’s width of 2.559. The value of xn1 6.5 is far outside the range of x data; the extrapolation penalty makes the interval very wide compared to the width of intervals for values of xn1within the range of the observed xdata.

Usually, the more relevant forecasting problem is that of predicting an individual yn1value rather than E(yn1). In most computer packages, the interval for

predicting an individual value is called a prediction interval. The same best guess is used, but the forecasting plus or minus term is larger when predicting yn1

than estimating E(yn1). In fact, it can be shown that the plus or minus forecasting error using yˆn1to predict yn1is as follows.

yˆn1

598 Chapter 11 Linear Regression and Correlation

In the growth retardation example, the corresponding prediction limits foryn1when the soil pH x4 are 10.179 to 21.898, (see output in Example 11.9).

The 95% confidence intervals for E(yn1) and the 95% prediction intervals for yn1are plotted in Figure 11.14; the inner curves are for E(yn1) and outer curves are for yn1.

The only difference between estimation of a mean E(yn1) and prediction of an individual yn1is the term 1 in the standard error formula. The presence of this extra term indicates that predictions of individual values are less accurate than estimates of means. The extrapolation penalty term still applies, as does the warn- ing that it understates the risk of extrapolation.

Predicting New y Values Using Regression

The estimated standard error of is