LINEAR PREDICTIVE CODING (LPC) OF SPEECH

The linear predictive coding (LPC) method for speech analysis and synthesis is based on modeling the vocal tract as a linear all-pole (IIR) ﬁlter having the system function

H(z) = G

1 + p k=1

ap(k)z−k

(12.27)

where pis the number of poles,Gis the filter gain, and{ap(k)} are the parameters that determine the poles. There are two mutually exclusive excitation functions to model voiced and unvoiced speech sounds. On a short-time basis, voiced speech is periodic with a fundamental frequency F0, or a pitch period 1/F0, which depends on the speaker. Thus voiced speech is generated by exciting the all-pole filter model by a periodic impulse train with a period equal to the desired pitch period. Unvoiced speech sounds are generated by exciting the all-pole filter model by the output of a random-noise generator. This model is shown in Figure 12.12.

FIGURE 12.12 Block diagram model for the generation of a speech signal

Given a short-time segment of a speech signal, usually about 20 ms or 160 samples at an 8 kHz sampling rate, the speech encoder at the transmitter must determine the proper excitation function, the pitch period for voiced speech, the gain parameterG, and the coeﬃcientsap(k).

A block diagram that illustrates the speech encoding system is given in Figure 12.13. The parameters of the model are determined adaptively from the data and encoded into a binary sequence and transmitted to the receiver. At the receiver the speech signal is synthesized from the model and the excitation signal.

The parameters of the all-pole filter model are easily determined from the speech samples by means of linear prediction. To be specific, the output of the FIR linear prediction filter is

s(n) =− p k=1

ap(k)s(n−k) (12.28)

FIGURE 12.13 Encoder and decoder for LPC

and the corresponding error between the observed sample s(n) and the predicted value ˆs(n) is

e(n) =s(n) + p k=1

ap(k)s(n−k) (12.29) By minimizing the sum of squared errors, that is,

E = N n=0

e2(n) = N n=0

s(n) +

p k=1

ap(k)s(n−k) 2

(12.30) we can determine the pole parameters {ap(k)} of the model. The result of diﬀerentiating E with respect to each of the parameters and equating the result to zero, is a set of plinear equations

p k=1

ap(k)rss(m−k) =−rss(m), m= 1,2, . . . , p (12.31) where rss(m) is the autocorrelation of the sequences(n) deﬁned as

rss(m) = N n=0

s(n)s(n+m) (12.32) The linear equation (12.31) can be expressed in matrix form as

Rssa=−rss (12.33)

whereRss is ap×pautocorrelation matrix,rssis ap×1 autocorrelation vector, and ais ap×1 vector of model parameters. Hence

a=−R−1ssrss (12.34)

These equations can also be solved recursively and most eﬃciently, with- out resorting to matrix inversion, by using the Levinson-Durbin algorithm [19]. However, in MATLAB it is convenient to use the matrix inversion.

The all-pole filter parameters {ap(k)} can be converted to the all-pole lattice parameters {Ki} (called the reflection coefficients) using the MATLAB function dir2latcdeveloped in Chapter 6.

The gain parameter of the ﬁlter can be obtained by noting that its input-output equation is

s(n) =− p k=1

ap(k)s(n−k) +Gx(n) (12.35) where x(n) is the input sequence. Clearly,

Gx(n) =s(n) + p k=1

ap(k)s(n−k) =e(n)

Then

N−1 n=0

x2(n) =

N−1 n=0

e2(n) (12.36)

If the input excitation is normalized to unit energy by design, then G2=

N−1 n=0

e2(n) =rss(0) + p k=1

ap(k)rss(k) (12.37) ThusG2is set equal to the residual energy resulting from the least-squares optimization.

Once the LPC coeﬃcients are computed, we can determine whether the input speech frame is voiced, and if so, what the pitch is. This is accomplished by computing the sequence

re(n) = p k=1

ra(k)rss(n−k) (12.38) where ra(k) is deﬁned as

ra(k) = p i=1

ap(i)ap(i+k) (12.39) which is the autocorrelation sequence of the prediction coeﬃcients.

The pitch is detected by ﬁnding the peak of the normalized sequence re(n)/re(0) in the time interval that corresponds to 3 to 15 ms in the 20-ms sampling frame. If the value of this peak is at least 0.25, the frame of speech is considered voiced with a pitch period equal to the value of n=Np, wherere(Np)/re(0) is a maximum. If the peak value is less than 0.25, the frame of speech is considered unvoiced and the pitch is zero.

The values of the LPC coefficients, the pitch period, and the type of excitation are transmitted to the receiver, where the decoder synthesizes the speech signal by passing the proper excitation through the all-pole filter model of the vocal tract. Typically, the pitch period requires 6 bits, and the gain parameter may be represented by 5 bits after its dynamic range is compressed logarithmically. If the prediction coefficients were to be coded, they would require between 8 to 10 bits per coefficient for accu- rate representation. The reason for such high accuracy is that relatively small changes in the prediction coefficients result in a large change in the pole positions of the filter model. The accuracy requirements are lessened by transmitting the reflection coefficients {Ki}, which have a smaller dynamic range—that is, |Ki| < 1. These are adequately represented by 6 bits per coefficient. Thus for a 10th-order predictor the total number of

bits assigned to the model parameters per frame is 72. If the model parameters are changed every 20 àsec. the resulting bit rate is 3,600 bps.

Since the reflection coefficients are usually transmitted to the receiver, the synthesis filter at the receiver is implemented as an all-pole lattice filter, described in Chapter 6.

12.5.1 PROJECT 12.5: LPC

The objective of this project is to analyze a speech signal through an LPC coder and then to synthesize it through the corresponding PLC decoder. Use several .wav sound files (sampled at 8000 sam/sec rate), which are available in MATLAB for this purpose. Divide speech signals into short-time segments (with lengths between 120 and 150 samples) and process each segment to determine the proper excitation function (voiced or unvoiced), the pitch period for voiced speech, the coefficients {ap(k)} (p≤10), and the gainG. The decoder that performs the synthesis is an all-pole lattice filter whose parameters are the reflection coefficients that can be determined from {ap(k)}. The output of this project is a syn- thetic speech signal that can be compared with the original speech signal.

The distortion eﬀects due to LPC analysis/synthesis may be assessed qualitatively.

LINEAR PREDICTIVE CODING (LPC) OF SPEECH

OVERVIEW OF DIGITAL SIGNAL PROCESSING

APPLICATIONS OF DIGITAL SIGNAL PROCESSING