LOSS CONCEALMENT FOR WAVEFORM SPEECH CODECS- 123docz.net

When digital systems started replacing analog equipment a few decades ago, processing power was scarce and expensive, and coding techniques still prim- itive. For those reasons, most early digital systems used a very simple coding scheme: PCM (Pulse Code Modulation). In this digital representation of speech, there isn’t really any coding in the compression sense. The signal is simply sampled and quantized. More specifically, the speech signal is typically sampled at 8 KHz, and each sample is encoded with 8-bit precision, using one of two quanti- zation schemes, usually referred to as A-law andμ-law. This gives a total rate of 64 Kbps. The PCM system used in telephony has been standardized by the ITU (International Telecommunication Union) in the standard G.711 [1]. For Voice over Internet Protocol (VoIP) or other packet network applications, the speech samples will be grouped into frames (typically 10 ms in duration) and sent as packets across the network, one frame per packet. Note that a frame corresponds to a data unit in the terminology of Chapter 2. Note that, since there is no real coding, there is no dependence across packets: packets can be received and de- coded independently. When G.711 was first adopted, the main motivation was

quality: A digital signal was not subject to degradation. At the same time, a 64- Kbps digital channel had a significant cost, and there was a strong push toward increased compression. With the evolution of speech compression technology, and increased processing power, more complex speech codecs were also standardized (e.g., [3–6]), providing better compression. Curiously, today, in many applications bandwidth is not necessarily a significant constraint any more, and we are starting to see basic PCM-coded speech increasing in usage again. Furthermore, many error concealment techniques operate in the time domain, and therefore are best understood as applying to PCM-coded speech. For this reason, in this section we review the basic concept of packet loss as applied to speech and look at some common techniques to conceal loss in PCM coded speech.

We assume speech samples are PCM coded and grouped in 10-ms frames be- fore transmission. Since we assume packets are either received error free or not received at all, this implies that any loss incurred in the transmission process will imply a missing segment of 10 ms (or a multiple thereof). Figure 3.1 shows a segment of a speech signal. The signal is typical of a voiced phoneme. Figure 3.1(a) shows the original signal, whereas 3.1(b) shows a plot where 20 ms (i.e., two packets) is missing. As can be inferred from the picture, a good concealment algorithm would try to replace the missing segment by extending the prior signal with new periods of similar waveforms. This can be done with different levels of complexity, yielding also different levels of artifacts. We will now investigate a

0 200 400 600 800 1000 1200

FIGURE 3.1: (a) A typical speech signal. (b) Original signal with two missing frames. (c) Concealed loss using Appendix I of G.711.

simple concealment technique, described in the Appendix I of Recommendation G.711 [2]. The results of applying that algorithm are illustrated in Figure 3.1(c).

3.2.1 A Simple Algorithm for Loss Concealment: G.711 Appendix I

The first modification needed in the G.711 decoder in order to allow for the error concealment is to introduce a 30 sample delay. This delay is used to smooth the transition between the end of the original (received) segment and the start of the synthesized segment. The second modification is that we maintain a circular buffer containing the last 390 samples (48.75 ms). The signal in this buffer is used to select a segment for replacing the lost frame(s).

When a loss is detected, the concealment algorithm starts by estimating the pitch period of the speech. This is done by finding the peak of the normalized cross-correlation between the most recent 20 ms of signal and the signal stored in the buffer. The peak is searched in the interval 40 to 120 samples, corresponding to a pitch of 200 to 66 Hz.

After the pitch period has been estimated, a segment corresponding to 1.25 periods is taken from the buffer and is used to conceal the missing segment. More specifically, the selected segment is overlap-added with the existing signal, with the overlap spanning 0.25 of the pitch period. Note that this overlap will start in the last few samples of the good frame (which is the reason we had to insert the 30 sample delay in the signal). The process is repeated until enough samples to fill the gap are produced. The transition between the synthesized signal and the first good frame is also smoothed by using an overlap-add with the first several samples of the received frame.

Special treatment is given to a number of situations. For example, if two or more consecutive frames are missing, the method uses a segment several pitch periods long as the replication method, instead of repeating several times the same pitch period. Also, after the first 10 ms, the signal is progressively attenuated, such that after 60 ms the synthesized signal is zero. This can be seen in Figure 3.1(c), where the amplitude of the synthesized signal starts to decrease slightly after 160 samples, even though the synthesized signal is still based on the same (preceding) data segment. Also, note that since the period of the missing segment is not iden- tical to the synthesized segment, the transition to the new next frame may present a very atypical pitch period, which can be observed in Figure 3.1(c) around sample 1000.

The reader is directed to the ITU Recommendation [2] for more details of the algorithm. Results of the subjective tests performed with the algorithm, as well as some considerations about bandwidth expansion, can be found in [7]. Al- ternatively, the reader may refer to Chapter 16, which gives details of a related timescale modification procedure. For our purposes, it suffices to understand that the algorithm works by replicating pitch periods. Other important elements are

the gradual muting when the loss is too long and the overlap-add to smooth transitions. These elements will be present in most other concealment algorithms.

By the nature of the algorithm, it can be easily understood why it works well for single losses in the middle of voiced phonemes. As expected, the level of artifacts is higher for unvoiced phonemes and transitions. More elaborate concealment techniques will address each of these issues more carefully, further reducing the level of artifacts, at the cost of complexity. One possibility is to use an LPC filter and do the concealment in the “residual domain” [8,9]. Note that this is un- related to the concealment of CELP codecs (which we will investigate in the next section). Here we simply use LPC to improve the extrapolation of the signal; the coefficients are actually computed at the decoder. In CELP codecs, we have to handle the problem of lost LPC coefficients.

LOSS CONCEALMENT FOR WAVEFORM SPEECH CODECS

LOSS CONCEALMENT FOR LAPPED TRANSFORM CODECS

FORWARD ERROR CORRECTION TECHNIQUES FOR SPEECH