EMBEDDED SUBBIT PLANE ENTROPY CODING

The section of the quantized coefficients in a time slot is encoded by an embedded subbit plane entropy coder, which is one of the most complicated components in EAC. We will explain in detail the working of the subbit plane entropy coder in the following. First, we review the human auditory system in Section 6.6.1.

Then we explain the implicit auditory masking approach in Section 6.6.2. We discuss the embedded coding unit (ECU) and the subbit plane entropy coder in Sections 6.6.3 and 6.6.4, respectively. We describe the arithmetic entropy coding unit in Section 6.6.5.

6.6.1 Human Auditory Masking

A detailed description of the human auditory system is beyond the scope of this chapter. The interested reader may refer to [7]. However, it is worth noticing that the characteristic of the human auditory system that most affects audio compression is auditory masking.

The human auditory system can be roughly divided into 26 critical bands, each of which is a bandpass filter bank with bandwidth on the order of 50 to 100 Hz for bands below 500 Hz and up to 5000 Hz for bands at high frequencies. Within each critical band, there is an auditory masking threshold, also referred to as the psychoacoustic masking threshold or the threshold of the just noticeable distortion (JND) [2]. Audio waveforms with an energy level below the JND threshold will not be audible. The auditory JND threshold is highly correlated to the spectral envelope of the signal. This is in contrast to the JND threshold in the human visual system, where the masking of a weak visual signal by a nearby strong signal occurs only over a very short range, and the dominant visual sensitivity is the same for a certain frequency regardless of the input signal. Let the auditory JND threshold of a critical band k at timei be THi,k. The JND threshold can be calculated as the maximum of a quiet threshold and a masking threshold. The quiet threshold TH_STk dictates the sensitivity of the auditory system for critical bandk without the presence of any strong audio signal. It can be calculated through an equal loudness curve, such as the Fletcher–Munson curve [7] shown as the solid line in Figure 6.3. According to the quiet threshold, the sensitivity

FIGURE 6.3: Auditory masking threshold: simultaneous masking.

of the ear is nearly constant for a large range (1–8 kHz) and drops dramatically before 500 Hz and after 10 kHz. Nevertheless, in audio compression, the auditory JND threshold is largely shaped by masking, which is an effect by which a low- level signal (the maskee) can be made inaudible by a simultaneously occurring strong signal (the masker) as long as the masker and the maskee are close enough to each other in time and frequency. The auditory masking threshold consists of three components: the simultaneous intra-band mask, the simultaneous inter-band mask, and the temporal mask. The most basic form of auditory masking is simultaneous intra-band masking, where the maskee and the masker are at the same time instant and within the same critical band. The intra-band masking threshold TH_INTRAi,kis directly proportional to the average spectral energy AVEi,kof the masker in critical bandkat the same time instanti, and can be expressed in dB as TH_INTRAi,k(dB)=AVEi,k(dB)−Rfac, (6.12) whereRfacis a constant offset value determined through the psychoacoustic ex- perimentation. The second form of masking is simultaneous inter-band masking, where the maskee and the masker are at the same time instant, but at neighbor- ing critical bands. The level of such inter-band masking TH_INTERi,k can be formulated as

TH_INTERi,k=max(THi,k−1−Rhigh,THi,k+1−Rlow), (6.13) where Rhigh and Rlow are the attenuation factors toward the high- and low- frequency critical bands, respectively. The higher frequency coefficients are more easily masked; thus the attenuationRhigh is smaller thanRlow. Combining quiet, intra- and inter-band auditory masking, the auditory masking threshold created by a strong audio signal identified as the “masker” is illustrated in Figure 6.3, where the auditory JND threshold is shown as the dashed line. Any signal below the JND threshold, for example, compression distortion, will not be audible by human ears.

The third form of masking is temporal masking, which dictates that a strong signal can also mask a weak signal in the same critical band, but in the immedi- ate preceding or following time interval. The duration within which premasking applies is less than one-tenth that of the postmasking, which is in the order of 50 to 200 ms. The temporal masking threshold TH_TIMEi,kcan be expressed as

TH_TIMEi,k=max(THi−1,k−Rpost,THi+1,k−Rpre), (6.14) whereRpre andRpostare the attenuation factors for the preceding and following time intervals, respectively. A sample temporal masking generated by a masker is shown in Figure 6.4.

The combined auditory JND threshold is the maximum of the quiet threshold, the simultaneous intra- and inter-band masking, and the temporal masking threshold,

FIGURE 6.4: Auditory masking threshold: temporal masking.

THi,k=max(TH_STk,TH_INTRAi,k,TH_INTERi,k,TH_TIMEi,k). (6.15) Calculation of the JND threshold requires the iteration of (6.13)–(6.15). Thus, if the input audio consists of several strong maskers, the combined JND threshold will be the maximum of the masking threshold generated by the individual masks.

6.6.2 Implicit Auditory Masking

Using the auditory masking effect, an audio coder can devote fewer bits to the coefficients that are less sensitive to the human ear and more bits to the auditorily sensitive coefficients, thus improving the quality of the coded audio. In EAC, the auditory masking module is integrated with the embedded entropy coding module.

It is done in a unique way with two distinctive features. First, the auditory JND threshold is derived from the partially coded coefficients and does not need to be transmitted. Second, the auditory JND threshold is used to determine the order that the bits of the coefficients are encoded, rather than to change the coefficients (by adopting a different quantizing step size for different critical bands). We call the approach implicit auditory masking because the auditory JND threshold is implicitly derived during the coding process.

To illustrate this distinctiveness, we show the process of encoding using traditional auditory masking in Figure 6.5 and that of the implicit auditory masking

FIGURE 6.5: Encoding using traditional auditory masking.

FIGURE 6.6: Encoding using implicit auditory masking.

in Figure 6.6. In traditional auditory masking, the encoder calculates the JND threshold based on the spectral envelope of the input audio waveform. The JND threshold is then encoded as a part of the compressed bit stream and is transmitted to the decoder. The encoder also quantizes the transform coefficients with a step size proportional to the JND threshold, that is, the coefficients are quantized coarsely in the critical bands with a larger JND threshold and are quantized finely in those with a smaller JND threshold. The approach is simple and suits a nonscal- able coder. In scalable audio coding, it is not efficient. First, sending the auditory JND threshold consumes a nontrivial number of bits, which can be as much as 10% of the total number of coded bits. Since the auditory masking module is ap- plied before the entropy coding module, the JND threshold must be transmitted with the same precision regardless of the compression ratio. The JND threshold overhead thus eats significantly into the bit budget, especially if the compressed bit stream is later scaled to a low bit rate. Second, as shown in Section 6.6.1, the JND threshold is shaped by the energy distribution of the input audio, while the same energy distribution is revealed through the bit plane coding process of the embedded entropy coder. As a result, the information is coded twice, which wastes precious coding bits.

The framework of implicit auditory masking is shown in Figure 6.6. Compared to Figure 6.5, the auditory masking operation is now integrated into the loop of the entropy coding module and is performed as follows. We first set the initial auditory JND threshold to the quiet threshold. A portion of the transform coefficients, for example, the top bit planes, is then encoded. Afterward, an updated auditory JND threshold is calculated based on the spectral envelope of the partially

coded transform coefficients. Since the decoder may derive the same auditory JND threshold from the same coded coefficients, the values of the auditory JND threshold need not be sent to the decoder. Using this implicitly calculated JND threshold, both the encoder and the decoder figure out which portion of the transform coefficients is to be encoded next. After the next portion of the coefficients has been encoded, the auditory JND threshold is updated again, which is then used to guide the coding order of the remaining portion of the coefficients. The process iterates among the operation of sending a portion of the quantized MDCT coefficients, updating the JND threshold, and using the updated JND threshold to determine the portions to be sent next. It only stops when a certain end crite- rion has been met, for example, the quantized coefficients have been encoded to the least significant bit plane (LSB), a desired coding bit rate has been reached, or a desired coding quality has been reached. By deriving the auditory masking threshold implicitly from the partially coded coefficients, bits normally required for the auditory JND threshold are saved. The saving can be especially significant at a low bit rate or when the coding bit stream is later truncated to a lower bit rate.

Implicit auditory masking may thus significantly improve compression efficiency.

Moreover, in all existing audio coders, the auditory JND threshold is carried as a header in the bit stream. In contrast, implicit auditory masking does not have an error-sensitive header. The EAC-compressed bit stream is thus less susceptible to transmission errors and therefore offers better error protection in a noisy channel, such as in a wireless environment. A third advantage of implicit auditory masking results from the fact that instead of coding the auditorily insensitive coefficients coarsely, the EAC encodes them at a later stage. By using auditory masking to govern the coding order, rather than to quantize the coefficients, the quality of the compressed audio becomes less sensitive to the accuracy of the JND threshold, as slight deviations in the threshold simply cause certain audio coefficients to be coded later.

6.6.3 Embedded Coding Unit

The section of quantized coefficients in a time slot is ultimately encoded by a subbit plane entropy coder. It encodes the audio coefficient bit by bit, and in a rate-distortion optimized order.

The subbit plane entropy coder of EAC is a general version of the simple bit plane coder, which works as follows. Leti index the time interval,j index the frequency component, andk index the critical band. Letxi,j be a coefficient at time intervali, frequencyj, andsi,k be a critical bandk at time intervali. Let each audio coefficient be represented in binary sign and magnitude form as

[±bL−1bL−2ã ã ãb0], (6.16) wherebL−1is the most significant bit (MSB),b0is the least significant bit (LSB), and ±is the sign of the coefficient. A group of bits of the same significance

from different coefficients forms a bit plane. For example, bitsbL−1of all coefficients form the most significantL−1 bit plane. The bit plane coder encodes the coefficients bit plane by bit plane: first the most significant bit plane, then the second most significant bit plane, and so on. This way, if the output-compressed bit stream is truncated, at least part of each coefficient can be decoded.

The subbit plane coder in EAC goes one step further in recognizing that bits in the same bit plane can be different in their rate and distortion contributions. First, the coefficients represented by the bits may have different JND thresholds that lead to vastly different subjective distortions even if the objective distortions are the same. Second, the bits can be statistically different considering their neighbor coefficients and coding histories. An illustration of subbit plane is shown in Fig- ure 6.7. Since the coefficients in EAC are actually arranged in a 2D array indexed by the time intervaliand frequencyj, the actual bit array is 3D. However, it is difficult to draw a 3D bit array; therefore, we show a slice of the bit array in 2D in Figure 6.7. Note that the sign of the coefficient is also part of the bit array, as the

‘plus’ and ‘minus’ signs can be represented by 0 and 1, respectively. LetbM be a bit in a coefficientx, which is to be encoded. If all more significant bits in the same coefficientx are 0s, the coefficientx is said to be insignificant (because if the bit stream is terminated right before bitbM has been coded, coefficientx will be reconstructed as zero), and the current bitbM is to be encoded in the mode of significance identification. Otherwise, the coefficient is said to be significant, and the bitbM is to be encoded in the mode of refinement. We distinguish between significance identification and refinement because a significance identification bit

FIGURE 6.7: Subbit plane-embedded entropy coding.

has a very high probability of being 0, while a refinement bit is usually equally distributed between 0 and 1. The sign of the coefficient only needs to be encoded immediately after the coefficient turns significant, that is, a first nonzero bit in the coefficient is encoded. For the bit array in Figure 6.7, the significance identification and the refinement bits are separated by a solid bar. For a critical band si,k, we call the band insignificant if all the coefficients in the critical band are insignificant. It becomes significant when at least one coefficient is significant.

EAC defines three subbit planes in a bit plane: the predicted significance (PS), the refinement (REF), and the predicted insignificance (PN). The PS subbit plane consists of bits of coefficients that are insignificant but has at least one neighbor known to be significant. The REF subbit plane consists of bits of coefficients that are already significant, that is, in the refinement mode. The PN subbit plane consists of bits of coefficients that are insignificant with no neighbors known to be significant. The subbit plane design is motivated by previous work on image coding [4] and the JPEG 2000 standard [14], which show that bits in different subbit planes contribute different decreases in average distortion per coding bit spent. For the sample bit array in Figure 6.7, we show the subbit plane types with different shades for the first three bit planes of the bit array.

We call a subbit plane of a critical band as an embedded coding unit (ECU).

ECU is the smallest unit in the EAC reordering operation. The coding orders of ECUs are determined by the instantaneous JND threshold of the critical band.

First, the initial auditory JND thresholds are calculated by using the quiet threshold. Using the initial threshold, the coding order of the ECUs is determined, and a set of high-priority ECUs is encoded. After a number of ECUs have been encoded or after a certain update interval, the auditory JND threshold is recalculated by both the encoder and decoder based on the partially coded coefficients at the moment. The updated JND threshold is then used to determine the formation and the coding order of the remaining ECUs. The process iterates until a certain end condition is met.

Note that we deliberately chose to update the JND threshold infrequently rather than updating after the encoding of one ECU or even after encoding one bit. This is in order to reduce the computational cost required of updating the JND threshold. Because in EAC, a slightly outdated JND threshold only leads to a slightly nonoptimal coding order of the ECUs, its impact on compression performance is minimal.

We mark the identity of each ECU by the critical band the ECU resides in and an ID that identifies the subbit plane. The ID is a rational number whose integer part is just the bit plane index and whose fractional part is assigned according to the subbit plane class. Currently, the PS, REF, and PN subbit planes are assigned with fractional values 0.875, 0.125, and 0.0, respectively. As an example, the ID of the PS subbit plane of bit plane 7 is 7.875. The fractional value is designed with the consideration of the average rate-distortion contribution of each subbit

plane class. Within each critical band, EAC encodes the ECUs according to the descending order of their IDs. For a critical band with a total ofLbit planes, the first ECU to be encoded is the PN subbit plane of bit planeL−1 (ID:L−1.0) because all coefficients are insignificant at bit planeL−1. The next three subbit planes to be encoded are the PS (ID:L−1.125), REF (ID:L−1.875), and PN (ID:L−2.0) subbit planes of bit plane L−2. Subsequently, the subbit planes of bit planeL−3 are encoded. With the order of ECUs within a critical band already determined, the implicit auditory masking process only needs to determine the order of the ECUs among different critical bands. Conveniently, this can be done by determining the critical bands whose ECUs are next in line to be coded.

We assign two important properties to each critical band: an instantaneous JND threshold and a progress indicator. The instantaneous JND thresholds are based on the partially reconstructed coefficient values of already coded coefficients, and the progress indicator records the ID of the next ECU to be encoded. It is the gap between the progress indicator and the instantaneous JND threshold that deter- mines the coding order of ECUs. The coding process of the subbit plane entropy coder with implicit auditory masking can thus be described as follows.

1. Initialization.

The maximum bit plane Lof all coefficients is calculated. The progress indicators of all critical bands are set to the PN subbit plane of bit plane L−1 (with ID:L−1). The initial instantaneous JND threshold of each critical band is set according to the quiet threshold. We also mark all critical bands as insignificant.

2. Finding the current gap.

For each critical band, we calculate a gap between its progress indicator and the instantaneous JND threshold. The gap is closely related to the level of the coding noise over the auditory JND threshold, the noise-to- mask ratio (NMR). The largest gap among all critical bands is defined as the current gap. The value of the current gap can be negative, which simply means that the coefficients with signal energy level below the auditory JND threshold are encoded. It can be easily proven that the instantaneous JND threshold is monotonically increasing and the progress indicator is monotonically decreasing. Therefore, the current gap shrinks in every iteration.

3. Encoding all critical bands with gap equal to the current gap.

We encode all critical bands with gap value the same as the current gap in this iteration. Such a process leads to the encoding of the ECUs with the largest reduction of NMR per coding bit spent. This encoding step may further consist of the following substeps.

(a) Critical band skipping.

If a chosen critical band is insignificant (not a single coefficient is significant), a status bit is encoded to indicate whether the critical

LOSS CONCEALMENT FOR WAVEFORM SPEECH CODECS

LOSS CONCEALMENT FOR LAPPED TRANSFORM CODECS