5.2.1 Spatial, Temporal, and SNR Coding Structures
There are three basic types of scalability in scalable video coding: spatial, tempo- ral, and quality (or SNR) scalabilities. In a spatial scalable scheme, full decoding leads to high spatial resolution, while partial decoding leads to reduced spatial resolutions (reduction of the format). In a temporal scalable scheme, partial de- coding provides lower decoded frame rates (temporal resolutions). In an SNR scalable scheme, temporal and spatial resolutions are kept the same, but the video quality (SNR) varies depending on how much of the bit stream is decoded.
Current standards, such as H.263, H.264, MPEG-2, and MPEG-4 (both part 2 and part 10), are based on a predictive video coding scheme (see Figure 5.1).
FIGURE5.1:Predictive(hybrid)videocodingscheme.
Although they were not initially designed to address these issues, current stan- dards tried to upgrade their video coding schemes in order to include scalability functionalities. However, this integration generally came at the expense of coding efficiency (performance).
In a standard environment, scalability is achieved through a layered structure, where the encoded video information is divided into two or more separated bit streams corresponding to the different layers (see Figure 5.2).
• The base layer (BL) is generally highly and efficiently compressed by a nonscalable standard solution.
• The enhancement layer(s) (EL) encode(s) the residual signal to produce the expected scalability (it delivers, when combined with the base layer decoding, a progressive quality improvement in case of SNR scalability, a higher spatial resolution for spatial scalability, and a higher frame rate for temporal scalability).
To achieve spatial scalability in the hybrid scheme presented in Figure 5.3, the input video sequence is first spatially decimated to yield the lowest resolution layer, which is encoded by a standard encoder. A similar coding scheme is em- ployed for the enhancement layer. To transmit a higher resolution version of the current frame, two predictions are formed: one is obtained by spatially interpolat- ing the decoded lower resolution image of the current frame (spatial prediction) and the other by temporally compensating the higher resolution image of the pre- dicted frame with motion information (temporal prediction). The two predictions are then adaptively combined for a better prediction and the residue after predic- tion is coded and transmitted. In Figure 5.3, a scheme with two resolution levels is depicted, but the same principle can be used to produce several spatial resolu- tion enhancement levels. This solution corresponds to a Laplacian pyramid and is noncritically sampled, or redundant (the number of output samples is higher than the number of input samples).
The drawback of this approach is that the different encoding loops with their own motion estimation steps are used in parallel, at the encoder side, and sev- eral motion compensation loops are necessary at the decoder side, thus in- creasing the computational complexity both at the encoder and at the decoder.
A possible advantage of this scheme is the flexibility in choosing the downsam- pling/upsampling filters, in particular for reducing aliasing at lower resolutions.
Related to the spatial scalability, there is the issue of motion vector scalability.
Indeed, the different resolution levels will need motion vector fields with different resolutions and, possibly, accuracies. For the aforementioned Laplacian pyramid coding, the simplest approach is to estimate and encode the motion vectors, start- ing from the lowest resolution and going to the highest. From one layer to the other, the motion vector size needs to be doubled. Additionally, a refinement of the motion vector can be performed at higher resolutions. At this point, the pre-
5.2:SCALABILITYMODESINCURRENTVIDEOCODINGSTANDARDS121
FIGURE 5.2: Global structure of a layered scalable video-coding scheme.
Chapter5:SCALABLEVIDEOCODING
FIGURE 5.3: Layered spatial scalability.
cision and the accuracy of the motion can also be increased at higher levels. By precision we understand here the size of the block considered for motion estima- tion and compensation. When doubling the resolution, the dimensions of the block also double, and the motion representation loses in precision. Therefore, it may be convenient to split the block in smaller subblocks (two rectangular or four square ones) and look for refinement vectors in the subblocks. The decision to split or keep the lower resolution precision may be taken based on a rate–distortion cri- terion. Once the lowest resolution motion vector field is encoded, the next levels can be either encoded independently, with a possible loss in efficiency, or only the refinement vector(s) can be encoded in the refinement layer. The interested reader is referred to [22] for a more detailed discussion on motion vector scalability and its impact on the prediction complexity.
Temporal scalability involves partitioning of the group of pictures (GOP) into layers having the same spatial resolution. A simple way to achieve temporal scal- ability is to put some of the B frames from an IBBP . . . stream into one or several enhancement layers. This solution comes at no cost in terms of coding efficiency.
In a more general setting, the base layer may contain I, P, and B frames at the low frame rate, while the enhancement layers can only use frames from the immedi- ately lower temporal layer and previous frames from the same enhancement layer for temporal prediction. Generally, temporal prediction from future frames in the same enhancement layer is prohibited in order to avoid reordering in the enhance- ment layers. An example with one enhancement layer is presented in Figure 5.4.
The layered solution can be seen as an upgrade of standard solutions in order to provide scalability. The main shortcoming of these schemes comes from the fact that the information redundancy between the different layers cannot be fully exploited. This functionality is thus achieved at the expense of implementation complexity and coding efficiency.
FIGURE 5.4: General framework for layered temporal scalability.
A general problem with introducing scalability in a predictive video coding scheme is the so-called drift effect. It occurs when the reference frame used for motion compensation in the encoding loop is not available or not completely available at the decoder side. Therefore both the encoder and the decoder have to maintain their synchronization on the same bit rate in the case of SNR scalabil- ity, resolution level for spatial scalability, and frame rate in the case of temporal scalability.
For SNR scalability, a layered encoder exploits correlations across subflows to achieve better overall compression: the input sequence is compressed into a number of discrete layers arranged in a hierarchy that provides progressive refine- ment. A strategy often used in the scalable extensions of current standards (i.e., in MPEG-2 and H263) is to encode the base layer using a large quantization step, whereas the enhancement layers have a refinement goal and use finer quantizers to encode the base layer coding error. This solution is illustrated in Figure 5.5 and is discussed in more detail later.
5.2.2 Successive Approximation Quantization and Bit Planes
To realize the SNR scalability concept discussed earlier, an important category of embedded scalar quantizers is the family of embedded dead zone scalar quantiz- ers. For this family, each transform coefficientxis quantized to an integer
ib=Qb(x)=
⎧⎨
⎩
sign(x)ã |x|
2b+ ξ 2b
, if |x| 2b+ ξ
2b>0,
0, otherwise,
whereadenotes the integer part ofa;ξ <1 determines the width of the dead zone; >0 is the basic quantization step size (basic partition cell size) of the quantizer family; and b∈Z+ indicates the quantizer level (granularity), with higher values ofb indicating coarser quantizers. In general,bis upper bounded by a valueBmax, selected to cover the dynamic range of the input signal. The reconstructed value is given by the inverse operation,
yip=Q−b1(ib)=
⎧⎨
⎩
0, ib=0,
sign(ib)ã
|ib| − ξ 2b +δ
2b, ib=0,
where 0≤δ <1 specifies the placement of the reconstructed value yib within the corresponding uncertainty interval (partition cell), defined asCbi
b, andiis the partition cell index, which is bounded by a predefined value for each quantizer level (i.e., 0≤i≤Mb−1, for eachb). Based on the aforementioned formulation, it is rather straightforward to show that the quantizerQ0has embedded within it
5.2:SCALABILITYMODESINCURRENTVIDEOCODINGSTANDARDS125
FIGURE 5.5: Layered SNR scalability.
all the uniform dead zone quantizers with step sizes 2b,b∈Z+. Moreover, it can be shown that, under the appropriate settings, the quantizer index obtained by dropping thebleast-significant bits (LSBs) ofi0is the same as that which would be obtained if the quantization was performed using a step size of 2b,b∈Z+
rather than. This means that if thebLSBs ofi0are not available, one can still dequantize at a lower level of quality using the inverse quantization formula.
The most common option for embedded scalar quantization is successive ap- proximation quantization (SAQ). SAQ is a particular instance of the general- ized family of embedded dead zone scalar quantizers defined earlier. For SAQ, MBmax =MBmax−1= ã ã ã =M0=2 and ξ =0, which determines a dead zone width twice as wide as the other partition cells, andδ=1/2, which implies that the output levels yib are in the middle of the corresponding uncertainty inter- valsCib
p. SAQ can be implemented via thresholding by applying a monotonically decreasing set of thresholds of the form
Tb−1=Tb 2 ,
withBmax≥b≥1. The starting thresholdTBmax is of the formTBmax =αxmax, wherexmaxis the highest coefficient magnitude in the input transform decompo- sition, andαis a constant that is taken asα≥1/2.
Let us consider the case of using a spatial transform for the compression of the frames. By using SAQ, the significance of the transform coefficients with respect to any given thresholdTbis indicated in a corresponding binary map, denoted by Wb, called the significance map. Denote byw(k)the transform coefficient with coordinates k=(κ1, κ2)in the two-dimensional transform domain of a given in- put. The significance operatorsb(ã)maps any valuex(k)in the transform domain to a corresponding binary valuewb(k)inWb, according to the rule
wb(k)=sb(x(k))=
0, if|x(k)|< Tb, 1, if|x(k)| ≥Tb.
In general, embedded coding of the input coefficients translates into coding the significance mapsWb, for everybwithBmax≥b≥0.
In most state-of-the-art embedded coders, for every b this is effectively per- formed based on several encoding passes, which can be summarized in the fol- lowing:
Nonsignificance pass: encodessb(x(k))in the list of nonsignificant coefficients (LNC). If significant, the coefficient coordinates k are transferred into the re- finement list (RL).
Block Significance pass: For a block of coefficients with coordinates kblock, this pass encodessb(x(kblock))and sign(x(kblock))if they have descendant blocks
(under a quad tree decomposition structure) that were not significant compared to the previous bit plane.
Coefficient Significance pass: If the coordinates of the coefficients of a signifi- cant block are not in the LNC, this pass encodes the significance of coefficients in blocks containing at least one significant coefficient. Also, the coordinates of new significant coefficients are placed into the RL. This pass also moves the coordinates of nonsignificant coefficients found in the block into the LNC for the next bit plane level(s).
Refinement pass: For each coefficient in the RL (except those newly put into the RL during the last block pass), encode the next refinement of the significance map.
5.2.3 Other Types of Scalability
In addition to the aforementioned scalabilities, other types of scalability have been proposed.
• Complexity scalability: the encoding/decoding algorithm has less complex- ity (CPU/memory requirements or memory access) with decreasing tempo- ral/spatial resolution or decreasing quality [40].
• Content (or object) scalability: a hierarchy of relevant objects is defined in the video scene and a progressive bit stream is created following this impor- tance order. Such methods of content selection may be related to arbitrary- shaped objects or even to rectangular blocks in block-based coders. The main problem of such techniques is how to automatically select and track visually important regions in video.
• Frequency scalability: this technique, popular in the context of transform coding, consists of allocating coefficients to different layers according to their frequency. Data partitioning techniques may be used to implement this functionality. The interested reader is referred to Chapter 2 of this book for more information on data partitioning.
Among existing standards, the first ones (MPEG-1 and H.261) did not provide any kind of scalability. H.263+and H.264 provide temporal scalability through B-frames skipping.