An SDTV Decoder with HDTV Capability: An All-Format ATV
This paper describes techniques for implementing a video decoder that can decode MPEG-2 high-definition (HD) bit streams at a significantly lower cost than that for previously described high-definition video decoders. The subjective quality of the pictures produced by this ''HD-capable" decoder is roughly comparable to current DBS delivered standard-definition (SD) digital television pictures. The HD-capable decoder can decode SD bit streams with precisely the same results as a conventional standard-definition decoder. The MPEG term Main Profile at Main Level (MP@ML) is also used to refer to standard-definition video in the sequel.
The decoder makes use of a pre-parser circuit that examines the incoming bit stream in a bit-serial fashion and selectively discards coded symbols that are not important for reconstruction of pictures at reduced resolution. This pre-parsing process is performed so that the required channel buffer size and bandwidth are both significantly reduced. The pre-parser also allows the syntax parser (SP) and variable-length decoder (VLD) circuitry to be designed for lower performance levels.
The HD-capable decoder "downsamples" decoded picture data before storage in the frame memory, thereby permitting reduction of the memory size. This downsampling can be performed adaptively on a field or frame basis to maximize picture quality. Experiments have been carried out using different methods for downsampling with varying results. The combination of the pre-parser and picture downsampling enables the use of the same amount of memory as used in standard definition video decoders.
The decoder selects a subset of the 64 DCT coefficients of each block for processing and treats the remaining coefficients as having the value zero. This leads to simplified inverse quantization (IQ) and inverse discrete cosine transform (IDCT) circuits. A novel IDCT is described whereby the one-dimensional 8-point IDCT used for decoding standard definition pictures is used as the basis for performing a reduced complexity IDCT when processing high-definition bit streams.
A decoder employing the above techniques has been simulated using "C" with HDTV bit streams, and the results are described. Normal HDTV encoding practices were used in these experiments. The bit streams were decoded according to the concepts described herein, including pre-parsing, the effects of reduced memory sizes, simplified IDCT processing, and the various associated filtering steps. The pre-parser and resampling result in a certain amount of prediction "drift" in the decoder that depends on a number of factors, some of which are under the control of the encoder. Those who have viewed the resulting images agree that the decoding discussed in this paper produces images that meet performances expectations of SDTV quality.
The HD-capable video decoder, as simulated, can be expected to be implementable at a cost only marginally higher than that of a standard definition video decoder. The techniques described here could be applied to produce HD-capable decoders at many different price/performance points. By producing a range of consumer products that can all decode HDTV bit streams, a migration path to full HDTV is preserved while allowing a flexible mix of video formats to be transmitted at the initiation of digital television service.
There is an ongoing policy debate about SDTV and HDTV standards, about a broadcast mix of both formats, and about how a full range of digital television might evolve from a beginning that includes either SDTV or HDTV or both. This paper offers technical input to that debate, specifically regarding consumer receivers that could decode both SDTV and HDTV digital signals at a cost only marginally higher than that of SDTV alone.
There are at least two areas of policy debate in which these issues are relevant:
1. What is the right mix of HDTV and SDTV as the digital service evolves over time? There are a variety of introduction scenarios for digital television, ranging from HDTV only, to SDTV only, to various mixes of the two. To preserve the HDTV broadcast option no matter how digital television service is introduced, SDTV receivers must be able to decode the HDTV signal. It is assumed here that SDTV receivers with such HDTV-decoding capability are both practical and cost effective. It is thus entirely practical to preclude SDTV-only receivers. Therefore, the introduction of SDTV would not prevent later introduction of HDTV because fully capable digital receivers would already be in use.
2. How quickly can National Television System Committee (NTSC) broadcasting be discontinued? The receiver design approach described herein can be applied to low-cost set-top boxes that permit NTSC receivers to be used to view digital television broadcasts. The existence of such decoders at low cost is implicit in any scenario that terminates NTSC broadcast.
Cost and Complexity of Full-Resolution HDTV Decoder Components
The single most expensive element of a video decoder is the picture storage memory. A fully compliant video decoder for U.S. HDTV will require a minimum of 9 MBytes of RAM for picture storage. An HDTV decoder will also require at least 1 MByte of RAM for channel buffer memory to provide temporary storage of the compressed bit stream. It can be expected that practical HDTV video decoders will employ 12 to 16 MBytes of specialty DRAM, which will probably cost at least $300 to $400 for the next few years and may be expected to cost more than $100 for the foreseeable future.
The IDCT section performs a large number of arithmetic computations at a high rate and represents a significant portion of the decoder chip area. The inverse quantizer (IQ) performs a smaller number of computations at a high rate, but it may also represent significant complexity.
The SP and VLD logic may also represent a significant portion of the decoder chip area. At the speeds and data rates specified for U.S. HDTV, multiple SP/VLD logic units operating in parallel may be required in a full HDTV decoder.
Cost Reductions of HDTV Decoder
This section describes several techniques that can be applied to reduce the cost of an HD-capable decoder. The following decoder subunits are considered: picture storage memory, pre-parser and channel buffer, SP and VLD, inverse quantizer and inverse discrete cosine transform, and motion compensated prediction. The discussion refers to Figure 1, which is a block diagram of a conventional SDTV decoder; and Figure 2, which is a block diagram of an HD-capable decoder. The blocks, which appear in Figure 2 but not in Figure 1, have been shaded to highlight the differences between an HD-capable decoder and a conventional SD decoder.
As described in Ng (1993), the amount of picture-storage memory needed in a decoder can be reduced by downsampling (i.e., subsampling horizontally and vertically) each picture within the decoding loop. Note in Figure 2 that residual or intradata downsampling takes place after the IDCT block and prediction downsampling is done following half-pel interpolation blocks. The upsample operation shown in Figure 2 serves to restore the sampling lattice to its original scale, thus allowing the motion vectors to be applied at their original resolution. Although this view is functionally accurate, in actual hardware implementations the residual/intra downsampling operation would be merged with the IDCT operation, and the prediction downsample operation would be merged with the upsample and half-pel interpolation. In an efficient implementation the upsamplehalf-pel interpolationdownsample operation is implemented by appropriately weighting each of the reference samples extracted from the (reduced resolution) anchor frame buffers to form reduced resolution prediction references.
The weights used in this operation depend on the full-precision motion vectors extracted from the coded bit stream.
Experiments have shown that it is important that the prediction downsampling process is near the inverse of the upsampling process, since even small differences are made noticeable after many generations of predictions (i.e., after an unusually long GOP that also contained many P-frames). There are two simple methods:
For both of these methods the concatenated upsample-downsample operation isidentity when motion vectors arezero. Both methods have been shown to provide reasonable image quality.
For the residual/intra downsampling process it is possible to use frequency domain filtering in lieu of spatial filtering to control aliasing. Frequency domain filtering is naturally accomplished by "zeroing" the DCT coefficients that correspond to high spatial frequencies. Note that the prediction filtering may introduce a spatial shiftthis can be accomodated by introducing a matching shift in the residual/intra downsampling process, or by appropriately biasing the motion vectors before use.
When processing interfaced pictures, the question arises as to whether upsampling and downsampling should be done on a field basis or on a frame basis. Field-based processing preserves the greatest degree of temporal resolution, whereas frame-based processing potentially preserves the greatest degree of spatial resolution. A brute-force approach would be to choose a single mode (either field or frame) for all downsampling.
A more elaborate scheme involves deciding whether to upsample or downsample each macroblock on a field basis or frame basis, depending on the amount of local motion and the high-frequency content. Field based processing is most appropriate when there is not much high-frequency content and/or a great deal of motion. Frame-based processing is most appropriate when there is significant high-frequency content and/or little motion.
One especially simple way of making this decision is to follow the choice made by the encoder for each macroblock in the area of field or frame DCT and/or field- or frame-motion compensation, since the same criteria
may apply to both types of decisions. Although field conversion is not optimal in areas of great detail, such as horizontal lines, simulations show that if a single mode is used, field is probably the better choice.
In MPEG parlance, SDTV corresponds to Main Level, which is limited to 720 × 480 pixels at 60 Hz, for a total of 345,600 pixels. U.S. ATV allows pictures as large as 1920 × 1080 pixels. Sequences received in this format can be conveniently downsampled by a factor of 3 horizontally and a factor of 2 vertically to yield a maximum resolution of 640 × 540, a total of 345,600 pixels. Thus the memory provided for SDTV would be adequate for the reduced-resolution HD decoder as well. It would be possible to use the same techniques with a smaller amount of downsampling for less memory savings.
In a cost-effective video decoder, the channel buffer and picture-storage buffers are typically combined into a single memory subsystem. The amount of storage available for the channel buffer is the difference between the memory size and the amount of memory needed for picture storage. Table 1 shows the amount of picture-storage memory required to decode the two high-definition formats with downsampling. The last column shows the amount of free memory when a single 16-Mbit memory unit is used for all of the decoder storage requirements. This is important since cost-effective SDTV decoders use an integrated 15-Mbit memory architecture. The memory not needed for picture storage can be used for buffering the compressed video bit stream.
As indicated in Table 1, the 1920 × 1080 format is downsampled by 3 horizontally and 2 vertically. This results in efficient use of memory (exactly the same storage requirements as MP@ML) and leaves a reasonable amount of free memory for use as a channel buffer.
The natural approach for the 1280 × 720 format would be to downsample by 2 vertically and horizontally. This leaves sufficient free memory that the downconverter would never need to consider channel buffer fullness when deciding which data to discard.
After decoding of a given macroblock, it might be immediately downsampled for storage or retained in a small buffer that contains several scan lines of full-resolution video to allow for filtering before downsampling. The exact method of upsampling and downsampling is discussed below; it can greatly affect image quality, since even small differences are made noticeable after many generations of predictions.1 The upsampling and downsampling functions are additional costs beyond that for an SD decoder.
The general concept of reducing memory storage requirements for a lower-cost HDTV decoder is known in the literature. This paper adds pre-parsing and new techniques for performing downsampling and upsampling.
Pre-parser and Channel Buffer
A fully compliant HDTV decoder requires at least 8 Mbits of high-speed RAM, with peak output bandwidth of 140 MBytes/sec for the channel buffer. With the use of a pre-parser to discard some of the incoming data before buffering, the output bandwidth can be reduced to a peak of 23 MBytes/sec and the size of the channel buffer can be reduced to 1.8 to 4.3 Mbits. (The lower number is required for MP@ML and the higher number is the amount left over in the SDTV 16-Mbit memory after a 1080 × 1920 image is downsampled by 3 horizontally and 2 vertically, including the required 3 frames of storage.)
The pre-parser examines the incoming bit stream and discards less important coding elements, specifically high-frequency DCT coefficients. It may perform this data selection while the DCT coefficients are still in the run-length/amplitude domain (i.e., while still variable-length encoded). The pre-parser thus serves two functions:
The pre-parser only discards full MPEG code words, creating a compliant but reduced data rate and reduced-quality bit stream. The picture degradation caused by the pre-parsing operation is generally minimal when downsampled for display at reduced resolution. The goal of the pre-parser is to reduce peak requirements in later functions rather than to significantly reduce average data rates. The overall reduction of the data rate through the pre-parser is generally small; for example, 18 Mbps may be reduced to approximately 12 to 14 Mbps.
The channel buffer in a fully HDTV decoder must have high-output bandwidth because it must output a full macroblock's data in the time it takes to process a macroblock. The pre-parser limits the maximum number of bits per macroblock to reduce the worst-case channel buffer output requirement. The peak number of bits allowed per macroblock in U.S. HDTV is 4608; this requires an output bandwidth of 140 MBytes/sec even though the average number of bits per macroblock is only 74. The pre-parser retains no more than 768 bits for each coded macroblock, thereby lowering the maximum output bandwidth to 23 MBytes/sec, the same as for MP@ML.
The pre-parser also removes high-frequency information (i.e., it does not retain any non-zero DCT coefficients outside of a predetermined low-frequency region). Pre-parsing could remove coefficients after a pre-specified coefficient position in the coded scan pattern, or it could remove only those coefficients that will not be retained for use in the IDCT. This reduces the total number of bits to be stored in the channel buffer.
In addition to discarding data to limit bits per coded macroblock and high-frequency coefficients, the pre-parser also alters its behavior based on the channel buffer fullness. The pre-parser keeps a model of buffer occupancy and removes coefficients as needed to ensure that the decreased size channel buffer will never
overflow. As this buffer increases in occupancy, the pre-processor becomes more aggressive about the amount of high-frequency DCT coefficient information to be discarded.
This decoder management of its own buffer is a key difference between the enhanced SDTV decoder and a "normal" SDTV decoder. In a "normal" encoder/decoder combination, the encoder limits the peak data rate to match the specifications of the decoder buffer; it is the responsibility of the encoder to assure that the decoder buffer does not overflow. In the enhanced SDTV decoder outlined in this paper, the decoder can accept bit streams intended for a much larger buffer (i.e., an HDTV bit stream) and can perform its own triage on the incoming bit stream to maintain correct buffer occupancy.2
This pre-parser is an additional cost over a stand-alone SD decoder, but the cost and complexity are low since it can run at the relatively low average incoming-bit rate. The pre-parser is significantly less complex than a full-rate SP and VLD because of its slower speed requirement and because it parses but does not have to actually decode values from all of the variable length codes.3
Syntax Parser and Variable-Length Decoder
The computational requirements for the SP and VLD units of the downconverter are substantially reduced by implementing a simplified bit-serial pre-parser as described above. The pre-parser limits the maximum number of bits per macroblock. It also operates to limit the number of DCT coefficients in a block by discarding coefficients after a certain number, thus reducing the speed requirements of the SP and VLD units.
At the speeds and data rates specified for U.S. HDTV, multiple SP/VLD logic units operating in parallel may be required. The pre-parser limits the processing speed requirements for the HD downconverter to SDTV levels. Thus the only additional requirement on the SP/VLD block for decoding HDTV is the need for slightly larger registers for storing the larger picture sizes and other related information, as shown in Figure 2.
Inverse Quantization and Inverse Discrete Cosine Transform
Reduced complexity inverse quantizer (IQ) and inverse discrete cosine transform (IDCT) units could be designed by forcing some predetermined set of high frequency coefficients to zero. MPEG MP@ML allows for pixel rates of up to 10.4 million per second. U.S. ATV allows pixel rates of up to 62.2 million per second. It is therefore possible to use SDTV-level IQ circuitry for HDTV decoding by ignoring all but the 10 or 11 most critical coefficients. Some of the ignored coefficients (the 8 × 8 coefficients other than the 10 or 11 critical coefficients) will probably have already been discarded by the pre-parser. However, the pre-parser is not required to discard all of the coefficients to be ignored. The pre-parser may discard coefficients according to coded scan pattern order, which will not, in general, result in deleting all of the coefficients that should be ignored by later processing stages.
Processing only 11 of 64 coefficients reduces the IQ computational requirement and significantly decreases the complexity of the IDCT. The complexity of the IDCT can be further reduced by combining the zeroing of coefficients with the picture downsampling described above.
IDCT circuitry for performing 8 × 8 IDCT is required for decoding SD bit streams. A common architecture for computing the two-dimensional IDCT is to use an engine that is capable of a fast, one-dimensional, 8-point IDCT. If the SC IDCT engine were used when decoding HD bit streams, it could perform about three 8-point IDCTs in the time of an HDTV block. Thus the SD IDCT can be used to compute the IDCT of the first three columns of coefficients. The remaining columns of coefficients would be treated as zero and thus require IDCT resources.
A special-purpose IDCT engine would be implemented to do the row IDCTs. It would be especially simple since five of the eight coefficients would always be zero, and only two or three output points would have to be computed for each transform. Note that only four rows would have to be transformed if no additional filtering were to be performed prior to downsampling.
For blocks in progressive frames, or that use field IDCTs, coefficients might be selected according to the following pattern (retained coefficients are represented by "x"):
For blocks that use frame DCTs on an interfaced picture, we might discardcoefficients with the following pattern:
This pattern of retained coefficients maintains temporal resolutionrepresented by differences between the twofields of the frame in moving images.
Motion Compensated Prediction (MCP)
Assume that the anchor pictures4 have been downsampled, as described above. The data bandwidth to the motion compensation circuitry is thus reduced by the same factor as the storage requirement. As described above, motion compensation is accomplished by appropriately interpolating the reduced resolution picture-reference data according to the values of the motion vectors. The weights of this interpolation operation are chosen to correspond to the concatenation of an anti-imaging upsampling filter, bilinear half-pel interpolation operation (depending on the motion vectors), and optional downsampling filter.
A block diagram of the HD-capable video decoder is shown in Figure 2. This can be compared with Figure 1, "SDTV Video Decoder Block Diagram," to identify the additional processing required over an SD decoder. Complexity comparisons between a full-resolution HD decoder, SD decoder, prior art HD downconverter,5 and the HD-capable decoder described in this paper are shown in Table 2. The total costs of the HD downconverter/SD decoder are not significantly greater than the cost of the SD decoder alone.
In MPEG video coding a significant portion of the coding gain is achieved by having the decoder construct a prediction of the current frame based on previously transmitted frames. In the most common case, the prediction process is initialized by periodic transmission of all intra-coded (I-frames). Predicted frames (P-frames) are coded with respect to the most recently transmitted I- or P-frames. Bidirectionally predicted frames
(B-frames) are coded with respect to the two most recently transmitted framesof types I or P.
Let the first P-frame following some particular I-frame be labeled P1. Recall that the decoder described above downsamples the decoded frames before storage. Thus, when P1 is to be decoded, the stored I-frame used for constructing the prediction differs from the corresponding full-resolution prediction maintained in the decoder. The version of P1 produced by the HD-capable decoder will thus be degraded by the use of an imperfect prediction reference, as well as by the pre-parsing and downsampling directly applied to P1. The next decoded P-frame suffers from two generations of this distortion. In this way the decoder prediction "drifts" away from the prediction maintained by the encoder, as P-frames are successively predicted from one another. Note that the coding of B-frames are successively predicted from one another. Note that the coding of B-frames does not contribute to this drift, since B-frames are never used as the basis for predictions.
Prediction drift can cause visible distortion that changes cyclically at the rate of recurrence of I-frames. The effect of prediction drift can be reduced by reducing the number of P-frames between I-frames. This can be done by increasing the ratio of B-frames to P-frames, decreasing the number of frames between I-frames, or both. As a practical matter, however, special encoding practices are neither needed nor recommended. Experiments have shown that reasonable HD video-encoding practices lead to acceptable quality from the HD-capable decoder described here.
Test images were chosen from material judged to be challenging for both HDTV and SDTV encoding and decoding. Progressive sequences were in the 1280 × 720 format; interlaced sequences were 1920 × 1024 and 1920 × 1035 (we did not have access to 1920 × 1080 material). The images contained significant motion and included large areas of complex detail.
Some of the bit streams used for testing were encoded by the authors using their MPEG-2 MP@HL software; others were provided by Thomson Consumer Electronics. Decoding at HDTV was done using the authors' MPEG-2 MP@HL software. The HD-capable decoder algorithms described above were simulated in "C" and tested with the HDTV bit streams.
The simulation included accurate modeling of all the processes described here. The bit stream was preparsed; channel buffer and picture memory sizes, IDCT processing, and IDCT coefficient selection were all in accord with these explanations; upsampling and downsampling were applied to use the original motion vectors.
The resulting HDTV and downconverted images were examined and compared. Although the downconverted images were of discernibly lower quality than the decoded full-HDTV images, observers agreed that the downconversion process met performance expectations of "SDTV quality."
This paper describes an enhanced SDTV receiver that can decode an HDTV bit stream. The enhancements needed to add HD decoding capability to an SD decoder are modest, even by consumer electronics standards. If all receivers included the capabilities described here, an introduction of SDTV would not preclude later introduction of HDTV because fully capable digital receivers would already be in use. The techniques described in this paper also permit design of low-cost, set-top boxes that would permit reception of the new digital signals for display on NTSC sets. The existence of such boxes at low cost is essential to the eventual termination of NTSC service.
Lee, D.H., et al. Goldstar, "HDTV Video Decoder Which Can Be Implemented with Low Complexity," Proceedings of the 1994 IEEE International Conference on Consumer Electronics, TUAM 1.3, pp. 6–7.
Ng, S., Thomson Consumer Electronics, "Lower Resolution HDTV Receivers," U.S. Patent 5,262,854, November 16, 1993.
1. The "predictions" mentioned here are the P-frames within the GOP sequence. The downsampling and preparsing processes alter the image data somewhat, so that small errors may accumulate if unusually long GOP sequences contain many P-frames. B-frames do not cause this kind of small error accumulation, and so good practice would be to increase the proportion of B-frames in long GOP sequences or to use GOPs of modest length. Receiver processing techniques can also reduce any visible effects, although they are probably unnecessary.
2. In this paragraph, the buffering operation and its associated memory are treated as distinct from the picture-storage memory. This distinction is useful for tutorial purposes, even though the two functions may actually share the same physical, 16-Mbit memory module.
3. Note that the macroblock type, coded block pattern, and run-length information must be decoded.
4. Anchor pictures are I- and P-frames in the MPEG GOP sequence. The downsampling that has been applied to them by the decoding techniques described here means that the motion vectors computed by the encoder can no longer be directly applied.
5. The term "downconverter" as used here applies to hardware that reduces the full HDTV image data to form an SDTV-resolution picture. The appropriately enhanced SDTV decoder described here inherently includes such an HDTV "downconverter."