Saturday, November 24, 2007

Frame Num and Picture Order Count

- The standard told me frame_num is used as an identifier for pictures and it has strong relationship with PrevRefFrameNum. However,I am not quite understanding the usage of frame_num during encoding/decoding.

The concept is simple, but it became more complicated as it was refined. It is actually primarily a loss robustness feature. It may actually sometimes be helpful for you to ignore the name of the syntax element and try to think very strictly only about how it behaves --
not what it is called. The name is only a hint -- a way to help you remember which syntax element we're talking about when we talk about some particular one. It might be better to just think about it as if its name was any_name or something like that. (This is true of all syntax elements, actually -- but it is especially true of this one.)

Primarily, the idea of the syntax element any_name was to have a counter that increments each time you decode a picture so that if there are losses of data, the decoder can detect that some picture(s) were missing and would be able to conceal the problem without losing track of what was going on.

You can see this idea reflected in the way that the behavior of any_name depends on whether the picture is a reference picture or not (i.e., on nal_ref_idc). Since the proper decoding of a non-reference picture is not necessary for the proper decoding of other pictures that arrive later, any_name was designed so that a missing non-reference picture would not cause any_name to indicate the presence of a problem when a non-reference picture is missing.

Since the value of any_name often changes from picture to picture (and does not change within a picture), it can be used (subclause 7.4.1.2.4) as part of a method to detect when a new picture begins in the bitstream.

Then there is the notion that you ought to be able to splice different coded video sequences together without changing all the any_name variables in every picture. And the decoding process for different coded video sequences is independent anyway, so the value of any_name was reset to zero whenever a new coded_video_sequence begins.

Then, we find that under some circumstances (e.g., esp. for redundant pictures that correspond to IDR primary pictures) it might be nice to be able to reset the value of any_name without necessarily using an IDR picture to do it (since IDR pictures carry a significant penalty
in rate-distortion performance relative to other types of pictures).This led to the feature embodied as memory_management_control_operation equal to 5.

We also found that if we governed the behavior of any_name within a coded video sequence too strictly, it would prevent the ability to have efficient multi-layer temporal scalability (the ability to remove some pictures from a bitstream and still have a decodable remaining sequence of pictures). This led to the features embodied in the standard as "gaps in any_name value" and "sub-sequences".

Then, finally, we get to interlace support and coded fields. Parity can be used to distinguish between a top field and a bottom field, so it is not necessary for pictures to have a different value of any_name to let you know whether an individual field is missing. So fields of different parity can share the same value of any_name.

Finally we get to the way fields are stored into memory for operation of the decoding process for PicAFF and MBAFF coding (picture- and macroblock-adaptive frame/field coding, respectively). If we let a top field be paired with a bottom field for use as a decoded reference frame, this means that we need some way for the decoder to know how to pair different fields together for that purpose. And we thought that it was probably not really necessary to allow any individual top field to be paired with any arbitrarily-selected bottom field for that purpose, since typically an encoder might not really be interested in doing that. Conceptually, it is simpler to be able to just store the data for two fields into a memory space that would ordinarily hold a frame, and not need to do extra work to be able to create an association between any arbitrary pair of fields. Then a decoder could just change the stride it uses when addressing a surface to
control whether it is accessing the samples of an individual field or a unified frame. So the decoded picture buffer (DPB) was designed to manage its memory model as a collection of frame stores, not as a collection of individual fields.

That is really essentially the entire purpose and design relating to any_name (i.e., frame_num). That is ALL it is. It is natural to want to think of any_name as essentially a numbering of source frames at the input to the encoder. Although this is what most encoders will probably do, it is not a strictly correct understanding sufficient to build a well-designed decoder. (It is important to keep in mind that we do not specify how encoders or displays will operate -- only
decoders.) For example, that thinking could lead to some incorrect assumptions about the allowed timing relationship of pictures at the output of the decoder. The syntax element is not really for that purpose. Instead, it is a way to achieve picture loss robustness without sacrificing too much flexibility for the way the video can be used, and a way to simplify the picture buffering model management in decoders for frame/field adaptive coding.


- The standard says "Picture order counts are used to determine initial picture orderings for reference pictures in the decoding of B slices",which means we don't need to consider pic_order_cnt_type when dealing with baseline profile?

The basic concept of POC is to provide a counter that specifies the relative order of the pictures in the bitstream in output order (which may differ from the relative order in which the coded pictures appear in the data of the bitstream, which is referred to as the decoding order).
The relative order of the pictures is indicated in POC, rather than the timing of the pictures. This allows systems that carry the video bitstream to control the exact timing of the processing and output of the video bitstream without affecting the decoding process for the values of the samples in the luma and chroma sample arrays of the pictures. In some cases, the values of the samples in the luma and chroma sample arrays will depend on POC values. However, the values of
the samples in the luma and chroma sample arrays will never depend on the timing of the pictures.

There are three modes of POC operation:

In POC type 0, each slice header contains a simple fixed-length counter syntax element (pic_order_cnt_lsb) that provides the LSBs of the current POC. The MSBs of the current POC are calculated by the decoder by tracking modulus wrapping in the LSBs.

In POC type 1, each slice header contains one or two variable-length-encoded syntax elements that provide the difference to apply to a prediction of the current POC to compute the actual current
POC. This POC type provides the encoder with the ability to encode the POC values using significantly fewer bits per slice than what would otherwise be needed when using POC type 0 in cases where the encoder will usually be using a repetitive pattern of POC behavior.

In POC type 2, no data is carried in the slice header to compute the current POC. When POC type 2 is in use, the output order of the pictures in the bitstream will be the same as the order in which the coded pictures appear in the data of the bitstream. This POC type eliminates the need for the encoder to send any syntax data in the slice header for POC derivation. However, it provides no flexibility to allow the output order of the pictures in the bitstream to differ from their decoding order.

That statement would ordinarily be true. However, picture order count can also be used to determine the output order of pictures. The decoder ought to have other sources of information to determine that (e.g., timestamps on pictures carried at a systems level), so a Baseline decoder may not need to pay attention to picture order count. But it does need to figure out the output order of pictures one way or another.

Picture order count is also used to determine weights for temporal weighted prediction. Of course, that's not part of the Baseline profile either.

I think the only dependencies between picture order count and the processes for determining the values of decoded picture samples are the following:
1) The ordering of the initial reference picture lists in B slices
2) Temporal weighted prediction in B slices
3) Temporal direct prediction in B slices

So the summary is that if you're not supporting B slices you don't need picture order count for for determining the values of decoded picture samples.

The only other issue is how to determine the output order of pictures. But a system may provide that information in some way that doesn't depend on picture order count.

-- From mpegif.org

Tuesday, November 20, 2007

Concepts of H.264

0. Abbreviations
- Access Unit: a set of NAL units always containing exactly one primary coded picture. One or more redundant coded pictures or other NAL units not containing slices or slice data partitions of a coded picture. The decoding of an access unit always results in a decoded picture.
- Coded Frame/Field: No frame/field picture concepts in h.264. Coded frame consists of 2 field coded together as a single picture. A complementary field pair consists of two fields coded as separate pictures. No pic order count relationship is required for coded frames or complementary field pair. In general they would be stored in one frame buffer. The only requirement for them is that no other pics have order counts that fall in between the order counts of these two fields. Once a frame is decoded, it contains two fields. The two fields together can be used to predict a coded frame, or each of those fields can also be used separately as reference pic to predict a coded field. Two subsequent fields can be coded as separate pic which once decoded, are combined together as complementary ref OR non-ref field pair. Note coded fields may either be part of complementary field pairs or they may be non-paired fields.
There are two kinds of complementary field pairs, complementary reference field pairs(both of the two fields are reference picture) and complementary non-ref field pairs(both of the fields are non-reference picture). If the two fields of a frame is different for reference property,for example, one is reference picture and the other is non-reference picture,either of the field is non-paird field. the reference one is called non-paired reference field, and the non-reference one is called non-paired non-reference field.
If field pictures are used they should occur in pairs and together constitute one coded frame. When coding interlaced sequences using frame pictures, two fields should be interleaved with one another and then the entire frame is coded as one frame picture. -- MPEG-2
- IDR: Instantaneous Decoding Refresh, similar to I picture. The picture of memory management control operation that marks all reference pictures as unused for reference (with value of 5) has the same function. A video seq shall start with one IDR picture and the following are all non-IDR pictures. So for h.264, the seq is similar to GOP of MPEG-2. So there is another NAL of end of stream, indicating the end of video stream. IDR could be used for short term or long term reference picture. Non-IDR would be short term reference picture.
- Decoded Picture Buffer (DPB): Store all the reconstructed pictures
- Picture order count: Non-decreasing value relative to the previous IDR picture in decoding order. It is used to identify the dependence of the OUTPUT picture ordering.
- Frame number: in the decoding order instead of presentation order to number REFERENCE pictures. B picture is not reference pic, it could be ignored and the frame num of I/P increments. B pic is reference pic, the frame num is exactly the decoding order. Note: it would be reset to zero when an IDR picture is obtained.
- PicNum: frame num for short-term reference pic based on current frame num and reference pic frame num
- LongTermPicNum: specified externally

1. The index of MB in pictures
In general MBs are indexed in the raster scanning order. In the case of MB-adaptive frame/field mode, the MB pair is used and each MB would be indexed first in its MB pair and then incremented in the raster scanning order of MB pair. This might be used for inverse scanning processes (6.4)

3. Availability for current MB and neighbouring MB
- Not available if one of three conditions is satisfied for current MB.
- Special cases for neighbouring MB

4. Coordinates in the picture
- X: right is positive
- Y: down is positive

5. Derivation process for neighbouring MB, block (4X4 or 8X8) and partitions (6.4.8)
- The objective is to get the index of the neighbouring units (A B C D). The key step is to use the routine in (6.4.9). Its input is a luma or chroma location (xN, yN) expressed relative to the upper left corner of the current MB. It outputs the MB index that contains (xN, yN) and its location relative to the upper left corner of this resulting MB.
- The location difference Table 6-2?
From (6.4.1) to (6.4.6), one location of the unit relative to the picture or MB or sub partition could be calculated. In order to use routine in (6.4.9) to get the index of the neighbouring unit, one location within the neighbouring unit is needed. Table 6-2 gives the relationship between these two locations.

Monday, November 19, 2007

Energy and Power Spectral Desity

Energy and/or power are the key property of one signal in DSP domain. The various transformations, like Fourier trans, DCT trans, wavelet trans, etc. are used to study the energy and/or power of the signal in essence.

* Energy signals
- Deterministic or random signals.
- Square integrable or square summable.
- The energy spectral density would be derived from its Fourier transformation.

* Power signals
- Deterministic or random signals.
- Not square integrable or square summable.
- The power spectral density would be the Fourier transformation of its autocorrelation function.

* Property for ESD and PSD
- Non negative
- The area under the energy spectral density curve is equal to the area under the square of the magnitude of the signal, no matter if the signal is continuous or discrete. The total power in a power spectral density being equal to the corresponding MEAN total signal power, which is the autocorrelation function at zero lag.

H.264

H.264, Advanced Video Coding (AVC) or MPEG-4 Part10 outperforms the MPEG-4 Visual and H.263 standards, providing better compression of video images. It could output bitrate about half MPEG-2 bitstream with the same quality.

* Picture Format Supported
Almost all video resolutions from SubQCIF to BT.709 are supported. Progressive and interlaced scanning.
Like MPEG-2, the default color sampling is 4:2:0 and the phase relationship between Y and C samples is the same as MEPG-2.

* Coded Data Format (Data Stream Syntax)
- Video Coding Layer (VCL): the output of encoding process, a sequence of bits representing the coded video data, which are mapped to NAL units prior to transmission or storage.
- Network Abstraction Layer (NAL): basic unit of a coded H.264 video sequence. Each contains an Raw Byte Sequence Payload (RBSP). The type of RBSP is indicated in the header of NAL (one byte) and the RBSP data makes up the rest of the NAL unit. Some important RBSPs are Parameter Set (sequence or picture), Coded Slice and End of Sequence, etc.

* Profile and Level
H.264 supports four profiles only, unlike MPEG-4. They are baseline for low bitrate applications, main for broadcasting and storage, extended for media streaming applications and high definition for HD and video studio.

* Video Coding Tools
- No GOB and GOP in bitstreams. Sequence is similar to GOP. Sequence supports progressive and interlaced sequence. Picture formats are field and frame. Each picture has a picture order count which defines the presentation order of this picture. Reference pictures are organized into one or two lists, list0 and list1, with frame numbers.

- A coded picture consists of slices, which is a set of MB or MB pairs in raster scan order. Slices have I-, P- and B-slice. MBs have three types too, I-, P- and B-MB. For I-slice, only I-MB is used. For P-slice, it could contain P and I-MB and a B-slice may contain B and I-MB. Slices are still the basic unit for resync and error recovery and keep independence on each other by applying intra prediction and motion vector prediction only within the same slice.

- I-MB, i.e. intra MB, is totally different from that of previous standards. Intra prediction from decoded samples in the current slice are used for I-MB. And the residual data is transformed, coded and transferred. This is actually the technique of DPCM. Note it is for pixel sample but not the DC component in frequence domain. An alternative to intra prediction is I-PCM for I-MB, which enables an encoder to transmit the values fo the image samples directly without prediction or transformation.

P- and B-MB are inter MB with inter prediction. P-MB uses list0 and B-MB uses both of list0 and list1. The MB partition and MB sub partition are supported and the reference picture might different for each of them. About the reference pictures, they could be before or after current picture in temporal order.

For B-MB, many prediction modes could be used: direct mode, MC from list0, MC from list1, or MC from both list0 and list1. Different modes may be chosen for each partition. And if 8X8 partition size is used the chosen mode would be applied to all sub partition within that partition. Note the backward and forward prediction are not really applicable anymore here.

- Inter Prediction
The differences between H.264 and earlier standards include the support for a range of block size and fine subsample motion vectors.

The luma component of one MB could be split up in FOUR ways, 16X16, two 16X8 partitions, two 8X16 partitions and four 8X8 partitions. For 8X8 partitions, another FOUR sub partitions are supported, i.e. 8X8, two 8X4 and two 4X8 and four 4X4. For chroma components, the same way to partition happens except the different sizes, which have exactly half the horizontal and vertical resolution of luma ones.

Each partition or sub partitions in an inter MB is predicted from an area of the SAME size in the reference picture. Note the different MB or partition might have different reference picture. The offset between the two areas has quarter-sample resolution for luma component and one-eighth-sample resolution for the chroma components. So the interpolation may be necessary for reference pictures. Note here the resolution of chroma components is half of that of luma. So the MV should be halved when applied to the chroma blocks, and the precision for chroma prediction would be half that of the luma, i.e. one eighth sample. For luma interpolation, half samples are generated first with six tap FIR and then quarter samples with average. For chroma interpolation, the linear interpolation or weighted average is used.

The residual data with less energy could be obtained to be coded by using smaller block prediction. However, the number of MVs are increased greatly. Also more side information is needed for correct decoding. In order to decrease the bitrate further, motion vector prediction is used. MV prediction is in the unit of MB, which might have partitions or sub partitions. Different prediction modes might used depending on the motion compensation partition size and on the availability of nearby vectors. In general, three partitions, i.e. left one, upper one and upper right one are used.

- Direct Prediction
No MV is transmitted for a B-MB or partitions in Direct Mode, i.e. they are different from skipped B-MBs. The MV for them would be reconstructed using direct prediction.

- Weighted Prediction
To modify the samples of prediction data in a P/B-MB before the compensation.
Two ways for this function, explicit and implicit weighted predictions.

- Intra Prediction
For luma component, the sizes for intra prediction could be 4X4 blocks with nine modes or 16X16 blocks with four modes. For chroma components, 8X8 blocks are used with four modes.

Note the intra prediction depends on the availability of all the required prediction samples.

Predictive coding is used to signal 4X4 intra modes. The left and upper sub partitions is used for the most probable prediction mode.

- Deblocking Filter

- Transform and Quantisation
Because the minimal size of predication is 4X4, the size of transform would be 4X4 instead of 8X8. The DC component for chroma blocks are transformed further with 2X2 Hadamard. And one special case is for intra MB with 16X16 prediction. The 4X4 Hadamard transform is applied for DC components of each block.

The transmission order for one MB: In general 26 blocks starting from index 0 to 25 is transmitted in order (24 blocks + 2 additional chroma DC sub blocks). For intra MB with 16X16 prediction, one additional block is needed with index -1 and would be transmitted first.

Quantisation step is determined by Quantisation Parameter (QP). Totally 52 values are supported and Q step doubles in size for every increment of six in QP. Such arrangement makes it possible to fine control the bitrate and video quality. Also predictive coding is used for QP in one slice.

The block of 4X4 would be scanned with zig-zag order for frame block and alternate order for field block.

- Entropy Coding

- Interlaced Video