Saturday, November 24, 2007

Frame Num and Picture Order Count

- The standard told me frame_num is used as an identifier for pictures and it has strong relationship with PrevRefFrameNum. However,I am not quite understanding the usage of frame_num during encoding/decoding.

The concept is simple, but it became more complicated as it was refined. It is actually primarily a loss robustness feature. It may actually sometimes be helpful for you to ignore the name of the syntax element and try to think very strictly only about how it behaves --
not what it is called. The name is only a hint -- a way to help you remember which syntax element we're talking about when we talk about some particular one. It might be better to just think about it as if its name was any_name or something like that. (This is true of all syntax elements, actually -- but it is especially true of this one.)

Primarily, the idea of the syntax element any_name was to have a counter that increments each time you decode a picture so that if there are losses of data, the decoder can detect that some picture(s) were missing and would be able to conceal the problem without losing track of what was going on.

You can see this idea reflected in the way that the behavior of any_name depends on whether the picture is a reference picture or not (i.e., on nal_ref_idc). Since the proper decoding of a non-reference picture is not necessary for the proper decoding of other pictures that arrive later, any_name was designed so that a missing non-reference picture would not cause any_name to indicate the presence of a problem when a non-reference picture is missing.

Since the value of any_name often changes from picture to picture (and does not change within a picture), it can be used (subclause 7.4.1.2.4) as part of a method to detect when a new picture begins in the bitstream.

Then there is the notion that you ought to be able to splice different coded video sequences together without changing all the any_name variables in every picture. And the decoding process for different coded video sequences is independent anyway, so the value of any_name was reset to zero whenever a new coded_video_sequence begins.

Then, we find that under some circumstances (e.g., esp. for redundant pictures that correspond to IDR primary pictures) it might be nice to be able to reset the value of any_name without necessarily using an IDR picture to do it (since IDR pictures carry a significant penalty
in rate-distortion performance relative to other types of pictures).This led to the feature embodied as memory_management_control_operation equal to 5.

We also found that if we governed the behavior of any_name within a coded video sequence too strictly, it would prevent the ability to have efficient multi-layer temporal scalability (the ability to remove some pictures from a bitstream and still have a decodable remaining sequence of pictures). This led to the features embodied in the standard as "gaps in any_name value" and "sub-sequences".

Then, finally, we get to interlace support and coded fields. Parity can be used to distinguish between a top field and a bottom field, so it is not necessary for pictures to have a different value of any_name to let you know whether an individual field is missing. So fields of different parity can share the same value of any_name.

Finally we get to the way fields are stored into memory for operation of the decoding process for PicAFF and MBAFF coding (picture- and macroblock-adaptive frame/field coding, respectively). If we let a top field be paired with a bottom field for use as a decoded reference frame, this means that we need some way for the decoder to know how to pair different fields together for that purpose. And we thought that it was probably not really necessary to allow any individual top field to be paired with any arbitrarily-selected bottom field for that purpose, since typically an encoder might not really be interested in doing that. Conceptually, it is simpler to be able to just store the data for two fields into a memory space that would ordinarily hold a frame, and not need to do extra work to be able to create an association between any arbitrary pair of fields. Then a decoder could just change the stride it uses when addressing a surface to
control whether it is accessing the samples of an individual field or a unified frame. So the decoded picture buffer (DPB) was designed to manage its memory model as a collection of frame stores, not as a collection of individual fields.

That is really essentially the entire purpose and design relating to any_name (i.e., frame_num). That is ALL it is. It is natural to want to think of any_name as essentially a numbering of source frames at the input to the encoder. Although this is what most encoders will probably do, it is not a strictly correct understanding sufficient to build a well-designed decoder. (It is important to keep in mind that we do not specify how encoders or displays will operate -- only
decoders.) For example, that thinking could lead to some incorrect assumptions about the allowed timing relationship of pictures at the output of the decoder. The syntax element is not really for that purpose. Instead, it is a way to achieve picture loss robustness without sacrificing too much flexibility for the way the video can be used, and a way to simplify the picture buffering model management in decoders for frame/field adaptive coding.


- The standard says "Picture order counts are used to determine initial picture orderings for reference pictures in the decoding of B slices",which means we don't need to consider pic_order_cnt_type when dealing with baseline profile?

The basic concept of POC is to provide a counter that specifies the relative order of the pictures in the bitstream in output order (which may differ from the relative order in which the coded pictures appear in the data of the bitstream, which is referred to as the decoding order).
The relative order of the pictures is indicated in POC, rather than the timing of the pictures. This allows systems that carry the video bitstream to control the exact timing of the processing and output of the video bitstream without affecting the decoding process for the values of the samples in the luma and chroma sample arrays of the pictures. In some cases, the values of the samples in the luma and chroma sample arrays will depend on POC values. However, the values of
the samples in the luma and chroma sample arrays will never depend on the timing of the pictures.

There are three modes of POC operation:

In POC type 0, each slice header contains a simple fixed-length counter syntax element (pic_order_cnt_lsb) that provides the LSBs of the current POC. The MSBs of the current POC are calculated by the decoder by tracking modulus wrapping in the LSBs.

In POC type 1, each slice header contains one or two variable-length-encoded syntax elements that provide the difference to apply to a prediction of the current POC to compute the actual current
POC. This POC type provides the encoder with the ability to encode the POC values using significantly fewer bits per slice than what would otherwise be needed when using POC type 0 in cases where the encoder will usually be using a repetitive pattern of POC behavior.

In POC type 2, no data is carried in the slice header to compute the current POC. When POC type 2 is in use, the output order of the pictures in the bitstream will be the same as the order in which the coded pictures appear in the data of the bitstream. This POC type eliminates the need for the encoder to send any syntax data in the slice header for POC derivation. However, it provides no flexibility to allow the output order of the pictures in the bitstream to differ from their decoding order.

That statement would ordinarily be true. However, picture order count can also be used to determine the output order of pictures. The decoder ought to have other sources of information to determine that (e.g., timestamps on pictures carried at a systems level), so a Baseline decoder may not need to pay attention to picture order count. But it does need to figure out the output order of pictures one way or another.

Picture order count is also used to determine weights for temporal weighted prediction. Of course, that's not part of the Baseline profile either.

I think the only dependencies between picture order count and the processes for determining the values of decoded picture samples are the following:
1) The ordering of the initial reference picture lists in B slices
2) Temporal weighted prediction in B slices
3) Temporal direct prediction in B slices

So the summary is that if you're not supporting B slices you don't need picture order count for for determining the values of decoded picture samples.

The only other issue is how to determine the output order of pictures. But a system may provide that information in some way that doesn't depend on picture order count.

-- From mpegif.org

No comments: