Saturday, January 22, 2011

H.264 decoding delay

Quote one email thread from Intel IPP for the understanding of DPB operation/decoding delay idea in h264:

Well, I think you're incorrect on the buffer (DPB) described in the H.264 specification is merely a suggestion as such. The buffering mechanism (however it is handled) will have to adhere to the specification to claim it is a conforming decoder (see Appendix C). Note that there are two types of conformance, output timing conformance and output order conformance.

To my knowledge: The Intel implementation in the IPP samples can only deliver in the correct reordered output order, whereas some of the other codecs also allow decoding order (immediate) output ordering. When providing the reordered output, buffering in the decoder needs to take place to take care of the reordered frames. Usually, this would be B-frames, but in H.264 this can also be P-frames. Therefore, for the GOP pattern described, we can not really know - but we assume that no reordering is taking place (as it just adds to the delay). In general, the decoder can not in advance know whether or not a reordered picture may appear at some point in the stream. Therefore, it seems that Intel has chosen a "safe path" in that the decoder use the "worst" possible buffering (delaying) that would be necessary to deliver the stream in a fluent manner. Elaborating on that, if the decoder did not buffer (delay) and an out-of-order picture suddenly appears, the flow out of the decoder would contain a gap as the out-of-order picture would need to be buffered before output. In other words, the decoder will buffer to the maximum number of pictures that is allowed for a given stream (I'll come to that later) to be able to deliver the frames in a fluent (one-by-one) flow.

The maximum buffering required is determined by the 'max_dec_frame_buffering' parameter as described in the H.264 specification in Annex E. This is part of the bitstream_restrictions in the VUI parameters of the SPS. As it is optional, the parameter is to be derived from 'MaxDpbSize', which again is specified/derived from the profile and level and the coded picture resolution as defined in Annex A. Note that the 'max_dec_frame_buffering' parameter is constrained at the low end to be >= the 'num_ref_frames' parameter of the SPS. The Intel decoder uses the 'max_dec_frame_buffering' parameter to set the "worst-case" buffering, and thus you can with the right encoding parameters and with the proper addition of the VUI parameters set this as low as possible to obtain the smallest possible buffering.

The 'max_dec_frame_buffering' parameter defines the maximum for the 'num_reorder_frames' parameter, which is also given in the VUI set. This thus sets a limit to the amount of reordering that can occur in a stream. This is actually the only information a decoder can derive about reordering directly from the H.264 stream. The SPS thus does not explicitly state whether or not there will be B-pictures in a stream (and P-pictures may also be reordered), and it also does not state if they actually do appear, i.e. even if the 'num_reorder_frames' parameter is >0, the stream is not required to actually use it.

Anyway, this does not mean you can handle it otherwise; especially if you have a closed-circuit system with control over both encoder and decoder side, as it seems to be the case. In this case, it is essential to choose the right encoding parameters and provide the right information in the stream, and/or adapt the decoder to use as little buffering as possible.

Hope this helps shed some light on the subject...


- Jay

Sunday, December 7, 2008

Notes of Linux Device Driver 3rd (2)

* Concurrency and race condition
Two aspects of resource sharing:
- Resource sync
- Reference count

Utils for resource sync provided in Linux Kernel
- Semaphore and mutexes
Process could be blocked and sleep. Not used in ISR.
Semaphore here is not preferred to use as event sync util. And it is ususally optimized for the "available" case. Completions could be used for event sycn.
- RW semaphores
best used when write access is required only rarely and writer access is held for short period of time.
- Spinlocks
Process could NOT be blocked. Higher performance than semaphores but has a lot of constraints:
-- better used on SMP systems
-- when holding a lock, the code must be atomic, non-blocking and non-sleep. so preemption would be disabled for this core when holding the lock; the interrupts are disabled either for this core.
-- when holding a lock, the time should be as little as possible.
- RW spinlocks

Alternatives to locking
- Lock-free algorithms: circular buffer
- Atomic variables
- Bit operations
- seqlocks: small, simple and frequently accessed and where write access is rare but must be fast. Non pointers in the protected data.
- Read-Copy-Update: reads are common and writes are rare. The resources must be accessed via pointers, and all references to those resources must be held only by atomic code.

Notes of Linux Device Driver 3rd (1)

* User space and kernel space
- Transfering from user space to kernel space by system calling(exception) or hardware interrupts. Not the difference between these two are the executing context. TThe system call is in the context of a process (therefore could access the data in the process's address space) while the ISR is in its own context and not related to any processes.

- Concurrency in the kernel
-- multi-processes which might use the driver at the same time
-- HW interrupts and softirq like timers, tasklets and workqueue, etc.
-- Kernel is preemptive from 2.6
All of these cause the kernel and drivers must be reentrant.

- Multithreading mapping (user space to kernel space)
-- Multiple to one: all the threads are mapped to only one schedule unit. Blocked if one thread is blocked.
-- One to one: each thread is mapped to one schedule unit. Two complicated in communications.
-- Multiple to multiple: Light weighted process (LWP) which share the data in between and form the process group. In Linux the one2one mapping between threads and LLWP is supported.

* Major and minor numbers
- The major num identifies the driver associated with the device. Usually one majoor num one driver.
- The minor num identifies exactly which device is being referred to.

* Some important data structs related to the device driver
Most of the fundamental driver operations involve three important kernel data structs, called file_operations, file and inode.
- File operations: implementing various system calls interface
- File: represents an open file
- Inode: internally represent files in kernel. One file could have multiple files struct but have only one inode struct.

Sunday, June 22, 2008

Resource Pattern

How to solve the problems, priority inversion and deadlock, of resource sync?

* Critical section pattern
pros: avoid both problems
cons: cost high

* Priority inheritance
pros: avoid priority inversion and simple
cons: deadline and chain blocking could happens
chain blocking: J1 needs S1 and S2 but S1 is used by J2 and S2 by J3. The priority of them is P1>P2>P3. Therefore, J1 must wait for both of them finish.

* Highest locker pattern
One prio ceiling is defined for each resource in system design. The basic idea is that the task owning the resource runs at the highest prio ceiling of all the resources that it currently owns, provided that it is blocking one or more higher prio tasks. In this way, chain blocking is avoided.

pros: avoid chain blocking
cons: deadlock still exists

* Priority ceiling pattern
The idea is to ensure that when a job J preempts the critical section of another job and executes its own critical section, the priority at which this new critical section will execute is guaranteed to be higher than the inherited priorities(so could run) AND the ceiling priorities(so could continue) of all the preempted critical sections in the system. The diff from Highest locker is in the first place the job would not be assigned to the priority ceiling of the locked resource.
pros: avoid both
cons: cost high

Memory Patterns

* Static allocation pattern
pros: deterministic allocation/deallcation time and no memory fragmentation and easy to maintain.
cons: long init time and large mem supply. inflexible.

* Dynamic allocation pattern
pros: flexible, sizable exec
cons: non-deterministic allocation/deallcation time and mem fragmentation. Hard to maintain ptrs mem.

* Pool allocation pattern
pros: more flexible than static allocation to satisfy dynamic requirements. sub deterministic allocation/deallcation time(possible the pool runs out) and no mem fragmentation.
cons: the num of objs in the pool needs to be explored for the best performance of diff systems.

* Fixed sized buffer pattern
pros: no mem fragmentation since it always uses the worst case of mem requirement.
cons: waste mem on average. And it could be improved by managing varied size heaps.

* Smart ptr pattern

* Garbage collection pattern

* Garbage compactor pattern

Wednesday, March 12, 2008

Linux Device Driver Concepts

The following concepts are not equal:
1. Device : physical chip
2. Device Driver : codes to control the chip
3. Interface : routines provided to users and hide the chip details underneath.
4. ISR :
5. kernel module : object files which could be loaded at running time and enhance the kernel functionality.

Sunday, February 17, 2008

Skip, Direct Pred Modes

* Diff from normal pred:
- Need to derive both refidx and mv. (mv only for normal cases)
-- P Skip: refidxL0 = 0
-- B Temporal: refidxL1 = 0

* Diff between B temporal and spatial modes:
- refidx:
-- temporal mode: always L0 and L1
-- spatial mode: two or one (either L0 or L1)

- mi shared
-- temporal mode: 16 4x4 blks shared one (if the colMB is 16x16 mode) or
four 4x4 blks shared one (direct_8x8_inference_flag == 1) or
no share in 16 4x4 blks.
-- spatial mode: 16 4x4 blks shared one (because the neighboring derivation is in MB boundary) or
four 4x4 blks shared one (direct_8x8_inference_flag == 1) or
no share in 16 4x4 blks.

Note:
- ColPic is a different concept from the refIdxL1.
- B picture could be the reference picture.