Sunday, February 17, 2008

Skip, Direct Pred Modes

* Diff from normal pred:
- Need to derive both refidx and mv. (mv only for normal cases)
-- P Skip: refidxL0 = 0
-- B Temporal: refidxL1 = 0

* Diff between B temporal and spatial modes:
- refidx:
-- temporal mode: always L0 and L1
-- spatial mode: two or one (either L0 or L1)

- mi shared
-- temporal mode: 16 4x4 blks shared one (if the colMB is 16x16 mode) or
four 4x4 blks shared one (direct_8x8_inference_flag == 1) or
no share in 16 4x4 blks.
-- spatial mode: 16 4x4 blks shared one (because the neighboring derivation is in MB boundary) or
four 4x4 blks shared one (direct_8x8_inference_flag == 1) or
no share in 16 4x4 blks.

Note:
- ColPic is a different concept from the refIdxL1.
- B picture could be the reference picture.

Saturday, February 2, 2008

Optimization

The principle of optimization:
* Cases: What the application will typically do, and if we are more concerned with average or worst case performance before optimization.
* Order: Algorithm & Design -> Local optimization -> Assembly
* Context: The true exec time for one instruction is a time measured in a certain context, and that time is meaningful only in that context. Also it is more meaningful to optimize the segment of codes instead of only one instruction.

Algorithm & Design:
* Two kinds of computing:
-- data processing (data streaming algorithm): data is processed sequentially and only once in general. The relationship of data is not the concern of algorithms. The cache seems overkill here. The digital signal processing is a typical application. How to reduce the computing cases might be an efficient way to do optimization.
-- data transaction processing (data block algorithm): The relationship of data in one block is the focus of algorithms. The data cache is very important for them. Sorting and search are among them.

Instruction Level Optimization:
* Avoid branch hazard:
-- Unroll loops: the result might be diff for W, R, W with R. In general 8~16 times unrolling depth is good.
-- Combine loops
-- Fill-in branch delay in Assembly
* Avoid data hazard:
-- Eliminate the data dependance
-- Prefetch data and instruction: use mem/cache throughput instead of latency.
* Understand each Assembly inst details and diff. Use the best.

Memory Usage Optimization:
* Optimize data structs
-- Split reference data structures, like lists, tree, etc. into small ones and access them simultaneously.
-- Compact data structs as densely as possible to reduce access increment step and increase the utilization ratio of cache lines.
-- Separate data structs to reduce access increment step.
* Use mem/cache pipeline: The basic unit between cache/TLB and memory is one cache line. So make read step increment be the length of one cache line.
* Processing mem in Int data type instead of char, esp. for R. It seems not matter for W.
* Addr alignment for src/dst
* Combine W and R together for large data block which exceeds the size of caches. For small ones, overlapping access of cache might be harmful.
* Compact data within one mem page as densely as possible, due to TLB and page table.
* Put data into diff mem banks to avoid DRAM refreshment and access multiple banks. One page size as increment is good.

Cache Usage Optimization:
* Data and instruct size not exceed cache size: For cache, the max attainable size of cached data is CacheSize + (CacheSize / Way)
* Addr alignment and no line split data
* Put data on diff cache banks in order to access multiple cache banks.
* Increment step of mem not be multiple of the size of one cache set bank, otherwise conflict miss happens.

Write Buffer Usage Optimization:
* Access same data types, eps. R after W
* Organize the codes together to take advantage of WB since the units in it might be flushed randomly.