ECE Review: 2007

Wednesday, December 26, 2007

Inter Prediction MV Derivation

Totally three modes are used to derive inter prediction motion vector for MB partitions/Sub MB partitions.
* P skip mode
* B skip/direct mode (spatial/temporal)
* Others

The mv's could be directly calculated in the first two modes while they are derived with mvp and mvd in the third mode. There are two issues which are very important in the course of the derivation, i.e. how to locate the colPartition/colBlk in the colPic and how to locate the neighboring partition/blk in the current pic.

* ColPartition/colBlk
Note: The basic processing unit for B_skip, B_16X16_direct and B_8X8_direct are 4X4 blk. For P skip it is 16X16.
Three parameters are needed for temporal direct mode: colPic, colPart/colBlk, and refIdxL0
- colPic: different combinations of fld, frm and afrm. Table 8-6
- colPart/colBlk: The basic unit is 4X4 block in current MB and colMB. If the partition in colMB is larger than this unit, the motion inforamtion would be copied on all the composited blocks. When 8X8_direct_flag is enabled, the basic units in one 8X8 sub partition of current MB share the same mv and refidx. Otherwise, 16 blocks have their own mv and refidx. To save memory, mv and refidx are needed for only several units in colMB for the derivation. The basic unit mapping could be defined with Table 8-8. LUT could be used to implement this kind of mapping.
- refIdxL0: Keep in mind the picture structure of colRef might be different. The current MB is field MB in afrm is always a special case.

* Neighboring partition/blk
The basic unit is still 4X4 blk as the above and it is possible that multiple units share the same mv information due to copy operation. It is not necessary to store all the blk information for the derivation. For example, only 4 units are needed for B and C. It is special for the processing of A and D, however. And all the blk information is needed in the current MB.

Tuesday, December 11, 2007

Neighboring Location Derivation

P6.4.9 is used to define the address of neighboring MB/Partition/Sub MB partition, given one location coordinates. The derivation would be much more complicated when MBAFF is enabled and Table 6-4 would be used. Under this circumstance, the following points should be kept in mind.

- The basic unit in one slice would be a MB pair instead of a MB. Therefore the indicing order for MB is different.
- If MBAFF is disabled, only PAFF is applicable, which means slices/MBs structure is the same for the whole picture. When MBAFF is enabled the neighboring MB may be either field or frame MB.
- Only top MbAddr could be derived by using p6.4.7.
- Field MB and frame MB have different starting points and steps vertically. Keep in mind that two fields are interleaved in the FRAME grid. The starting point for top field would be the most left point of first line while bottom field be the most left point of the second line. The unit step for fields would be two instead of one for frames.
- When coordinates are negative, only one value is possible: -1.
- Note this clause just defines the derivation of neighboring MB index since MaxW and MaxH are for MB boundary. For some cases of calculating neighboring sub MB partitions, it might be possible that neighboring partition C is located on the top MB. Here the key parameter is the difference of luma/chrma location: xD, yD for MB, MB partition, subMB partition, luma8X8Blk, luma4X4Blk, chroma4X4Blk.
- The predPartWidth of xD is special for these cases: P skip, B skip, B direct 16X16 and B direct 8X8, it would be 16. Otherwise, it would be SubMbPartWidth/MbPartWidth.
- Generally the MB could be divided into 16 blocks each of which stores the motion information. The motion information might be the same for one partition or sub MB. Given one MB/Partition/Sub MB Partition, the neighboring ones could be located with the first block of this partition with the help of x, y and predPartWidth, according to Table 6-3/6-4.

Monday, December 3, 2007

Reference Picture Management

- The processing flow is reference list initialization (setup and sort) -> resorting -> decoding one picture -> reference picture marking
- Before decoding one picture, the reference pictures for every MB/partition should be ready with reference list0/1. The pictures in DPB with reference flag would be put into lists (the picture with non reference flag should not be put into DPB?). Sorting means that the short term reference pictures would be first with the decreasing PicNum order and the long term reference pictures follow with the increasing LongTermPicNum order.
- It might be possible that some reference pictures no matter if they are short or long term reference would be used more often than others by MB. The small indexes of lists for these reference pictures would reduce bitrate further. So resorting process starts on every picture of lists based on the slice header information of the current slice.
- After the current picture is decoded, this picture would be flagged as three modes: unused for reference, used for short term reference, and used for long term reference. And it is stored into DPB if it is used for reference picture. This is called reference picture marking. Also it includes the memory management of DPB. Two ways are used for this management: sliding window or adaptive_ref_pic_marking_mode with 7 commands.

Syntax elements of H.264

* Syntax for video sequence stream
- VCL represents the content of the video data. NAL is to format that data and provide header information for comm and storage.
- Big endian for video stream in byte while little endian for bits in one byte with LSBit on the right. The MSBit is always first in bit stream.
- One coded slice NAL needs to contain all the data of one slice. The data struct is slice header, slice data and trailing bits. This means NAL represents one slice of a picture instead of one picture.

- Syntax elements for NAL
1). nal_ref_idc:
For seq and pic parameter sets NAL, it shall be 1.
For slices of reference pic, it shall be 1.
For slices of non reference pic, it shall be 0.
2). nal_unit_type:

- Syntax elements for Seq Parameter Set (SPS)
0) seq_parameter_set_id: [0, 31]
1) log2_max_frame_num_minus4: [0, 12] (maximal num of MaxFrameNum is 2^16)
2) pic_order_cnt_type: specify the method to decode picture order count. [0, 2]
3) log2_max_pic_order_cnt_lsb_minus4: [0, 12] (MaxPicOrderCntLsb)
4) several elements for the decoding of picture order count
5) num_ref_frames: [0, MaxDpbSize], the sum of reference frames, complementary reference field pair and non-paired reference fields
6) frame_mbs_only_flag: indicate only frames exist in the video seq
7) mb_adaptive_frame_field_flag:

- Syntax elements for Pic Parameter Set (PPS)
0) pic_parameter_set_id: [0, 255]
1) mb to slice group map
2) QP initial value for Y/C

- Syntax elements for Slice header
0) first_mb_in_slice: MB index in general and MB pair index for MBAFF
1) slice_type: IDR only contains I/SI slices and so does the video seq when num_ref_frames is 0
2) frame_num: number reference pictures.
3) field_pic_flag: this slice is one of a coded field, i.e. the picture is field picture. The picture structure could be defined with this flag. But if it is 0 the MB structure may be either frame or field.
4) bottom_field_flag: this slice is part of a coded bottom field. The picture is bottom field.
5) pic_order_cnt_lsb: the picture order count modulo MaxPicOrderCntLsb for the top field of a coded frame or for a coded field.
6) delta_pic_order_cnt_bottom
7) delta_pic_order_cnt[0-1]?
8) idr_pic_id: identifies an IDR picture. All slices in one IDR have the same value of idr_pic_id.

- Syntax elements for slice data
0) mb_field_decoding_flag: identify if the current MB is field or frame structure in MBAFF mode

Saturday, November 24, 2007

Frame Num and Picture Order Count

- The standard told me frame_num is used as an identifier for pictures and it has strong relationship with PrevRefFrameNum. However,I am not quite understanding the usage of frame_num during encoding/decoding.

The concept is simple, but it became more complicated as it was refined. It is actually primarily a loss robustness feature. It may actually sometimes be helpful for you to ignore the name of the syntax element and try to think very strictly only about how it behaves --
not what it is called. The name is only a hint -- a way to help you remember which syntax element we're talking about when we talk about some particular one. It might be better to just think about it as if its name was any_name or something like that. (This is true of all syntax elements, actually -- but it is especially true of this one.)

Primarily, the idea of the syntax element any_name was to have a counter that increments each time you decode a picture so that if there are losses of data, the decoder can detect that some picture(s) were missing and would be able to conceal the problem without losing track of what was going on.

You can see this idea reflected in the way that the behavior of any_name depends on whether the picture is a reference picture or not (i.e., on nal_ref_idc). Since the proper decoding of a non-reference picture is not necessary for the proper decoding of other pictures that arrive later, any_name was designed so that a missing non-reference picture would not cause any_name to indicate the presence of a problem when a non-reference picture is missing.

Since the value of any_name often changes from picture to picture (and does not change within a picture), it can be used (subclause 7.4.1.2.4) as part of a method to detect when a new picture begins in the bitstream.

Then there is the notion that you ought to be able to splice different coded video sequences together without changing all the any_name variables in every picture. And the decoding process for different coded video sequences is independent anyway, so the value of any_name was reset to zero whenever a new coded_video_sequence begins.

Then, we find that under some circumstances (e.g., esp. for redundant pictures that correspond to IDR primary pictures) it might be nice to be able to reset the value of any_name without necessarily using an IDR picture to do it (since IDR pictures carry a significant penalty
in rate-distortion performance relative to other types of pictures).This led to the feature embodied as memory_management_control_operation equal to 5.

We also found that if we governed the behavior of any_name within a coded video sequence too strictly, it would prevent the ability to have efficient multi-layer temporal scalability (the ability to remove some pictures from a bitstream and still have a decodable remaining sequence of pictures). This led to the features embodied in the standard as "gaps in any_name value" and "sub-sequences".

Then, finally, we get to interlace support and coded fields. Parity can be used to distinguish between a top field and a bottom field, so it is not necessary for pictures to have a different value of any_name to let you know whether an individual field is missing. So fields of different parity can share the same value of any_name.

Finally we get to the way fields are stored into memory for operation of the decoding process for PicAFF and MBAFF coding (picture- and macroblock-adaptive frame/field coding, respectively). If we let a top field be paired with a bottom field for use as a decoded reference frame, this means that we need some way for the decoder to know how to pair different fields together for that purpose. And we thought that it was probably not really necessary to allow any individual top field to be paired with any arbitrarily-selected bottom field for that purpose, since typically an encoder might not really be interested in doing that. Conceptually, it is simpler to be able to just store the data for two fields into a memory space that would ordinarily hold a frame, and not need to do extra work to be able to create an association between any arbitrary pair of fields. Then a decoder could just change the stride it uses when addressing a surface to
control whether it is accessing the samples of an individual field or a unified frame. So the decoded picture buffer (DPB) was designed to manage its memory model as a collection of frame stores, not as a collection of individual fields.

That is really essentially the entire purpose and design relating to any_name (i.e., frame_num). That is ALL it is. It is natural to want to think of any_name as essentially a numbering of source frames at the input to the encoder. Although this is what most encoders will probably do, it is not a strictly correct understanding sufficient to build a well-designed decoder. (It is important to keep in mind that we do not specify how encoders or displays will operate -- only
decoders.) For example, that thinking could lead to some incorrect assumptions about the allowed timing relationship of pictures at the output of the decoder. The syntax element is not really for that purpose. Instead, it is a way to achieve picture loss robustness without sacrificing too much flexibility for the way the video can be used, and a way to simplify the picture buffering model management in decoders for frame/field adaptive coding.

- The standard says "Picture order counts are used to determine initial picture orderings for reference pictures in the decoding of B slices",which means we don't need to consider pic_order_cnt_type when dealing with baseline profile?

The basic concept of POC is to provide a counter that specifies the relative order of the pictures in the bitstream in output order (which may differ from the relative order in which the coded pictures appear in the data of the bitstream, which is referred to as the decoding order).
The relative order of the pictures is indicated in POC, rather than the timing of the pictures. This allows systems that carry the video bitstream to control the exact timing of the processing and output of the video bitstream without affecting the decoding process for the values of the samples in the luma and chroma sample arrays of the pictures. In some cases, the values of the samples in the luma and chroma sample arrays will depend on POC values. However, the values of
the samples in the luma and chroma sample arrays will never depend on the timing of the pictures.

There are three modes of POC operation:

In POC type 0, each slice header contains a simple fixed-length counter syntax element (pic_order_cnt_lsb) that provides the LSBs of the current POC. The MSBs of the current POC are calculated by the decoder by tracking modulus wrapping in the LSBs.

In POC type 1, each slice header contains one or two variable-length-encoded syntax elements that provide the difference to apply to a prediction of the current POC to compute the actual current
POC. This POC type provides the encoder with the ability to encode the POC values using significantly fewer bits per slice than what would otherwise be needed when using POC type 0 in cases where the encoder will usually be using a repetitive pattern of POC behavior.

In POC type 2, no data is carried in the slice header to compute the current POC. When POC type 2 is in use, the output order of the pictures in the bitstream will be the same as the order in which the coded pictures appear in the data of the bitstream. This POC type eliminates the need for the encoder to send any syntax data in the slice header for POC derivation. However, it provides no flexibility to allow the output order of the pictures in the bitstream to differ from their decoding order.

That statement would ordinarily be true. However, picture order count can also be used to determine the output order of pictures. The decoder ought to have other sources of information to determine that (e.g., timestamps on pictures carried at a systems level), so a Baseline decoder may not need to pay attention to picture order count. But it does need to figure out the output order of pictures one way or another.

Picture order count is also used to determine weights for temporal weighted prediction. Of course, that's not part of the Baseline profile either.

I think the only dependencies between picture order count and the processes for determining the values of decoded picture samples are the following:
1) The ordering of the initial reference picture lists in B slices
2) Temporal weighted prediction in B slices
3) Temporal direct prediction in B slices

So the summary is that if you're not supporting B slices you don't need picture order count for for determining the values of decoded picture samples.

The only other issue is how to determine the output order of pictures. But a system may provide that information in some way that doesn't depend on picture order count.

-- From mpegif.org

Tuesday, November 20, 2007

Concepts of H.264

0. Abbreviations
- Access Unit: a set of NAL units always containing exactly one primary coded picture. One or more redundant coded pictures or other NAL units not containing slices or slice data partitions of a coded picture. The decoding of an access unit always results in a decoded picture.
- Coded Frame/Field: No frame/field picture concepts in h.264. Coded frame consists of 2 field coded together as a single picture. A complementary field pair consists of two fields coded as separate pictures. No pic order count relationship is required for coded frames or complementary field pair. In general they would be stored in one frame buffer. The only requirement for them is that no other pics have order counts that fall in between the order counts of these two fields. Once a frame is decoded, it contains two fields. The two fields together can be used to predict a coded frame, or each of those fields can also be used separately as reference pic to predict a coded field. Two subsequent fields can be coded as separate pic which once decoded, are combined together as complementary ref OR non-ref field pair. Note coded fields may either be part of complementary field pairs or they may be non-paired fields.
There are two kinds of complementary field pairs, complementary reference field pairs(both of the two fields are reference picture) and complementary non-ref field pairs(both of the fields are non-reference picture). If the two fields of a frame is different for reference property,for example, one is reference picture and the other is non-reference picture,either of the field is non-paird field. the reference one is called non-paired reference field, and the non-reference one is called non-paired non-reference field.
If field pictures are used they should occur in pairs and together constitute one coded frame. When coding interlaced sequences using frame pictures, two fields should be interleaved with one another and then the entire frame is coded as one frame picture. -- MPEG-2
- IDR: Instantaneous Decoding Refresh, similar to I picture. The picture of memory management control operation that marks all reference pictures as unused for reference (with value of 5) has the same function. A video seq shall start with one IDR picture and the following are all non-IDR pictures. So for h.264, the seq is similar to GOP of MPEG-2. So there is another NAL of end of stream, indicating the end of video stream. IDR could be used for short term or long term reference picture. Non-IDR would be short term reference picture.
- Decoded Picture Buffer (DPB): Store all the reconstructed pictures
- Picture order count: Non-decreasing value relative to the previous IDR picture in decoding order. It is used to identify the dependence of the OUTPUT picture ordering.
- Frame number: in the decoding order instead of presentation order to number REFERENCE pictures. B picture is not reference pic, it could be ignored and the frame num of I/P increments. B pic is reference pic, the frame num is exactly the decoding order. Note: it would be reset to zero when an IDR picture is obtained.
- PicNum: frame num for short-term reference pic based on current frame num and reference pic frame num
- LongTermPicNum: specified externally

1. The index of MB in pictures
In general MBs are indexed in the raster scanning order. In the case of MB-adaptive frame/field mode, the MB pair is used and each MB would be indexed first in its MB pair and then incremented in the raster scanning order of MB pair. This might be used for inverse scanning processes (6.4)

3. Availability for current MB and neighbouring MB
- Not available if one of three conditions is satisfied for current MB.
- Special cases for neighbouring MB

4. Coordinates in the picture
- X: right is positive
- Y: down is positive

5. Derivation process for neighbouring MB, block (4X4 or 8X8) and partitions (6.4.8)
- The objective is to get the index of the neighbouring units (A B C D). The key step is to use the routine in (6.4.9). Its input is a luma or chroma location (xN, yN) expressed relative to the upper left corner of the current MB. It outputs the MB index that contains (xN, yN) and its location relative to the upper left corner of this resulting MB.
- The location difference Table 6-2?
From (6.4.1) to (6.4.6), one location of the unit relative to the picture or MB or sub partition could be calculated. In order to use routine in (6.4.9) to get the index of the neighbouring unit, one location within the neighbouring unit is needed. Table 6-2 gives the relationship between these two locations.

Monday, November 19, 2007

Energy and Power Spectral Desity

Energy and/or power are the key property of one signal in DSP domain. The various transformations, like Fourier trans, DCT trans, wavelet trans, etc. are used to study the energy and/or power of the signal in essence.

* Energy signals
- Deterministic or random signals.
- Square integrable or square summable.
- The energy spectral density would be derived from its Fourier transformation.

* Power signals
- Deterministic or random signals.
- Not square integrable or square summable.
- The power spectral density would be the Fourier transformation of its autocorrelation function.

* Property for ESD and PSD
- Non negative
- The area under the energy spectral density curve is equal to the area under the square of the magnitude of the signal, no matter if the signal is continuous or discrete. The total power in a power spectral density being equal to the corresponding MEAN total signal power, which is the autocorrelation function at zero lag.

H.264

H.264, Advanced Video Coding (AVC) or MPEG-4 Part10 outperforms the MPEG-4 Visual and H.263 standards, providing better compression of video images. It could output bitrate about half MPEG-2 bitstream with the same quality.

* Picture Format Supported
Almost all video resolutions from SubQCIF to BT.709 are supported. Progressive and interlaced scanning.
Like MPEG-2, the default color sampling is 4:2:0 and the phase relationship between Y and C samples is the same as MEPG-2.

* Coded Data Format (Data Stream Syntax)
- Video Coding Layer (VCL): the output of encoding process, a sequence of bits representing the coded video data, which are mapped to NAL units prior to transmission or storage.
- Network Abstraction Layer (NAL): basic unit of a coded H.264 video sequence. Each contains an Raw Byte Sequence Payload (RBSP). The type of RBSP is indicated in the header of NAL (one byte) and the RBSP data makes up the rest of the NAL unit. Some important RBSPs are Parameter Set (sequence or picture), Coded Slice and End of Sequence, etc.

* Profile and Level
H.264 supports four profiles only, unlike MPEG-4. They are baseline for low bitrate applications, main for broadcasting and storage, extended for media streaming applications and high definition for HD and video studio.

* Video Coding Tools
- No GOB and GOP in bitstreams. Sequence is similar to GOP. Sequence supports progressive and interlaced sequence. Picture formats are field and frame. Each picture has a picture order count which defines the presentation order of this picture. Reference pictures are organized into one or two lists, list0 and list1, with frame numbers.

- A coded picture consists of slices, which is a set of MB or MB pairs in raster scan order. Slices have I-, P- and B-slice. MBs have three types too, I-, P- and B-MB. For I-slice, only I-MB is used. For P-slice, it could contain P and I-MB and a B-slice may contain B and I-MB. Slices are still the basic unit for resync and error recovery and keep independence on each other by applying intra prediction and motion vector prediction only within the same slice.

- I-MB, i.e. intra MB, is totally different from that of previous standards. Intra prediction from decoded samples in the current slice are used for I-MB. And the residual data is transformed, coded and transferred. This is actually the technique of DPCM. Note it is for pixel sample but not the DC component in frequence domain. An alternative to intra prediction is I-PCM for I-MB, which enables an encoder to transmit the values fo the image samples directly without prediction or transformation.

P- and B-MB are inter MB with inter prediction. P-MB uses list0 and B-MB uses both of list0 and list1. The MB partition and MB sub partition are supported and the reference picture might different for each of them. About the reference pictures, they could be before or after current picture in temporal order.

For B-MB, many prediction modes could be used: direct mode, MC from list0, MC from list1, or MC from both list0 and list1. Different modes may be chosen for each partition. And if 8X8 partition size is used the chosen mode would be applied to all sub partition within that partition. Note the backward and forward prediction are not really applicable anymore here.

- Inter Prediction
The differences between H.264 and earlier standards include the support for a range of block size and fine subsample motion vectors.

The luma component of one MB could be split up in FOUR ways, 16X16, two 16X8 partitions, two 8X16 partitions and four 8X8 partitions. For 8X8 partitions, another FOUR sub partitions are supported, i.e. 8X8, two 8X4 and two 4X8 and four 4X4. For chroma components, the same way to partition happens except the different sizes, which have exactly half the horizontal and vertical resolution of luma ones.

Each partition or sub partitions in an inter MB is predicted from an area of the SAME size in the reference picture. Note the different MB or partition might have different reference picture. The offset between the two areas has quarter-sample resolution for luma component and one-eighth-sample resolution for the chroma components. So the interpolation may be necessary for reference pictures. Note here the resolution of chroma components is half of that of luma. So the MV should be halved when applied to the chroma blocks, and the precision for chroma prediction would be half that of the luma, i.e. one eighth sample. For luma interpolation, half samples are generated first with six tap FIR and then quarter samples with average. For chroma interpolation, the linear interpolation or weighted average is used.

The residual data with less energy could be obtained to be coded by using smaller block prediction. However, the number of MVs are increased greatly. Also more side information is needed for correct decoding. In order to decrease the bitrate further, motion vector prediction is used. MV prediction is in the unit of MB, which might have partitions or sub partitions. Different prediction modes might used depending on the motion compensation partition size and on the availability of nearby vectors. In general, three partitions, i.e. left one, upper one and upper right one are used.

- Direct Prediction
No MV is transmitted for a B-MB or partitions in Direct Mode, i.e. they are different from skipped B-MBs. The MV for them would be reconstructed using direct prediction.

- Weighted Prediction
To modify the samples of prediction data in a P/B-MB before the compensation.
Two ways for this function, explicit and implicit weighted predictions.

- Intra Prediction
For luma component, the sizes for intra prediction could be 4X4 blocks with nine modes or 16X16 blocks with four modes. For chroma components, 8X8 blocks are used with four modes.

Note the intra prediction depends on the availability of all the required prediction samples.

Predictive coding is used to signal 4X4 intra modes. The left and upper sub partitions is used for the most probable prediction mode.

- Deblocking Filter

- Transform and Quantisation
Because the minimal size of predication is 4X4, the size of transform would be 4X4 instead of 8X8. The DC component for chroma blocks are transformed further with 2X2 Hadamard. And one special case is for intra MB with 16X16 prediction. The 4X4 Hadamard transform is applied for DC components of each block.

The transmission order for one MB: In general 26 blocks starting from index 0 to 25 is transmitted in order (24 blocks + 2 additional chroma DC sub blocks). For intra MB with 16X16 prediction, one additional block is needed with index -1 and would be transmitted first.

Quantisation step is determined by Quantisation Parameter (QP). Totally 52 values are supported and Q step doubles in size for every increment of six in QP. Such arrangement makes it possible to fine control the bitrate and video quality. Also predictive coding is used for QP in one slice.

The block of 4X4 would be scanned with zig-zag order for frame block and alternate order for field block.

- Entropy Coding

- Interlaced Video

Saturday, September 29, 2007

Board Support Package

* Two Steps
- Basic init
1) Disable interrupts and cache
2) Init memory controller and cache
3) Init stack register and other registers
4) Init URAT for debugging
This step generally uses Assembly to implement the above tasks
- Board init
1) Clear memory
2) Init interrupts
3) Init timers
4) Init other hardwares
5) Init RTOS
This step would run C codes

Sunday, September 23, 2007

Fixed Point Arithmetic in DSP

* Rules for Fixed-Point Arithmetic
- Unsigned wordlength: U(a, b) is a + b
Range: [0, 2**a - 2**(-b)]
- Signed wordlength: A(a, b) is a + b + 1(sign)
Range: [-2**a, 2**a - 2**(-b)]
- Addition
Scale must be the same for all the operands
X(e, f) + Y(e, f) = Z(e+1, f)
- Multiplication
X(a, b) * Y(e, f) = Z(a+e, b+f) Note: exclude the sign bit
- Shifting to divide/multiple by a power of two, or to change the scaling
1) multiply/divide by a power of two: X(a, b) >> n = X(a, b)
where if n greater than zero, right shift and if n less than zero, left shift
2) modifying scale: X(a, b) >> n = X(a+n, b-n)
Note these operations might result in a loss of precision or overflow.

* How to scale
- For A(a, b) scale, the fixed point num would be X = (x + 2**(-b-1)) * 2**b
- Round is better than truncation for scaling in terms of error. Always add 2**(b-1) when scaling to b
- The maximal value of output for FIR
Ymax = Xmax * sum(abs(bi)) where bi is the coefficients of FIR. So sum(abs(bi)) is generally one.
- Scale the coefficients for FIR
For signed number with M bit wordlength, in order to represent the maximal value of the coefficients, S1 = floor(log2(2**(M-1)-1)/max(abs(bi)))
In order to make the final sum not overflow, S2 = A - L - ceiling(log2(sum(abs(bi)))), where A is the length of accumulator, L is the bitwidth of inputs.
So the final scale for the coefficients of FIR should be S = min(S1, S2)

Monday, September 17, 2007

Difference Between JPEG and MPEG

- JPEG deals with luma and chroma with different Q table and VLC table. MPEG does Q and VLE independent of luma and chroma.
- JPEG does not standardize these tables the customized table could be transmitted within the stream while MPEG does esp. VLC table is not allowed to be changed by user.
- For VLC, both JPEG and MPEG deal with DC(intra-MB), AC(intra-MB) differently. Meanwhile the VLC structure is different too. For JPEG, (run, size) is coded with VLC while level is coded with minimal bits and the sign could be derived from level code and size. MPEG just does the same thing but combining these triple together with different VLC table and the sign is coded explicitly. The code length would be LUTed by the prefix of each code. They are not compatible with each other.
- MPEG-2 has more precision for DC component: 8-11 bits while JPEG only supports 8 bits. like MPEG-1.
- The syntax elements above slice should be fixed (in general) or variable length coded in MPEG. The elements within and below slice should be variable length coded, esp. in the MB level everything is VLEed, like MV diff, CBP, QP diff, etc.

Sunday, September 9, 2007

Random Number Generation

* BigRand and RandInt
long long bigrand()
{ return (long long)RAND_MAX * rand() + rand(); }
int randint(int l, int u)
{ return l + bigrand() % (u-l+1); }
Here rand() returns one random number with 32bits; bigrand returns one big random number with 64bits and randint(l, u) returns one random number between l and u.

* Select K random numbers in an array of N with sorted output
To select s number out of r remaining, the next number would be selected with probability s/r. It could be expressed by
if (bigrand() % r) < s
The pseudocode is:
select = K
remaining = N
for i = [0, N-1]
   if (bigrand() % remaining) < select
     print i
     select--
   remaining--
Alternatively, the data structure of set could be used:
initialize set S to empty
size = 0
while size < K do
   t = bigrand() % N
   if t is not in S
     insert t into S
     size++
print S in sorted order
The third one allowing number repeating is to shuffle the array, like this:
for i = [0, N-1]
   swap(i, randint(i, n-1))

Saturday, September 8, 2007

Searching

* Sequential Search O(N)

* Binary Search O(logN)

* Hashing O(1)

* Binary Search Tree O(logN)

* Key Index to Array O(1)
Note its different space requirement from that of hashing

* Examples
- Find the missing/duplicate element in one unsorted array
Binary search for large array. In some cases, it could be solved by calculating the difference between the sum of the array and the sum of the range of number.
- Find all sets of angrams in one dictionary
Set a key for each word, then sort by the key to group them.
- Find the angrams of one word in one dictionary
After finish Q2, use binary search to do it.
- Find the Kth largest number in one non-repeated array
Order statistics
- Find the median of two sorted arrays
- Find one number which duplicates more than half size of one array
- Bitmap sort actually uses key index

Sorting

* Selection Sort
Find the smallest one and put it in the leftest slot, then find the second smallest one and put it in the second slot, and so on. The average and worst case are O(N^2).

* Insertion Sort
Compare new element with the sorted list in its left. The worst case is O(N^2) and generally it is used for almost sorted array with O(N). It is stable.

* Heapsort
In place sort without recursion and the worst case is O(NlogN).

* Merge Sort
Divide and conquer. The worst case is O(NlogN) with O(N) space. It is very useful for sorting data on disk that is too large to fit entirely into primary memory.

* Quick Sort
Partition and recursion. The average case is O(NlogN) with O(logN) space.

* Bitmap Sort
The data in one specific range with limited duplication. O(N)

* Binary Search Tree
If the size of the array is not known in advance, the binary search tree could be used to do the sorting. Heap could be used too. However, it might be harder to implement heap with list than with array.

* Examples
- Sort a file containing at most n(10^7) positive integers, each less than n without duplication. The main memory to be used is limited to one megabyte.
Merge sort, quick sort with multiple passes, and bitmap sort (with multiple passes)
- What if the number in Q1 might occur at most ten times or each non-repeated number is associated with one of three prefix: 800, 877 and 880?
- Transpose one 4000X4000 matrix
Apply (row, col) for each element and stable sort (insertion sort) by row then by col.

Friday, September 7, 2007

Bit Operation

* Bit Operations
~, &, |, ^, << and >>.

* Examples
- Check whether the system is big endian or little endian
Why does this solution not work?
int i = 1;
char c = (char)i; /* c == 1 no matter what endianness */
return c;
- Get/clear the lowest bit of 1 (or 0 by using ~x as input)
x & -x; and x & (x-1);
- Get the index of lowest/highest bit of 1
k = 0;
if ((x & 0x1) != 0) return k;
do { ++k;} while (((x >>= 1) & 0x1) == 0) and
while (x >>= 1) ++k;
- Calculate the number of '1' in the one integer
while (x != 0)
{ x = x & (x-1); ++count; }
- Check whether the integer is the power of 2
The number of '1' in the integer is one.
- Get the next power of 2
unsigned int n = 1U << highestOne(x);
if (n == x) return n;
return n << 1;
- Reverse the bits order in one integer
x = ((x & m) << n) | ((x & ~m) >> n); /* m is masking code and n is swapping number */
- Reverse two bits in one word: swap2(x, k1, k2)
- Arithmetic Rotation
x = (x << r) | (x >> (BITNUM - r));
- Circular Block Shifting
x = (1 << n) - 1;
x = x << p | (x >> (BITNUM - p));
- Max(0, x) and Min(0, x)
return x & ~(x >> (BITNUM - 1)); and
return x & (x >> (BITNUM - 1))
- Overflow for signed num
addition: ~(a^b)&(a^s)&0x80000000
subtraction: (a^b)&(a^s)&0x80000000

Thursday, September 6, 2007

String And Array

* Examples
- To get the first letter which is not repeated in the string
Build the key index between character and its occurrence with array or hashtable.
- Remove the specified characters in the string
Build the key index between character and its status (stay or removed). In place deleting operation with two pointers.
- Reverse the order of a sequence of words separated with space OR circularly shift the string
In place reverse with single word reverse after getting the start and end of words.
- Conversion between signed integer and string
1) To convert between char 0~9 and its integer with the aid of '0'
2) Accumulate in the conversion of atoi from left to right
3) Separate digits in the conversion of itoa from right to left and so reverse order output
- Merge two sorted arrays with the length of M and N (M>>N)
Binary search in M

Tree

* Binary Search Tree
- Search in BALANCED BST: O(logN)
- Insert and delete in BALANCED BST: O(logN)
- Obtain the parent node for a given node: O(logN)
- How to get maximal and minimal value in BST
- Output all the nodes SORTED in BST: O(N)
- Recursive algorithm associated

* Heap: Complete Binary Tree
- Implemented efficiently by an array due to its order and shape properties
- Insert, delete_min/max and delete: O(logN)
- Increase and decrease key: O(logN)
- Build heap: O(NlogN)
- Search in heap: O(N)

* Traversal
Preorder, DFS, postorder, in-order, and BFS

* Set
A set contains a sorted set of unique objects, which is usually implemented by BST.

* Examples
- Preorder traversal of BST without recursion
Stack is used to store the intermediate variables.
- Find the lowest ancestor of two nodes in one BST
The one with value between these two nodes would be the answer.
- Find the first K largest numbers or to find a subset of K numbers which sums up to at most T
Partial sort with one heap of K units.
- Find the K largest/smallest number in a heap of N elements with O(KlogK)/O(1). K is much less than N
- Construct a Huffman code and compute the sum of a large set of floating point number

List

* Single Linked List
- One Head Pointer Must Be Associated
To change the head pointer in one sub-routine, the pointer itself instead of its copy must be changed. Therefore the pointer to head pointer must be passed in ANY sub-routines which changes the head/tail pointers.
- Lists End With NULL
This must be checked when the traverse happens.
- Insert And Delete
When deleting one element in the list, the element before this one must be obtained.
When inserting one element, it depends on whether it is inserted before or after a specific element.
- Free Every Element in The List
- Special Cases Processing: Head and Tail; Length of List == 0, 1, 2, ...

* Examples
- Implement stack data structure
We could implement stack with variable array or single linked list. Due to the heavy overhead of maintaining the memory of elements, variable array might be suitable for small data set.
- Insert and delete function in list with head and tail pointers
How to process special cases?
- Get the mth element in reverse order in one list
1) n = N - m; Two traverses to get N and n; O(N)
2) Two pointers with the distance of m
- How to check if the list contains loop?
Fast and slow pointers
- How to delete one element if only the pointer to this element is given, i.e. without traversing?
Copy the content of next element to this one and delete next element.
- How to reverse one list with/without recursion?

Wednesday, September 5, 2007

Sync in MPEG

* SCR/PCR, PTS and DTS
Sync in MPEG is handled at the packet layer (PES), with the SCR/PCR, PTS and DTS field serving as instruments. The timing base of them is 90KHz (27MHz/300). They are not necessarily encoded for each video frame or audio presentation unit, but are only required to occur with intervals not exceeding 0.7s/700ms for periodic updating of the decoders' clocks, i.e., the maximal length of pack in PS is 0.7s. For B frames of video and audio packet, the PTS is equal to its DTS. In the case of I and P pictures only, they are the same too. In general, I and P frames of video have a reordering delay (until next I/P frame comes) between DTS and PTS. The PTS actually could be interpolated based on the above information.

* Sync Using A Master Stream
All of the media streams being decoded and displayed must have exactly one independent master. Each of the individual media display units must slave the timing of their operation to the master stream. The master stream may be chosen depending on the application. Whichever media stream is the master, all the media streams bu the master must slave the timing of their respective displays to the PTSs extracted from the master media stream.

Audio is always chosen to be the master in MPEG decoding. The audio stream will be played back continuously with the clock being continually updated to equal the PTS value of the audio unit. In particular, the STD clock is typically initialized to be equal to the value encoded in the first SCR/PCR field when it enters the decoder's buffer. Thereafter the audio decoder controls the STD clock by updating this clock with the PTS value (coarse sync without PLL, in constrast, fine sync would PLL 27MHz with the PTSs).

The other decoders simply use the audio-controlled clock to determine the correct time to present their decoded data, at the times when their PTS are equal to the current value of the clock. Therefore, if the sync is missed, video picture would be skipped or repeated but audio unit never.

Monday, September 3, 2007

Extern "C"

Extern "C"
* It is used for C/C++ mix programming. Two aspects it has:
- All the variables and functions it is with are extern
- All of these are compiled and linked with C methods

* C and C++ Object Name
- Functions and variables in C are identified by their names
- Functions and variables in C++ are identified by their name, input arguments, return value and class name associated, etc. by combined, due to the overloading and other features in C++.

* Usage
It is only used in C++ header file and can not be included in C files.
- When C++ uses C header files, use the following in the C++ header file:
extern "C"
{
#include "xxx.h"
}
The C++ files must include this header file instead of using extern directly for C variables and functions.
- When C uses C++ header files, use extern instead of including the C++ header file:
cppfile.h:
extern "C"
{
void xxx(void);
}
cfile.c
extern void xxx(void);

* _cplusplus
In order to share this header file by C and C++, the macro _cplusplus is used to wrap the extern "C":
#ifdef _cplusplus
extern "C"{
#endif
... ... ...
#ifdef _cplusplus
}
#endif

Sunday, September 2, 2007

Abstraction Relationship

* Three Relationships In General
- is-a
The public inheritance. Derived class is a special base class and could be used in any circumstances of base class. This is the key point of dynamic binding. On the contrary, base class is not derived class.
- has-a
The composite implementation with the handles to composite objects.
- use-a
Two classes interact at some time instant. No private inheritance implementation and use composite implementation with the handles to used objects.

* Has-a and Usa-a
It is difficult to distinguish these two relationships from the coding since it is generally using the handles to implement them. For has-a, the master class has the responsibility to create, maintain and delete its composite object. For use-a, the useful object might be shared with others. The setting of handles to comp might be these:
- Has-a
{ delete comp; comp = newcomp; }
{ delete comp; comp = newcomp->clone(); }
- Use-a
{ comp = newcomp; }

C++ Dynamic Binding

* C++ Dynamic Binding
- Implemented by public inheritance and virtual functions
- Base and derived classes: general and special
- Base class must have one virtual function at least and it is the destructor. This affects the following two points.
- Virtual pointer to the virtual function table and its implementation in the derivation.
- Conversion/casting between base class object and derived class object.
- Public inheritance preferred. Use composite instead of private inheritance. No protected inheritance.
- Distinguish function overload, override, and hidden.

* Inheritance
- Base class consists of public interface (public accessibility) and private interface for derived classes (public and protected accessibility)
- Pure virtual function allows only interface without implementation for derived class.
- Virtual function allows interface with default implementation for derived class.
- Non-virtual function allows interface with mandatory implementation for derived class.

* Virtual Function
- Virtual function can not be static
- Redefine all the overloaded virtual functions
- No default arguments for virtual functions
- No virtual copy and assignment operator
- No virtual function calling in the constructor and destructor

* Abstract Class and Virtual Base Class

C++ Data Abstraction

* Abstract Data Type (class and object)
- Public interface
- Private implementation and private interface for derived classes

* Class
The following is very important for the abstract data type. They might be generated by compiler and could not be derived (not be virtual except destructor). (Note another one which can not be derived is the hidden data member or member function.)
- Constructor
1) Use initialization list in the constructor as much as possible to improve initialization efficiency. Const and reference must be put in the list while static and array member must not. Base class, const, reference and object members must be initialized.
2) Note the consistent order of member initialization.
3) Constructor with only one input or default inputs performs implicit casting.
4) Direct initialization instead of copy initialization, for example,
X a(10);
X a = X(10);
x a = 10;

- Destructor
1) If the class is designed for being derived, destructor must be virtual to allow the correct destructor to be called with dynamic binding and uniform data type mapping in memory. If not, do not declare it as virtual to save memory.
2) Even the destructor is pure virtual function, it still needs to be defined and it is allowed.

- Copy constructor
1) If some handle to object/memory outside the object is contained, the copy constructor must be redefined.
2) Each member must be copied explicitly in the copy constructor, esp. the base class.
3) The standard format for copy constructor
X(X const &x)
4) Put it in the private area without new definition to avoid use it illegally.

- Assignment operator
Note the difference of initialization and assignment.
1) If some handle to object/memory outside the object is contained, the assignment operator must be redefined. Delete the original memory allocation and reallocate and assign it.
2) The self assignment must be checked.
if (this != that) ...
3) Each member must be assigned explicitly in the assignment operator, esp. calling the assignment operator of base class.
B::operator =(that);
4) The standard format for assignment operator
X const &operator=(X const &x)
5) Put it in the private area without new definition to avoid use it illegally.

Wednesday, August 29, 2007

Digital Signal Processing

*Compared to IIR filters, FIR filters offer the following advantages:
- They can easily be designed to be "linear phase" (and usually are). Put simply, linear-phase filters delay the input signal, but don’t distort its phase. A FIR filter is linear-phase if (and only if) its coefficients are symmetrical around the center coefficient. FIR filter which has N taps, the delay is: (N - 1) / (2 * Fs), where Fs is the sampling frequency. So the center of FIR (1-D or 2-D) generally determines the exact output.
- They could be designed effectively by inverse-transforming desired freq response and windowing. And they are simple to implement. On most DSP microprocessors, the FIR calculation can be done by looping a single instruction.
- They are suited to multi-rate applications. By multi-rate, we mean either "decimation" (reducing the sampling rate), "interpolation" (increasing the sampling rate), or both. Whether decimating or interpolating, the use of FIR filters allows some of the calculations to be omitted, thus providing an important computational efficiency. In contrast, if IIR filters are used, each output must be individually calculated, even if it that output will discarded (so the feedback will be incorporated into the filter).
- They have desirable numeric properties. In practice, all DSP filters must be implemented using "finite-precision" arithmetic, that is, a limited number of bits. The use of finite-precision arithmetic in IIR filters can cause significant problems like overflow/fluctuation due to the use of feedback, but FIR filters have no feedback, so they can usually be implemented using fewer bits, and the designer has fewer practical problems to solve related to non-ideal arithmetic.
- They can be implemented using fractional arithmetic. Unlike IIR filters, it is always possible to implement a FIR filter using coefficients with magnitude of less than 1.0. (The overall gain of the FIR filter can be adjusted at its output, if desired.) This is an important consideration when using fixed-point DSP's, because it makes the implementation much simpler.

- Compared to IIR filters, FIR filters sometimes have the disadvantage that they require more memory and/or calculation to achieve a given filter response characteristic. Also, certain responses are not practical to implement with FIR filters.

Reference: DSPGuru Fir FAQ

Digital Signal Processor

* RISC for data manipulation and DSP for math processing

* Harvard Arch
Separate memories for data and program instructions, with separate buses for each.

* Super Harvard Arch
Harvard Arch + Instruction cache + I/O DMA

The filter coefficients are put in program memory, while the input signal in data memory. For one operation of addition, one input signal value needs to be over the data memory bus, but two values over the program memory bus (the program instruction and the coefficient). The first time through a loop, the program instructions must be passed over the program memory bus. This results in slower operation because of the conflict with the coefficients that must also be fetched along this path. However, on additional executions of the loop, the program instructions can be pulled from the instruction cache. This means that all of the memory to CPU information transfers can be accomplished in a single cycle: the sample from the input signal comes over the data memory bus, the coefficient comes over the program memory bus, and the program instruction comes from the instruction cache. In the jargon of the field, this efficient transfer of data is called a high memory-access bandwidth.

In order to improve throughput, I/O controller, DMA, is connected to data memory. This is how the signals enter and exit the system.

* Circular Buffer and Data Address Generator
Data Address Generator (DAG), one for each of the program and data memory. These control the addresses sent to the program and data memories, specifying where the information is to be read from or written to. DSPs are designed to operate with circular buffers, and benefit from the extra hardware to manage them efficiently. This avoids needing to use precious CPU clock cycles to keep track of how the data are stored.

There are many circular buffers in DSP. Some DSP algorithms are best carried out in stages. For instance, IIR filters are more stable if implemented as a cascade of biquads (a stage containing two poles and up to two zeros). Multiple stages require multiple circular buffers for the fastest operation. The DAGs are also designed to efficiently carry out the Fast Fourier transform. In this mode, the DAGs are configured to generate bit-reversed addresses into the circular buffers, a necessary part of the FFT algorithm. In addition, an abundance of circular buffers greatly simplifies DSP code generation- both for the human programmer as well as high-level language compilers, such as C.

* ALU, MAC and Barrel Shifter
A long bit accumulator is built into the multiplier to reduce the round-off error associated with multiple fixed-point math operations, esp. for IIR. The multiplier is designed with combination logic and could execute MAC operation in one cycle.

In some cases, shadow registers could be used for all the CPU's key registers. These are duplicate registers that can be switched with their counterparts in a single clock cycle. They are used for fast context switching, the ability to handle interrupts quickly. When an interrupt occurs in traditional microprocessors, all the internal data must be saved before the interrupt can be handled. This usually involves pushing all of the occupied registers onto the stack, one at a time. In comparison, an interrupt is handled by moving the internal data into the shadow registers in a single clock cycle. When the interrupt routine is completed, the registers are just as quickly restored.

Tuesday, August 28, 2007

I/O Subsystem

* I/O Subsystem
- Interfaces between a device and the main processor occur in two ways: port mapped and memory mapped.
- I/O devices are classified as either character-mode devices or block-mode devices. The classification refers to how the device handles data transfer with the system.
- I/O devices could be active(issue interrupts periodically or aperiodically) or passive(no interrupts, CPU needs polling for read).
- DMA controllers allows data transfer bypassing the main processor.
- I/O Modes
1) Interrupt-driven I/O (active)
in which an input buffer is filled at device interrupt time(Ready) with DMA and is emptied by processes that read the device probably triggered by DMA complete interrupt; an output buffer is filled by processes that write to the device and is emptied at device interrupt time(Ready) with DMA. This might overload the CPU if the I/O traffic is heavy.
2) Polling in real-time (passive)
Within the regular interrupt of a timer, CPU would poll every I/O.
- Task Assignments for I/O
1) ISR associated tasks for active I/O devices
2) Timer triggered tasks for polling of passive I/O devices
3) Resource control task to control a shared I/O device or a group devices
4) Request dispatcher to multiple tasks from devices
- I/O subsystem associates ISR closely.
- I/O subsystems must be flexible enough to handle a wide range of I/O devices, which hides device peculiarities from applications.
- The I/O subsystem maintains a driver table that associates uniform I/O calls, like create, delete, read, write, ioctrl, etc. with driver-specific I/O routines.
- The I/O subsystem maintains a device table and forms an association between this table and the driver table.

Exception and Interrupt

* Classification of General Exceptions
- Async non-maskable
- Async maskable
- Sync precise: PC points to the exact instruction caused that exception
- Sync imprecise

* Priority
- Async non-maskable
- Sync precise
- Sync imprecise
- Async maskable
- Programmable Task

* Processing General Exceptions
- Install exception handles
It requires replacing the appropriate vector table entry (accessed by IRQ) with the address of the desired ESR or ISR. To install handles when the device is used instead of initialization in order to share the precious interrupt resources.
- Exception frame or interrupt stack
The main reasons existing for needing an exception frame are 1) to handle nested exceptions; 2) the portion of the ISR in C/C++ requires a stack to which pass function parameters and to invoke a library function. Do not use the task stack for exception frame because it might cause stack overflow which is very hard to debug (depending on which task, which interrupt and its freq and timing, etc.).
- Three ways to mask interrupts (No effect on non-masking interrupts)
1) Disable the device
2) Disable the interrupts of the same or lower priority levels
3) Disable global system-wide interrupts
They might be used because:
1) the ISR tries to reduce the total number of interrupts raised by the devise,
2) the ISR is non-reentrant,
3) the ISR needs to perform some atomic operations
The masking could be done by setting the interrupt mask register, which would be saved in the beginning of interrupt handles and restored in the end. Therefore, ISR could disable other ISR happening and need to save status register while exception handles do not have this ability.
- Exception processing time
The interrupt frequency of each device that can assert an interrupt is very important for the ISR design. It is possible for the entire processing to be done within the context of the interrupt, that is, with interrupts disabled. Notice, however, that the processing time for a higher priority interrupt is a source of interrupt latency for the lower priority interrupt. Another approach is to have one section of ISR running in the context of the interrupt and another section running in the context of a task. The first section of the ISR code services the device so that the service request is acknowledged and the device is put into a known operational state so it can resume operation. This portion of the ISR packages the device service request and sends it to the remaining section of the ISR that executes within the context of a task. This latter part of the ISR is typically implemented as a dedicated daemon task. Note, however, the interrupt response time increases. The increase in response time is attributed to the scheduling delay, and the daemon task might have to yield to higher priority tasks. In conclusion, the duration of the ISR running in the context of the interrupt depends on the number of interrupts and the frequency of each interrupt source existing in the system.

* General Guides
On architectures where interrupt nesting is allowed:
- An ISR should disable interrupts of the same level if the ISR is non-reentrant.
- An ISR should mask all interrupts if it needs to execute a sequence of code as one atomic operation.
- An ISR should avoid calling non-reentrant functions. Some standard library functions are non-reentrant, such as many implementations of malloc and printf. Because interrupts can occur in the middle of task execution and because tasks might be in the midst of the "malloc" function call, the resulting behavior can be catastrophic if the ISR calls this same non-reentrant function.
- An ISR must never make any blocking or suspend calls. Making such a call might halt the entire system.
- If an ISR is partitioned into two sections with one section being a daemon task, the daemon task does not have a high priority by default. The priority should be set with respect to the rest of the system.

Monday, August 27, 2007

Video Coding

A popular and effective video coding method is the one that uses block-based temporal prediction and transform coding. This method is essentially the core of all the international video coding standards.

* Block-Based Hybrid Video Coding
- Block-based motion estimation => MV
- Motion compensated predication => predicated block
- DCT on predication error block
- Quantization of DCT coeffs
- Run-length scanning and variable length coding on MV and quantified coeffs
- Loop filtering to reduce block artifacts on reconstructed picture
Thus the temporal, spatial and symbol redundancy are reduced as much as possible.

The MB is the basic unit of motion estimation and compensation. In general, three modes for MB are I-, P- and B-mode. Because MV and DC coeffs (only for I-mode MB) of adjacent MBs or blocks are similar, they are typically coded predictively within the same GOB or slice to avoid the error propagation. This is extended in H.264 and called intra predication. The quantization parameter or factor is also determined in the MB level.

A frame could be coded entirely in the intra-mode. This picture is called I-picture or intra-picture. A MB in P-picture could be coded in I- or P-mode while one MB in B-picture could be in I-, P- and B-mode based on the predication error. Both P- and B-picture are called inter-picture. Also the picture structure of them could be field and frame.

For ME/MC, the key pointer is block size, search range, search precision, MV predication, block predication, MV number, etc. The improvement includes unrestricted motion vector, OBMC, global MC, multiple references, etc. For DCT, more advanced transformation could be used, like integer transform, wavelet, etc. For quantization, the stepsize (QP) and non-linear quantization or vector quantization have been explored. Alternate scanning (Horizontal or vertical) is proposed and so are arithmetic VLC and 3-D VLC instead of Huffman coding. For improved performance, the choice between different parameters can be determined by a rate-distortion optimization approach.

One important issue in video coding is rate control, which refers to how to code a video so that the resulting bitstream satisfies a target bit rate. This leads to how to choose the coding parameters, like frame rate, MB mode, QP, etc., to meet the rate constraint. The desired target bit rate is the average rates over short intervals of time due to the VLC video coder and the complexity of the scene. the variability of the bit rate within each interval must be handled by a smoothing buffer following the encoder. The buffer size is determined by the time interval and delay requirement of the application. RC is typically accomplished in three steps:
- Update the target average bitrate for each short time interval, based on the bandwidth, delay requirement and buffer fullness
- Determine the coding mode for picture (I, B and P) and the target bit budget for each picture to be coded in this interval, based on the target average rate for the interval and the current buffer fullness
- Determine the coding mode and QP for each MB in a picture to meet the target rate for this frame

Video Bit Stream Syntax

In order to support different applications, the syntax of video bit stream must be flexible. This is achieved by having a hierarchy of different layers that each start with a header. Each layer performs a different logical function. Most headers can be uniquely identified in the bit stream since they begin with a start code and a start code identifier.
* Sequence and VOL for MPEG-4 Visual
Sequence header and end-of-sequence code. Global parameters are contained in the sequence header and its extension.

* Group of Picture (GOP)
GOP allows random access and interaction with video sequence since GOP is coded independent of each other. For MPEG-4 Visual, it is GVOP and for H.26x, no GOP since they are for realtime interactive applications. The first picture in GOP needs to be coded independently. Closed GOP means the B picture in the end of GOP could not use the I picture of the following GOP as reference. It is only used for video edition.

* Picture
The basic coding unit of a video sequence. It consists of three components: Y, Cb and Cr. The picture header indicates the picture type, picture structure and others. A VOP is the coding unit in MPEG-4 Visual.

* GOB, Slice and Video Packet: Groups of MBs
- GOB has fixed structure with three lines of MB and eleven MBs in one line. H.26x.
- Slice has one line of MB with variable length. Note its definition in MPEG-1 and MPEG-2 is different. DC components of intra MBs would be coded predicatively within one slice and MV for non-intra MBs are coded predicatively within one slice too.
- Video packet is based on the bit rate threshold as defined by the encoder instead of MB.
All these three structures are used to do the resync and error recovery.

* Macro Block (MB)
Eight blocks for 4:2:2 and six blocks for 4:2:0. Basic unit for motion compensation. No start code for MB.

* Block
Transform and entropy coding unit. No start code for blocks.

Sunday, August 26, 2007

MPEG-1/2/4

MPEG-1, ISO 11172, was designed for progressively scanned video and the target bit rate is around 1.2Mbps (totally 1.5Mbps including audio and data). And it had to support basic VCR-like interactivity, like forward, fast reverse, etc. MPEG-1 consists of five parts. Only video part is considered here.

* Picture Formats Supported
SIF, progressive scanning

* Video Coding Tools
- I, P and B frames. Bidirectional motion compensation is used for B frame. Two MV needed for B frames. And the coding order is different from the scanning order.
- For I frame, the weight matrix is used to adapt DCT coeffs to human visual system before the uniform quantization of them. The DC of I-block is coded predictively.
- Half pel predication precision and MV range is +/-64 pels (7bits).
- GOP structure for random access and interactivity

MPEG-2, ISO13818 or H.262, was designed to extend the MPEG-1 functionality to interlaced picture, primarily using BT601 4:2:0 format. The target was to produce TV-quality pictures at data rate of 4~8Mbps and high quality picture at 10~15Mbps. MPEG-2 consists of nine parts.
* Picture Formats Supported
SIF, BT601, SMPTE 296/295M, progressive and interlaced scanning

* Video Coding Tools
- Chroma samples in the 4:2:0 are located horizontally shifted by 0.5 pel compared to MPEG-1, H.261 and H.263
- Two picture structures: frame picture and field picture with I, P and B picture mode
- More predication modes for interlaced video
1) Field predication for field picture: reference picture chosen more flexibly for P and B picture
2) Field predication for frame picture: 16X8 splitting and apply field predication. Two MVs for P and four MVs for B picture
3) Dual prime for P picture: MV and DMV for two reference fields; average two field block to obtain the predication block
4) 16X8 motion predication: Two parts of one MB in field picture for predication with two or four MVs, due to the low vertical resolution of field picture
- Field DCT
Reorganize the pels in MB to increase the vertical correlation within a block
- Alternate scan
Other than zig-zag scan, alternate scan could be used to increase vertical correlation too for frame pictures.
- Modes of scalability: data partition(like spectral selection in progressive JPEG), spatial(like hierarchy coding in JPEG), temporal(subsampling in time), and SNR(like the successive approximation in progressive JPEG)

- For progressive sequence, all the pictures are frame-picture. For interlaced sequence, the picture could be either frame or field picture. In the beginning of interlaced sequence, the order of field parity would be specified and kept all the time, like {top, bottom, top, ...} for NTSC or {bottom, top, bottom, ...} for PAL and 1080i. Field pictures occurs in pairs with different parity but the same picture mode so the GOP is in the unit of frame. Note the luma and chroma sample lattices for field picture with 4:2:0 color sampling. The chroma sample would be shifted up by one quarter luma sample relative to the field sampling grid, while the chroma sample would be shifted down by one quarter luma sample relative to the field sampling grid. Frame and field are the store type of pictures and have almost nothing with coding.
- In order to display the progressive sequence on the interlaced devices, 2-3 pull-down is done. Half of frames would repeat one of its field to generate the required fields. (24 * 2 + 12 = 60 for NTSC)
- For field picture, the predictor must be field predictor for all the MBs. But for frame picture, the predictor could be chosen MB by MB, either field prediction mode or frame prediction mode based on the motion activity between two fields. Both MB type (I, B and P) and predication mode determines the exact predication type of one single MB.
- Frame DCT could be used in progressive sequence and interlaced sequence if only the motion activity in between two fields are little. In interlaced sequence field DCT could be used in frame picture if motion activity are high. And it is possible that field picture uses frame DCT, for example, the 2-3 pull-down sequence.
- The horizontal search range is +/-10bits and vertical one is +/-7bits with half pel precision.
- Unlike JPEG, the quantization table and VLC table for MPEG are standardized even though they could be overrided by ones within the video sequence. Like JPEG, the intra MB would be perceptually weighted when the quantization is being done. But for non-intra MB, they are predication error and not viewed directly, so the quantization matrix is flat. Furthermore, in order to do adaptive quantization for rate control, quantization scale factor (MQUANT) is used to scale up and down the standardized quantization tables.

MPEG-4 is designed for the new generation of highly interactive multimedia applications with supporting traditional applications. It is an object-based coding standard for video, audio, and graphics.

Profiles and Levels
* Profiles describe the tools required for decoding a bit stream. Primarily they are picture mode, color subsampling format, and scalability for MPEG-2. More for MPEG-4.

* Levels describe the parameter ranges for these tools. They are the size of the picture, frame rate and bit rate requirement.

For MPEG-2, common combinations are MP@ML, MP@HL. For MPEG-4, they are simple profile and advanced simple profile.

H.261 and H.263

H.261 was published in 1990, in order to enable video conferencing using 1~30 ISDN channels.
* Picture Formats Supported
QCIF(7.5Hz) and CIF(15Hz), progressive scanning

* Video Coding Tools
- I and P picture only; due to realtime requirement, no B picture.
- Intra and inter MB modes
- Forward motion compensation mode only with precision of 1 pel
- Motion vector range is +/-16 pels
- Two quantizers for DCT coeffs: Uniform quantizer of stepsize eight for DC in intra mode and a midtread quantizer with variable stepsize for other cases.
- run-length and variable length coding
- One loop filter is used to reduce the predication noise

H.263/H.263+/H.263++ is based on the framework of H.261. This standard was completed from 1995 to 2000.
* Picture Formats Supported
Sub-QCIF(128X96; 4:2:0), QCIF and CIF, progressive scanning

* Video Coding Tools
- I and P picture only
- Intra and inter MB modes
- Forward motion compensation mode only with precision of 0.5 pel
- MV could be the median of three ones of neighboring MBs
- Improved VLC: 3D VLC, i.e. (last, run, level) run-length coding
- Reduced GOB structure

In addition to these, H.263 offers a list of optional features that are defined in annexes to the standard.
- Unrestricted motion vectors. The MV range is [-31.5,31].
- Syntax-based arithmetic coding
- Advanced prediction mode: OBMC, and four block MVs for one MB
- Advanced Intra Coding
- PB picture

The system/signaling part for audiovisual comm. associated with H.261/263 is H.323 for Internet comm. and H.324 for PSTN comm.

Video Quality Measure

* Mean Square Error in Two Video Sequences of N Frames

* Peak Signal to Noise Ratio for One Video Sequence of N Frames (dB)
= 10 * log_10 (maximal intensity ^2 / MSE) where maximal intensity is 255 or 235 for luminance component.
- Calculate the MSE of this sequence instead of one frame first, then convert MSE to obtain the PSNR of this sequence.
> 40dB - excellent
30~40dB - good
20~30dB - poor
< 20dB - unacceptable

Digital Video

* Characterization of a Digital Video
- frame rate (frames/sec)
- line number (lines/frame)
- sample number per line
- image aspect ratio
- pixel aspect ratio
It is related to IAR and equal to IAR * active lines per frame / active sample num per line. NTSC is 8/9 while PAL is 16/15. For all HDTV formats, it is 1. PAR must be matched in display device.
- chrominance subsampling format
4:4:4, 4:2:2 (honrizontal subsampling only), 4:1:1 (honrizontal subsampling only) and 4:2:0 (chrominance phase might be different)
- bit number per pixel
4:4:4 => 24 bits/pixel
4:2:2 => 16 bits/pixel = (4*8+4*8)/4
4:1:1 and 4:2:0 => 12 bits/pixel

* Digital Video Format
- ITU-T BT601 for SDTV
Uniform sampling frequency for NTSC and PAL: 13.5MHz
Pixel number per line: 858/NTSC and 864/PAL with active pixel of 720/line
Active line numbers: 480 of 525/NTSC and 576 of 625/PAL
Raster scanning: 50I/60I
Color coordinate: YCbCr; 4:4:4, 4:2:2(by default) and 4:2:0
- ITU-T BT709 for HDTV
1920X1080, 24P/30P/60I; 4:2:2; 16:9; 74.25MHz sampling rate and different color coordinates with BT601
- SMPTE 296M
1280X720; 24P/30P/60P; 4:2:0; 16:9
- SMPTE 295M
1920X1080; 24P/30P/60I; 4:2:0; 16:9
- SIF (for VCD/CVD/SVCD)
352X240/288; 30P/25P; 4:2:0; 4:3
- 4CIF
704X480/576; 50I/60I; 4:2:0; 4:3
- CIF
352X240/288; 30P; 4:2:0; 4:3
- QCIF
176X144; 30P; 4:2:0; 4:3
- Sub QCIF
128X96; <30P; 4:2:0; 4:3

* Bit Rate for Some Applications
- HDTV: SMPTE 296/295M, MPEG-2 20~45Mbps
- Video Studio: BT601, MPEG-2 15~50Mbps
- SDTV, DVD: BT601, MPEG-2 4~8Mbps
- VCD, WWW: SIF, MPEG-1 1.5Mbps
- Video conference over ISDN/Internet: CIF, H.26x, 128~384Kbps
- Video telephony over wired/wireless modem: QCIF, H.263, 20~64Kbps
- DVCPRO 50: BT601(4:2:2), MPEG-2, 50Mbps
DVCPRO 25: BT601(4:1:1), MPEG-2, 25Mbps

Analog Video

* Two Raster Scans
Progressive Scan and Interlaced Scan

* Characterization of a Video Raster
- frame rate (frames/sec)
- line number (lines/frame)
These two parameters define the temporal and vertical sampling rates of a raster scan. From them, many other parameters could be derived, like line rate.
- image aspect ratio

* Retrace and Sync
Suppose fl is the line rate (lines/sec) and Tl is the line period. Note Tl includes the horizontal retrace time. Similarly, the frame period includes the vertical retrace time. Also within the H/V retrace durations, line/field/frame sync are added. Note the analog scanning signal uses inverse black/white voltage level.

* Analog Color TV System
- NTSC: 525 lines/frame; 29.97 frames/sec, 59.94 fields/sec; 4:3; YIQ; 6 MHz composite bandwidth.
- PAL: 625 lines/frame; 25 frames/sec, 50 fields/sec; 4:3; YUV; 8 MHz composite bandwidth.
Note the line number for these systems are not the active lines; they contain the vertical retrace periods. For NTSC, the active lines are 483/frame.

Color Coordinates

* Two Attributes
which describe the color sensation of a human being: luminance and chrominance. Luminance refers to the perceived brightness of the light and it is proportional to the total energy in the visible band. Chrominance describes the perceived color tone of a light, which depends on the wavelength composition of the light.

* Trichromatic Theory of Color Mixture
Most colors can be produced by mixing three properly chosen primary colors.
- For illuminating light source, they are R, G and B
- For reflecting light source, they are C, M and Y (plus K for printing application)
- All present display systems use an RGB primary while printers are using CMYK primary.

* Color Specification by Luminance And Chrominance
In many applications, it is desirable to describe a color in terms of its luminance and chrominance content separately, to enable more efficient processing and tramission of color signals.

* Composite versus Component Video
A video in the format of RGB or YUV is called component video. Three signals could be multiplexed into a single signal to construct a composite video, which relies on the property that the chrominance signals have a significantly smaller bandwidth than the luminance component. Thus, it could be transmitted or stored more efficiently at a cost of image quality. As a compromise between data rate and image quality, S-video consists of two components: the luminance and a single chrominance component that is the multiplex of two original chrominance signals.

* Gamma Correction
In reality, the output signals of RGB from most cameras are not linearly related to the actual color values. Similarly, most of the display devices also suffer from such a nonlinear relation between input and displayed color intensity. In order to present true colors, one must do some compensation on the camera output and before sending real image values for display, compensation must be done for the gamma effect of display devices. These processes are known as gamma correction. In practice, one gamma correction is done in the side of sender for RGB signals and then they are converted to YUV or YIQ for transmission. Receivers need only to convert them back to RGB for display.

* YUV, YIQ, YPbPr and YCbCr
- As said before, for video capture and display, all the analog and digital systems are using RGB primary. However, the color coordinate systems used in transmission of analog TV systems including NTSC, PAL and SCEAM and digital system including BT601 and BT656 are different. In general, a luminance/chrominance coordinate is employed.
- YUV coordinate is the basic for all these coordinates. Y is the luminance component while U and V are proportional to color difference, B-Y and R-Y, respectively, scaled to have the desired range. YUV is used in PAL and SCEAM TV system.
- YIQ is used in NTSC, where I and Q components are the rotated(by 33 degree) version of the U and V components.
- YCbCr is the digital color coordinate introduced in BT601. The Y, Cb and Cr are scaled and shifted version of the Y, U and V so that the value range is in [0-255].
- YPbPr is the analog color coordinate used in BT709 for HDTV video interconnect signal. The offsets and scaling factors are specified to prevent saturation of the amplifiers of the analog signals.

* Common Video Input Formats
YUY2 4:2:2, interleaved Y,Cb,Y,Cr
YV12 4:2:0, Y,Cb and Cr are stored separately and in sequence

Friday, August 24, 2007

Memory Management

* How to Use Memory?
In terms of how to use memory/buffer, components could be divided into three categories.
- In place read-only
- In place read and write
- transform
For transform components, they accept the data in the input buffer, do some kind of transformation on the data, then put the new data to the new requested buffer and free the old one.

* Two Kinds of Buffer Factory
- Discrete buffer factory: fixed size data buffers
- Circular buffer factory: variable size data buffers

* Discrete Buffer
- Reference count for zero-copy reuse
In the cases of sharing discrete data buffers, one solution is to make copies of buffer data. But it wastes memory. Reference passing is a better way and a reference count of a buffer is equal to the number of filters which have access to this buffer. The buffer memory is only freed when its reference count falls to zero. In order to support reference count, all the downstream components must be in place read-only ones. The buffer's reference count would be automatically added one or N (for multi-output stream connection) when it is written to a stream connection; but each component must release buffer when it has finished using it and passed its pointer to the stream connection.

* Circular buffer
- One exclusive write and multiple reads
One write pointer and multiple read pointers are maintained for this kind of buffer. The amount of empty space in the buffer is equal to the distance from the write pointer to the closet read pointer. In other words, the slowest reader controls the amount of available space in the buffer. And in this way, the mutual exclusion is realized. For those components which underflow in the circular buffer, they are responsible for combining the new data with the old ones; or they could not update the reader pointer and try again later.

Data Streaming Model

* Data streaming model is a fundamental software architecture, which has wide applications in the field of data processing. This model is based on the following basic ideas for the embedded applications:
- Multi-threading
- Object-oriented: components

* Components
A component is a hardware-independent, software abstraction of a hardware or microcode function. It has its own interface that performs component specific functions. Components are the basic units to construct a data stream or data flow. A component could be associated with a thread or not. Two key features for components are inheritance and polymorphism and dynamic creation and linking. Dynamic creation means components must be dynamically allocated at startup or runtime instead of global or local allocation. Dynamic creation and linking make application and low-level supports, like middleware, separated as much as possible.

* Types of Components
- Stream filter
It is the processing element (brick) in a data stream. In general, it is associated with one thread. It has input/output ports and command/message ports for async events.
- Stream connection
It is used to connect two or more stream filters to construct a data stream (mortar). In essence, a stream connection is a message queue. In general, no thread is bond with stream connections. Therefore, the comm. among tasks is implemented with message passing instead of shared data structure. This model might be more suitable for multi-core application since it helps to hide the existence of other cores.
- Buffer factory
It is used to manage memory usage within a data stream. One or more buffer factories could be associated with one flow as needed. No threads are with them.
- Flow controller
This macro component creates, initializes and connects multiple building components as mentioned above to construct a data flow. Also it is the interface to the higher application level and emits commands to and receives status messages from the whole flow. It could be considered to be a special case of stream filters.

New components are created and old components might be deleted when a new data flow is built, in order to efficiently make use of memory. However, some kinds of components provide facilities for all data streams, so they need to be created once at startup and stay in the whole system lifetime.

Sunday, August 19, 2007

Scheduing

* Mode
- Process Contention Scope
- System Contention Scope

* Types
- Time Sharing with Equal or Dynamic Priority (Fair)
- Realtime with Different and Static Priority (Non-fair in order to have predicable response): Preemptive for different priority and Round-Robin (time slicing) for the same priority

* Priority Assignment
Pre-condition:
All tasks have worst-case execution times that are less than or equal to their deadlines.
- Rate Monotonic Scheduling
For periodic tasks and task deadlines are in the end of period, tasks with shorter period have higher priority.
- Minimum Interarrival Time Scheduling
For aperiodic tasks and task deadlines are equal to their minimum interarrival time, tasks with shorter arrival time have higher priority.
- Deadline Monotonic Scheduling
For periodic or aperiodic tasks and task deadlines are shorter than their periods or minimum interarrival time, tasks with shorter deadline have higher priority.

* Issues
- Priority inversion
It involves at least three threads of different priorities. The solutions are priority inheritance and priority ceiling. Mutex supports to solve this problem generally. When use the preemptive lock, the priority inversion might happen since the task with higher priority could not be executed in time.

Communications

* Types of Communication
- Loosely coupled comm.
- Tightly coupled comm.

* Approaches
Shared Data Structure
- Resource sync needed.
Message Passing (Message Queue)
- Provide sync and comm together
- No owner and mutual exclusion implemented internally
- Value passing for small messages and reference passing for large messages
- Semaphore implementation: a mailbox of one valid message

Usage for Comm
- Loosely coupled one-way comm: ONE message queue
- Tightly coupled one-way comm: ONE message queue and ONE binary semaphore
- Tightly coupled two-way comm (Rendezvous): TWO message queues with one or multiple entries
Note: the task model here could be "one to one", "multiple to one" and "one to multiple", etc.

Synchronization

* Types of Synchronization
- Resource sync: critical section and resource mutual exclusion
- Activity sync

* Approaches
Low weight ones: TSL(Test Set Lock, H/W), semaphore, mutex, conditional variable
High weight ones: message passing, spinlock, monitor, read-write lock, barrier, interrupt lock, and preemption(scheduling) lock.
- Resource sync
Resource could be either data or codes:
mutex for single shared resource access
recursive mutex for recursive shared resource access
counting semaphore plus mutex, the mixing of non-blocking and blocking mutexes for multiple shared resource access
interrupt lock for the shared resource with ISR
preemption lock for the shared resource without waiting
read-write lock for multiple reads and exclusive write (broadcasting)
locking hierarchy and locking chain for dependent resources
- Activity sync
1) Single-task waiting and single-task signaling: conditional variable and semaphore
2) Multiple-task waiting and single-task signaling: conditional variable(broadcasting) and semaphore(flushing)
3) Single-task waiting and multiple-task signaling: ONE semaphore with iterative checking MULTIPLE input sources from ISR; ONE counting semaphore associated with the each input queue which might be redesigned.
4) Flow control: counting semaphore
5) Rendezvous and barrier sync: TWO semaphore without comm. Two conditional variables. TWO message queues of one entry for comm. Callback mechanism with comm. And barrier for multiple tasks.
6) Event occurrence: conditional variable or event object, counting semaphore (support to count the times of event happening)

* Issues
- Race
A race occurs when two or more threads try to get something or do something at the same time. Only one can win. Race is the result of resource conflicts caused by insufficient sync. Race is difficult to diagnose since every thread would keep going.
- Deadlock
Deadlock is the result of resource conflicts caused by wrong sync. It is easy to be detected since threads stop. In general, if multiple mutexes are used sequentially for dependent resources, deadlock happens probably. The general solutions are:
1) fixed locking hierarchy and
2) try and backoff with reverse unlocking sequence.
Note: Lock chaining is a special case of locking hierarchy, where the scope of two locks overlap. With one mutex locked, the code enters a region where another mutex is required. After successfully locking that second mutex, the first is no longer needed, and can be released.

* Sleep and Non-sleep
Many facilities would make current threads to sleep, like semaphore, mutex, etc. However, in the following cases it is not allowed to sleep:
- ISR
- critical sections

* Semaphore, mutex and conditional variable
- Binary semaphore is too simple to use since programmers need to make sure both mutual exclusion and synchronization of threads by using multiple semaphores. Semaphore has no owner and anyone could release it. Also it needs no predicate associated.
- Mutex is for mutual exclusion and it has its owner which needs to release mutex after using it. Mutex supports to task delete safety, recursive access and priority inheritance for priority inversion.
- Conditional variable is for thread synchronization and is dependent of a SINGLE (not multiple) predicate of the shared resource. Always test your predicate; and then test it again by using while loop. It must be associated with a mutex. But a mutex may have any number of condition variables associated with it.

* Spinlock and monitor
- Spinlock can be effective in very restricted circumstances. The critical section must be short, you must have significant contention for the lock, and you must be running on more than one CPU since it would not block the task. The trylock of mutex could be used to implement it. Spinlock could be used in ISR to do resource sync due to its non-sleep & non-interruptible property.
- Monitor integrates the shared resource with its mutual exclusion utilities together. In many situations, monitors are an excellent synchronization technique. By forcing the programmer to manipulate shared data via monitored code, you ensure that there can never be a failure to lock its use, and that the access methods will be well-defined. Monitors cannot handle all types of locking situations, however. When you want to use the “trylock” functions, you must use regular mutexes.

* Initialization/destroy and blocking/non-blocking
All these utilities need to be initialized before using them and deleted in the end. It is possible that memory associated is allocated within the initialization. Therefore this memory should be freed in the delete routine. Also, it must be sure that when they are deleted, no other tasks are waiting or holding them so that no deadlock happens. One general deletion is:
{
     lock(mutex);
     unlock(mutex);
     delete(mutex);
}
On the other hand, when these utilities are used, two types of functions are provided in general, i.e. blocking access and non-blocking or time-out access.

Concurrency Control

* In order to create concurrent execution context, and control how they operate within the library and application, three facilities need to be provided:
- Execution context: the state of a concurrent entity.
- Scheduling: to determine which execution context and/or CPU
- Synchronization: to cooperate between concurrent execution contexts

* Thread
- All threads in one process would share all the resources, like virtual address space, file descriptor/pointer, working directory, etc..
- A thread's registers, PC, stack and heap memory are private for the thread unless the thread communicates pointer to that memory to other threads.
- A thread is fully synchronous with itself unless some asynchronous events happen, like ISR.

* Thread Working Modes
- Pipeline
A task is broken into a series of suboperations, each of which is handled in series, but concurrently, by a different thread.
- Client/Server
Multiple clients contact with an independent server for each job anonymously (Multiple to one).
- Manager/worker
A single thread, the manager assigns work to other threads, the workers(One to multiple). Typically, the manager handles all input and parcels out work to the other tasks. At least two forms of the manager/worker model are common: static worker pool and dynamic worker pool.
- Work Crew
Multiple threads work together independently. A special form of manager/worker mode since the manager would join the workers after it creates workers.

Memory Organization

* Memory Hierarchy
- Objective: Reduce the performace gap between memory and CPU
- Principle: The principle of locality (temporal and spatial)
- Why hierarchy? Trade-off between performance and cost
- Result: Program codes and data are spreaded among multiple locations in the memory hierarchy.
- Order in performace: Register, cache, main memory and disk
- Performance metric:
latency (response time, execution time): how fast to get?
throughput (bandwidth): how many/much to get?
for example, register has low latency and high throughput while cache has low latency but low throughput.
Sometimes, power consumption is important.
- Two things are very important in memory hierarchy: Cache and virtual memory

* Cache
- Basic unit
Block/line with multiple bytes. The address of first byte should be aligned to the stride of cache line.
- Where to put cache lines?
Directed mapping (one set associative, n sets), n/m set associative (m sets), and fully associative(one set).
- How to locate cache lines?
Memory address: Tag + Set index + Block offset
- Replacement rule
FIFO, LRU (least Recent Used), and Random
- Write policy
Write back and write through
Write allocate and non-write allocate
write buffer, Non-block cache operations
- Cache size
Cache line component = cache tag + state bits (valid, dirty for WB) + cache data
Cache size in bytes = block size * set associative * sets
- Cache Organization
Two logic dimensions:
Horizontal: Set associative
Vertical: Set
Each set associative could be one cache bank. Note large bulk of data with sequential addresses would be stored vertically not honrizontally.

* Virtual Memory
- Virtual address space
Each process has its own virtual address space, which is determined by the bus width of the system. The process could use this address space fully and its codes and data could be stored anywhere in the space. The organization of text, data, bss regions, stack and heap are in the virtual address space not in physical address space. Therefore the address the program could see is the virtual address but not physical address of main memory.
- Virtual memory
Virtual memory is to expand the concept of main memory to include the disk. The main memory could be considered the cache of the disk.
- Advantage of virtual memory
1) Make multiple processes share main memory
2) No need to worry about the size of main memory
3) Make it possible relocate the codes and data (swapping)
- Virtual address to physical address
Cache and memory are accessed with physical address. So virtual address need to be tranlated to physical address after it leaves CPU. TLB and page table for each process are used for this purpose. Page table would map the virtual address to physical address. Since the size of page table might be large and it is located in main memory, two memory accesses are needed for one address access. According to the principle of locality, TLB (translation buffer) is used to do this job fastly.
- Basic unit
Page or segment; the size of page could be the size of one cache bank (sets * block size)
- Where to put pages?
Fully associative
- How to locate pages?
TLB and Page Table for each process
- Replacement rule
LRU
- Write policy
Write back
- Virtual address format
Virtual address component = virtual address index + page offset
Virtual address index is used to look up the page table.
- TLB
TLB is the cache of process's page table in main memory. It has the same properties as cache. The exception is that cache is controlled completely by hardware, but once page fault happens, the OS needs to be in charge since the cost to access the disk is so high as to switch other process context.

* Main Memory Organization
Main memory could be divided into multiple contiguous address sections, or banks. Each memory bank could be mapped to memory chips. Inside the memory chip, bits could be organized in banks and address interleaves among banks in order to improve the access bandwidth. Two "banks" here are different in terms of address mapping.