Wednesday, August 29, 2007

Digital Signal Processing

*Compared to IIR filters, FIR filters offer the following advantages:
- They can easily be designed to be "linear phase" (and usually are). Put simply, linear-phase filters delay the input signal, but don’t distort its phase. A FIR filter is linear-phase if (and only if) its coefficients are symmetrical around the center coefficient. FIR filter which has N taps, the delay is: (N - 1) / (2 * Fs), where Fs is the sampling frequency. So the center of FIR (1-D or 2-D) generally determines the exact output.
- They could be designed effectively by inverse-transforming desired freq response and windowing. And they are simple to implement. On most DSP microprocessors, the FIR calculation can be done by looping a single instruction.
- They are suited to multi-rate applications. By multi-rate, we mean either "decimation" (reducing the sampling rate), "interpolation" (increasing the sampling rate), or both. Whether decimating or interpolating, the use of FIR filters allows some of the calculations to be omitted, thus providing an important computational efficiency. In contrast, if IIR filters are used, each output must be individually calculated, even if it that output will discarded (so the feedback will be incorporated into the filter).
- They have desirable numeric properties. In practice, all DSP filters must be implemented using "finite-precision" arithmetic, that is, a limited number of bits. The use of finite-precision arithmetic in IIR filters can cause significant problems like overflow/fluctuation due to the use of feedback, but FIR filters have no feedback, so they can usually be implemented using fewer bits, and the designer has fewer practical problems to solve related to non-ideal arithmetic.
- They can be implemented using fractional arithmetic. Unlike IIR filters, it is always possible to implement a FIR filter using coefficients with magnitude of less than 1.0. (The overall gain of the FIR filter can be adjusted at its output, if desired.) This is an important consideration when using fixed-point DSP's, because it makes the implementation much simpler.

- Compared to IIR filters, FIR filters sometimes have the disadvantage that they require more memory and/or calculation to achieve a given filter response characteristic. Also, certain responses are not practical to implement with FIR filters.

Reference: DSPGuru Fir FAQ

Digital Signal Processor

* RISC for data manipulation and DSP for math processing

* Harvard Arch
Separate memories for data and program instructions, with separate buses for each.

* Super Harvard Arch
Harvard Arch + Instruction cache + I/O DMA

The filter coefficients are put in program memory, while the input signal in data memory. For one operation of addition, one input signal value needs to be over the data memory bus, but two values over the program memory bus (the program instruction and the coefficient). The first time through a loop, the program instructions must be passed over the program memory bus. This results in slower operation because of the conflict with the coefficients that must also be fetched along this path. However, on additional executions of the loop, the program instructions can be pulled from the instruction cache. This means that all of the memory to CPU information transfers can be accomplished in a single cycle: the sample from the input signal comes over the data memory bus, the coefficient comes over the program memory bus, and the program instruction comes from the instruction cache. In the jargon of the field, this efficient transfer of data is called a high memory-access bandwidth.

In order to improve throughput, I/O controller, DMA, is connected to data memory. This is how the signals enter and exit the system.

* Circular Buffer and Data Address Generator
Data Address Generator (DAG), one for each of the program and data memory. These control the addresses sent to the program and data memories, specifying where the information is to be read from or written to. DSPs are designed to operate with circular buffers, and benefit from the extra hardware to manage them efficiently. This avoids needing to use precious CPU clock cycles to keep track of how the data are stored.

There are many circular buffers in DSP. Some DSP algorithms are best carried out in stages. For instance, IIR filters are more stable if implemented as a cascade of biquads (a stage containing two poles and up to two zeros). Multiple stages require multiple circular buffers for the fastest operation. The DAGs are also designed to efficiently carry out the Fast Fourier transform. In this mode, the DAGs are configured to generate bit-reversed addresses into the circular buffers, a necessary part of the FFT algorithm. In addition, an abundance of circular buffers greatly simplifies DSP code generation- both for the human programmer as well as high-level language compilers, such as C.

* ALU, MAC and Barrel Shifter
A long bit accumulator is built into the multiplier to reduce the round-off error associated with multiple fixed-point math operations, esp. for IIR. The multiplier is designed with combination logic and could execute MAC operation in one cycle.

In some cases, shadow registers could be used for all the CPU's key registers. These are duplicate registers that can be switched with their counterparts in a single clock cycle. They are used for fast context switching, the ability to handle interrupts quickly. When an interrupt occurs in traditional microprocessors, all the internal data must be saved before the interrupt can be handled. This usually involves pushing all of the occupied registers onto the stack, one at a time. In comparison, an interrupt is handled by moving the internal data into the shadow registers in a single clock cycle. When the interrupt routine is completed, the registers are just as quickly restored.

Tuesday, August 28, 2007

I/O Subsystem

* I/O Subsystem
- Interfaces between a device and the main processor occur in two ways: port mapped and memory mapped.
- I/O devices are classified as either character-mode devices or block-mode devices. The classification refers to how the device handles data transfer with the system.
- I/O devices could be active(issue interrupts periodically or aperiodically) or passive(no interrupts, CPU needs polling for read).
- DMA controllers allows data transfer bypassing the main processor.
- I/O Modes
1) Interrupt-driven I/O (active)
in which an input buffer is filled at device interrupt time(Ready) with DMA and is emptied by processes that read the device probably triggered by DMA complete interrupt; an output buffer is filled by processes that write to the device and is emptied at device interrupt time(Ready) with DMA. This might overload the CPU if the I/O traffic is heavy.
2) Polling in real-time (passive)
Within the regular interrupt of a timer, CPU would poll every I/O.
- Task Assignments for I/O
1) ISR associated tasks for active I/O devices
2) Timer triggered tasks for polling of passive I/O devices
3) Resource control task to control a shared I/O device or a group devices
4) Request dispatcher to multiple tasks from devices
- I/O subsystem associates ISR closely.
- I/O subsystems must be flexible enough to handle a wide range of I/O devices, which hides device peculiarities from applications.
- The I/O subsystem maintains a driver table that associates uniform I/O calls, like create, delete, read, write, ioctrl, etc. with driver-specific I/O routines.
- The I/O subsystem maintains a device table and forms an association between this table and the driver table.

Exception and Interrupt

* Classification of General Exceptions
- Async non-maskable
- Async maskable
- Sync precise: PC points to the exact instruction caused that exception
- Sync imprecise

* Priority
- Async non-maskable
- Sync precise
- Sync imprecise
- Async maskable
- Programmable Task

* Processing General Exceptions
- Install exception handles
It requires replacing the appropriate vector table entry (accessed by IRQ) with the address of the desired ESR or ISR. To install handles when the device is used instead of initialization in order to share the precious interrupt resources.
- Exception frame or interrupt stack
The main reasons existing for needing an exception frame are 1) to handle nested exceptions; 2) the portion of the ISR in C/C++ requires a stack to which pass function parameters and to invoke a library function. Do not use the task stack for exception frame because it might cause stack overflow which is very hard to debug (depending on which task, which interrupt and its freq and timing, etc.).
- Three ways to mask interrupts (No effect on non-masking interrupts)
1) Disable the device
2) Disable the interrupts of the same or lower priority levels
3) Disable global system-wide interrupts
They might be used because:
1) the ISR tries to reduce the total number of interrupts raised by the devise,
2) the ISR is non-reentrant,
3) the ISR needs to perform some atomic operations
The masking could be done by setting the interrupt mask register, which would be saved in the beginning of interrupt handles and restored in the end. Therefore, ISR could disable other ISR happening and need to save status register while exception handles do not have this ability.
- Exception processing time
The interrupt frequency of each device that can assert an interrupt is very important for the ISR design. It is possible for the entire processing to be done within the context of the interrupt, that is, with interrupts disabled. Notice, however, that the processing time for a higher priority interrupt is a source of interrupt latency for the lower priority interrupt. Another approach is to have one section of ISR running in the context of the interrupt and another section running in the context of a task. The first section of the ISR code services the device so that the service request is acknowledged and the device is put into a known operational state so it can resume operation. This portion of the ISR packages the device service request and sends it to the remaining section of the ISR that executes within the context of a task. This latter part of the ISR is typically implemented as a dedicated daemon task. Note, however, the interrupt response time increases. The increase in response time is attributed to the scheduling delay, and the daemon task might have to yield to higher priority tasks. In conclusion, the duration of the ISR running in the context of the interrupt depends on the number of interrupts and the frequency of each interrupt source existing in the system.

* General Guides
On architectures where interrupt nesting is allowed:
- An ISR should disable interrupts of the same level if the ISR is non-reentrant.
- An ISR should mask all interrupts if it needs to execute a sequence of code as one atomic operation.
- An ISR should avoid calling non-reentrant functions. Some standard library functions are non-reentrant, such as many implementations of malloc and printf. Because interrupts can occur in the middle of task execution and because tasks might be in the midst of the "malloc" function call, the resulting behavior can be catastrophic if the ISR calls this same non-reentrant function.
- An ISR must never make any blocking or suspend calls. Making such a call might halt the entire system.
- If an ISR is partitioned into two sections with one section being a daemon task, the daemon task does not have a high priority by default. The priority should be set with respect to the rest of the system.

Monday, August 27, 2007

Video Coding

A popular and effective video coding method is the one that uses block-based temporal prediction and transform coding. This method is essentially the core of all the international video coding standards.

* Block-Based Hybrid Video Coding
- Block-based motion estimation => MV
- Motion compensated predication => predicated block
- DCT on predication error block
- Quantization of DCT coeffs
- Run-length scanning and variable length coding on MV and quantified coeffs
- Loop filtering to reduce block artifacts on reconstructed picture
Thus the temporal, spatial and symbol redundancy are reduced as much as possible.

The MB is the basic unit of motion estimation and compensation. In general, three modes for MB are I-, P- and B-mode. Because MV and DC coeffs (only for I-mode MB) of adjacent MBs or blocks are similar, they are typically coded predictively within the same GOB or slice to avoid the error propagation. This is extended in H.264 and called intra predication. The quantization parameter or factor is also determined in the MB level.

A frame could be coded entirely in the intra-mode. This picture is called I-picture or intra-picture. A MB in P-picture could be coded in I- or P-mode while one MB in B-picture could be in I-, P- and B-mode based on the predication error. Both P- and B-picture are called inter-picture. Also the picture structure of them could be field and frame.

For ME/MC, the key pointer is block size, search range, search precision, MV predication, block predication, MV number, etc. The improvement includes unrestricted motion vector, OBMC, global MC, multiple references, etc. For DCT, more advanced transformation could be used, like integer transform, wavelet, etc. For quantization, the stepsize (QP) and non-linear quantization or vector quantization have been explored. Alternate scanning (Horizontal or vertical) is proposed and so are arithmetic VLC and 3-D VLC instead of Huffman coding. For improved performance, the choice between different parameters can be determined by a rate-distortion optimization approach.

One important issue in video coding is rate control, which refers to how to code a video so that the resulting bitstream satisfies a target bit rate. This leads to how to choose the coding parameters, like frame rate, MB mode, QP, etc., to meet the rate constraint. The desired target bit rate is the average rates over short intervals of time due to the VLC video coder and the complexity of the scene. the variability of the bit rate within each interval must be handled by a smoothing buffer following the encoder. The buffer size is determined by the time interval and delay requirement of the application. RC is typically accomplished in three steps:
- Update the target average bitrate for each short time interval, based on the bandwidth, delay requirement and buffer fullness
- Determine the coding mode for picture (I, B and P) and the target bit budget for each picture to be coded in this interval, based on the target average rate for the interval and the current buffer fullness
- Determine the coding mode and QP for each MB in a picture to meet the target rate for this frame

Video Bit Stream Syntax

In order to support different applications, the syntax of video bit stream must be flexible. This is achieved by having a hierarchy of different layers that each start with a header. Each layer performs a different logical function. Most headers can be uniquely identified in the bit stream since they begin with a start code and a start code identifier.
* Sequence and VOL for MPEG-4 Visual
Sequence header and end-of-sequence code. Global parameters are contained in the sequence header and its extension.

* Group of Picture (GOP)
GOP allows random access and interaction with video sequence since GOP is coded independent of each other. For MPEG-4 Visual, it is GVOP and for H.26x, no GOP since they are for realtime interactive applications. The first picture in GOP needs to be coded independently. Closed GOP means the B picture in the end of GOP could not use the I picture of the following GOP as reference. It is only used for video edition.

* Picture
The basic coding unit of a video sequence. It consists of three components: Y, Cb and Cr. The picture header indicates the picture type, picture structure and others. A VOP is the coding unit in MPEG-4 Visual.

* GOB, Slice and Video Packet: Groups of MBs
- GOB has fixed structure with three lines of MB and eleven MBs in one line. H.26x.
- Slice has one line of MB with variable length. Note its definition in MPEG-1 and MPEG-2 is different. DC components of intra MBs would be coded predicatively within one slice and MV for non-intra MBs are coded predicatively within one slice too.
- Video packet is based on the bit rate threshold as defined by the encoder instead of MB.
All these three structures are used to do the resync and error recovery.

* Macro Block (MB)
Eight blocks for 4:2:2 and six blocks for 4:2:0. Basic unit for motion compensation. No start code for MB.

* Block
Transform and entropy coding unit. No start code for blocks.

Sunday, August 26, 2007

MPEG-1/2/4

MPEG-1, ISO 11172, was designed for progressively scanned video and the target bit rate is around 1.2Mbps (totally 1.5Mbps including audio and data). And it had to support basic VCR-like interactivity, like forward, fast reverse, etc. MPEG-1 consists of five parts. Only video part is considered here.

* Picture Formats Supported
SIF, progressive scanning

* Video Coding Tools
- I, P and B frames. Bidirectional motion compensation is used for B frame. Two MV needed for B frames. And the coding order is different from the scanning order.
- For I frame, the weight matrix is used to adapt DCT coeffs to human visual system before the uniform quantization of them. The DC of I-block is coded predictively.
- Half pel predication precision and MV range is +/-64 pels (7bits).
- GOP structure for random access and interactivity

MPEG-2, ISO13818 or H.262, was designed to extend the MPEG-1 functionality to interlaced picture, primarily using BT601 4:2:0 format. The target was to produce TV-quality pictures at data rate of 4~8Mbps and high quality picture at 10~15Mbps. MPEG-2 consists of nine parts.
* Picture Formats Supported
SIF, BT601, SMPTE 296/295M, progressive and interlaced scanning

* Video Coding Tools
- Chroma samples in the 4:2:0 are located horizontally shifted by 0.5 pel compared to MPEG-1, H.261 and H.263
- Two picture structures: frame picture and field picture with I, P and B picture mode
- More predication modes for interlaced video
1) Field predication for field picture: reference picture chosen more flexibly for P and B picture
2) Field predication for frame picture: 16X8 splitting and apply field predication. Two MVs for P and four MVs for B picture
3) Dual prime for P picture: MV and DMV for two reference fields; average two field block to obtain the predication block
4) 16X8 motion predication: Two parts of one MB in field picture for predication with two or four MVs, due to the low vertical resolution of field picture
- Field DCT
Reorganize the pels in MB to increase the vertical correlation within a block
- Alternate scan
Other than zig-zag scan, alternate scan could be used to increase vertical correlation too for frame pictures.
- Modes of scalability: data partition(like spectral selection in progressive JPEG), spatial(like hierarchy coding in JPEG), temporal(subsampling in time), and SNR(like the successive approximation in progressive JPEG)

- For progressive sequence, all the pictures are frame-picture. For interlaced sequence, the picture could be either frame or field picture. In the beginning of interlaced sequence, the order of field parity would be specified and kept all the time, like {top, bottom, top, ...} for NTSC or {bottom, top, bottom, ...} for PAL and 1080i. Field pictures occurs in pairs with different parity but the same picture mode so the GOP is in the unit of frame. Note the luma and chroma sample lattices for field picture with 4:2:0 color sampling. The chroma sample would be shifted up by one quarter luma sample relative to the field sampling grid, while the chroma sample would be shifted down by one quarter luma sample relative to the field sampling grid. Frame and field are the store type of pictures and have almost nothing with coding.
- In order to display the progressive sequence on the interlaced devices, 2-3 pull-down is done. Half of frames would repeat one of its field to generate the required fields. (24 * 2 + 12 = 60 for NTSC)
- For field picture, the predictor must be field predictor for all the MBs. But for frame picture, the predictor could be chosen MB by MB, either field prediction mode or frame prediction mode based on the motion activity between two fields. Both MB type (I, B and P) and predication mode determines the exact predication type of one single MB.
- Frame DCT could be used in progressive sequence and interlaced sequence if only the motion activity in between two fields are little. In interlaced sequence field DCT could be used in frame picture if motion activity are high. And it is possible that field picture uses frame DCT, for example, the 2-3 pull-down sequence.
- The horizontal search range is +/-10bits and vertical one is +/-7bits with half pel precision.
- Unlike JPEG, the quantization table and VLC table for MPEG are standardized even though they could be overrided by ones within the video sequence. Like JPEG, the intra MB would be perceptually weighted when the quantization is being done. But for non-intra MB, they are predication error and not viewed directly, so the quantization matrix is flat. Furthermore, in order to do adaptive quantization for rate control, quantization scale factor (MQUANT) is used to scale up and down the standardized quantization tables.

MPEG-4 is designed for the new generation of highly interactive multimedia applications with supporting traditional applications. It is an object-based coding standard for video, audio, and graphics.

Profiles and Levels
* Profiles describe the tools required for decoding a bit stream. Primarily they are picture mode, color subsampling format, and scalability for MPEG-2. More for MPEG-4.

* Levels describe the parameter ranges for these tools. They are the size of the picture, frame rate and bit rate requirement.

For MPEG-2, common combinations are MP@ML, MP@HL. For MPEG-4, they are simple profile and advanced simple profile.

H.261 and H.263

H.261 was published in 1990, in order to enable video conferencing using 1~30 ISDN channels.
* Picture Formats Supported
QCIF(7.5Hz) and CIF(15Hz), progressive scanning

* Video Coding Tools
- I and P picture only; due to realtime requirement, no B picture.
- Intra and inter MB modes
- Forward motion compensation mode only with precision of 1 pel
- Motion vector range is +/-16 pels
- Two quantizers for DCT coeffs: Uniform quantizer of stepsize eight for DC in intra mode and a midtread quantizer with variable stepsize for other cases.
- run-length and variable length coding
- One loop filter is used to reduce the predication noise

H.263/H.263+/H.263++ is based on the framework of H.261. This standard was completed from 1995 to 2000.
* Picture Formats Supported
Sub-QCIF(128X96; 4:2:0), QCIF and CIF, progressive scanning

* Video Coding Tools
- I and P picture only
- Intra and inter MB modes
- Forward motion compensation mode only with precision of 0.5 pel
- MV could be the median of three ones of neighboring MBs
- Improved VLC: 3D VLC, i.e. (last, run, level) run-length coding
- Reduced GOB structure

In addition to these, H.263 offers a list of optional features that are defined in annexes to the standard.
- Unrestricted motion vectors. The MV range is [-31.5,31].
- Syntax-based arithmetic coding
- Advanced prediction mode: OBMC, and four block MVs for one MB
- Advanced Intra Coding
- PB picture

The system/signaling part for audiovisual comm. associated with H.261/263 is H.323 for Internet comm. and H.324 for PSTN comm.

Video Quality Measure

* Mean Square Error in Two Video Sequences of N Frames

* Peak Signal to Noise Ratio for One Video Sequence of N Frames (dB)
= 10 * log_10 (maximal intensity ^2 / MSE) where maximal intensity is 255 or 235 for luminance component.
- Calculate the MSE of this sequence instead of one frame first, then convert MSE to obtain the PSNR of this sequence.
> 40dB - excellent
30~40dB - good
20~30dB - poor
< 20dB - unacceptable

Digital Video

* Characterization of a Digital Video
- frame rate (frames/sec)
- line number (lines/frame)
- sample number per line
- image aspect ratio
- pixel aspect ratio
It is related to IAR and equal to IAR * active lines per frame / active sample num per line. NTSC is 8/9 while PAL is 16/15. For all HDTV formats, it is 1. PAR must be matched in display device.
- chrominance subsampling format
4:4:4, 4:2:2 (honrizontal subsampling only), 4:1:1 (honrizontal subsampling only) and 4:2:0 (chrominance phase might be different)
- bit number per pixel
4:4:4 => 24 bits/pixel
4:2:2 => 16 bits/pixel = (4*8+4*8)/4
4:1:1 and 4:2:0 => 12 bits/pixel

* Digital Video Format
- ITU-T BT601 for SDTV
Uniform sampling frequency for NTSC and PAL: 13.5MHz
Pixel number per line: 858/NTSC and 864/PAL with active pixel of 720/line
Active line numbers: 480 of 525/NTSC and 576 of 625/PAL
Raster scanning: 50I/60I
Color coordinate: YCbCr; 4:4:4, 4:2:2(by default) and 4:2:0
- ITU-T BT709 for HDTV
1920X1080, 24P/30P/60I; 4:2:2; 16:9; 74.25MHz sampling rate and different color coordinates with BT601
- SMPTE 296M
1280X720; 24P/30P/60P; 4:2:0; 16:9
- SMPTE 295M
1920X1080; 24P/30P/60I; 4:2:0; 16:9
- SIF (for VCD/CVD/SVCD)
352X240/288; 30P/25P; 4:2:0; 4:3
- 4CIF
704X480/576; 50I/60I; 4:2:0; 4:3
- CIF
352X240/288; 30P; 4:2:0; 4:3
- QCIF
176X144; 30P; 4:2:0; 4:3
- Sub QCIF
128X96; <30P; 4:2:0; 4:3

* Bit Rate for Some Applications
- HDTV: SMPTE 296/295M, MPEG-2 20~45Mbps
- Video Studio: BT601, MPEG-2 15~50Mbps
- SDTV, DVD: BT601, MPEG-2 4~8Mbps
- VCD, WWW: SIF, MPEG-1 1.5Mbps
- Video conference over ISDN/Internet: CIF, H.26x, 128~384Kbps
- Video telephony over wired/wireless modem: QCIF, H.263, 20~64Kbps
- DVCPRO 50: BT601(4:2:2), MPEG-2, 50Mbps
DVCPRO 25: BT601(4:1:1), MPEG-2, 25Mbps

Analog Video

* Two Raster Scans
Progressive Scan and Interlaced Scan

* Characterization of a Video Raster
- frame rate (frames/sec)
- line number (lines/frame)
These two parameters define the temporal and vertical sampling rates of a raster scan. From them, many other parameters could be derived, like line rate.
- image aspect ratio

* Retrace and Sync
Suppose fl is the line rate (lines/sec) and Tl is the line period. Note Tl includes the horizontal retrace time. Similarly, the frame period includes the vertical retrace time. Also within the H/V retrace durations, line/field/frame sync are added. Note the analog scanning signal uses inverse black/white voltage level.

* Analog Color TV System
- NTSC: 525 lines/frame; 29.97 frames/sec, 59.94 fields/sec; 4:3; YIQ; 6 MHz composite bandwidth.
- PAL: 625 lines/frame; 25 frames/sec, 50 fields/sec; 4:3; YUV; 8 MHz composite bandwidth.
Note the line number for these systems are not the active lines; they contain the vertical retrace periods. For NTSC, the active lines are 483/frame.

Color Coordinates

* Two Attributes
which describe the color sensation of a human being: luminance and chrominance. Luminance refers to the perceived brightness of the light and it is proportional to the total energy in the visible band. Chrominance describes the perceived color tone of a light, which depends on the wavelength composition of the light.

* Trichromatic Theory of Color Mixture
Most colors can be produced by mixing three properly chosen primary colors.
- For illuminating light source, they are R, G and B
- For reflecting light source, they are C, M and Y (plus K for printing application)
- All present display systems use an RGB primary while printers are using CMYK primary.

* Color Specification by Luminance And Chrominance
In many applications, it is desirable to describe a color in terms of its luminance and chrominance content separately, to enable more efficient processing and tramission of color signals.

* Composite versus Component Video
A video in the format of RGB or YUV is called component video. Three signals could be multiplexed into a single signal to construct a composite video, which relies on the property that the chrominance signals have a significantly smaller bandwidth than the luminance component. Thus, it could be transmitted or stored more efficiently at a cost of image quality. As a compromise between data rate and image quality, S-video consists of two components: the luminance and a single chrominance component that is the multiplex of two original chrominance signals.

* Gamma Correction
In reality, the output signals of RGB from most cameras are not linearly related to the actual color values. Similarly, most of the display devices also suffer from such a nonlinear relation between input and displayed color intensity. In order to present true colors, one must do some compensation on the camera output and before sending real image values for display, compensation must be done for the gamma effect of display devices. These processes are known as gamma correction. In practice, one gamma correction is done in the side of sender for RGB signals and then they are converted to YUV or YIQ for transmission. Receivers need only to convert them back to RGB for display.

* YUV, YIQ, YPbPr and YCbCr
- As said before, for video capture and display, all the analog and digital systems are using RGB primary. However, the color coordinate systems used in transmission of analog TV systems including NTSC, PAL and SCEAM and digital system including BT601 and BT656 are different. In general, a luminance/chrominance coordinate is employed.
- YUV coordinate is the basic for all these coordinates. Y is the luminance component while U and V are proportional to color difference, B-Y and R-Y, respectively, scaled to have the desired range. YUV is used in PAL and SCEAM TV system.
- YIQ is used in NTSC, where I and Q components are the rotated(by 33 degree) version of the U and V components.
- YCbCr is the digital color coordinate introduced in BT601. The Y, Cb and Cr are scaled and shifted version of the Y, U and V so that the value range is in [0-255].
- YPbPr is the analog color coordinate used in BT709 for HDTV video interconnect signal. The offsets and scaling factors are specified to prevent saturation of the amplifiers of the analog signals.

* Common Video Input Formats
YUY2 4:2:2, interleaved Y,Cb,Y,Cr
YV12 4:2:0, Y,Cb and Cr are stored separately and in sequence

Friday, August 24, 2007

Memory Management

* How to Use Memory?
In terms of how to use memory/buffer, components could be divided into three categories.
- In place read-only
- In place read and write
- transform
For transform components, they accept the data in the input buffer, do some kind of transformation on the data, then put the new data to the new requested buffer and free the old one.

* Two Kinds of Buffer Factory
- Discrete buffer factory: fixed size data buffers
- Circular buffer factory: variable size data buffers

* Discrete Buffer
- Reference count for zero-copy reuse
In the cases of sharing discrete data buffers, one solution is to make copies of buffer data. But it wastes memory. Reference passing is a better way and a reference count of a buffer is equal to the number of filters which have access to this buffer. The buffer memory is only freed when its reference count falls to zero. In order to support reference count, all the downstream components must be in place read-only ones. The buffer's reference count would be automatically added one or N (for multi-output stream connection) when it is written to a stream connection; but each component must release buffer when it has finished using it and passed its pointer to the stream connection.

* Circular buffer
- One exclusive write and multiple reads
One write pointer and multiple read pointers are maintained for this kind of buffer. The amount of empty space in the buffer is equal to the distance from the write pointer to the closet read pointer. In other words, the slowest reader controls the amount of available space in the buffer. And in this way, the mutual exclusion is realized. For those components which underflow in the circular buffer, they are responsible for combining the new data with the old ones; or they could not update the reader pointer and try again later.

Data Streaming Model

* Data streaming model is a fundamental software architecture, which has wide applications in the field of data processing. This model is based on the following basic ideas for the embedded applications:
- Multi-threading
- Object-oriented: components

* Components
A component is a hardware-independent, software abstraction of a hardware or microcode function. It has its own interface that performs component specific functions. Components are the basic units to construct a data stream or data flow. A component could be associated with a thread or not. Two key features for components are inheritance and polymorphism and dynamic creation and linking. Dynamic creation means components must be dynamically allocated at startup or runtime instead of global or local allocation. Dynamic creation and linking make application and low-level supports, like middleware, separated as much as possible.

* Types of Components
- Stream filter
It is the processing element (brick) in a data stream. In general, it is associated with one thread. It has input/output ports and command/message ports for async events.
- Stream connection
It is used to connect two or more stream filters to construct a data stream (mortar). In essence, a stream connection is a message queue. In general, no thread is bond with stream connections. Therefore, the comm. among tasks is implemented with message passing instead of shared data structure. This model might be more suitable for multi-core application since it helps to hide the existence of other cores.
- Buffer factory
It is used to manage memory usage within a data stream. One or more buffer factories could be associated with one flow as needed. No threads are with them.
- Flow controller
This macro component creates, initializes and connects multiple building components as mentioned above to construct a data flow. Also it is the interface to the higher application level and emits commands to and receives status messages from the whole flow. It could be considered to be a special case of stream filters.

New components are created and old components might be deleted when a new data flow is built, in order to efficiently make use of memory. However, some kinds of components provide facilities for all data streams, so they need to be created once at startup and stay in the whole system lifetime.

Sunday, August 19, 2007

Scheduing

* Mode
- Process Contention Scope
- System Contention Scope

* Types
- Time Sharing with Equal or Dynamic Priority (Fair)
- Realtime with Different and Static Priority (Non-fair in order to have predicable response): Preemptive for different priority and Round-Robin (time slicing) for the same priority

* Priority Assignment
Pre-condition:
All tasks have worst-case execution times that are less than or equal to their deadlines.
- Rate Monotonic Scheduling
For periodic tasks and task deadlines are in the end of period, tasks with shorter period have higher priority.
- Minimum Interarrival Time Scheduling
For aperiodic tasks and task deadlines are equal to their minimum interarrival time, tasks with shorter arrival time have higher priority.
- Deadline Monotonic Scheduling
For periodic or aperiodic tasks and task deadlines are shorter than their periods or minimum interarrival time, tasks with shorter deadline have higher priority.

* Issues
- Priority inversion
It involves at least three threads of different priorities. The solutions are priority inheritance and priority ceiling. Mutex supports to solve this problem generally. When use the preemptive lock, the priority inversion might happen since the task with higher priority could not be executed in time.

Communications

* Types of Communication
- Loosely coupled comm.
- Tightly coupled comm.

* Approaches
Shared Data Structure
- Resource sync needed.
Message Passing (Message Queue)
- Provide sync and comm together
- No owner and mutual exclusion implemented internally
- Value passing for small messages and reference passing for large messages
- Semaphore implementation: a mailbox of one valid message

Usage for Comm
- Loosely coupled one-way comm: ONE message queue
- Tightly coupled one-way comm: ONE message queue and ONE binary semaphore
- Tightly coupled two-way comm (Rendezvous): TWO message queues with one or multiple entries
Note: the task model here could be "one to one", "multiple to one" and "one to multiple", etc.

Synchronization

* Types of Synchronization
- Resource sync: critical section and resource mutual exclusion
- Activity sync

* Approaches
Low weight ones: TSL(Test Set Lock, H/W), semaphore, mutex, conditional variable
High weight ones: message passing, spinlock, monitor, read-write lock, barrier, interrupt lock, and preemption(scheduling) lock.
- Resource sync
Resource could be either data or codes:
mutex for single shared resource access
recursive mutex for recursive shared resource access
counting semaphore plus mutex, the mixing of non-blocking and blocking mutexes for multiple shared resource access
interrupt lock for the shared resource with ISR
preemption lock for the shared resource without waiting
read-write lock for multiple reads and exclusive write (broadcasting)
locking hierarchy and locking chain for dependent resources
- Activity sync
1) Single-task waiting and single-task signaling: conditional variable and semaphore
2) Multiple-task waiting and single-task signaling: conditional variable(broadcasting) and semaphore(flushing)
3) Single-task waiting and multiple-task signaling: ONE semaphore with iterative checking MULTIPLE input sources from ISR; ONE counting semaphore associated with the each input queue which might be redesigned.
4) Flow control: counting semaphore
5) Rendezvous and barrier sync: TWO semaphore without comm. Two conditional variables. TWO message queues of one entry for comm. Callback mechanism with comm. And barrier for multiple tasks.
6) Event occurrence: conditional variable or event object, counting semaphore (support to count the times of event happening)

* Issues
- Race
A race occurs when two or more threads try to get something or do something at the same time. Only one can win. Race is the result of resource conflicts caused by insufficient sync. Race is difficult to diagnose since every thread would keep going.
- Deadlock
Deadlock is the result of resource conflicts caused by wrong sync. It is easy to be detected since threads stop. In general, if multiple mutexes are used sequentially for dependent resources, deadlock happens probably. The general solutions are:
1) fixed locking hierarchy and
2) try and backoff with reverse unlocking sequence.
Note: Lock chaining is a special case of locking hierarchy, where the scope of two locks overlap. With one mutex locked, the code enters a region where another mutex is required. After successfully locking that second mutex, the first is no longer needed, and can be released.

* Sleep and Non-sleep
Many facilities would make current threads to sleep, like semaphore, mutex, etc. However, in the following cases it is not allowed to sleep:
- ISR
- critical sections

* Semaphore, mutex and conditional variable
- Binary semaphore is too simple to use since programmers need to make sure both mutual exclusion and synchronization of threads by using multiple semaphores. Semaphore has no owner and anyone could release it. Also it needs no predicate associated.
- Mutex is for mutual exclusion and it has its owner which needs to release mutex after using it. Mutex supports to task delete safety, recursive access and priority inheritance for priority inversion.
- Conditional variable is for thread synchronization and is dependent of a SINGLE (not multiple) predicate of the shared resource. Always test your predicate; and then test it again by using while loop. It must be associated with a mutex. But a mutex may have any number of condition variables associated with it.

* Spinlock and monitor
- Spinlock can be effective in very restricted circumstances. The critical section must be short, you must have significant contention for the lock, and you must be running on more than one CPU since it would not block the task. The trylock of mutex could be used to implement it. Spinlock could be used in ISR to do resource sync due to its non-sleep & non-interruptible property.
- Monitor integrates the shared resource with its mutual exclusion utilities together. In many situations, monitors are an excellent synchronization technique. By forcing the programmer to manipulate shared data via monitored code, you ensure that there can never be a failure to lock its use, and that the access methods will be well-defined. Monitors cannot handle all types of locking situations, however. When you want to use the “trylock” functions, you must use regular mutexes.

* Initialization/destroy and blocking/non-blocking
All these utilities need to be initialized before using them and deleted in the end. It is possible that memory associated is allocated within the initialization. Therefore this memory should be freed in the delete routine. Also, it must be sure that when they are deleted, no other tasks are waiting or holding them so that no deadlock happens. One general deletion is:
{
     lock(mutex);
     unlock(mutex);
     delete(mutex);
}
On the other hand, when these utilities are used, two types of functions are provided in general, i.e. blocking access and non-blocking or time-out access.

Concurrency Control

* In order to create concurrent execution context, and control how they operate within the library and application, three facilities need to be provided:
- Execution context: the state of a concurrent entity.
- Scheduling: to determine which execution context and/or CPU
- Synchronization: to cooperate between concurrent execution contexts

* Thread
- All threads in one process would share all the resources, like virtual address space, file descriptor/pointer, working directory, etc..
- A thread's registers, PC, stack and heap memory are private for the thread unless the thread communicates pointer to that memory to other threads.
- A thread is fully synchronous with itself unless some asynchronous events happen, like ISR.

* Thread Working Modes
- Pipeline
A task is broken into a series of suboperations, each of which is handled in series, but concurrently, by a different thread.
- Client/Server
Multiple clients contact with an independent server for each job anonymously (Multiple to one).
- Manager/worker
A single thread, the manager assigns work to other threads, the workers(One to multiple). Typically, the manager handles all input and parcels out work to the other tasks. At least two forms of the manager/worker model are common: static worker pool and dynamic worker pool.
- Work Crew
Multiple threads work together independently. A special form of manager/worker mode since the manager would join the workers after it creates workers.

Memory Organization

* Memory Hierarchy
- Objective: Reduce the performace gap between memory and CPU
- Principle: The principle of locality (temporal and spatial)
- Why hierarchy? Trade-off between performance and cost
- Result: Program codes and data are spreaded among multiple locations in the memory hierarchy.
- Order in performace: Register, cache, main memory and disk
- Performance metric:
latency (response time, execution time): how fast to get?
throughput (bandwidth): how many/much to get?
for example, register has low latency and high throughput while cache has low latency but low throughput.
Sometimes, power consumption is important.
- Two things are very important in memory hierarchy: Cache and virtual memory

* Cache
- Basic unit
Block/line with multiple bytes. The address of first byte should be aligned to the stride of cache line.
- Where to put cache lines?
Directed mapping (one set associative, n sets), n/m set associative (m sets), and fully associative(one set).
- How to locate cache lines?
Memory address: Tag + Set index + Block offset
- Replacement rule
FIFO, LRU (least Recent Used), and Random
- Write policy
Write back and write through
Write allocate and non-write allocate
write buffer, Non-block cache operations
- Cache size
Cache line component = cache tag + state bits (valid, dirty for WB) + cache data
Cache size in bytes = block size * set associative * sets
- Cache Organization
Two logic dimensions:
Horizontal: Set associative
Vertical: Set
Each set associative could be one cache bank. Note large bulk of data with sequential addresses would be stored vertically not honrizontally.

* Virtual Memory
- Virtual address space
Each process has its own virtual address space, which is determined by the bus width of the system. The process could use this address space fully and its codes and data could be stored anywhere in the space. The organization of text, data, bss regions, stack and heap are in the virtual address space not in physical address space. Therefore the address the program could see is the virtual address but not physical address of main memory.
- Virtual memory
Virtual memory is to expand the concept of main memory to include the disk. The main memory could be considered the cache of the disk.
- Advantage of virtual memory
1) Make multiple processes share main memory
2) No need to worry about the size of main memory
3) Make it possible relocate the codes and data (swapping)
- Virtual address to physical address
Cache and memory are accessed with physical address. So virtual address need to be tranlated to physical address after it leaves CPU. TLB and page table for each process are used for this purpose. Page table would map the virtual address to physical address. Since the size of page table might be large and it is located in main memory, two memory accesses are needed for one address access. According to the principle of locality, TLB (translation buffer) is used to do this job fastly.
- Basic unit
Page or segment; the size of page could be the size of one cache bank (sets * block size)
- Where to put pages?
Fully associative
- How to locate pages?
TLB and Page Table for each process
- Replacement rule
LRU
- Write policy
Write back
- Virtual address format
Virtual address component = virtual address index + page offset
Virtual address index is used to look up the page table.
- TLB
TLB is the cache of process's page table in main memory. It has the same properties as cache. The exception is that cache is controlled completely by hardware, but once page fault happens, the OS needs to be in charge since the cost to access the disk is so high as to switch other process context.

* Main Memory Organization
Main memory could be divided into multiple contiguous address sections, or banks. Each memory bank could be mapped to memory chips. Inside the memory chip, bits could be organized in banks and address interleaves among banks in order to improve the access bandwidth. Two "banks" here are different in terms of address mapping.

Tuesday, August 14, 2007

L/R Values in C

Every expression in C and C++ is either an l-value or an r-value. An l-value is an expression that designates (refers to) an object. Every l-value is, in turn, either modifiable or non-modifiable. An r-value is any expression that isn't an l-value. Operationally, the difference among these kinds of expressions is this:
The address of each l-value is known at compile time and is where the variable will be kept at runtime, i.e., a modifiable lvalue is addressable (can be the operand of unary &) and assignable (can be the left operand of =). So if the compiler needs to do something with an address (add an offset to it, perhaps), it can do that directly and does not need to plant code to retrieve the address first. In contrast, an r-value is neither addressable nor assignable. For example, the followings are typical r-values and could not be modified.
- function return value
- the result of ? :

A non-modifiable l-value is addressable, but not assignable, like pointer to const value or the array name.

Endianness

* Memory Allocation Rule
In general, the memory is allocated for data starting from low address in the stack(grow from high address to low address) and the heap.

* Endianness
Most modern computer processors agree on bit ordering "inside" individual bytes (this was not always the case). This means that any single-byte value will be read the same on almost any computer one may send it to.

Integers are usually stored as sequences of bytes, so that the encoded value can be obtained by simple concatenation. The two most common of them are:
- increasing numeric significance with increasing memory addresses, known as little-endian, and
- its opposite, called big-endian.
The bit order of register in CPU is fixed for both endians and hardware would take charge of conversion between register and memory.

* Endianness in Networking
Data is transfered generally byte by byte from low address (char *buffer). Networks generally use big-endian order for BYTE transmission (maybe not in memory), and thus it is called network order when sending information over a network in a common format. Some routines are provided generally for these conversions between network and host.

* Bit-Level Endianness
Bit endianness is used to refer to the transmission order of bits over a serial medium. Most often that order is transparently managed by the hardware and is the bit-level analogue of little-endian (low-bit first), although protocols exist which require the opposite ordering (e.g. I2C). In networking, the decision about the order of transmission of bits is made in the very bottom of the data link layer of the OSI model.

* Bit-Shift and Endianness
Bit-shift has nothing with endianness. The results of these operations are defined by std and implemented by compiler.
Reference: Endianness in Wikipedia

Optimization in C

* Better Algorithm or Data Structure

* Mathematics Solution
- Get correct mathematics formula for the problem. For example, calculate N!.

* More Space Less Time
- Using macros for short functions
- Look-up Table

* Use Bit Operations (Be cautious)
- Shift and bit mask for division and module

* Language and compiler features
- Use pointer to operate an array
- No need to define useless return value
- Define a variable in the register instead of stack by using `register'
- Prefer post ++/-- to prefix ++/-- for usage without reference to the result
- Organize the order of cases in switch: The higher occurrence the higher its case number
- In some cases, an array of pointer to functions might be more efficient than switch statement. For example,
int handleMsg1(void);
int handleMsg2(void);
int handleMsg3(void);
int (*MsgFunction[])() = {handleMsg1, handleMsg2,handleMsg3};
status = MsgFunction[ReceiveMessage()]();

* Hardware features
- Copy codes and data from FLASH to RAM for running
- Fully use the UART buffer for data transfer
- Use DMA

* Embed Assembly
- However, it is non-portable

Std Library Functions

* Library Calls And System Calls
- Library calls are parts of the language or application while system calls are part of OS. A system call would be triggered by using trap or interrupt. The C library is the same on every ANSI C implementation. They are calls to routines in a library and linked with user program. They executes in the user address space with lower calling overhead and counts as part of user time. On the other hand, the system calls are different in each OS. They are calls to the kernal for a service. They are entry points to the OS. They executes in the kernal address space with high calling overhead and counts as part of system time.

In practice, many C library functions do their jobs by making system calls.

* getchar
- Its return value is integer instead of char.

* strcpy, strcat, strncpy and strncat
- strcpy and strcat do not check the buffer size and the ending condition is '\0'. So it might cause buffer boundary problem. Try to use strncpy and strncat.

* strlen
- Its return value does not include the character '\0'

* scanf
- It expects a pointer to an integer instead of char to read an integer. The formats of float and double are different. %f for the float and %lf for the double. In printf, %f for both.

* memcpy and memmove
- memcpy could not copy overlapped memory blocks of src and dst and memmove could at a cost of performance.

* setbuf
- When the main function returns, the library would FREE and clean up the memory setbuf used. Therefore, this buffer could not be one in the stack and it is supposed to be memory in the heap or static/global array.

* fread, fseek and fwrite
- fseek needs to be called between the callings of fread and fwrite. One file can not be read and write in sequence without the state change.

* errno
- This global variable would always be the last error number. The general usage should be:
if (error return value)
{
     check errno
}

Monday, August 13, 2007

Thumb of Rules

Function Interface Design
* Write function comment in the head of each function
* Give a good name to each function and avoid using undefined verbs
* Return a status and pass the desired return values by pointers
* The number of input paramters is no more than 7. Othrewise use struct
* Write function prototype and keep data types of input and return parameters match their declarations and no casting
* Const qualified read-only parameter to which input pointers point
* Follow some order of input parameters, like (dst, src, num)
* Check the legitimacy of input parameters and input global variables by using `assert'
* Avoid using input parameters, esp. pointers, as working variables. Use their local copies
* Check the legitimacy of function return value, esp. the return value of std library functions
* Use macros to replace variables of multiple references (i.e., of very long names)

Function Design
* Divide large scale codes to multiple level function calls.
* Design functions with high input fans and low output fans
* Write re-entrant function as much as possible
* Define only one logic task for each function as much as possible
* Keep the length of each function within 200~300 lines
* Separate implemention codes from control codes

Physical Structure of Code
* The basic code unit is a set of files, c/cpp file and its associated h files which might be more than one, like one for the public and another one for the private. The c/cpp file in one unit is supposed to include its h files.
* The h files should include non-std h files as few as possible. These non-std h files should be included in the c/cpp file. The h files could do forward declarations for type declarations within itself in its beginning to avoid including other h files.
* Each h file should be guarded with #ifndef/#define and its unique tag.
* Avoid using `extern' functions directly in c/cpp file but try to include the h files of their declarations.
* Use `static' to restrict the scope of functions and variables. Global variables should be `extern' declared in its h file and defined & initialized in the c/cpp files. Include the h file of global variables first before use them.

Misc
* Not do too much in ONE single statement.
* malloc and free
- Check the return pointer of malloc, reset the memory allocated to zeros with memset.
- Free the memory allocated by malloc at the same code level, assign NULL to the pointer finally.
* fopen and fclose
- Check the return handle of fopen
- Close the file handle in the end
* sizeof and size_t
- size_t == unsigned long int
- sizeof data type instead of variable name
- Prefer (num * sizeof data type) to (sizeof data type * num)
* Condition Check with Boolean, Int, Float and Pointer
- Boolean: if (!Flag)
- Int: if (0 == Flag)
- Float: if (Flag >= -EPSILON && Flag <= EPSILON)
- Pointer: if (NULL == p)
* Infinite Loop - while (1) {...}

Re-entrant Code

* Re-entrant Code
Re-entrant code could be used by multi-thread and the result is only dependent upon input parameters. Re-entrant code is thread safe but thread safe might not re-entrant since the state of code might change due to global states. Re-entrant functions are similar to those of functional programming.

* Non Re-entrant Code
- Uses non-const global variables
- Uses static global/local variables
- Calls non re-entrant functions, like malloc, free, printf, fopen, and other I/O std functions

Interrupt Service Routine (ISR)

* ISR Issues
Find out what is wrong in the following ISR:
__interrupt double compute_area(double radius)
{
     double area = PI * radius * radius;
     printf("\n area = %f\n", area);
     return area;
}
ISR is supposed to be
- No input parameters
- No returns
- Be compact and simple. Note: float point arithmetic operations and printf are complex and not re-entrant.

Const and Volatile Qualifier

* Const
- Read only
- Initialization
const type must be initialized when it is declared.
     const int j; /* error */
- Specify the exact data type
     #define i 10
     const long j = 10;
     char h = i;
     char k = j; /* error due to truncation */
- Save memory
Memory is allocated once.
     #define STRING "abcdefg"
     const char string[] = "abcdefg";
     printf(STRING); /* first time allocation */
     printf(string);
     printf(STRING); /* second time allocation */
     printf(string);
- Change const
     const int i = 0;
     int *p = (int *)&i;
     *p = 10;

* Volatile
- Usage
-- Hardware registers or ports which might be changed by I/O
-- Non-automative variables in ISR which might be changed by ISR
-- Global variables in multi-threading environment which might be changed by other thread.
- Effect
Compiler would re-read this variable from the cache or memory instead of register.

* Qualifier Meaning
- int const/volatile *p = &i;
- int * const/volatile p = &i;
- int volatile * const p = &i;

* Declaration of Const and Volatile
- Two types of declaration
     const/volatile int i; <=> int const/volatile i;
The latter one might be better. Think about this one:
     typedef char * pchar;
     const pchar p;
It might be explained as "const char *" but actually it is "char * const". The declaration of "pchar const p" has no such confusion.
- Declaring an entire object to be volatile and/or const effectively
declares each member of that object as volatile and/or const.
- Defining a data type to be const and/or volatile might be more useful than just defining an object of this type to be const and/or volatile. The reason of this is no need to consider the type match (as explained below) when pointer assignments happen, like parameter passing in function calling. Otherwise, all these occurances should be declared as const and/or volatile.
Note: typedef struct A const AA; =>
AA is const but struct A is not. So use this format:
typedef struct A
{
...
} const B;

* Const/Volatile Pointer Assignment
     int *p = &i;
     int const *cp; /* No initialization is allowed since cp is a pointer not const pointer actually */
     p = cp; /* error */
     cp = p;
In the above example, cp is a pointer to a qualified data type while p is a pointer to a unqualified data type. In the assignment, the data type to which the left pointer points should be with the qualifier of the right one. So cp could not be assigned to p.
     int **pp = &p;
     int const **cpp;
     cpp = pp; /* error */
which might happen in the parameter passing of function calling. Why?
pp => a pointer to a pointer to int
cpp => a pointer to a pointer to const int
"A pointer to int" is not the same data type as "a pointer to const int". Therefore, pp and cpp is different pointers. Casting needed here. If
pp => a pointer to a pointer to int (int **)
cpp => a pointer to a const pointer to int (int * const *) or
           a const pointer to a const pointer to int (int * const * const) or
           a const pointer to a pointer to int (int ** const)
then cpp = pp; is legal.

Sunday, August 12, 2007

Casting in C

* The cast operator forces the conversion of its SCALAR operand to a specified SCALAR data type, or to void . The operator consists of a type-name, in parentheses, that precedes an expression, as follows:
( type-name ) expression

The type-name can also be an enum specifier, or a typedef name. The type-name can be a structure or union only if it is a pointer. That is, the type-name can be a pointer to a structure or union, but cannot be a structure or union because structures and unions are not scalar types. For example:
     (struct abc *)x /* allowed */
     (struct abc)x /* not allowed */

Cast operations cannot force the conversion of any expression to an array, function, structure, or union. The following example casts the identifier P1 to pointer to array of int:
     (int (*)[10]) p;
This kind of cast operation does not change the contents of P1 ; it only causes the compiler to treat the value of p as a pointer to such an array.
     p + 1; /* Increments by 10*sizeof(int) */

* Cast operators can be used in the following conversions that involve pointers:
- A pointer can be converted to an integral type. A pointer occupies the same amount of storage as objects of type int or long (or their unsigned equivalents). Therefore, a pointer can be converted to any of these integer types and back again without changing its value. No scaling takes place, and the representation of the value does not change. Converting from a pointer to a shorter integer type is similar to converting from an unsigned long type to a shorter integer type; that is, the high-order bits of the pointer are discarded. Converting from a shorter integer type to a pointer is similar to the conversion from a shorter integer type to an object of unsigned long type; that is, the high-order bits of the pointer are filled with copies of the sign bit.
- A pointer to an object or incomplete type can be converted to a pointer to a different object or a different incomplete type. The resulting pointer might not be valid if it is improperly aligned for the type pointed to. For example,
     char c[10];
     int *p = (int *)c[1]; /* misalignment */
It is guaranteed, however, that a pointer to an object of a given alignment can be converted to a pointer to an object of the same alignment or less strict alignment, and back again. The result is equal to the original pointer. (An object of character type has the least strict alignment.) For example,
struct A
{
     struct B;
     int C;
} a, *pa, *pa2;
pa = &a;
struct B *pb;
pb = (struct B *)pa;      /* Allowed and safe. Recall the alignment property of struct */
pa2 = (struct A *)pb;     /* Now pa == pa2 */
- A pointer to a function of one type can be converted to a pointer to a function of another type and back again; the result is equal to the original pointer. If a converted pointer is used to call a function that has a type not compatible with the type of the called function, the behavior is undefined.

Reference: EETime Embedded

Saturday, August 11, 2007

void and void *

* void
- NO INSTANCE of void data type, like "void a". The data type of void is abstract, like the abtract class in C++.
- void Usage: function return is void and function arguments are void

* "void *"
- The pointer of void * could be assigned by any type of pointer. No casting needed. However, it is not correct to assign the pointer of void * to the pointer of other data types. Casting needed.
- The pointer of void * could not perform arithmetic operations, like
void *p;
p++;    /* No increment is allowed on p */
- The pointer of void * could be used as an argument or return value of functions which accepts or return a pointer to any data type, respectively. For example,
void *memcpy(void *dst, void *src, size_t len);

Struct and Union

* Typical Usage of Struct and Union
struct A
{
     int a;
     char b;
};
struct B
{
     short c;
     char d[2];
};
struct C
{
     int e;
     long f;
     short g;
};

struct CommonPacket
{
     int PacketType;
     union
    {
         struct A PacketA;
         struct B PacketB;
         struct C PacketC;
    }
}

In general, struct could be used effectively to describe a section of continual memory slots or registers; and access it with the pointer to the head of this section.

* Alignment in Struct (How to estimate the size of struct?)
- "Although you can never be absolutely sure how your compiler will pad the members within a structure, the Standard guarantees there will be no padding before the first member. The Standard also mandates that each member in a structure must be allocated in the order in which it's declared."

- Natural Alignment:
By default, each data member is aligned based on the size of its data type. The padding after this member is the one which makes the next data member aligned on its boundary. In the end, the size of struct should be multiple of the maximal size among data member in struct. This maximal size is just the alignment size for this struct. Fox example,
struct A
{
     char a;
     long b;
};
struct B
{
     short c;
     struct A d;
}
Then for struct A, a would be aligned with the size of 1 byte (sizeof(char)). Since the alignment size of b is 4 bytes (sizeof(long int)), three bytes need to be padded after a in order to guarantee b aligned in one address of multiple of 4. The final size of struct A is 8 which is multiple of 4. Therefore the alignment of struct A is 4. (Think about what if the positions of a and b switch in struct A.) About struct B, it contains one compound of struct A. So first the alignment of this compound should be considered. It is 4 as explained before. Two bytes are padded after c and the final size of struct B is 12 which satisfies the requirement.
Note: 1), The final size of struct is not equal to (N x Len), where N is the number of data members and Len is the maximal size of data members. The objective is to save memory allocation as much as possible. Look at this example:
struct C
{
char x1;
short x2;
int x3;
char x4;
};
The size of struct C is not 16.
2), For arrays, the alignment size is the size of data type but not the size of the array. For example,
long int c[20];
The alignment size for c is sizeof(long) but not sizeof(c).
- Alignment with #pragma
Force structs to align n bytes: #pragma pack(n)
Cancel alignment of n bytes: #pragma pack()
The alignment size of each data member should be the minimal value between its natural alignment size and n.
In summary, 1) defining the alignment size for each data member (compounds first), 2) align data members in sequence, 3) save memory as much as possible.
- Offset Calculation
(size_t)((char *)&((struct A *)0)->f - (char *)((struct A *)0))

* Initialization of struct
- struct A a = {'t', 'c', 8, 0.99, "example"}; or
struct A a = {0}; /* Every member is 0 now, no matter which type.*/

* Assignment of struct
struct A
{
     char *p;
     char c;
} a, b;
char cc = 'c';
a.p = &cc;
a.c = 15;
b = a;
*b.p = 30;      /* cc now is changed */
If the pointer is contained in struct, when assignment happens between two variables, two pointers are point to the same memory.
Although arrays could not be assigned to each other, they could if they are within one struct, like this:
struct A
{
     char array[10];
} a, b;
for (int i = 0; i < 10; ++i)
     a.array[i] = i;
b = a;     /* b.array now is the same with a.array */

* Struct For Bit Map
Under some circumstances, struct could be used to do bit map for a block of memory, like this
struct A
{
     int a:1;
     int b:7;
} t;
t.a = 1;
t.b = 0x7f;
- The total number of bits should be reasonable. It might be the size of one of basic data types, like char, short, int, long int, etc.
- Be cautious that t.a and t.b are defined as SIGNED int. Therefore, one bit needs to be the sign and the value ranges of them are [0,-1] and [-64, 63]. Unsigned data type might be more useful for this kind of struct usage since each bit in this struct should be meaningful.
- Almost everything of bit field in struct is implementation-dependent. Make sure everything, like which end starts in bit order, whether it allows cross the boundary of byte, etc. before use this data structure.

* Union in Memory
In general, all data members of one union start at the same low memory address. This property would be used to exploit some special usages of union.
union bits32
{
     char bytes[4];
     int whole;
} t;
t.whole = 0x12345678;
t.bytes[0] = 0x90; => Now t.whole becomes 0x12345690 in little endian system.
Another classic example of union to check system endian:
t.whole = 1;
return (t.bytes[0] == 1); /* True is little endian and false is big endian */
Or:
union bits32 endian_test = { { 'l', '?', '?', 'b' } };
#define ENDIANNESS ((char)endian_test.whole)

Unsigned and Signed Integer

* Two attributes for char, short, int, long and long long:
bitwidth (8, 16, 32, 64 bits) and sign (unsigned, signed).

* Conversion Rule
- When an expression does operations with the same bitwidth on (signed/unsigned)char, (signed/unsigned)short, bit-field, enum, these types would be promoted to int type. And float type would be promoted to double type. This is called type promotion.
- When an expression contains variables or numbers whose bitwidths are different, all variables or numbers would be converted to the wider data type (signed or unsigned) and continue the operation. This is called universal arithmetic conversions. The conversion rule is to extend the sign to the bytes of high addresses since in general the allocation of memory is from low address to high address in stack and heap. On the other hand, the opposite conversion from wider bitwidth to narrower bitwidth, the bytes with high addresses, which contains the sign, would be discarded. Keep in mind different results due to the big and little endian of the system.

* Arithmetic Operation of Unsigned
- When an expression contains variables or numbers that are with the SAME bitwidth but different sign, the signed data is converted to the unsigned version. This might bring some trouble when it happens in the condition check, like this:
unsigned int a = 6;
int b = -20;
int c = (a+b>6) ? a : b;
The c would be always equal to a.
- The general arithmetic operations on unsigned:
c = a +/- b mod 2^n
where n is the bitwidth of the data type. Therefore no overflow and underflow for unsigned data. This might be not expected in some cases.

If both operands are signed, the result of overflow/underflow is UNDEFINED. In general, it is hard to test overflow/underflow of SIGNED integer operations. It could be done to check the flags of some status register in Assembly. However if x and y are two integers and known to be non-negative, it could be done in this way:
if ((int)((unsigned)x + (unsigned)y) < 0)
    complain();

* Shift Operations
- If the item is left shifted, zeros are padded in the right. Not left shift signed data.
- "If the item being right shifted is unsigned, zeroes are shifted in. If the item is signed, the implementation is permitted to fill vacated bit positions either with zeroes or with copies of the sign bit. If you care about vacated bits in a right shift, declare the variable in question as unsigned. You are then entitled to assume that vacated bits will be set to zero."
- "if the item being right or left shifted is n bits long, then the shift count must be greater than or equal to zero and strictly less than n. Thus, it is not possible to shift all the bits out of a value in a single operation."
- By shifting bits, the multiplication and division for unsigned and multiplication for signed are safe and correct. "Note that a right shift of a signed integer is generally not equivalent to division by a power of two, even if the implementation copies the sign into vacated bits. To prove this, consider that the value of (-1)>>1 cannot possibly be zero."

* size_t
- size_t = unsigned long int
- The return data type of sizeof is size_t. Keep in mind the rules of conversion and unsigned arithmetic operations. For example,
#define TOTAL (sizeof(array)/sizeof(array[0]))
{
     int d = -1;
     if (d <= TOTAL-2)
         x = array[d+1];
}

* Post-fix UL and L
If an expression has the overflowed value, consider to put these post-fix on integer numbers: U, L, and UL. Fox example, write a routine to calculate n! assuming the result would not make long int overflow.
long foo(int n)
{
     return ((n+1L) * n / 2);
}

* Usage of Unsigned Data in C
- Unsigned version is ONLY used in BIT OPERATIONS (&, |, ~, >>, <<). Otherwise, CAST it to signed version.

* Identify Whether a Data Type or Variable Is Unsigned
- For variables: #define ISUNSIGNED(a) ((a) >= (char)0 && ~(a) >= (char)0)
- For data type: #define ISUNSIGNED(type) ((type)0 - (char)1 > (char)0)