Saturday, July 21, 2007

Functions

* Memory Footprint of a Process
- "To understand what stack buffers are we must first understand how a process is organized in memory. Processes are divided into three regions:
Text, Data, and Stack.

The text region is fixed by the program and includes code (instructions) and read-only data. This region corresponds to the text section of the executable file. This region is normally marked read-only and any attempt to write to it will result in a segmentation violation.

The data region contains initialized and uninitialized data. Static variables are stored in this region. The data region corresponds to the data-bss sections of the executable file. Its size can be changed with the brk system call. If the expansion of the bss data or the user stack exhausts available memory, the process is blocked and is rescheduled to run again with a larger memory space. New memory is added between the data and stack segments."

Therefore, the memory footprint of one process is like this: (from high address to low address): Stack (grow from high to low), uninitialized data region, initialized data region and text region.

* Stack and Frame
For single threading programs, all of the subroutines would share the same stack which is allocated when it is executed. In general three addresses are needed for proper stack operations, and they are stored in processor registers. They are stack pointer(SP), stack base and stack limit.

One more register, namely frame pointer(FP), is needed for the normal function calling. When procedure is called, after input parameters (the pushing order of input parameters is from right to left) and return address (it is the current PC and then *PC = addr of procedure) are pushed, the frame pointer is pushed into the stack, and FP = SP (here SP points to the last occupied address instead of the first free location of the stack). Then the local variables would be allocated. All of these elements consist of a stack frame.

Frame pointer is used to support variable length of the stack. Once returned, the function would assign SP = FP, which points to the previous FP. Pop it up to assign FP. Thus the context goes back to the last one. Pop up one more to get return PC and continue the process execution. Meanwhile, during the execution of the callee function, FP would be the base address to access the local variables and input parameters, i.e., the offsets of local variables are negative and offsets of input parameters are positive. For example,
void function(int a, int b, int c)
{
     char buffer1[5];
     char buffer2[10];
}
void main()
{
     function(1,2,3);
}
The assembly of codes by using "gcc -S -o example1.s example1.c" is
     pushl $3
     pushl $2
     pushl $1
     call function
This pushes the 3 arguments to function backwards into the stack, and calls function(). The instruction 'call' will push PC onto the stack.The first thing done in function is the procedure prolog:
     pushl %ebp /* EBP is FP */
     movl %esp,%ebp /* ESP is SP */
     subl $20,%esp /* allocate buffer1 */
The stack now looks like:
     buffer1 FP RET a b c (Bottom of stack)
    [      ] [] [] [][][]
Note: the head of buffer1 locates in the low address (left side). So once the overflow happens on buffer1, FP, RET would be overwritten and segmentation fault occurs.

* Function Name
The function name is the address of the function codes in text region. It could directly be assigned to one pointer to function. Meanwhile, it is allowable to specify any address in text region to a pointer to function (no function body at all) and execute it. This is typically used to reset system.

* Function Parameters Casting
"It is the programmer's responsibility to ensure that the argument to a function are of the right type."

* Function Parameters
The function parameter is always passed by value into function stack, never by reference in C. That means the copies of function parameter, instead of themselves, are used within the function. Therefore any changes on copies can not be seen outside the function. The only way to achieve such changes on original variables is to use pointers as parameters and dereference them inside. Keep in mind the key here is to change the variables pointed by pointers. The pointers themselves within the function are still the copies of original ones. Changes on pointers within functions are not meaningful. For example,
void GetMem(char *p, int num)
{
        p = (char *)malloc(num * sizeof(char));
}
void FreeMem(char *p)
{
        free(p);
        p = NULL;
}
Two solutions for this: to use **p or to return the local pointer.

* Local Variables
Local variables are allocated in function stack and they would freed automatically when function returns. Generally local variables, like pointers (to non-local memory, like static variable), struct, union, etc., could be returned by value, not reference. The exception is arrays. For one thing, arrays could not be treated as a whole unit in C and therefore could not be returned. On the other hand, the memory of arrays is freed once the function returns. No way to dereference this data outside the function.

One typical example for this case is when using the library function setbuf(stdout, buf). Before handling control back to the OS, the library would flush(not free) the remaining in the buf. This happens just after the main function returns generally. If this buf is allocated in stack by using arrays, this last flush would be wrong since the memory has been freed. The solution is to use a global array or a static array. What if the buf in the heap?

Tuesday, July 17, 2007

Pointer

* Data Type For Pointer
In general, a pointer is of UNSIGNED LONG int, which means the pointer has sizeof(unsigned long) bytes.

* Initialize Pointers First
Pointers should be initialized after declaration, with &, malloc or NULL. Keep in mind this rule when deal with multi-dimension pointers. The following is a typical example:
char *p = "Hello World";
char **pp; /* No initialization here */
*pp = p;

The usage of **p in this example is a little bit special:
void GetMem(char **p, size_t num)
{
     *p = (char *)malloc(num * sizeof(char));
}
The caller must initialize p first with &, then pass it to this subroutine. No way to guarantee that p is valid within the subroutine. Otherwise the result of dereferencing p is undefined.

Another way to initialize pointers is:
int *p = (int *)0x12345678;
*p = 1; == *((int * const)0x12345678) = 1;

* Check Pointers Before Use Them
Dereferencing a NULL pointer or an illegal but not NULL pointer is undefined. Therefore checking if it is NULL is always the first thing for pointers as function input parameters or return value.

* The Precedence of Symbol `*'
The symbol `*' has lower precedence than postfix operators, like `.', ->, (), [], ++, --, etc. Keep this in mind to understand the correct meaning of some expressions:
*p.f, *a[0], *fp(), *p++

* Pointer Operations
- Subtraction is meaningful for two pointers while addition is not when both are doing with the same memory region, like array or buffer.
- When one constant is added/subtracted to a pointer, the resulting address is not just the addition/subtraction of them. It depends on the type of pointer and is calculated by the compiler.

* Operator []
When use a pointer to an array or memory block(allocated with malloc), p[i] and p+i is totally different. p[i] is dereferenced and not a pointer any more.

* Exchangeable Pointer and Array Name
As mentioned before, the array name actually is a pointer of data type of array's elements for most of time. So a pointer could be one to a single variable, to an array or to a memory region.
    int a[3][5];
    int (*p)[5];
    p = a; =>
    p[2][5] == a[2][5]
On the other hand, be careful about this:
    int a[3];
    int (*p)[3];
    p = &a; =>
    (*p)[2] == a[2]

* Print Address
    printf("The value of the pointer is %p\n", (void *)p);

Sunday, July 15, 2007

Buffer Pointers

* The Indexes In Array
As we know, the first index of array in C is 0 and the last one is N-1. It seems not good but actually it is very convenient on the calculation of array length without the off-by-one error IN PROGRAMMING. The basic idea is that zero is the first valid index in array while N is the first invalid index after array, thus the length of this array would be equal to (N-0). So it always looks like:
  for (int i=0; i< N; ++i)

* Buffer Pointers
We could apply the same idea to the definitions of buffer pointers. The size of buffer is N and the pointer to buffer is buf. Furthermore we need one pointer, say bufptr, to indicate the usage of this buffer. It is better off to set bufptr to the first unused place. Thus the length of used region is (bufptr-buf) and that of the unused is (N-(bufptr-buf)). The buffer could be filled in simply by *bufptr++. The full condition could be checked by ((bufptr-buf) == N).

This is the typical example with this technique. Pay attention to the updates of pointers:

void bufwrite(char *p, int n)
{
  while (n > 0)
  {
        int k, rem;
        if (bufptr-buf == N)
        {
              flushbuffer();
        }
        rem = N - (bufptr - buf);
        k = n > rem ? rem : n;
        memcpy(bufptr, p, k);
        bufptr += k;
        p += k;
        n -= k;
  }
}

Arrays

* Array in C
- No way to process an array as a whole unit in C.
- Only one dimension array exists. And the size of the array must be known in compilation.
- Arrays could not be copied and compared with each other directly.
- Arrays could not be returned from functions.
- Two things could be done to an array: to get the size of the array and to get the pointer to the first element of this array. All other operation on arrays would be done with this pointer.
- Two operators with the array name would do with the whole array: & and sizeof. & would get the address of this array, i.e. it is supposed to assign one pointer to the array. The size of array in byte would be obtained with sizeof. Another situation is in printf, which prints the whole string with the name of the array/pointer(NO *) and `%s'. For all other cases, the array name means the CONST pointer to the first element of the array. But when the array name is used as a function argument, the const property is lost. Changes are allowed on it in this case.
- The Nth element of an array does not exist. But its address could be used to do some checking, like
    if (p != &array[N]) p++;
- Multi-dimensional array
Note in C only one dimensional array exists. Multi-dimensional array is of linear memory mapping. Another way to get multi-dimensional array is to use multi-dimensional pointer and dynamic allocation. Note in this way the memory mapping might not be linear unless the total size of multi-dimensional array is allocated in the beginning.

Preprocessing Notes

* Scope of Macros
- Until #undef
- Or it is global static in current module
- Or it is global when included

* No Semicolon After Preprocessors

* Be cautious of the space in Macros
This example might not work as you expect:
#define max_(a, b) ((a) > (b) ? (a) : (b)) /* _ represents one space here */

* Arithmetic Calculation Is Allowed in Macro
#define SEC_PER_YEAR (60*60*24*365)UL

* Macros Are Not Functions
- Parenthese usage (the whole Macro and its oprands)
- Avoid evaluating operands more than once, like no such operators in operands, ++, --, etc.

* Macros Are Not Type Definitions
When declaring a new name for pointers, it is better off to use typedef instead of macros.

It is ok to
#define FOOTYPE struct foo
FOOTYPE a, b;

But this is not
#define FOOTYPE struct foo *
FOOTYPE a, b;

Meanwhile, this is allowed for macro but not for typedef:
#define FOOTYPE int
typedef int DOOTYPE;
unsigned FOOTYPE a;
unsigned DOOTYPE d; /* error */

* How to Define a Macro as An Expression?
It is desirous that macros could be used as a single statement and put a semicolon after them. A general solution for this is to use this structure:
do {\
    ...;\
    ...;\ /* No `break;' inside */
    ...;\
} while(0)    /* No semicolon here */
Note: the new line is supposed to begin right after symbol \

A special case for this is if-else. The ?: and || could be used for simplicity due to its specific evaluation order of operands, like the definition of assert macro:
#define assert(exp) ((exp) || _assert_())
Here _assert_() is used to report errors.

* String in Macros
- #define str(s) (b = #s) =>
    str(Hello); == b = "Hello";

* Function Parameters in Macro
- #define Print(x) (printf x) =>
    print(("hello")); == printf("hello");

* ##
For example,
#define bwMCDR2_ADDRESS 4
#define bsMCDR2_ADDRESS 17
#define bmMCDR2_ADDRESS BIT_MASK(MCDR2_ADDRESS)
#define BIT_MASK(__bf) (((1U << (bw ## __bf)) - 1) << (bs ## __bf))
#define SET_BITS(__dst, __bf, __val) \
((__dst) = ((__dst) & ~(BIT_MASK(__bf))) | \
(((__val) << (bs ## __bf)) & (BIT_MASK(__bf))))

SET_BITS(MCDR2, MCDR2_ADDRESS, RegisterNumber);

* Misc
- #if, #elif, #else, #ifdef, #ifndef, #endif, #define, #undef
- #ifdef A; #if defined(A)||!defined(B)
- __LINE__(int), __FILE__(char *), __DATE__(char *), __TIME__(char *)

Reference:
Andrew Koenig, "C Traps and Pitfalls", AT&T Bell Lab

Saturday, July 14, 2007

Semantic Notes

* The Interpretation of Numbers
In the occurrence of numbers in C codes, note the default data types of them, since this would affect the conversion of data types in the expression.
- Integer number is of signed int.
- float point number is of double, not float.

* Expression Evaluation Sequence
Only four operators in C, &&, ||, ?: and , specify an order of evaluation. All other operators evaluate their operands in undefined order. The assignment operator does not guanantee the evaluation order. Like y[i] = x[i++];

* The meaning of 0
- It is the NULL pointer with (void *) type and its content can not be referenced.
- It is the binary number of the ending symbol of a string, '\0'.
- It could be casted to the pointer to any data type and perform some calculation, like
    (size_t)((char *)&(((Type *)0)->partx) - (char *)((Type *)0))
- Complement of 0: unsigned int compzero = ~0;

Reference:
Andrew Koenig, "C Traps and Pitfalls", AT&T Bell Lab

Syntactic Notes

* Precedence
Postfix like ++, --, ->, (), [], ., etc have higher precedence than others and are left associative. Note the difference between postfix and prefix ++, --. Both prefix and postfix ones would increment/decrement its operand/variable directly. But prefix ones would return this variable while postfix ones returns the previous value of this variable. If the return value is not used further, like "i++;" and "++i;", prefix ones are preferred for efficiency since it do not need another temporary variable to store the previous value.

To apply the parenthesis to expressions is the best solution for the precedence confusion.

* Typedef/Define Complex Declarations
The key point is to understand the precedence of * and (), [], etc, when you identify the compound data types.
Two steps to declare compound data types by using typedef or #define: First figure out the correct type format; secondly remove the variable name (and the semicolon for #define) from the format and combine it with typedef/#define.
e.g. (*(void (*)())0)() and we could rewrite this as
typedef void (*fp)();
(*(fp)0)();
Try this now:
void (*((*p)[10]))(void);

* Semicolon
A single semicolon means a null statement in C. Two cases are special for semicolon usage.
One is if or while clause(semicolon might not necessary) and the other is the end of a declaration just before a function definition(semicolon is necessary since it might confuse the return type of the function).

* The Switch Statement
Keep in mind the weakness and strength of "break" in switch.

* The else Problem
The "else" is always associated with the closest unmatched if. The solution is to put curly brackets always even though only one statement is followed.

* Static
- static variables with the scopes only in one function
- static global variables with the scopes only in one file
- static global functions with the scopes only in one file

* +=; -=, *= and /=
The advantage of these operators are left values of them are just evaluated once when these l-values are compounds. For example,
array[i++] += y; =>
array[i] = array[i] + y; i++;
instead of
array[i++] = array[i++] + y;

Reference:
Andrew Koenig, "C Traps and Pitfalls", AT&T Bell Lab

Lexical Notes

* Assignment and Logic Comparison (i.e. = is not ==)
In most cases, the solution is to switch the arguments in comparison, like if (0 == x) instead of if (x == 0). This is invalid if both arguments are variables.

* Bit Operations and Logic Operations (i.e. & is not &&, | is not ||)
& and | treat their arguments as a sequence of bits while && and || does as 'true' or 'false'.
For bit operations, the portable formats:
#define BIT3 (0x1 << 3)
unsigned int a = 0x1;
a |= BIT3; /* Set */
a &= ~BIT3; /* Clear */

* Multi-character Tokens
How to explain these cases?
y = x/*p; z = y+++x; (assuming p is the pointer)
The greedy interpretion rule!

Reference:
Andrew Koenig, "C Traps and Pitfalls", AT&T Bell Lab