Sunday, December 7, 2008

Notes of Linux Device Driver 3rd (2)

* Concurrency and race condition
Two aspects of resource sharing:
- Resource sync
- Reference count

Utils for resource sync provided in Linux Kernel
- Semaphore and mutexes
Process could be blocked and sleep. Not used in ISR.
Semaphore here is not preferred to use as event sync util. And it is ususally optimized for the "available" case. Completions could be used for event sycn.
- RW semaphores
best used when write access is required only rarely and writer access is held for short period of time.
- Spinlocks
Process could NOT be blocked. Higher performance than semaphores but has a lot of constraints:
-- better used on SMP systems
-- when holding a lock, the code must be atomic, non-blocking and non-sleep. so preemption would be disabled for this core when holding the lock; the interrupts are disabled either for this core.
-- when holding a lock, the time should be as little as possible.
- RW spinlocks

Alternatives to locking
- Lock-free algorithms: circular buffer
- Atomic variables
- Bit operations
- seqlocks: small, simple and frequently accessed and where write access is rare but must be fast. Non pointers in the protected data.
- Read-Copy-Update: reads are common and writes are rare. The resources must be accessed via pointers, and all references to those resources must be held only by atomic code.

Notes of Linux Device Driver 3rd (1)

* User space and kernel space
- Transfering from user space to kernel space by system calling(exception) or hardware interrupts. Not the difference between these two are the executing context. TThe system call is in the context of a process (therefore could access the data in the process's address space) while the ISR is in its own context and not related to any processes.

- Concurrency in the kernel
-- multi-processes which might use the driver at the same time
-- HW interrupts and softirq like timers, tasklets and workqueue, etc.
-- Kernel is preemptive from 2.6
All of these cause the kernel and drivers must be reentrant.

- Multithreading mapping (user space to kernel space)
-- Multiple to one: all the threads are mapped to only one schedule unit. Blocked if one thread is blocked.
-- One to one: each thread is mapped to one schedule unit. Two complicated in communications.
-- Multiple to multiple: Light weighted process (LWP) which share the data in between and form the process group. In Linux the one2one mapping between threads and LLWP is supported.

* Major and minor numbers
- The major num identifies the driver associated with the device. Usually one majoor num one driver.
- The minor num identifies exactly which device is being referred to.

* Some important data structs related to the device driver
Most of the fundamental driver operations involve three important kernel data structs, called file_operations, file and inode.
- File operations: implementing various system calls interface
- File: represents an open file
- Inode: internally represent files in kernel. One file could have multiple files struct but have only one inode struct.

Sunday, June 22, 2008

Resource Pattern

How to solve the problems, priority inversion and deadlock, of resource sync?

* Critical section pattern
pros: avoid both problems
cons: cost high

* Priority inheritance
pros: avoid priority inversion and simple
cons: deadline and chain blocking could happens
chain blocking: J1 needs S1 and S2 but S1 is used by J2 and S2 by J3. The priority of them is P1>P2>P3. Therefore, J1 must wait for both of them finish.

* Highest locker pattern
One prio ceiling is defined for each resource in system design. The basic idea is that the task owning the resource runs at the highest prio ceiling of all the resources that it currently owns, provided that it is blocking one or more higher prio tasks. In this way, chain blocking is avoided.

pros: avoid chain blocking
cons: deadlock still exists

* Priority ceiling pattern
The idea is to ensure that when a job J preempts the critical section of another job and executes its own critical section, the priority at which this new critical section will execute is guaranteed to be higher than the inherited priorities(so could run) AND the ceiling priorities(so could continue) of all the preempted critical sections in the system. The diff from Highest locker is in the first place the job would not be assigned to the priority ceiling of the locked resource.
pros: avoid both
cons: cost high

Memory Patterns

* Static allocation pattern
pros: deterministic allocation/deallcation time and no memory fragmentation and easy to maintain.
cons: long init time and large mem supply. inflexible.

* Dynamic allocation pattern
pros: flexible, sizable exec
cons: non-deterministic allocation/deallcation time and mem fragmentation. Hard to maintain ptrs mem.

* Pool allocation pattern
pros: more flexible than static allocation to satisfy dynamic requirements. sub deterministic allocation/deallcation time(possible the pool runs out) and no mem fragmentation.
cons: the num of objs in the pool needs to be explored for the best performance of diff systems.

* Fixed sized buffer pattern
pros: no mem fragmentation since it always uses the worst case of mem requirement.
cons: waste mem on average. And it could be improved by managing varied size heaps.

* Smart ptr pattern

* Garbage collection pattern

* Garbage compactor pattern

Wednesday, March 12, 2008

Linux Device Driver Concepts

The following concepts are not equal:
1. Device : physical chip
2. Device Driver : codes to control the chip
3. Interface : routines provided to users and hide the chip details underneath.
4. ISR :
5. kernel module : object files which could be loaded at running time and enhance the kernel functionality.

Sunday, February 17, 2008

Skip, Direct Pred Modes

* Diff from normal pred:
- Need to derive both refidx and mv. (mv only for normal cases)
-- P Skip: refidxL0 = 0
-- B Temporal: refidxL1 = 0

* Diff between B temporal and spatial modes:
- refidx:
-- temporal mode: always L0 and L1
-- spatial mode: two or one (either L0 or L1)

- mi shared
-- temporal mode: 16 4x4 blks shared one (if the colMB is 16x16 mode) or
four 4x4 blks shared one (direct_8x8_inference_flag == 1) or
no share in 16 4x4 blks.
-- spatial mode: 16 4x4 blks shared one (because the neighboring derivation is in MB boundary) or
four 4x4 blks shared one (direct_8x8_inference_flag == 1) or
no share in 16 4x4 blks.

Note:
- ColPic is a different concept from the refIdxL1.
- B picture could be the reference picture.

Saturday, February 2, 2008

Optimization

The principle of optimization:
* Cases: What the application will typically do, and if we are more concerned with average or worst case performance before optimization.
* Order: Algorithm & Design -> Local optimization -> Assembly
* Context: The true exec time for one instruction is a time measured in a certain context, and that time is meaningful only in that context. Also it is more meaningful to optimize the segment of codes instead of only one instruction.

Algorithm & Design:
* Two kinds of computing:
-- data processing (data streaming algorithm): data is processed sequentially and only once in general. The relationship of data is not the concern of algorithms. The cache seems overkill here. The digital signal processing is a typical application. How to reduce the computing cases might be an efficient way to do optimization.
-- data transaction processing (data block algorithm): The relationship of data in one block is the focus of algorithms. The data cache is very important for them. Sorting and search are among them.

Instruction Level Optimization:
* Avoid branch hazard:
-- Unroll loops: the result might be diff for W, R, W with R. In general 8~16 times unrolling depth is good.
-- Combine loops
-- Fill-in branch delay in Assembly
* Avoid data hazard:
-- Eliminate the data dependance
-- Prefetch data and instruction: use mem/cache throughput instead of latency.
* Understand each Assembly inst details and diff. Use the best.

Memory Usage Optimization:
* Optimize data structs
-- Split reference data structures, like lists, tree, etc. into small ones and access them simultaneously.
-- Compact data structs as densely as possible to reduce access increment step and increase the utilization ratio of cache lines.
-- Separate data structs to reduce access increment step.
* Use mem/cache pipeline: The basic unit between cache/TLB and memory is one cache line. So make read step increment be the length of one cache line.
* Processing mem in Int data type instead of char, esp. for R. It seems not matter for W.
* Addr alignment for src/dst
* Combine W and R together for large data block which exceeds the size of caches. For small ones, overlapping access of cache might be harmful.
* Compact data within one mem page as densely as possible, due to TLB and page table.
* Put data into diff mem banks to avoid DRAM refreshment and access multiple banks. One page size as increment is good.

Cache Usage Optimization:
* Data and instruct size not exceed cache size: For cache, the max attainable size of cached data is CacheSize + (CacheSize / Way)
* Addr alignment and no line split data
* Put data on diff cache banks in order to access multiple cache banks.
* Increment step of mem not be multiple of the size of one cache set bank, otherwise conflict miss happens.

Write Buffer Usage Optimization:
* Access same data types, eps. R after W
* Organize the codes together to take advantage of WB since the units in it might be flushed randomly.

Thursday, January 3, 2008

Multiprocessing and Multithread

Multiprocessing: multiple CPU, like CMP, SMP, etc.
Multithreading: One CPU but probably with more threading HW support, like multiple Register files and PC's.

Superscalar: HW supported dynamic instruction issue and branch predication
VLIW: Software (compiler) supported multiple instruction issue and BP

ccNUMA SMP: symmetrical multiprocessing. OS could be running on any of multiple CPUs with the help of "resource locker". cc means cache coherence and each CPU has its own memory banks (NUMA).